How LLM AI Works: A Very Short Guide

By Andreas Ramos, Professor at CSTU
andreas.ramos@cstu.edu
Palo Alto, November 12, 2024

Based on Tim Lee's article on LLM AI, Stephen Wolfram's short book on LLM, and many conversations with students at CSTU, AI developers, and professors.

LLMs Convert Words into Numbers

Large Language Model (LLM) AI (such as OpenAI ChatGPT, Anthropic Claude or Sonnet, etc.) use tokens, not words.

A token is an ID number for a word:

A full word (such as "cat") has (for example) the token #4829.
A word with several parts can be two or more tokens. For example, "catfood" has two parts: "cat" and "food", so there are two tokens: token #4829 for "cat" and token #7731 for "food".
Punctuation marks have tokens.
Special characters have more tokens.
The things that can appear in a text (letter, punctuation, symbols) all have tokens.

The token is converted into a longer number, which is called a vector. (The embedding layer does the conversion). For example: the token for cat is #4829 and the vector for "cat" is a long list of numbers, such as [0.0074, -0.0105, 0.0742, ...]. GPT-3 uses 12,288 sets of numbers for each token.

LLM Uses Vectors to Find Relations

Similar to geometry, where you can plot positions on an X or Y axis, the conversion of words into numbers (vectors) allows the LLM to calculate the distance between vectors. For example:

"Dog" and "puppy" are close together.
"Paris" and "France" are similar in relationship to "Berlin" and "Germany".
"Happy" and "sad" are at opposite ends of an emotional dimension.

This allows LLMs to find complex relationships between words. For example, the LLM can find "king" minus "man" plus "woman" equals "queen".

(In this example, I used words such as dog, Paris, and happy, but that's just for you: the LLM uses vector numbers.)

The Transformer Adds Information to the Tokens

The LLM uses transformer architecture (called neural networks), which is a series of layers (steps) to add information to the vectors. As a vector moves to a layer, information is added to the vector. The vector is passed to the next layer and additional information is added. The enhanced vectors move to the next layer and repeat the process to add more information.The GPT-3 transformer has 96 layers and each layer has 96 attention heads. This process builds a rich context for each vector.

At each layer, the transformer's attention mechanism "pays attention" to previous words. This identifies the context. For example, in "The cat sits in the sun because it is warm", the LLM notes "it" refers to "sun" (not "cat") from context.

This new additional information is called hidden states (additional vectors, which are meta-data). There are vectors (the numbers for the cat) and hidden vectors (numbers about the numbers). The GPT-3 transformer has 96 layers and each layer has 96 attention heads, which means 96 layers × 96 heads = 9,216 transformations.

The first layers often focus on grammar and syntax (such as noun, verb, prepositions, etc.).
Middle layers handle relationships and context (such as mother and daughter).
Later layers work on higher-level meaning and task requirements.

Many types of definitions can be applied to vectors:

Grammar (noun, verb, etc.).
Relationships between words.
Topic and theme.
Logical flow.

The LLM uses the final layer to calculate probabilities for the next word. For example, "John (identity, role) wants (verb, request) his (relation) wife (identity, role) to __[call]__."

The LLM adds the predicted word to the text and repeats the process all over again. By doing this over and over in billions of steps per second, the LLM creates readable text.

How the LLM Generates Text

To summarize:

The LLM converts words into tokens and then into vecotors, which are numbers.
By converting words into numbers, the LLM can use simple arithmetic to manage words.
The LLM processes the vectors through transformer layers to add definitions, context, and information to the vectors.
The LLM creates a probability distribution for possible next words
The LLM chooses a word based on these probabilities
The LLM adds the new word and repeats the process all over again to write the complete sentence and paragraph.

In very simple LLMs with only a few vectors and a few layers, researchers can understand how the LLM made decisions. However, it can take months for the researchers to do this. Large LLMs use trillions of parameters and calculations per second which makes it impossible for humans to describe or understand how these LLMs made decisions.

Parameters in an LLM

Parameters are like the knobs on your car radio which you can adjust the knobs to radio stations, volumne, and so on. Parameters are numbers that are applied to tokenization, vectors, attention headers, feed-forward transformers, etc. The numbers can be adjusted (increased or decreased) to change the "weight" to adjust the results.

A new LLM starts with random parameters.
Reviewers look at the results and adjust the parameters slightly to get better results (better grammar, less violence, and so on).
This happens billions of times until the parameters are well-tuned.

Parameters are set by computer science engineers and training (moderation) by humans. The LLM AI itself also writes parameters.

GPT-3 has 175 billion parameters (weights and rules).
GPT-4 has 1.8 trillion parameters.

For example, an LLM has a temperature control to adjust the level of randomness. Higher temperature produces results that are creative but less focused, while lower temperature produces predictable output.

The Data in LLM AI

LLMs use datasets:

The Pile dataset has ~2 trillion tokens (if this were converted into books, it'd be 700M books).
The Fineweb dataset has 15 trillion tokens (if it were converted into books, it'd be 5.3B books).

The information in datasets are curated (selected) by computer science researchers and information library experts.

This enormous amount of data allows LLMs to discover patterns in human language and knowledge.

Summary of LLMs

LLMs are complex pattern-matching systems, based on mathematics, that recognize and reproduce statistical patterns in human language and knowledge. An LLM AI can't comprehend like humans can. However, LLM uses mathematics and massive data to look at far more than humans can.

Extra Stuff about AI

Here are a few personal comments about LLM AI, based on my experience, knowledge, reading lots of books on AI, and talking with many people in Silicon Valley. Many of these questions have come up in class. These are my opinion, which may change as LLM AI develops.

Why do you keep calling it an LLM AI? Why not just say, "AI"?

Because there are 36+ kinds of AIs, of which one category is machine learning (ML) AI, which has a subcategory for Large Language Model (LLM) AI. LLM AI is one small area in the field of AI. Other AIs are different.

Does AI have emergence?

The idea of emergence is popular in the world of AI. As things become complex, new features and abilities seem to appear. Indeed, nobody thought LLM AI could translate, write Shakespearean sonnets, act as a person, or have a theory of mind (ToM) (aware of the intentions of others). These features arose as the LLM AIs became more complex. AIs appear to show emergence.

How far can this go? As the LLMs got more GPUs (chips), faster, larger datasets, and more training, the results improved to the point that LLMs can write better than 95% of people. This progress is due to scaling (more, bigger, better, faster). If scaling continues, will the LLM become self-aware?

LLM AI are complex mathematical machines. It can generate new stuff, but it can't become conscious. Computer science developers seem to agree now (late 2024) that progress in LLM AI will plateau soon.

Many developers also agree LLM AI will not lead to AGI (Artificial General Intelligence) or ASI (Artificial Super Intelligence). Another technology could do that, but nobody has any idea how to design that. (Maybe LLM AI can help us to create that technology? 🙂)

But it really feels like the AI is conscious

Yes, it's remarkable how an AI can write thoughtful poetry. Although I know it's a computer and software, I tend to think of an AI as an intelligent person. It helps to think this way when I'm trying to get the AI to write ("hey, AI, write in a casual and friendly tone.") But I know it's a machine.

However… can AI write poetry better than humans? Yes, a study shows that people rate AI-generated poetry to be better than poetry by humans. Here's the study: https://www.nature.com/articles/s41598-024-76900-1

Read the article's summary carefully: an AI can write poems that non-expert humans prefer, because the AI-generated poem is simple, clear, and easy to understand.

Humans don't like poems by by expert poets (such as Emily Dickinson, Sylvia Plath, and so.) because these are subtle, complex, and not easy to understand.

This means the vast majority of people will prefer movies, music, theater, poetry, painting, and many art forms that is produced by AI. The AI will give them what they want.

However, aesthetes prefer the complexity of art created by elite artists, such as Rilke, Rodin, or Calder. (By elite art, I don't mean "expensive art for the wealthy". Their taste in art is generally the same as the majority.)

Will AI replace workers?

Yes, AI will replace workers and it will create many more new jobs. Over the last 180 years of technological industrialization, every new technology has created many more jobs than it replaced. Horses were replaced by automobiles, which created jobs in manufacturing, components, distribution, shipping, marketing, sales, gasoline stations, maintenance, repairs, and so on, most of which didn't exist before.

If you have other opinions on these points, let me know.