How LLM AI Works: A Very Short Guide

By Andreas Ramos, Professor at CSTU
andreas.ramos@cstu.edu
Palo Alto, November 12, 2024

Based on Tim Lee's article on LLM AI, Stephen Wolfram's short book on LLM, and many conversations with students at CSTU, AI developers, and professors.

LLMs Convert Words into Numbers

Large Language Model (LLM) AI (such as OpenAI ChatGPT, Anthropic Claude or Sonnet, etc.) use tokens, not words.

A token is an ID number for a word:

  • A full word (such as "cat") has (for example) the token #4829.
  • A word with several parts can be two or more tokens. For example, "catfood" has two parts: "cat" and "food", so there are two tokens: token #4829 for "cat" and token #7731 for "food".
  • Punctuation marks have tokens.
  • Special characters have more tokens.
  • There are many tokens to identify all the items that can appear in a text.

An embedding layer converts the token into a number, which is called a vector. For example: the vector for "cat" is a long list of numbers, such as [0.0074, -0.0105, 0.0742, ...]. GPT-3 uses 12,288 sets of numbers for each token.

LLM Uses Vectors to Find Relations

Similar to geometry, where you can plot positions on an X or Y axis, the conversion of words into numbers (vectors) allows the LLM to calculate the distance between vectors. For example:

  • "Dog" and "puppy" are close together.
  • "Paris" and "France" are similar in relationship to "Berlin" and "Germany".
  • "Happy" and "sad" are at opposite ends of an emotional dimension.

This allows LLMs to find complex relationships between words. For example, the LLM can find "king" minus "man" plus "woman" equals "queen".

(In this example, I used words such as dog, Paris, and happy, but that's just for you: the LLM uses vector numbers.)

The Transformer Adds Information to the Tokens

The LLM uses transformer architecture (called neural networks), which is a series of layers (steps) to add information to the vectors. As a vector moves to a layer, information is added to the vector. The vector is passed to the next layer and additional information is added. The enhanced vectors move to the next layer and repeat the process to add more information.The GPT-3 transformer has 96 layers and each layer has 96 attention heads. This process builds a rich context for each vector.

At each layer, the transformer's attention mechanism "pays attention" to previous words. This identifies the context. For example, in "The cat sits in the sun because it is warm", the LLM notes "it" refers to "sun" (not "cat") from context.

This new additional information is called hidden states (additional vectors, which are meta-data). There are vectors (the numbers for the cat) and hidden vectors (numbers about the numbers). The GPT-3 transformer has 96 layers and each layer has 96 attention heads, which means 96 layers × 96 heads = 9,216 transformations.

  • The first layers often focus on grammar and syntax (such as noun, verb, prepositions, etc.).
  • Middle layers handle relationships and context (such as mother and daughter).
  • Later layers work on higher-level meaning and task requirements.

Many types of definitions can be applied to vectors:

  • Grammar (noun, verb, etc.).
  • Relationships between words.
  • Topic and theme.
  • Logical flow.

The LLM uses the final layer to calculate probabilities for the next word. For example, "John (identity, role) wants (verb, request) his (relation) wife (identity, role) to __[call]__."

The LLM adds the predicted word to the text and repeats the process all over again. By doing this over and over in billions of steps per second, the LLM creates readable text.

How the LLM Generates Text

To summarize:

  1. The LLM converts words into tokens, where each token is a set of numbers (a vector).
  2. The LLM processes the vectors through transformer layers to add definitions, context, and information to the vectors.
  3. The LLM creates a probability distribution for possible next words
  4. The LLM chooses a word based on these probabilities
  5. The LLM adds the new word and repeats the process all over again to write the complete sentence and paragraph.

In very simple LLMs with only a few vectors and a few layers, researchers can understand how the LLM made decisions. However, it can take months for them to do this. Large LLMs use trillions of parameters and calculations per second which makes it impossible for humans to describe or understand how these LLMs made decisions.

Parameters in an LLM

Parameters are like the knobs on your car radio which you can adjust the knobs to radio stations, volumne, and so on. Parameters are numbers that are applied to tokenization, vectors, attention headers, feed-forward transformers, etc. The numbers can be adjusted (increased or decreased) to change the "weight" to adjust the results.

  • A new LLM starts with random parameters.
  • Reviewers look at the results and adjust the parameters slightly to get better results (better grammar, less violence, and so on).
  • This happens billions of times until the parameters are well-tuned.

Parameters are set by computer science engineers and training (moderation) by humans. The LLM AI itself also writes parameters.

  • GPT-3 has 175 billion parameters (weights and rules).
  • GPT-4 has 1.8 trillion parameters.

For example, an LLM has a temperature control to adjust the level of randomness. Higher temperature produces results that are creative but less focused, while lower temperature produces predictable output.

The Data in LLM AI

LLMs use datasets:

  • The Pile dataset has ~2 trillion tokens (if this were converted into books, it'd be 700M books).
  • The Fineweb dataset has 15 trillion tokens (if it were converted into books, it'd be 5.3B books).

The information in datasets are curated (selected) by computer science researchers and information library experts.

This enormous amount of data allows LLMs to discover patterns in human language and knowledge.

Summary of LLMs

LLMs are complex pattern-matching systems, based on mathematics, that recognize and reproduce statistical patterns in human language and knowledge. An LLM AI can't comprehend like humans can. However, LLM uses mathematics and massive data to look at far more than humans can.

Extra Stuff about AI

Here are a few personal comments about LLM AI, based on my experience, knowledge, reading lots of books on AI, and talking with many people in Silicon Valley. Many of these questions have come up in class. These are my opinion, which may change as LLM AI develops.

Why do you keep calling it an LLM AI? Why not just say, "AI"?

Because there are 36+ kinds of AIs, of which one category is machine learning (ML) AI, which has a subcategory for Large Language Model (LLM) AI. LLM AI is one small area in the field of AI. Other AIs are different.

Creativity with LLM AI

If it's just math and patterns, what's the good of LLM AI? an LLM AI can come up with many new ideas because it scans so much data and it's very good at finding patterns. Ask your AI, "Suggest ten creative, unique, original, and wild ideas for ten customer appreciation events for our ski boot store." It'll generate lots of ideas. You can pick the ones you like.

Does an LLM AI have bias?

There is positive and negative bias (bias towards and bias against). The bias arises from the datasets and the training by humans. Negative bias is often easy to identify. Positive bias however is very difficult to identify because many people think that's the way it should be. For example, LLM AI made in Silicon Valley is biased towards American social values, US laws, the US economic system, and so on. Eventually, there will be LLM AIs from the Islamic world, China, Europe, Africa, and so on that reflect their cultures.

Can all bias be removed?

Imagine someone who has no opinions, preferences, or reactions for or against anything. It'd be like talking with a potato.

Is it possible to make an AI that would be like a human?

Would this be an AI that's like the average person? Average education, average intelligence, average skills? Why? We already have lots of people. Could the AI be a super human in intelligence and skills? That's a low goal. Who wants a car that can gallop like a horse? An airplane that flaps its wings like a bird? Horses and birds are very good at what they do, but we can build things that are better. The proper goal for AI is to build systems that are intelligent in their own way and not limited by human abilities.

The ethics of AI

Many AI companies say they have panels of philosophers to guide their work. The company management may indeed want to behave ethically, but pressure from investors will push to increase revenues and stock value. As we see in many companies, financial pressure wins over everything else (workers, customers, society, the environment, and so on.)

Could an LLM AI be ethical?

Ethics committees can add rules to an AI. These ethical rules may be well-intentioned, the "right thing to do", good for everyone, and so on, but that means they have positive bias that reflects the broader culture that creates the AI. Think of what an AI would be like that's produced by Silicon Valley in comparison to an LLM AI built by Muslims, West Africans, or French. What's ethical in one culture is wrong in another. Any AI will be biased toward the culture that created it.

Can a calculator be ethical?

No, it would not have the number seven, because seven eight nine.

Does AI have emergence?

The idea of emergence is popular in the world of AI. As things become complex, new features and abilities seem to appear. Indeed, nobody thought LLM AI could translate, write Shakespearean sonnets, act as a person, or have a theory of mind (ToM) (aware of the intentions of others). These features arose as the LLM AIs became more complex. AIs appear to show emergence.

How far can this go? As the LLMs got more GPUs (chips), faster, larger datasets, and more training, the results improved to the point that LLMs can write better than 95% of people. This progress is due to scaling (more, bigger, better, faster). If scaling continues, will the LLM become self-aware?

LLM AI are complex mathematical machines. It can generate new stuff, but it can't become conscious. Computer science developers seem to agree now (late 2024) that progress in LLM AI will plateau soon.

Many developers also agree LLM AI will not lead to AGI (Artificial General Intelligence) or ASI (Artificial Super Intelligence). Another technology could do that, but nobody has any idea how to design that. (Maybe LLM AI can help us to create that technology? 🙂)

But it really feels like the AI is conscious

Yes, it's remarkable how an AI can write thoughtful poetry. Although I know it's a computer and software, I tend to think of an AI as an intelligent person. It helps to think this way when I'm trying to get the AI to write ("hey, AI, write in a casual and friendly tone.") But I know it's a machine.

However… can AI write poetry better than humans? Yes, a study shows that people rate AI-generated poetry to be better than poetry by humans. Here's the study: https://www.nature.com/articles/s41598-024-76900-1

Read the article's summary carefully: an AI can write poems that non-expert humans prefer, because the AI-generated poem is simple, clear, and easy to understand.

Humans don't like poems by by expert poets (such as Emily Dickinson, Sylvia Plath, and so.) because these are subtle, complex, and not easy to understand.

This means the vast majority of people will prefer movies, music, theater, poetry, painting, and many art forms that is produced by AI. The AI will give them what they want.

However, elite aesthetes prefer the complexity of art created by elite artists, such as Rilke, Rodin, or Calder. (By elite art, I don't mean "expensive art for the wealthy". Their taste in art is generally the same as the majority.)

Can laws and legislation control AI?

AI laws requires the developers be able to explain the results. An LLM make 400 trillion calculations per second which means no one can understand how it made decisions. Developments in AI also move so fast that laws are left behind.

Will AI replace workers?

Yes, AI will replace workers and it will create many more new jobs. Over the last 180 years of technological industrialization, every new technology has created many more jobs than it replaced. Horses were replaced by automobiles, which created jobs in manufacturing, components, distribution, shipping, marketing, sales, gasoline stations, maintenance, repairs, and so on, most of which didn't exist before.

If you have other opinions on these points, let me know.