Who is this deep dive for?
A few days ago, Andrej Karpathy released a video titled “Deep dive into LLMs like ChatGPT.” It’s a goldmine of information, but it’s also 3 hours and 31 minutes long.
I watched the whole thing and took a bunch of notes, so I figured why not put together a TL;DR version for anyone who wants the essential takeaways without the large time commitment.
If any of this sounds like you, this post (and the original video) is worth checking out:
- You want to understand how LLMs actually work not just at the surface level.
- You want to understand confusing fine-tuning terms like
chat_template
andChatML
(especially if you’re using Axolotl). - You want to get better at prompt engineering by understanding why some prompts work better than others.
- You’re trying to reduce hallucinations and want to know how to keep LLMs from making things up.
- You want to understand why DeepSeek-R1 is such a big deal right now.
I won’t be covering everything in the video, so if you have time, definitely watch the whole thing. But if you don’t, this post will give you the key takeaways.
Note: If you are looking for the excalidraw diagram that Andrej made for the video, you can download it here. He shared it through Google Drive and it invalidates the link after a certain time. That’s why I have decided to host it on my CDN as well.
Pretraining Data
Internet
LLMs start by crawling the internet to build a massive text dataset. The problem? Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it needs heavy filtering.
- If you’re building an English only model, you’ll need a heuristic to filter out non-English text (e.g., only keeping text with a high probability of being English).
- One example dataset is FineWeb, which contains over 1.2 billion web pages.
Once cleaned, the data still needs to be compressed into something usable. Instead of feeding raw text into the model, it gets converted into tokens: a structured, numerical representation.
Tokenization
Tokenization is how models break text into smaller pieces (tokens) before processing it. Instead of storing raw words, the model converts them into IDs that represent repeating patterns.
- A popular technique for this is Byte Pair Encoding (BPE).
- There’s an optimal number of symbols (tokens) for compression. For example, GPT-4 uses 100,277 tokens. It is totally dependent on the discretion of the model creator.
- You can visualize how this works using tools like Tiktokenizer.
Neural Network I/O
Once the data is tokenized, it’s fed into the neural network. Here’s how that process works:
- The model takes in a context window, a set number of tokens (e.g., 8,000 for some models, up to 128k for GPT-4).
- It predicts the next token based on the patterns it has learned.
- The weights in the model are adjusted using backpropagation to reduce errors.
- Over time, the model learns to make better predictions.
A longer context window means the model can “remember” more from the input, but it also increases computational cost.
Neural Network Internals
Inside the model, billions of parameters interact with the input tokens to generate a probability distribution for the next token.
- This process is defined by complex mathematical equations optimized for efficiency.
- Model architectures are designed to balance speed, accuracy, and parallelization.
- You can see a production-grade LLM architecture example here.
Inference
LLMs don’t generate deterministic outputs, they are stochastic. This means the output varies slightly every time you run the model.
- The model doesn’t just repeat what it was trained on, it generates responses based on probabilities.
- In some cases, the response will match something in the training data exactly, but most of the time, it will generate something new that follows similar patterns.
This randomness is why LLMs can be creative, but also why they sometimes hallucinate incorrect information.
GPT-2
GPT-2, released by OpenAI in 2019, was an early example of a transformer-based LLM.
Here’s what it looked like:
- 1.6 billion parameters
- 1024-token context length
- Trained on ~100 billion tokens
- The original GPT-2 training cost was $40,000.
Since then, efficiency has improved dramatically. Andrej Karpathy managed to reproduce GPT-2 using llm.c for just $672. With optimized pipelines, training costs could drop even further to around $100.
Why is it so much cheaper now?
- Better pre-training data extraction techniques → Cleaner datasets mean models learn faster.
- Stronger hardware and optimized software → Less computation needed for the same results.
Open Source Base Models
Some companies train massive LLMs and release the base models for free. A base model is essentially the raw, unrefined LLM, it still needs tuning to be useful.
- Base models are trained on raw internet text, meaning they generate completions but don’t understand human intent.
- OpenAI open-sourced GPT-2.
- Meta open-sourced Llama 3.1 (405B parameters) which is a rather new LLM base model as compared to GPT-2.
In order to fully open-source a base model, two things are required:
- The code (e.g., a few hundred lines in Python that defines the steps the model will take to generate the prediction).
- The parameters (e.g., billions of tuned weights that define the model).
How Base Models Work
- They generate token-level internet-style text.
- Every run produces a slightly different output (stochastic behavior).
- They can regurgitate parts of their training data.
- The parameters are like a lossy zip file of internet knowledge.
- You can already use them for applications like:
- Translation → Using in-context examples.
- Basic assistants → Prompting them in a structured way.
Want to experiment with one? Try the Llama 3 (405B base model) here.
At its core, a base model is just an expensive autocomplete. It still needs fine-tuning.
Pre-Training to Post-Training
So far, we’ve looked at base models, which are just pre-trained text generators. But to make an actual assistant, you need post-training.
- Base models hallucinate a lot → They generate text, but it’s not always useful.
- Post-training fixes this by fine-tuning the model to respond better.
- The good news? Post-training is way cheaper than pre-training (e.g., months vs. hours).
Supervised Fine-Tuning (SFT)
Data Conversations
Once the base model is trained on internet data, the next step is post-training. This is where we replace the internet dataset with human/assistant conversations to make the model more conversational and useful.
- Pre-training takes months, but post-training is much faster. It can tak
10 Comments
dzogchen
For a model to be ‘fully’ open source you need more than the model itself and a way to run it. You also need the data and the program that can be used to train it.
See The Open Source AI Definition from OSI: https://opensource.org/ai
bluelightning2k
Great write up of what is presumably a truly great lecture. Debating trying to follow the original now.
albert_e
OT: What is a good place to discuss the original video — once it has dropped out of the HN front-page?
I am going through the video myself — roughly halfway through — and have a fw things to bring up.
Here they are now that we have a fresh opportunity to discuss:
1 – MATH and LLMs
I am curious why many of the examples Andrej chose to pose to the LLM were "computational" questions — for instance "what is 2+2" or some numerical puzzles that needed algebraic thinking and then some addition/subtraction/multiplication (example at 1:50 mins about buying Apples and Oranges).
I can understand these abilities of LLMs are becoming powerful and useful too — but in my mind these are not the "basic" abilities of a next token predictor.
I would have appreciated a more clear distinction of prompts that showcase core LLM ability — to generate text that is acceptable as generally grammatically correct, based in facts and context, without necessarily needing the ability of a working memory / assigning values to algebraic variables / doing arithmetic etc.
If there are any good references to discussion on the mathematical abilities of LLMs and the wisdom of trying to make them do math — versus simply recognizing when a math is needed and generating the necessary python/expressions and let the tools handle it.
2 – META
While Andrej briefly acknowledges the "meta" situation where LLMs are being used to create training data for the training of and judge the outputs of newer LLMs … there is not much discussion on that here.
There are just many more examples of how LLMs are used to prepare mitigations for hallucinations by preparing Q&A training sets with "correct" answers etc
I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.
I kind of feel that this is a bit like the Manhattan project and atomic weapons — in that early results and advances are being looped back immediately into the development of more powerful technology. (A smaller fission charge at the core of a larger fusion weapon — to be very loose with analogies)
<I am sure I will have a few more questions as I go through the rets of the video and digest it>
est
I have read many articles about LLMs, and understand how it works in general, but one thing always bothers me: why other models did't work as good as SOTA ones? What's the history and reason behind the current model architecture?
khazhoux
I'm still seeking an answer to what DeepSeek really is, especially in the context of their $5M versus ChatGPT's >$1B (source: internet). What did they do versus not do?
miletus
i saw a good thread today: https://x.com/0xmetaschool/status/1888873661840634111
EncomLab
It would be great if the hardware issues were discussed more – too little is made of the distinction between silicon substrate, fixed threshold, voltage moderated brittle networks of solid-state switches and protein substrate, variable threshold, chemically moderated plastic networks of biological switches.
To be clear, neither possesses any magical "woo" outside of physics that gives one or the other some secret magical properties – but these are not arbitrary meaningless distinctions in the way they are often discussed.
thomasahle
I find Meta’s approach to hallucinations delightfully counter intuitive. Basically they (and presumably OpenAI and others):
In a way this is obvious in hindsight, but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.
Instead of teaching the model to recognize what it doesn't know, why not teach it using those same examples? Of course the idea is to "connect the unused uncertainty neuron", which makes sense for out-of-context generalization. But we can at least appreciate why this wasn't an obvious thing to do for generation 1 LLMs.
sylware
It is sad to see that much attention given to LLM in comparison to the other types of AIs like those doing maths (strapped to a formal solver), folding proteins, etc.
We had a talk about those physics AIs using those maths AIs to design hard mathematical models to fit fundamental physics data.
wolfhumble
I haven't watched the video, but was wondering about the Tokenization part from the TL;DR:
"|" "View" "ing" "Single"
Just looking at the text being tokenized in the linked article, it looked like (to me) that the text was: "I View", but the "I" is actually a pipe "|".
From Step 3 in the link that @miletus posted in the Hacker News comment: https://x.com/0xmetaschool/status/1888873667624661455 the text that is being tokenized is:
|Viewing Single (Post From) . . .
The capitals used (View, Single) also makes more sense when seeing this part of the sentence.