Behind the Prompt: How LLMs Actually Work

This is what I learned from watching AI Researcher Andrej Karpathy's latest video

Jun 08, 2025

Don’t understand how LLMs work?

Neither do I.

I’ve mostly been learning on the job by using AI tools without fully grasping how they work under the hood.

That is, until I saw this name Andrej Karpathy pop up a couple of times in my feed. He’s the one who coined the term vibe coding (winging it with AI by “feeling” and not fully understanding what it’s doing). He was also a founding member of OpenAI.

Turns out, he explains this stuff better than anyone. Even though it’s already 3 months (!) old, I totally recommend watching this 2-hour video on how he uses LLMs.

But if you prefer a quick breakdown: These are my 5 takeaways that finally made things click.

1. Basically, you’re talking to a 1TB zip file

An LLM (Large Language Model) is a multi-terabyte zip file of the internet. It’s a compressed snapshot that’s distilled during training
ChatGPT, Claude, Perplexity, Grok etc. are called Generative Chat apps because they allow you to talk to an LLM with generative (= creative) capabilities
This happens roughly once every 6 months (depending on the model). That’s because it costs a lot of money (~US$10M) and time (~3 months) to train an LLM
The zip file is probabilistic: If the model has seen more examples of something, it's more likely to answer you accurately. It doesn’t think - It recalls patterns from training data

Michael’s observation: This zip-file is NOT like a human brain. 🧠 That’s a current limitation: It’s not continuously learning or updating the more you use it.

An LLM (Large Language Model) is basically a 1TB zip file

2. Tools can give the model extra knowledge post-training

The model can’t learn more post-training. It’s frozen 🥶
But there are 2 ways to extend the LLM with additional knowledge. Something Karpathy calls Quality of Life Features and Tools
To me, this is not entirely self-explanatory, so I think it’s more convenient to think of these 2 extra sources as Pre-Chat and As-You-Chat
- Pre-Chat:
  - Memory - Information stored about you by the Generative Chat app
  - Custom instructions - Your personal preferences as to how the Generative Chat app should provide its answers to you
  - Previous chats
- As-You-Chat:
  - Internet search - In case the LLM doesn’t know the information you’re asking for
  - Code interpreter (e.g. Python) - To enable more advanced functionality. Examples are creating a quick web app or calculating
  - File upload - For you to provide documents, images, etc.
It depends on the Generative Chat app, which of these features are available. ChatGPT is currently the most “feature-rich”

Extra knowledge is extended to the LLM through pre-chat and as-you-chat capabilities

3. The base layer of a chat is tokens

It doesn’t matter what the input is to the model (text, audio, image, or video), they all become tokens (a set of numbers). The model doesn’t see the difference - LLMs only understand the patterns in those numbers
Some models use text as a layer in between, while others can process e.g. audio directly as tokens (like ChatGPT and Grok)

Different paths for input to end up as tokens

4. Chats have short-term memory

The chat has a short-term memory (called the context window). It can only hold so many tokens. Once the context window fills up, earlier info gets dropped or blurred
That’s why it’s good to keep your queries concise and start a new chat if you're switching topics 💡

The context window of a Generative Chat app defines how much short-term memory it has

5. Reasoning models improve accuracy

China’s Deepseek is the most notable thinking or reasoning model
Reasoning models use step-by-step thinking to solve complex queries. Like showing their work in math class. That’s why accuracy typically improves (for harder questions), but it will also take more time
Reasoning models are calibrated through Reinforcement Learning. It basically learns thinking strategies that lead to a good outcome (as defined by humans)
Be clear about which model you need for the task you want to get done:
- Normal tasks → Standard model suffices (like 4o in ChatGPT)
- Need more accuracy and/or standard model can’t produce result → Use reasoning model (like o3 in ChatGPT)
- For deeper problem statements → Run deep research, which is essentially web search + reasoning

How to go about picking the right type of model for your task at hand

Bonus Lesson ✨ Don't read a book alone anymore

Karpathy shares that he copies books into chats and will ask questions while reading! Or he will use Claude to ask for a conceptual diagram (e.g. a tree diagram) of a book’s chapter, so it will allow him to understand an upcoming chapter more visually. This makes his reading much more like interactive studying.

Michael’s observation: Somebody should build a tool that will allow people to ask questions from books directly! As if books can talk back 😎

Follow-up questions I have

After watching the video, I wondered:

What else could I expect coming out of the AI space? Are LLMs the only type of AI? Will LLMs be enough to achieve the Artificial General Intelligence (= on par general intelligence with humans)?
What are the limitations of reasoning models? I just saw that Apple may have proven that those models actually don’t reason at all. They just memorize patterns really well.

🛝 Michael's Playground

Discussion about this post