Intro to Large Language Models - Karpathy Lecture 1 Summary

📝 Notes on Intro to Large Language Models

1. What is a Large Language Model (LLM)?

An LLM is essentially:
- Two files:
  - Parameters file → contains the trained weights (billions of numbers).
  - Inference code → the software that runs those weights to generate predictions.
Example: LLaMA 2–70B (Meta AI, 70 billion parameters).
Key distinction:
- Open-source models (like LLaMA) → weights + architecture available.
- Closed-source models (like ChatGPT) → only accessible via API, no raw weights.

2. Parameters

Each parameter is just a number (weight) in the neural network.
For LLaMA 2–70B → 70 billion parameters.
These weights define the model’s ability to map text input → output.

3. How Inference Works

Given a prompt, the model:
- Breaks text into tokens.
- Uses the Transformer architecture (attention mechanism, layers).
- Predicts next tokens step by step.
Example: Autocompleting “Once upon a…” → “time”.

4. Training

Training an LLM = adjusting weights on huge datasets.
Uses gradient descent + backpropagation.
Dataset size matters: trillions of tokens → better performance.
Compute requirement: thousands of GPUs for weeks.

5. Inference vs. Training

Training = expensive, slow, requires clusters.
Inference = running the frozen model on new inputs (much cheaper).
Optimizations: quantization, batching, GPU/TPU acceleration.

6. Model Scaling

Bigger models (more parameters) usually → better results.
But diminishing returns beyond certain size.
Parallel scaling laws: data size, parameter size, compute.

7. Open vs Closed Models

Open (LLaMA, Falcon, Mistral, etc.):
- Researchers + devs can fine-tune or modify.
- Transparency in weights and architecture.
Closed (ChatGPT, Claude, Gemini):
- Only API access.
- Easier to use, but less customizable.

8. Applications

Chatbots, summarization, code generation, translation.
Domain-specific fine-tuning → legal, medical, finance, etc.

9. Limitations

Hallucinations (makes up facts).
Bias (mirrors training data biases).
Large compute & memory costs.
Not inherently reasoning machines → pattern learners.

Subscribe to: Comments (Atom)