📝 Notes on Intro to Large Language Models
1. What is a Large Language Model (LLM)?
-
An LLM is essentially:
-
Two files:
-
Parameters file → contains the trained weights (billions of numbers).
-
Inference code → the software that runs those weights to generate predictions.
-
-
-
Example: LLaMA 2–70B (Meta AI, 70 billion parameters).
-
Key distinction:
-
Open-source models (like LLaMA) → weights + architecture available.
-
Closed-source models (like ChatGPT) → only accessible via API, no raw weights.
-
2. Parameters
-
Each parameter is just a number (weight) in the neural network.
-
For LLaMA 2–70B → 70 billion parameters.
-
These weights define the model’s ability to map text input → output.
3. How Inference Works
-
Given a prompt, the model:
-
Breaks text into tokens.
-
Uses the Transformer architecture (attention mechanism, layers).
-
Predicts next tokens step by step.
-
-
Example: Autocompleting “Once upon a…” → “time”.
4. Training
-
Training an LLM = adjusting weights on huge datasets.
-
Uses gradient descent + backpropagation.
-
Dataset size matters: trillions of tokens → better performance.
-
Compute requirement: thousands of GPUs for weeks.
5. Inference vs. Training
-
Training = expensive, slow, requires clusters.
-
Inference = running the frozen model on new inputs (much cheaper).
-
Optimizations: quantization, batching, GPU/TPU acceleration.
6. Model Scaling
-
Bigger models (more parameters) usually → better results.
-
But diminishing returns beyond certain size.
-
Parallel scaling laws: data size, parameter size, compute.
7. Open vs Closed Models
-
Open (LLaMA, Falcon, Mistral, etc.):
-
Researchers + devs can fine-tune or modify.
-
Transparency in weights and architecture.
-
-
Closed (ChatGPT, Claude, Gemini):
-
Only API access.
-
Easier to use, but less customizable.
-
8. Applications
-
Chatbots, summarization, code generation, translation.
-
Domain-specific fine-tuning → legal, medical, finance, etc.
9. Limitations
-
Hallucinations (makes up facts).
-
Bias (mirrors training data biases).
-
Large compute & memory costs.
-
Not inherently reasoning machines → pattern learners.