Intro to Large Language Models - Karpathy Lecture 1 Summary

 

📝 Notes on Intro to Large Language Models

1. What is a Large Language Model (LLM)?

  • An LLM is essentially:

    • Two files:

      • Parameters file → contains the trained weights (billions of numbers).

      • Inference code → the software that runs those weights to generate predictions.

  • Example: LLaMA 2–70B (Meta AI, 70 billion parameters).

  • Key distinction:

    • Open-source models (like LLaMA) → weights + architecture available.

    • Closed-source models (like ChatGPT) → only accessible via API, no raw weights.


2. Parameters

  • Each parameter is just a number (weight) in the neural network.

  • For LLaMA 2–70B → 70 billion parameters.

  • These weights define the model’s ability to map text input → output.


3. How Inference Works

  • Given a prompt, the model:

    • Breaks text into tokens.

    • Uses the Transformer architecture (attention mechanism, layers).

    • Predicts next tokens step by step.

  • Example: Autocompleting “Once upon a…” → “time”.


4. Training

  • Training an LLM = adjusting weights on huge datasets.

  • Uses gradient descent + backpropagation.

  • Dataset size matters: trillions of tokens → better performance.

  • Compute requirement: thousands of GPUs for weeks.


5. Inference vs. Training

  • Training = expensive, slow, requires clusters.

  • Inference = running the frozen model on new inputs (much cheaper).

  • Optimizations: quantization, batching, GPU/TPU acceleration.


6. Model Scaling

  • Bigger models (more parameters) usually → better results.

  • But diminishing returns beyond certain size.

  • Parallel scaling laws: data size, parameter size, compute.


7. Open vs Closed Models

  • Open (LLaMA, Falcon, Mistral, etc.):

    • Researchers + devs can fine-tune or modify.

    • Transparency in weights and architecture.

  • Closed (ChatGPT, Claude, Gemini):

    • Only API access.

    • Easier to use, but less customizable.


8. Applications

  • Chatbots, summarization, code generation, translation.

  • Domain-specific fine-tuning → legal, medical, finance, etc.


9. Limitations

  • Hallucinations (makes up facts).

  • Bias (mirrors training data biases).

  • Large compute & memory costs.

  • Not inherently reasoning machines → pattern learners.