Understanding the impact of training data on model quality
| Metric | TroelsLLM-Scratch | TroelsLLM-Books | Improvement |
|---|---|---|---|
| Training Tokens | 5,145 | 360,000 | 72x more |
| Books/Sources | 1 story | 3 classics | 3x diversity |
| Genres | Short story | Romance, Fantasy, Mystery | Multi-genre |
| Vocabulary Exposure | Limited | Rich & diverse | ⬆️⬆️⬆️ |
| Training Time | 10 minutes | 5-6 hours | 30x longer |
| Train Loss | 2.5 | 4.4 | Higher (more complex) |
| Val Loss | 3.5 | 5.9 | Higher (more complex) |
| Output Quality | Story-specific phrases | Diverse, coherent text | ⬆️⬆️⬆️ |
This is counterintuitive but normal!
Higher loss with more diverse data doesn't mean worse quality - it means the model has more patterns to learn. Think of it like:
The key metric: Output quality, not loss value! The books model generates significantly better, more diverse text despite higher loss.
TroelsLLM-Scratch:
"It is a truth universally Gisburn rather a cheap genius..."
→ Defaults to training story phrases
TroelsLLM-Books:
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife..."
→ Recognizes and continues Pride & Prejudice!
TroelsLLM-Scratch:
"Alice was beginning the painting was Gisburn..."
→ Confused, mixing unrelated concepts
TroelsLLM-Books:
"Alice was beginning to get very tired of sitting by her sister on the bank..."
→ Coherent continuation from Alice in Wonderland!
TroelsLLM-Scratch:
"Sherlock Holmes the verdict was..."
→ No context for this character
TroelsLLM-Books:
"Sherlock Holmes took his bottle from the corner of the mantelpiece..."
→ Generates in Conan Doyle's style!
This comparison demonstrates fundamental AI scaling laws:
Why frontier labs need massive compute: To reach GPT-4 level quality requires billions (not thousands) of tokens and thousands of GPUs. This project shows the principles at small scale.
Implemented and trained transformer architecture from scratch
Demonstrated data impact on model quality (72x increase)
Balanced learning value vs compute cost (3 books vs full Gutenberg)
Made pragmatic tradeoffs (5 hours vs 200 hours)
Both models built with PyTorch, deployed with FastAPI & Hugging Face Spaces