📊 Model Comparison

Understanding the impact of training data on model quality

BASELINE

TroelsLLM-Scratch

Training Data 5,145 tokens

Source "The Verdict" (one short story)

Training Time 10 minutes

Final Loss Train: 2.5 | Val: 3.5

Purpose Demonstrate architecture understanding

⭐ IMPROVED

TroelsLLM-Books

Training Data 360,000 tokens (72x more!)

Source 3 classic books (diverse genres)

Training Time 5-6 hours

Final Loss Train: 4.4 | Val: 5.9

Purpose Demonstrate scaling understanding

📈 Detailed Comparison

Metric	TroelsLLM-Scratch	TroelsLLM-Books	Improvement
Training Tokens	5,145	360,000	72x more
Books/Sources	1 story	3 classics	3x diversity
Genres	Short story	Romance, Fantasy, Mystery	Multi-genre
Vocabulary Exposure	Limited	Rich & diverse	⬆️⬆️⬆️
Training Time	10 minutes	5-6 hours	30x longer
Train Loss	2.5	4.4	Higher (more complex)
Val Loss	3.5	5.9	Higher (more complex)
Output Quality	Story-specific phrases	Diverse, coherent text	⬆️⬆️⬆️

🤔 Why is Loss Higher with More Data?

This is counterintuitive but normal!

Higher loss with more diverse data doesn't mean worse quality - it means the model has more patterns to learn. Think of it like:

Scratch model: Memorized one story perfectly (low loss, limited scope)
Books model: Learning from 3 diverse genres (higher loss, but much richer understanding)

The key metric: Output quality, not loss value! The books model generates significantly better, more diverse text despite higher loss.

📝 Sample Outputs

Prompt: "It is a truth universally"

TroelsLLM-Scratch:
"It is a truth universally Gisburn rather a cheap genius..."
→ Defaults to training story phrases

TroelsLLM-Books:
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife..."
→ Recognizes and continues Pride & Prejudice!

Prompt: "Alice was beginning"

TroelsLLM-Scratch:
"Alice was beginning the painting was Gisburn..."
→ Confused, mixing unrelated concepts

TroelsLLM-Books:
"Alice was beginning to get very tired of sitting by her sister on the bank..."
→ Coherent continuation from Alice in Wonderland!

Prompt: "Sherlock Holmes"

TroelsLLM-Scratch:
"Sherlock Holmes the verdict was..."
→ No context for this character

TroelsLLM-Books:
"Sherlock Holmes took his bottle from the corner of the mantelpiece..."
→ Generates in Conan Doyle's style!

💡 Key Insight: Scaling Laws

This comparison demonstrates fundamental AI scaling laws:

72x more data = significantly better quality
Diverse data > homogeneous data
Diminishing returns: Going from 360K → 36M tokens would improve quality, but not 100x

Why frontier labs need massive compute: To reach GPT-4 level quality requires billions (not thousands) of tokens and thousands of GPUs. This project shows the principles at small scale.

🎯 What This Demonstrates

Technical Depth

Implemented and trained transformer architecture from scratch

Scaling Understanding

Demonstrated data impact on model quality (72x increase)

Resource Management

Balanced learning value vs compute cost (3 books vs full Gutenberg)

Product Thinking

Made pragmatic tradeoffs (5 hours vs 200 hours)

← Back to Demo

Both models built with PyTorch, deployed with FastAPI & Hugging Face Spaces

View on GitHub