📊 Model Comparison

Understanding the impact of training data on model quality

BASELINE

TroelsLLM-Scratch

Training Data 5,145 tokens
Source "The Verdict" (one short story)
Training Time 10 minutes
Final Loss Train: 2.5 | Val: 3.5
Purpose Demonstrate architecture understanding
⭐ IMPROVED

TroelsLLM-Books

Training Data 360,000 tokens (72x more!)
Source 3 classic books (diverse genres)
Training Time 5-6 hours
Final Loss Train: 4.4 | Val: 5.9
Purpose Demonstrate scaling understanding

📈 Detailed Comparison

Metric TroelsLLM-Scratch TroelsLLM-Books Improvement
Training Tokens 5,145 360,000 72x more
Books/Sources 1 story 3 classics 3x diversity
Genres Short story Romance, Fantasy, Mystery Multi-genre
Vocabulary Exposure Limited Rich & diverse ⬆️⬆️⬆️
Training Time 10 minutes 5-6 hours 30x longer
Train Loss 2.5 4.4 Higher (more complex)
Val Loss 3.5 5.9 Higher (more complex)
Output Quality Story-specific phrases Diverse, coherent text ⬆️⬆️⬆️

🤔 Why is Loss Higher with More Data?

This is counterintuitive but normal!

Higher loss with more diverse data doesn't mean worse quality - it means the model has more patterns to learn. Think of it like:

The key metric: Output quality, not loss value! The books model generates significantly better, more diverse text despite higher loss.

📝 Sample Outputs

Prompt: "It is a truth universally"

TroelsLLM-Scratch:
"It is a truth universally Gisburn rather a cheap genius..."
→ Defaults to training story phrases

TroelsLLM-Books:
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife..."
→ Recognizes and continues Pride & Prejudice!

Prompt: "Alice was beginning"

TroelsLLM-Scratch:
"Alice was beginning the painting was Gisburn..."
→ Confused, mixing unrelated concepts

TroelsLLM-Books:
"Alice was beginning to get very tired of sitting by her sister on the bank..."
→ Coherent continuation from Alice in Wonderland!

Prompt: "Sherlock Holmes"

TroelsLLM-Scratch:
"Sherlock Holmes the verdict was..."
→ No context for this character

TroelsLLM-Books:
"Sherlock Holmes took his bottle from the corner of the mantelpiece..."
→ Generates in Conan Doyle's style!

💡 Key Insight: Scaling Laws

This comparison demonstrates fundamental AI scaling laws:

Why frontier labs need massive compute: To reach GPT-4 level quality requires billions (not thousands) of tokens and thousands of GPUs. This project shows the principles at small scale.

🎯 What This Demonstrates

Technical Depth

Implemented and trained transformer architecture from scratch

Scaling Understanding

Demonstrated data impact on model quality (72x increase)

Resource Management

Balanced learning value vs compute cost (3 books vs full Gutenberg)

Product Thinking

Made pragmatic tradeoffs (5 hours vs 200 hours)

← Back to Demo

Both models built with PyTorch, deployed with FastAPI & Hugging Face Spaces

View on GitHub