TroelsLLM - Architecture Deep-Dive

Model Specifications

162M

Total Parameters

12

Transformer Layers

12

Attention Heads

768

Embedding Dimensions

256

Context Length (tokens)

50,257

Vocabulary Size

Data Flow Through the Model

Input Text: "The painting was" ↓ ┌─────────────────────────────────────────┐ │ 1. TOKENIZATION (BPE) │ │ "The" → 464 │ │ "painting" → 12927 │ │ "was" → 373 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 2. TOKEN EMBEDDINGS (768-dim vectors) │ │ 464 → [0.23, -0.45, 0.12, ...] │ │ + POSITIONAL ENCODINGS │ │ pos_0, pos_1, pos_2 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 3. TRANSFORMER BLOCKS (12 layers) │ │ │ │ ╔════════════════════════════════════╗ │ │ ║ Layer 1 ║ │ │ ║ • LayerNorm ║ │ │ ║ • Multi-Head Attention (12 heads) ║ │ │ ║ • Residual Connection ║ │ │ ║ • LayerNorm ║ │ │ ║ • FeedForward (768→3072→768) ║ │ │ ║ • Residual Connection ║ │ │ ╚════════════════════════════════════╝ │ │ ↓ │ │ ╔════════════════════════════════════╗ │ │ ║ Layer 2 (same structure) ║ │ │ ╚════════════════════════════════════╝ │ │ ↓ │ │ ... │ │ ↓ │ │ ╔════════════════════════════════════╗ │ │ ║ Layer 12 ║ │ │ ╚════════════════════════════════════╝ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 4. FINAL LAYER NORM │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 5. OUTPUT PROJECTION (768 → 50,257) │ │ Logits for each token in vocabulary │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 6. SOFTMAX → PROBABILITIES │ │ P("beautiful") = 0.23 │ │ P("destroyed") = 0.15 │ │ P("hanging") = 0.12 │ │ ... │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ 7. SAMPLING (top-k=50, temp=0.7) │ │ Selected token: "hanging" │ └─────────────────────────────────────────┘ ↓ Output: "The painting was hanging" ↓ (Feed back as input for next token...)

Attention Mechanism Explained

Query: "What am I looking for?"
Each token creates a query vector asking what information it needs.

Key: "What information do I have?"
Each token creates a key vector describing what it offers.

Value: "Here's my actual content"
Each token has a value vector with its information.

Attention Scores: Query · Key → Scores
Dot product between queries and keys determines relevance.
Higher score = more relevant.

Softmax: Scores → Weights (sum to 1.0)
Convert scores to probabilities.

Weighted Sum: Weights × Values → Output
Each token gets a weighted combination of all values,
focusing more on relevant tokens.

Multi-Head Attention

Instead of one attention mechanism, we have 12 parallel heads. Each head learns different relationships:

Head 1: Subject-verb agreement "The dog [runs]" ← focuses on "dog" Head 2: Coreference "John loves [his] dog" ← links "his" to "John" Head 3: Semantic similarity "The [car] raced" ← connects "car" with "raced" Head 4-12: Other linguistic patterns (syntax, style, context, etc.)

All heads run in parallel, then their outputs are concatenated and projected back to 768 dimensions.

Why This Architecture Works

Residual Connections: Allow gradients to flow directly through all 12 layers during training. Without these, training deep networks would fail.

Layer Normalization: Keeps activations in a stable range (mean=0, std=1). Prevents exploding/vanishing values.

Feed-Forward Expansion: 768 → 3072 → 768 gives the model capacity to learn complex patterns. The 4× expansion is a key design choice.

Causal Masking: During training, tokens can only attend to previous tokens, not future ones. This teaches the model to predict next tokens.

Autoregressive Generation: At inference, generate one token at a time, feeding each output back as input. This allows open-ended text generation.

Training Process

1. Initialize model with random weights ↓ 2. Load training data ("The Verdict" - 5,145 tokens) ↓ 3. Split into batches (batch_size=2, seq_len=256) ↓ 4. For each batch: ┌──────────────────────────────────────┐ │ a. Forward pass (predict next tokens)│ │ b. Calculate loss (cross-entropy) │ │ c. Backward pass (compute gradients) │ │ d. Update weights (AdamW optimizer) │ └──────────────────────────────────────┘ ↓ 5. Repeat for 10 epochs ↓ 6. Loss decreases: 11.0 → 2.5 ✓ ↓ 7. Save trained model (model.pth - 623MB)

Production System

┌─────────────────────────────────────────┐ │ USER BROWSER │ │ → Types prompt │ │ → Clicks "Send" │ └────────────┬────────────────────────────┘ │ HTTPS POST ↓ ┌─────────────────────────────────────────┐ │ GITHUB PAGES (Frontend) │ │ → JavaScript intercepts │ │ → Sends to backend API │ └────────────┬────────────────────────────┘ │ POST /generate ↓ ┌─────────────────────────────────────────┐ │ HUGGING FACE SPACES (Backend) │ │ → FastAPI receives request │ │ → Loads model.pth (if not in memory) │ │ → Tokenizes input │ │ → Runs forward pass (162M params) │ │ → Generates 50 tokens (autoregressive) │ │ → Returns JSON response │ └────────────┬────────────────────────────┘ │ JSON response ↓ ┌─────────────────────────────────────────┐ │ USER BROWSER │ │ → Displays generated text │ │ → Shows timing ("Generated in 2.3s") │ └─────────────────────────────────────────┘ Cold start (first request after sleep): ~10s Warm requests: ~2-3s

Key Implementation Details

No Shortcuts: This model was built from scratch using only PyTorch primitives. No transformers library, no pre-built attention modules.

Hand-coded components:

Multi-head scaled dot-product attention
Positional encoding generation
Layer normalization
Causal attention masking
Training loop with validation
Text generation with sampling strategies

Mathematical correctness: Attention scores are scaled by 1/√d_k to prevent softmax saturation. Residuals ensure gradient flow. LayerNorm uses learned scale/shift parameters.

🧠 TroelsLLM Architecture