Model Specifications
256
Context Length (tokens)
Data Flow Through the Model
Input Text: "The painting was"
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. TOKENIZATION (BPE) โ
โ "The" โ 464 โ
โ "painting" โ 12927 โ
โ "was" โ 373 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. TOKEN EMBEDDINGS (768-dim vectors) โ
โ 464 โ [0.23, -0.45, 0.12, ...] โ
โ + POSITIONAL ENCODINGS โ
โ pos_0, pos_1, pos_2 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. TRANSFORMER BLOCKS (12 layers) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Layer 1 โ โ
โ โ โข LayerNorm โ โ
โ โ โข Multi-Head Attention (12 heads) โ โ
โ โ โข Residual Connection โ โ
โ โ โข LayerNorm โ โ
โ โ โข FeedForward (768โ3072โ768) โ โ
โ โ โข Residual Connection โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Layer 2 (same structure) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ ... โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Layer 12 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. FINAL LAYER NORM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5. OUTPUT PROJECTION (768 โ 50,257) โ
โ Logits for each token in vocabulary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 6. SOFTMAX โ PROBABILITIES โ
โ P("beautiful") = 0.23 โ
โ P("destroyed") = 0.15 โ
โ P("hanging") = 0.12 โ
โ ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 7. SAMPLING (top-k=50, temp=0.7) โ
โ Selected token: "hanging" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Output: "The painting was hanging"
โ
(Feed back as input for next token...)
Attention Mechanism Explained
Query: "What am I looking for?"
Each token creates a query vector asking what information it needs.
Key: "What information do I have?"
Each token creates a key vector describing what it offers.
Value: "Here's my actual content"
Each token has a value vector with its information.
Attention Scores: Query ยท Key โ Scores
Dot product between queries and keys determines relevance.
Higher score = more relevant.
Softmax: Scores โ Weights (sum to 1.0)
Convert scores to probabilities.
Weighted Sum: Weights ร Values โ Output
Each token gets a weighted combination of all values,
focusing more on relevant tokens.
Multi-Head Attention
Instead of one attention mechanism, we have 12 parallel heads. Each head learns different relationships:
Head 1: Subject-verb agreement
"The dog [runs]" โ focuses on "dog"
Head 2: Coreference
"John loves [his] dog" โ links "his" to "John"
Head 3: Semantic similarity
"The [car] raced" โ connects "car" with "raced"
Head 4-12: Other linguistic patterns
(syntax, style, context, etc.)
All heads run in parallel, then their outputs are concatenated and projected back to 768 dimensions.
Why This Architecture Works
Residual Connections: Allow gradients to flow directly through all 12 layers during training. Without these, training deep networks would fail.
Layer Normalization: Keeps activations in a stable range (mean=0, std=1). Prevents exploding/vanishing values.
Feed-Forward Expansion: 768 โ 3072 โ 768 gives the model capacity to learn complex patterns. The 4ร expansion is a key design choice.
Causal Masking: During training, tokens can only attend to previous tokens, not future ones. This teaches the model to predict next tokens.
Autoregressive Generation: At inference, generate one token at a time, feeding each output back as input. This allows open-ended text generation.
Training Process
1. Initialize model with random weights
โ
2. Load training data ("The Verdict" - 5,145 tokens)
โ
3. Split into batches (batch_size=2, seq_len=256)
โ
4. For each batch:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ a. Forward pass (predict next tokens)โ
โ b. Calculate loss (cross-entropy) โ
โ c. Backward pass (compute gradients) โ
โ d. Update weights (AdamW optimizer) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
5. Repeat for 10 epochs
โ
6. Loss decreases: 11.0 โ 2.5 โ
โ
7. Save trained model (model.pth - 623MB)
Production System
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ USER BROWSER โ
โ โ Types prompt โ
โ โ Clicks "Send" โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HTTPS POST
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GITHUB PAGES (Frontend) โ
โ โ JavaScript intercepts โ
โ โ Sends to backend API โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ POST /generate
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HUGGING FACE SPACES (Backend) โ
โ โ FastAPI receives request โ
โ โ Loads model.pth (if not in memory) โ
โ โ Tokenizes input โ
โ โ Runs forward pass (162M params) โ
โ โ Generates 50 tokens (autoregressive) โ
โ โ Returns JSON response โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ JSON response
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ USER BROWSER โ
โ โ Displays generated text โ
โ โ Shows timing ("Generated in 2.3s") โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Cold start (first request after sleep): ~10s
Warm requests: ~2-3s
Key Implementation Details
No Shortcuts: This model was built from scratch using only PyTorch primitives. No transformers library, no pre-built attention modules.
Hand-coded components:
- Multi-head scaled dot-product attention
- Positional encoding generation
- Layer normalization
- Causal attention masking
- Training loop with validation
- Text generation with sampling strategies
Mathematical correctness: Attention scores are scaled by 1/โd_k to prevent softmax saturation. Residuals ensure gradient flow. LayerNorm uses learned scale/shift parameters.
โ Back to Demo
Built with PyTorch, deployed with FastAPI & Hugging Face Spaces
View on GitHub