This is a GPT model I built and trained from scratch following "Build a Large Language Model (From Scratch)" by Sebastian Raschka.
Every component was implemented by hand - no pre-built transformers library!
🔧 Key Components
1. Tokenization
- Text is split into tokens using BPE (Byte Pair Encoding)
- Each token is mapped to an ID from a 50,257-token vocabulary
2. Embeddings
- Token IDs are converted to dense 768-dimensional vectors
- Positional encodings are added so the model knows word order
3. Attention Mechanism
- Self-attention allows tokens to focus on relevant context
- Multi-head attention (12 heads) captures different relationships simultaneously
- Causal masking prevents "seeing the future" during training
4. Transformer Blocks
- 12 stacked layers of attention + feedforward networks
- Layer normalization stabilizes training
- Residual connections ensure gradients flow through all layers
5. Text Generation
- Autoregressive: generates one token at a time
- Each token is fed back as input to generate the next
- Temperature and top-k sampling control randomness
📊 Model Specifications
- Architecture: GPT-2 (124M parameters)
- Parameters: 162,419,712 trainable weights
- Layers: 12 transformer blocks
- Attention Heads: 12 per layer
- Embedding Dimension: 768
- Context Length: 256 tokens
- Vocabulary Size: 50,257 tokens (GPT-2 tokenizer)
- Training Data: "The Verdict" by Edith Wharton
- Training Time: ~10 minutes (10 epochs)
🛠️ Built With
- Backend: Python, PyTorch, FastAPI
- Frontend: Vanilla JavaScript, HTML5, CSS3
- Hosting: Hugging Face Spaces (backend), GitHub Pages (frontend)
- Implementation: Hand-coded from scratch (no transformers library!)