You chose training data, built a transformer from scratch, and watched it learn to write. Here's what actually happened under the hood — and where to go from here.
The notebook built a 10–20 million parameter GPT — the exact same architecture behind ChatGPT, Claude, and Gemini, just smaller. It tokenized raw text into numbers, embedded each token into 384 dimensions, ran it through 6 transformer layers with multi-head self-attention (6 heads), and trained for 5,000 steps of gradient descent to predict the next character. By the end, your model could generate coherent text in the style of its training data.
Notebook: Step 1
The first cell detected your GPU hardware and auto-scaled the model to match. More VRAM means a bigger model — more layers, wider embeddings, more attention heads. On an RTX 4090, you got a 384-dimensional, 6-layer, 6-head transformer (~10 M params). The same training on a personal computer would take hours instead of minutes.
What you learned
GPUs process thousands of matrix operations in parallel — essential for training neural networks at speed
Notebook: Step 2
You chose one of three corpora (tech startup pitches, Python code, or Shakespeare) or pasted your own text. Each built-in corpus is ~400 KB of raw text. The next step (Tokenize) splits this 90/10 into training and validation sets — the validation set is “held back” to test whether the model is genuinely learning patterns or just memorizing the training data.
What you learned
The quality and domain of training data determines what a language model can generate — garbage in, garbage out
Notebook: Step 3
Every unique character in the text gets a number: a=0, b=1, etc. The notebook builds an encode() function (text → numbers) and a decode() function (numbers → text). Your vocabulary was typically 65–95 characters. GPT-4 uses ~100,000 tokens (subword, not character-level), but the principle is identical.
What you learned
GPUs can only do math with numbers — tokenization converts text to integers so the model can process it
Notebook: Step 4
You built a multi-layer transformer decoder from scratch in PyTorch. Each layer contains:
Multi-Head Self-Attention
Each position computes Query, Key, Value vectors. Attention = softmax(Q·K′/√d)·V. Six heads run in parallel, each learning different patterns.
Feed-Forward Network
A two-layer MLP (expand 4× then project back) applied independently at each position. This is where the model stores learned knowledge.
What you learned
Transformers are built from repeated blocks of self-attention + feed-forward layers — the same pattern at every scale
Notebook: Step 5
The training loop ran 5,000 steps. Each step: grab a batch of text, ask the model to predict the next character at every position, measure how wrong it was (cross-entropy loss), compute gradients, and nudge all weights to be slightly less wrong.
Loss started around 4.0 (random guessing) and dropped below 1.0 (coherent text). The timelapse showed the model's output evolving from random noise → common characters → words → sentences → coherent, styled text. Your GPU processed 200,000+ characters per second in parallel.
What you learned
Training is iterative prediction and correction — thousands of loops of ‘predict, measure error, adjust weights’
Notebook: Step 6
The model generated text at three settings: Focused (temp=0.3, top_k=5), Balanced (temp=0.8, top_k=40), and Creative (temp=1.2, top_k=200). Temperature controls randomness — how likely the model is to pick surprising characters. top_k limits the candidate pool — how many characters it considers at each step. Together they shape the output style. Then you typed your own prompt and watched the model continue it in the style it learned.
What you learned
Text generation is sequential prediction — pick a character, feed it back, repeat. Temperature + top_k together control creativity vs safety.
Notebook: Step 7
The model was tested on text from its own domain (low perplexity — confident) and then on completely different domains (high perplexity — confused). A model trained on Python code is baffled by Shakespeare. This proves it learned patterns of one dataset, not general knowledge. GPT-4 was trained on the entire internet to generalize across all domains.
What you learned
Models become domain experts — confident in their training domain, lost outside it. Scale the data to scale the understanding.
Notebook: Step 8
The heatmaps showed which characters your model “looks at” when processing each position. Different attention heads learn different roles: one might focus on the immediately previous character, another on matching brackets or quotes, another on long-range grammatical dependencies. This specialization emerges naturally from training — nobody tells the heads what to learn.
What you learned
Attention heads specialize — some track grammar, some track meaning, some track long-range patterns
Each character asks 'what should I pay attention to?' by comparing its Query against every previous character's Key. The answer determines which Values get read. This is how the model understands context.
Attention(Q,K,V) = softmax(QK'/\u221Ad_k)V. Causal masking ensures position i can only attend to positions \u2264 i. Multi-head: 6 parallel attention computations with different learned projections, concatenated and projected.
Each character maps to a 384-dimensional vector \u2014 a point in mathematical space where similar characters cluster together. The model learns these positions during training.
Token embedding: nn.Embedding(vocab_size, 384). Position embedding: nn.Embedding(block_size, 384). Both are learned parameters. The sum token_emb + pos_emb is the input to the first transformer layer.
Stack 6 identical layers, each containing attention + a feed-forward network. Each layer refines the model's understanding. Deeper layers capture more abstract patterns.
Each block: LayerNorm \u2192 MultiHeadAttention \u2192 residual \u2192 LayerNorm \u2192 FFN(4\u00d7expand) \u2192 residual. Total params \u2248 10M for 384d/6L. GELU activation in FFN.
The model generates one character at a time: predict the next, append it, feed the whole sequence back in, repeat. That's how AI 'writes' \u2014 just sequential prediction.
Output logits \u2208 R^vocab_size \u2192 softmax \u2192 probability distribution. Temperature \u03c4 scales logits before softmax. Sample from distribution, append token, repeat up to max_length.
Each training step processes a batch of text sequences in parallel, computing predictions for every position simultaneously. 5,000 steps \u00d7 full forward+backward pass in minutes. Impossible without a GPU.
Batch of 64 sequences \u00d7 256 positions = 16,384 predictions per step (RTX 4090 / A100). Forward: ~10M FLOPs per token. Backward: ~2\u00d7 forward. Total: ~650 GFLOPs/step \u00d7 5,000 steps.
Same architecture. Same math. Same attention mechanism. The difference is scale — and that's the entire insight.
Take a base LLM (Llama, Mistral) and fine-tune it on your own data with LoRA/QLoRA. Same transformer architecture — just 1000× bigger with subword tokenization.
From text generation to image generation. Deploy SDXL on a GPU and create images from text prompts. Different architecture (diffusion) but same GPU-accelerated training loop.
Apply gradient descent to a different problem — build a CNN that learns to recognize objects from labeled photos. Same optimizer, same loss function, different data modality.
Wrap any trained model in a FastAPI server and expose it as an inference endpoint. Deploy on DAIRX with a custom Docker image for production use.
Built by DAIRX — The AI Resource Exchange.