✅ Lab Setup (5–10 min)
Step 0 — Create folder
Create: genai-section2-lab/ with:
data/outputs/lab_section2.ipynb(recommended)
Step 1 — Install packages
pip install torch transformers tokenizers datasets accelerate sentencepiece matplotlib numpy pandas scikit-learn
Step 2 — Verify GPU (optional)
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
🧪 Part A — Anatomy of Transformers (2.1)
A1) Visualize Self-Attention on a Sentence
Goal: Extract real attention weights from a transformer and visualize which tokens attend to which.
Step A1.1 — Load a small model
Use: bert-base-uncased (encoder) OR distilbert-base-uncased (lighter).
Step A1.2 — Prepare a sentence
Example:
“The mechanic inspected the engine because it was noisy.”
Step A1.3 — Run model with output_attentions=True
Tokenize sentence
Forward pass
Collect attention tensors:
(layers, heads, seq, seq)
Step A1.4 — Plot one attention head as heatmap
Pick: layer 1, head 0
Plot token-to-token attention map
Save image to
outputs/attention_heatmap.png
✅ Deliverable: 1 heatmap + 3 bullet observations (e.g., which token attends to “engine” / “noisy”).
A2) Understand Positional Encoding (By Breaking It)
Goal: Prove positions matter by scrambling token order and observing output changes.
Step A2.1 — Create two inputs
Original: “The cat sat on the mat”
Scrambled: “Mat the on sat cat the”
Step A2.2 — Compare embeddings (encoder model)
Extract last hidden state for both
Compute cosine similarity between:
Sentence embeddings (mean pooling)
Token embeddings
Step A2.3 — Explain outcome
Same words but different order → different representations
✅ Deliverable: similarity scores + short explanation: “Why position changes meaning.”
A3) Encoder vs Decoder Architecture (Hands-On)
Goal: Experience difference between “understanding” models vs “generation” models.
Step A3.1 — Encoder task (BERT): Fill-Mask
Input:
“Transformers are [MASK] at understanding context.”
Use BERT
fill-maskpipelineRecord top 5 predictions
Step A3.2 — Decoder task (GPT2): Generate
Prompt:
“Transformers are powerful because”
Use GPT2
text-generationGenerate 3 outputs at temperature 0.7
Step A3.3 — Compare in writing
Encoder: bidirectional understanding
Decoder: left-to-right generation
✅ Deliverable: a short comparison paragraph + outputs.
🧪 Part B — Tokens, Embeddings & Context Windows (2.2)
B1) Tokenization Strategies: Word vs Subword
Goal: See why subword tokenization exists.
Step B1.1 — Compare 2 tokenizers
Use:
GPT-2 tokenizer (BPE)
BERT tokenizer (WordPiece)
Step B1.2 — Test words that break naïve tokenizers
Try:
“unbelievable”
“internationalization”
“electroencephalography”
“bioinformatics”
A typo word: “mecahnical”
Step B1.3 — Print token lists + token counts
Count tokens per word for each tokenizer
Save results in a table
✅ Deliverable: tokenization table + 2 observations:
“Which tokenizer splits more?”
“How does typo affect tokens?”
B2) Embeddings & Semantic Meaning (Mini Semantic Search)
Goal: Prove embeddings cluster meaning.
Step B2.1 — Create 12 sentences (4 topics × 3 sentences)
Example topics:
Aviation maintenance
Finance
Health
Sports
Step B2.2 — Convert each sentence to embeddings
Use one:
sentence-transformers(recommended if allowed)
ORmean-pool BERT last hidden states
Step B2.3 — Measure similarity
Compute cosine similarity matrix (12×12)
Identify top-2 most similar sentences for each one
Step B2.4 — Visualize embedding space
Use PCA (2D) or t-SNE
Plot 2D scatter (label points)
✅ Deliverable: similarity matrix + 2D plot + brief conclusion.
B3) Context Window Limits (Practical Experiment)
Goal: Experience how long context fails and why pruning matters.
Step B3.1 — Create a long prompt
Generate a paragraph repeated until ~1500–2500 tokens
Insert a key fact near the start:
“The secret key is ORANGE-9281.”
Step B3.2 — Ask at the end
“What is the secret key?”
Step B3.3 — Test with 2 settings
Short max context (or short prompt)
Long prompt near model limit
Observe:
Does it recall?
Does accuracy degrade?
✅ Deliverable: 2 outputs + explanation:
“Why long context reduces reliability”
Mention: attention cost grows with sequence length.
🧪 Part C — How LLMs Are Trained (2.3)
C1) Pretraining Objectives (Mini Demo)
You’ll simulate what models learn during pretraining.
Option 1: Masked Language Modeling (BERT style)
Goal: Predict missing tokens.
Step C1.1 — Make 10 custom sentences (domain-specific)
Example:
“The aircraft engine requires regular inspection.”
“Hydraulic systems can leak under high pressure.”
Step C1.2 — Mask random tokens
Replace 15% with [MASK].
Step C1.3 — Run BERT fill-mask predictions
Evaluate: how often does top-1 match your original?
✅ Deliverable: masked sentence results + basic “accuracy”.
Option 2: Next Token Prediction (GPT style)
Goal: Predict next word distribution.
Step C1.1 — Choose prompt
“In aviation maintenance, reliability depends on”
Step C1.2 — Generate with low temperature
temperature 0.1, top_p 0.9
✅ Deliverable: outputs + note: “This is what pretraining optimizes.”
C2) Fine-Tuning vs Instruction Tuning (Hands-On)
Goal: See difference between fine-tuning for a task vs instruction-following behavior.
Step C2.1 — Create a tiny dataset (20 rows)
Make a CSV with:
instructioninputoutput
Example tasks:
Convert text → JSON
Summarize into 1 sentence
Classify into labels
Step C2.2 — Run inference BEFORE tuning
Feed 5 examples to base model
Record outputs (often inconsistent)
Step C2.3 — Do a lightweight instruction tuning (conceptual + optional)
If you want a runnable approach without heavy compute:
Use a small model like
google/flan-t5-small(already instruction-following)Compare against a base model that is not instruction tuned (like raw GPT2)
✅ Deliverable: “Before vs after” comparison table:
formatting consistency
instruction adherence
If you want full tuning, we can do LoRA/PEFT in Section 6+ where it fits better.
C3) RLHF (High-Level Intuition) — Mini Simulation
Goal: Understand RLHF as “preference optimization.”
Step C3.1 — Create 6 prompts
Example:
“Write a polite email declining a meeting.”
“Give a safe answer to a risky question.”
Step C3.2 — Generate 2 candidate answers each
Use two settings:
Candidate A: temperature 0.2 (safer)
Candidate B: temperature 0.9 (more creative)
Step C3.3 — Rank each pair (human preference)
Create a simple table:
Prompt
Answer A
Answer B
Winner + why
Step C3.4 — Explain RLHF mapping
Your preference labels ≈ “reward model”
Optimization pushes model toward preferred style
✅ Deliverable: preference table + 5-line explanation of RLHF.
✅ Final Submission Checklist (Section 2 Lab)
Students submit:
Attention heatmap + observations
Positional encoding experiment results (similarity scores)
Encoder vs decoder outputs + comparison paragraph
Tokenization comparison table
Embedding similarity matrix + PCA/t-SNE plot
Context window experiment outputs + explanation
Pretraining objective demo results
Instruction tuning comparison (base vs instruction-following)
RLHF preference ranking table