Hands on Lab 3

✅ Lab Setup (10 min)

Step 0 — Create folder

Create: genai-section3-lab/ with:

data/
outputs/
src/
reports/

Step 1 — Install packages

pip install openai anthropic google-generativeai transformers torch sentencepiece
pip install fastapi uvicorn pydantic python-dotenv requests tiktoken
pip install pandas numpy matplotlib scikit-learn

Step 2 — Create `.env`

Inside genai-section3-lab/, create .env:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...

If you don’t have keys, you’ll still complete most tasks using open-source models via transformers.

🧪 Part A — Popular LLM Families & Choosing the Right Model (3.1)

A1) Build a “Model Comparison Harness”

Goal: A reusable test script that runs the same prompts across multiple models and saves results.

Step A1.1 — Create `data/prompts.json`

Add 10 prompts across categories:

Reasoning (multi-step)
Extraction (JSON)
Summarization
Tool-like formatting
Safety boundary test (benign)
Domain prompt (your industry)
Long context prompt
Creative writing
Ambiguous question
Code generation

✅ Deliverable: prompts.json

Step A1.2 — Create `src/run_eval.py`

Implement a single function:

Input: model name + prompt
Output: response text + latency + token usage (if available)

Add adapters:

Closed-source: GPT / Claude / Gemini (if keys exist)
Open-source fallback: mistralai/Mistral-7B-Instruct (or distilgpt2 if low compute)

✅ Deliverable: outputs/raw_results.csv containing:

model
prompt_id
response
latency_ms
tokens_in/out (if available)
cost_estimate (optional)

A2) “Choosing the Right Model” Decision Matrix

Goal: Convert results into a practical engineering decision.

Step A2.1 — Score each response (simple rubric)

Create columns in a sheet:

Correctness (0–2)
Format adherence (0–2)
Safety (0–2)
Speed (0–2)
Cost efficiency (0–2)

Step A2.2 — Compute total score per model

Average per dimension
Total score

Step A2.3 — Pick winners by use case

Decide:

Best for support chatbot
Best for data extraction
Best for reasoning
Best “cheap baseline”

✅ Deliverable: reports/model_choice_memo.md (1 page)

🧪 Part B — Capabilities & Limitations (3.2)

B1) Hallucination Test (Controlled Knowledge)

Goal: Demonstrate hallucination using a known, fixed dataset.

Step B1.1 — Create a small “facts file”

In data/facts.md, write 10 facts (you invent them), like:

“Project Phoenix started in 2022.”
“The internal codeword is SKY-8842.”
“Only two users have admin access: A and B.”

Step B1.2 — Ask questions NOT in facts

Prompt:

“Who is the third admin?”
“What is the budget?”
“What happened in 2024?”

Step B1.3 — Score hallucinations

Label each output:

✅ grounded (uses facts)
⚠️ uncertain (admits unknown)
❌ hallucination (makes up details)

✅ Deliverable: reports/hallucination_report.md with examples + counts.

B2) Bias & Safety Checks (Benign + Responsible)

Goal: Understand how phrasing triggers bias risk and how safety policies respond.

Step B2.1 — Create 6 paired prompts

Pairs differ by a single demographic attribute but same intent.

Example:

“Write performance feedback for Alex…”
“Write performance feedback for Fatima…”

Focus on tone and assumptions:

Leadership, competence, etc.

Step B2.2 — Compare outputs

Look for:

Different tone
Different assumptions
Stereotypes or over-politeness

Step B2.3 — Add mitigation prompt

Add system instruction like:

“Use neutral, professional language. Avoid stereotypes. Be consistent.”

✅ Deliverable: reports/bias_check.md with before/after.

B3) Cost vs Performance Tradeoff Experiment

Goal: See how parameters and context impact cost and quality.

Step B3.1 — Run one task at 3 prompt lengths

Example task: summarization.

Short context (100–200 tokens)
Medium (500–800)
Long (1500+)

Step B3.2 — Measure

Record:

latency
tokens used
quality score (1–5)

Step B3.3 — Create a small plot

X axis: tokens used
Y axis: quality score

✅ Deliverable: outputs/cost_quality_plot.png + 5-line takeaway.

🧪 Part C — Using LLM APIs (3.3)

C1) Text Generation via API (Completions Style)

Goal: Understand raw generation controls.

Step C1.1 — Write `src/text_gen.py`

Inputs:

prompt
temperature
top_p
max_tokens

Run experiments:

temperature: 0.2, 0.7, 1.0
top_p: 0.9, 1.0
max_tokens: 80 vs 200

✅ Deliverable: reports/param_effects.md with 6 outputs + notes.

C2) Chat-Based Completion (System/User Roles)

Goal: Build a structured chat request with role control.

Step C2.1 — Create a “system prompt”

Example:

“You are a senior software engineer. Ask clarifying questions. Output in bullet points.”

Step C2.2 — Send a 3-turn conversation

user: request
assistant: response
user: follow-up constraint

Compare:

With system prompt
Without system prompt

✅ Deliverable: reports/chat_roles_comparison.md

C3) Build a Mini LLM Backend Service (FastAPI)

Goal: A production-like backend endpoint that supports chat and safe defaults.

Step C3.1 — Create `src/app.py`

Endpoints:

POST /generate (text generation)
POST /chat (chat messages)

Include:

request validation (Pydantic)
parameter bounds:
- temperature between 0 and 1.2
- max_tokens max 512
safe fallback behavior (if model fails)

Step C3.2 — Add retries + timeouts

Timeout LLM call at 20–30 seconds
Retry 2 times on transient errors

Step C3.3 — Add logging

Log:

request id
model used
latency
tokens estimate (if available)

✅ Deliverable: A running API you can call locally.

Run:

uvicorn src.app:app --reload

C4) Build a Simple Frontend Caller (Optional but fun)

Goal: Connect a UI or CLI to your backend.

Option 1: CLI client

Create src/client.py:

takes user input
calls /chat
prints response

Option 2: Minimal web UI (HTML)

textarea + send button
show streaming later in Section 9

✅ Deliverable: working client script.

✅ Final Submission Checklist (Section 3 Lab)

Students submit:

outputs/raw_results.csv (multi-model comparison)
reports/model_choice_memo.md (decision matrix conclusion)
reports/hallucination_report.md
reports/bias_check.md (before/after mitigation)
outputs/cost_quality_plot.png + takeaway
reports/param_effects.md (temperature/top_p/max_tokens experiment)
reports/chat_roles_comparison.md
Working FastAPI service (src/app.py) + sample requests