Hands on Lab 3


✅ Lab Setup (10 min)

Step 0 — Create folder

Create: genai-section3-lab/ with:

Step 1 — Install packages

pip install openai anthropic google-generativeai transformers torch sentencepiece
pip install fastapi uvicorn pydantic python-dotenv requests tiktoken
pip install pandas numpy matplotlib scikit-learn

Step 2 — Create .env

Inside genai-section3-lab/, create .env:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...

If you don’t have keys, you’ll still complete most tasks using open-source models via transformers.

🧪 Part A — Popular LLM Families & Choosing the Right Model (3.1)

A1) Build a “Model Comparison Harness”

Goal: A reusable test script that runs the same prompts across multiple models and saves results.

Step A1.1 — Create data/prompts.json

Add 10 prompts across categories:

  1. Reasoning (multi-step)

  2. Extraction (JSON)

  3. Summarization

  4. Tool-like formatting

  5. Safety boundary test (benign)

  6. Domain prompt (your industry)

  7. Long context prompt

  8. Creative writing

  9. Ambiguous question

  10. Code generation

Deliverable: prompts.json

Step A1.2 — Create src/run_eval.py

Implement a single function:

Add adapters:

Deliverable: outputs/raw_results.csv containing:

A2) “Choosing the Right Model” Decision Matrix

Goal: Convert results into a practical engineering decision.

Step A2.1 — Score each response (simple rubric)

Create columns in a sheet:

Step A2.2 — Compute total score per model

Step A2.3 — Pick winners by use case

Decide:

Deliverable: reports/model_choice_memo.md (1 page)

🧪 Part B — Capabilities & Limitations (3.2)

B1) Hallucination Test (Controlled Knowledge)

Goal: Demonstrate hallucination using a known, fixed dataset.

Step B1.1 — Create a small “facts file”

In data/facts.md, write 10 facts (you invent them), like:

Step B1.2 — Ask questions NOT in facts

Prompt:

Step B1.3 — Score hallucinations

Label each output:

Deliverable: reports/hallucination_report.md with examples + counts.

B2) Bias & Safety Checks (Benign + Responsible)

Goal: Understand how phrasing triggers bias risk and how safety policies respond.

Step B2.1 — Create 6 paired prompts

Pairs differ by a single demographic attribute but same intent.

Example:

Focus on tone and assumptions:

Step B2.2 — Compare outputs

Look for:

Step B2.3 — Add mitigation prompt

Add system instruction like:

Deliverable: reports/bias_check.md with before/after.

B3) Cost vs Performance Tradeoff Experiment

Goal: See how parameters and context impact cost and quality.

Step B3.1 — Run one task at 3 prompt lengths

Example task: summarization.

Step B3.2 — Measure

Record:

Step B3.3 — Create a small plot

Deliverable: outputs/cost_quality_plot.png + 5-line takeaway.

🧪 Part C — Using LLM APIs (3.3)

C1) Text Generation via API (Completions Style)

Goal: Understand raw generation controls.

Step C1.1 — Write src/text_gen.py

Inputs:

Run experiments:

Deliverable: reports/param_effects.md with 6 outputs + notes.

C2) Chat-Based Completion (System/User Roles)

Goal: Build a structured chat request with role control.

Step C2.1 — Create a “system prompt”

Example:

Step C2.2 — Send a 3-turn conversation

Compare:

Deliverable: reports/chat_roles_comparison.md

C3) Build a Mini LLM Backend Service (FastAPI)

Goal: A production-like backend endpoint that supports chat and safe defaults.

Step C3.1 — Create src/app.py

Endpoints:

Include:

Step C3.2 — Add retries + timeouts

Step C3.3 — Add logging

Log:

Deliverable: A running API you can call locally.

Run:

uvicorn src.app:app --reload

C4) Build a Simple Frontend Caller (Optional but fun)

Goal: Connect a UI or CLI to your backend.

Option 1: CLI client

Create src/client.py:

Option 2: Minimal web UI (HTML)

Deliverable: working client script.

✅ Final Submission Checklist (Section 3 Lab)

Students submit:

  1. outputs/raw_results.csv (multi-model comparison)

  2. reports/model_choice_memo.md (decision matrix conclusion)

  3. reports/hallucination_report.md

  4. reports/bias_check.md (before/after mitigation)

  5. outputs/cost_quality_plot.png + takeaway

  6. reports/param_effects.md (temperature/top_p/max_tokens experiment)

  7. reports/chat_roles_comparison.md

  8. Working FastAPI service (src/app.py) + sample requests