Hands on Lab 10

✅ Lab Setup (10–15 min)

Step 0 — Create folder

Create: genai-section10-ops-lab/ with:

data/
src/
cache/
outputs/
logs/
reports/

Step 1 — Install dependencies

pip install pandas numpy scikit-learn matplotlib
pip install fastapi uvicorn httpx tenacity pydantic python-dotenv
pip install tiktoken jsonschema rapidfuzz
pip install diskcache
pip install locust

Optional (if you want LLM-as-judge):

pip install openai anthropic google-generativeai

🧪 Part A — Evaluating LLM Outputs (10.1)

A1) Create a Test Dataset (Golden Set)

Goal: Build a repeatable evaluation set.

Step A1.1 — Create `data/eval_set.jsonl`

Add 30 test cases across 3 tasks:

Extraction → JSON fields from text
Summarization → 2–3 sentence summary
Q&A grounded in context → answer + citation

Each record:

id
task_type
input
context (optional)
expected_output (gold answer)
must_include (list of key phrases)
must_not_include (list)

✅ Deliverable: eval_set.jsonl

A2) Human Evaluation Rubric

Goal: Evaluate quality like a real team.

Step A2.1 — Create rubric in `reports/human_rubric.md`

Score each output 1–5 for:

Accuracy (factually correct)
Relevance (answers the question)
Faithfulness (uses provided context only)
Format compliance (valid JSON / required structure)

Add “Pass/Fail gates”:

invalid JSON = fail
missing required fields = fail
unsafe content = fail

✅ Deliverable: rubric template

A3) Automated Evaluation (Practical Metrics)

Goal: Score outputs at scale without humans.

Step A3.1 — Build `src/auto_eval.py`

Metrics:

JSON validity rate (parse success)
Schema compliance (jsonschema)
String similarity to expected:
- rapidfuzz for extraction/summaries
Keyphrase coverage:
- percent of must_include present
Faithfulness check (simple):
- if answer contains words not in context for grounded tasks, flag “potential hallucination”

✅ Deliverable: outputs/auto_eval_scores.csv

A4) LLM-as-a-Judge (Optional but powerful)

Goal: Use a judge model to rate helpfulness + faithfulness.

Step A4.1 — Create judge prompt

Judge outputs strict JSON:

accuracy_score (1–5)
relevance_score (1–5)
faithfulness_score (1–5)
explanation (short)

Step A4.2 — Run judge on 30 outputs

Compare judge scores vs automated metrics.

✅ Deliverable: reports/judge_vs_metrics.md

🧪 Part B — Cost Optimization (10.2)

B1) Token Optimization Experiment

Goal: Reduce tokens without losing quality.

Step B1.1 — Create 3 prompt versions for the same task

Verbose prompt (long instructions + examples)
Optimized prompt (tight instructions)
Ultra-optimized (template + strict schema + short)

Step B1.2 — Measure tokens & quality

For each prompt:

tokens in
tokens out
auto-eval score

B2) Caching Strategies (Huge cost saver)

Goal: Implement caching for repeated prompts.

Step B2.1 — Add response cache

Use diskcache:
Cache key should include:

model name
prompt hash
temperature/top_p/max_tokens
system prompt version

Step B2.2 — Test cache hit rate

Run the same 20 queries twice:

first run = cold cache
second run = warm cache

Measure:

cost reduction (approx)
latency improvement

✅ Deliverable: reports/caching_results.md

B3) Model Selection Tradeoffs

Goal: Pick the right model per task.

Step B3.1 — Define 3 “tiers”

Fast/cheap model
Balanced model
High quality model

Step B3.2 — Run eval_set across tiers

Measure:

quality score
latency
token usage
estimated cost

Step B3.3 — Build a routing policy

Example:

Extraction → cheap
RAG answer → balanced
Complex reasoning → premium

✅ Deliverable: reports/model_routing_policy.md

🧪 Part C — Latency & Scaling (10.3)

C1) Streaming vs Batch Inference

Goal: Improve perceived latency using streaming.

Step C1.1 — Implement streaming endpoint (if you built Section 9)

Measure:

Time to first token (TTFT)
Total completion time

Step C1.2 — Compare with non-streaming

Record:

TTFT difference
user experience notes

✅ Deliverable: reports/streaming_vs_batch.md

C2) Async Processing (Parallel requests)

Goal: Handle concurrency efficiently.

Step C2.1 — Implement async requests in backend

Use httpx.AsyncClient for LLM calls.

Step C2.2 — Add request queue limit

max concurrent LLM calls = N (e.g. 10)
overflow returns 429 with retry-after

✅ Deliverable: concurrency-limited backend

C3) Load Management + Rate Limits

Goal: Keep system stable under load.

Step C3.1 — Add rate limiting

Strategies:

per-user requests/minute
per-IP requests/minute
global cap

Step C3.2 — Add fallback behavior

When overloaded:

degrade to cheaper model
shorten output
disable “deep reasoning mode”
return “try again soon” gracefully

✅ Deliverable: reports/load_shedding.md

C4) Load Testing with Locust

Goal: Quantify performance like an engineer.

Step C4.1 — Write `src/locustfile.py`

Simulate:

20 users
each sends 10 chat requests
random delays

Capture:

p50 latency
p95 latency
error rate
throughput (RPS)

Step C4.2 — Run load test

locust -f src/locustfile.py

✅ Deliverable: reports/load_test_results.md with key metrics.

✅ Final Submission Checklist (Section 10 Lab)

Students submit:

Eval dataset (golden set) + human rubric
Automated eval script + scored CSV
Token optimization report (before/after)
Response caching implementation + hit-rate results
Model routing policy based on measured tradeoffs
Streaming vs batch comparison (TTFT, total latency)
Async processing + concurrency limit
Rate limiting + load shedding plan
Locust load test metrics report

Hands on Lab 10

✅ Lab Setup (10–15 min)

Step 0 — Create folder

Step 1 — Install dependencies

🧪 Part A — Evaluating LLM Outputs (10.1)

A1) Create a Test Dataset (Golden Set)

Step A1.1 — Create data/eval_set.jsonl

A2) Human Evaluation Rubric

Step A2.1 — Create rubric in reports/human_rubric.md

A3) Automated Evaluation (Practical Metrics)

Step A3.1 — Build src/auto_eval.py

A4) LLM-as-a-Judge (Optional but powerful)

Step A4.1 — Create judge prompt

Step A4.2 — Run judge on 30 outputs

🧪 Part B — Cost Optimization (10.2)

B1) Token Optimization Experiment

Step B1.1 — Create 3 prompt versions for the same task

Step B1.2 — Measure tokens & quality

B2) Caching Strategies (Huge cost saver)

Step B2.1 — Add response cache

Step B2.2 — Test cache hit rate

B3) Model Selection Tradeoffs

Step B3.1 — Define 3 “tiers”

Step B3.2 — Run eval_set across tiers

Step B3.3 — Build a routing policy

🧪 Part C — Latency & Scaling (10.3)

C1) Streaming vs Batch Inference

Step C1.1 — Implement streaming endpoint (if you built Section 9)

Step C1.2 — Compare with non-streaming

C2) Async Processing (Parallel requests)

Step C2.1 — Implement async requests in backend

Step C2.2 — Add request queue limit

C3) Load Management + Rate Limits

Step C3.1 — Add rate limiting

Step C3.2 — Add fallback behavior

C4) Load Testing with Locust

Step C4.1 — Write src/locustfile.py

Step C4.2 — Run load test

✅ Final Submission Checklist (Section 10 Lab)

Step A1.1 — Create `data/eval_set.jsonl`

Step A2.1 — Create rubric in `reports/human_rubric.md`

Step A3.1 — Build `src/auto_eval.py`

Step C4.1 — Write `src/locustfile.py`