Hands on Lab 10


βœ… Lab Setup (10–15 min)

Step 0 β€” Create folder

Create: genai-section10-ops-lab/ with:

Step 1 β€” Install dependencies

pip install pandas numpy scikit-learn matplotlib
pip install fastapi uvicorn httpx tenacity pydantic python-dotenv
pip install tiktoken jsonschema rapidfuzz
pip install diskcache
pip install locust

Optional (if you want LLM-as-judge):

pip install openai anthropic google-generativeai

πŸ§ͺ Part A β€” Evaluating LLM Outputs (10.1)

A1) Create a Test Dataset (Golden Set)

Goal: Build a repeatable evaluation set.

Step A1.1 β€” Create data/eval_set.jsonl

Add 30 test cases across 3 tasks:

  1. Extraction β†’ JSON fields from text

  2. Summarization β†’ 2–3 sentence summary

  3. Q&A grounded in context β†’ answer + citation

Each record:

βœ… Deliverable: eval_set.jsonl

A2) Human Evaluation Rubric

Goal: Evaluate quality like a real team.

Step A2.1 β€” Create rubric in reports/human_rubric.md

Score each output 1–5 for:

Add β€œPass/Fail gates”:

βœ… Deliverable: rubric template

A3) Automated Evaluation (Practical Metrics)

Goal: Score outputs at scale without humans.

Step A3.1 β€” Build src/auto_eval.py

Metrics:

βœ… Deliverable: outputs/auto_eval_scores.csv

A4) LLM-as-a-Judge (Optional but powerful)

Goal: Use a judge model to rate helpfulness + faithfulness.

Step A4.1 β€” Create judge prompt

Judge outputs strict JSON:

Step A4.2 β€” Run judge on 30 outputs

Compare judge scores vs automated metrics.

βœ… Deliverable: reports/judge_vs_metrics.md

πŸ§ͺ Part B β€” Cost Optimization (10.2)

B1) Token Optimization Experiment

Goal: Reduce tokens without losing quality.

Step B1.1 β€” Create 3 prompt versions for the same task

Step B1.2 β€” Measure tokens & quality

For each prompt:

βœ… Deliverable: reports/token_optimization.md
Include a table:
| Prompt | Input tokens | Output tokens | Quality score | Notes |

B2) Caching Strategies (Huge cost saver)

Goal: Implement caching for repeated prompts.

Step B2.1 β€” Add response cache

Use diskcache:
Cache key should include:

Step B2.2 β€” Test cache hit rate

Run the same 20 queries twice:

Measure:

βœ… Deliverable: reports/caching_results.md

B3) Model Selection Tradeoffs

Goal: Pick the right model per task.

Step B3.1 β€” Define 3 β€œtiers”

Step B3.2 β€” Run eval_set across tiers

Measure:

Step B3.3 β€” Build a routing policy

Example:

βœ… Deliverable: reports/model_routing_policy.md

πŸ§ͺ Part C β€” Latency & Scaling (10.3)

C1) Streaming vs Batch Inference

Goal: Improve perceived latency using streaming.

Step C1.1 β€” Implement streaming endpoint (if you built Section 9)

Measure:

Step C1.2 β€” Compare with non-streaming

Record:

βœ… Deliverable: reports/streaming_vs_batch.md

C2) Async Processing (Parallel requests)

Goal: Handle concurrency efficiently.

Step C2.1 β€” Implement async requests in backend

Use httpx.AsyncClient for LLM calls.

Step C2.2 β€” Add request queue limit

βœ… Deliverable: concurrency-limited backend

C3) Load Management + Rate Limits

Goal: Keep system stable under load.

Step C3.1 β€” Add rate limiting

Strategies:

Step C3.2 β€” Add fallback behavior

When overloaded:

βœ… Deliverable: reports/load_shedding.md

C4) Load Testing with Locust

Goal: Quantify performance like an engineer.

Step C4.1 β€” Write src/locustfile.py

Simulate:

Capture:

Step C4.2 β€” Run load test

locust -f src/locustfile.py

βœ… Deliverable: reports/load_test_results.md with key metrics.

βœ… Final Submission Checklist (Section 10 Lab)

Students submit:

  1. Eval dataset (golden set) + human rubric

  2. Automated eval script + scored CSV

  3. Token optimization report (before/after)

  4. Response caching implementation + hit-rate results

  5. Model routing policy based on measured tradeoffs

  6. Streaming vs batch comparison (TTFT, total latency)

  7. Async processing + concurrency limit

  8. Rate limiting + load shedding plan

  9. Locust load test metrics report