β Lab Setup (10β15 min)
Step 0 β Create folder
Create: genai-section10-ops-lab/ with:
data/src/cache/outputs/logs/reports/
Step 1 β Install dependencies
pip install pandas numpy scikit-learn matplotlib pip install fastapi uvicorn httpx tenacity pydantic python-dotenv pip install tiktoken jsonschema rapidfuzz pip install diskcache pip install locust
Optional (if you want LLM-as-judge):
pip install openai anthropic google-generativeai
π§ͺ Part A β Evaluating LLM Outputs (10.1)
A1) Create a Test Dataset (Golden Set)
Goal: Build a repeatable evaluation set.
Step A1.1 β Create data/eval_set.jsonl
Add 30 test cases across 3 tasks:
Extraction β JSON fields from text
Summarization β 2β3 sentence summary
Q&A grounded in context β answer + citation
Each record:
idtask_typeinputcontext(optional)expected_output(gold answer)must_include(list of key phrases)must_not_include(list)
β
Deliverable: eval_set.jsonl
A2) Human Evaluation Rubric
Goal: Evaluate quality like a real team.
Step A2.1 β Create rubric in reports/human_rubric.md
Score each output 1β5 for:
Accuracy (factually correct)
Relevance (answers the question)
Faithfulness (uses provided context only)
Format compliance (valid JSON / required structure)
Add βPass/Fail gatesβ:
invalid JSON = fail
missing required fields = fail
unsafe content = fail
β Deliverable: rubric template
A3) Automated Evaluation (Practical Metrics)
Goal: Score outputs at scale without humans.
Step A3.1 β Build src/auto_eval.py
Metrics:
JSON validity rate (parse success)
Schema compliance (jsonschema)
String similarity to expected:
rapidfuzzfor extraction/summaries
Keyphrase coverage:
percent of
must_includepresent
Faithfulness check (simple):
if answer contains words not in context for grounded tasks, flag βpotential hallucinationβ
β
Deliverable: outputs/auto_eval_scores.csv
A4) LLM-as-a-Judge (Optional but powerful)
Goal: Use a judge model to rate helpfulness + faithfulness.
Step A4.1 β Create judge prompt
Judge outputs strict JSON:
accuracy_score (1β5)
relevance_score (1β5)
faithfulness_score (1β5)
explanation (short)
Step A4.2 β Run judge on 30 outputs
Compare judge scores vs automated metrics.
β
Deliverable: reports/judge_vs_metrics.md
π§ͺ Part B β Cost Optimization (10.2)
B1) Token Optimization Experiment
Goal: Reduce tokens without losing quality.
Step B1.1 β Create 3 prompt versions for the same task
Verbose prompt (long instructions + examples)
Optimized prompt (tight instructions)
Ultra-optimized (template + strict schema + short)
Step B1.2 β Measure tokens & quality
For each prompt:
tokens in
tokens out
auto-eval score
β
Deliverable: reports/token_optimization.md
Include a table:
| Prompt | Input tokens | Output tokens | Quality score | Notes |
B2) Caching Strategies (Huge cost saver)
Goal: Implement caching for repeated prompts.
Step B2.1 β Add response cache
Use diskcache:
Cache key should include:
model name
prompt hash
temperature/top_p/max_tokens
system prompt version
Step B2.2 β Test cache hit rate
Run the same 20 queries twice:
first run = cold cache
second run = warm cache
Measure:
cost reduction (approx)
latency improvement
β
Deliverable: reports/caching_results.md
B3) Model Selection Tradeoffs
Goal: Pick the right model per task.
Step B3.1 β Define 3 βtiersβ
Fast/cheap model
Balanced model
High quality model
Step B3.2 β Run eval_set across tiers
Measure:
quality score
latency
token usage
estimated cost
Step B3.3 β Build a routing policy
Example:
Extraction β cheap
RAG answer β balanced
Complex reasoning β premium
β
Deliverable: reports/model_routing_policy.md
π§ͺ Part C β Latency & Scaling (10.3)
C1) Streaming vs Batch Inference
Goal: Improve perceived latency using streaming.
Step C1.1 β Implement streaming endpoint (if you built Section 9)
Measure:
Time to first token (TTFT)
Total completion time
Step C1.2 β Compare with non-streaming
Record:
TTFT difference
user experience notes
β
Deliverable: reports/streaming_vs_batch.md
C2) Async Processing (Parallel requests)
Goal: Handle concurrency efficiently.
Step C2.1 β Implement async requests in backend
Use httpx.AsyncClient for LLM calls.
Step C2.2 β Add request queue limit
max concurrent LLM calls = N (e.g. 10)
overflow returns 429 with retry-after
β Deliverable: concurrency-limited backend
C3) Load Management + Rate Limits
Goal: Keep system stable under load.
Step C3.1 β Add rate limiting
Strategies:
per-user requests/minute
per-IP requests/minute
global cap
Step C3.2 β Add fallback behavior
When overloaded:
degrade to cheaper model
shorten output
disable βdeep reasoning modeβ
return βtry again soonβ gracefully
β
Deliverable: reports/load_shedding.md
C4) Load Testing with Locust
Goal: Quantify performance like an engineer.
Step C4.1 β Write src/locustfile.py
Simulate:
20 users
each sends 10 chat requests
random delays
Capture:
p50 latency
p95 latency
error rate
throughput (RPS)
Step C4.2 β Run load test
locust -f src/locustfile.py
β
Deliverable: reports/load_test_results.md with key metrics.
β Final Submission Checklist (Section 10 Lab)
Students submit:
Eval dataset (golden set) + human rubric
Automated eval script + scored CSV
Token optimization report (before/after)
Response caching implementation + hit-rate results
Model routing policy based on measured tradeoffs
Streaming vs batch comparison (TTFT, total latency)
Async processing + concurrency limit
Rate limiting + load shedding plan
Locust load test metrics report