Hands on Lab 6

✅ Lab Setup (10–15 min)

Step 0 — Create folder

Create: genai-section6-rag-lab/ with:

data/docs/
data/queries/
indexes/
outputs/
src/
reports/

Step 1 — Install dependencies

pip install numpy pandas scikit-learn matplotlib
pip install sentence-transformers
pip install chromadb faiss-cpu
pip install rank-bm25 nltk
pip install transformers torch sentencepiece
pip install pydantic python-dotenv tqdm

If FAISS fails, skip it and use Chroma only.

Step 2 — Download NLTK tokenizer

python -c "import nltk; nltk.download('punkt')"

🧪 Part A — Why RAG is Needed (6.1)

A1) Hallucination Baseline (No Retrieval)

Goal: Prove that “LLM-only” answers hallucinate on unknown facts.

Step A1.1 — Create a private knowledge base

Add 6–10 docs in data/docs/ (plain .txt):

“Company Policy Handbook”
“Product FAQ”
“Engineering Runbook”
“Incident Postmortem”
“HR Guidelines”
“Aviation Maintenance Notes” (optional domain)

Each doc: 500–1500 words.

Step A1.2 — Create evaluation questions

Create data/queries/questions.json with 15 questions:

10 answerable from docs
5 NOT answerable (to test honesty)

Each record:

id
question
answerable (true/false)
source_doc_hint (optional)

Step A1.3 — Run “LLM-only” answering

Use a small instruction model for generation:

google/flan-t5-base (light)
or
any API model you already use

Prompt:

“Answer the question. If you don’t know, say you don’t know.”

Save outputs to:

outputs/baseline_llm_only.json

Step A1.4 — Score hallucinations manually (fast)

Add columns:

hallucinated (Y/N)
uncertain (Y/N)
correct (Y/N)

✅ Deliverable: reports/hallucination_baseline.md
Include counts and 2–3 examples of hallucination.

🧪 Part B — RAG Architecture (6.2)

B1) Document Ingestion Pipeline

Goal: Turn documents into chunks with metadata.

Step B1.1 — Build ingestion script `src/ingest.py`

Read all data/docs/*.txt and produce chunk records with:

chunk_id
doc_id
title
chunk_text
chunk_index
char_start, char_end

Step B1.2 — Implement chunking strategy

Use token-aware chunking:

chunk_size = 250 tokens
overlap = 40 tokens

Save:

outputs/chunks.jsonl

✅ Deliverable: chunk file + chunk count per document.

B2) Embeddings + Vector Store (Retriever)

Goal: Create a searchable index.

Step B2.1 — Generate embeddings for chunks

Use:

sentence-transformers/all-MiniLM-L6-v2

Save:

outputs/chunk_embeddings.npy
outputs/chunk_metadata.json

Step B2.2 — Store in Chroma

Create a Chroma collection:

name: rag_demo

Insert:

ids = chunk_id
embeddings
documents = chunk_text
metadata = {doc_id, title, chunk_index}

✅ Deliverable: persisted Chroma DB in indexes/chroma/.

B3) Retriever → Generator Flow

Goal: Implement end-to-end RAG.

Step B3.1 — Write retriever function `src/retrieve.py`

Inputs:

query
top_k (default 4)

Steps:

embed query
vector search
return top_k chunks + scores

Step B3.2 — Build context window manager

Implement “context builder”:

Sort by score
Stop adding chunks when context exceeds token budget (e.g., 900 tokens)
Add citations like: [doc_id#chunk_index]

Step B3.3 — Write generator `src/generate.py`

Prompt template:

System: “You are a grounded assistant. Use only provided context.”
User: question
Context: retrieved chunks with citations

Rules:

If answer not in context: say “Not found in provided documents.”
Provide bullet answer + citations.

Save outputs:

outputs/rag_answers.json

✅ Deliverable: working RAG pipeline producing answers with citations.

B4) Compare RAG vs LLM-only

Goal: Demonstrate improvement + reduced hallucination.

Step B4.1 — Run same 15 questions through RAG

Save results.

Step B4.2 — Score outputs

Add:

grounded (Y/N)
citation_correct (Y/N)
hallucinated (Y/N)

Step B4.3 — Write comparison summary

Compute:

hallucination rate drop
answer accuracy improvement on answerable questions
honesty improvement on unanswerable questions

✅ Deliverable: reports/rag_vs_baseline.md with a table.

🧪 Part C — Advanced RAG Techniques (6.3)

C1) Hybrid Search (Keyword + Vector)

Goal: Improve retrieval for exact terms and rare keywords.

Step C1.1 — Build BM25 index

Use rank-bm25 over chunk texts.

Step C1.2 — Retrieve with both methods

For each query:

vector top_k = 8
bm25 top_k = 8

Step C1.3 — Merge results

Combine and deduplicate:

Normalize scores
Weighted merge (example: 0.6 vector, 0.4 bm25)

✅ Deliverable: outputs/hybrid_retrieval.json and a short before/after example.

C2) Re-ranking Strategies

Goal: Improve top results before generation.

Option A (No LLM): Cross-encoder re-ranker

Use:

cross-encoder/ms-marco-MiniLM-L-6-v2

Step C2.1 — Take top 12 candidates from hybrid

Step C2.2 — Re-rank to top 4 based on relevance score

Save:

before ranks vs after ranks

✅ Deliverable: reports/reranking_effect.md (show one query where it improves relevance).

C3) Multi-Document Reasoning

Goal: Answer questions requiring combining multiple sources.

Step C3.1 — Create multi-doc questions (5)

Example types:

Compare two policies from different docs
Summarize change from Postmortem vs Runbook
“What are the steps AND the exception rules?”

Step C3.2 — Retrieval strategy

Retrieve top_k=8
Ensure diversity:
- max 2 chunks per doc
- prefer covering multiple doc_ids

Step C3.3 — Generate answer with synthesis prompt

Prompt rules:

Must cite at least 2 distinct sources
Provide “combined explanation” section
Provide “source breakdown” section

✅ Deliverable: outputs/multidoc_answers.json + 2 strong examples.

🧪 Part D — Context Window Management (Production Practice)

D1) Token Budget Stress Test

Goal: Show what happens when context is too large.

Step D1.1 — Try with large top_k (e.g., 15)

Step D1.2 — Compare strategies

naive: concatenate all
smart: token-budget selection + diversity
summary: summarize chunks then insert

✅ Deliverable: reports/context_management.md with 3 outputs and conclusions.

✅ Final Submission Checklist (Section 6 Lab)

Students submit:

LLM-only baseline answers + hallucination scoring
Chunking/ingestion outputs with metadata
Vector store index (Chroma) + retrieval function
Full RAG pipeline answers with citations
RAG vs baseline comparison report
Hybrid search implementation + example improvement
Re-ranking report (before/after ranks)
Multi-document reasoning answers with multi-source citations
Context window management stress test report

Hands on Lab 6

✅ Lab Setup (10–15 min)

Step 0 — Create folder

Step 1 — Install dependencies

Step 2 — Download NLTK tokenizer

🧪 Part A — Why RAG is Needed (6.1)

A1) Hallucination Baseline (No Retrieval)

Step A1.1 — Create a private knowledge base

Step A1.2 — Create evaluation questions

Step A1.3 — Run “LLM-only” answering

Step A1.4 — Score hallucinations manually (fast)

🧪 Part B — RAG Architecture (6.2)

B1) Document Ingestion Pipeline

Step B1.1 — Build ingestion script src/ingest.py

Step B1.2 — Implement chunking strategy

B2) Embeddings + Vector Store (Retriever)

Step B2.1 — Generate embeddings for chunks

Step B2.2 — Store in Chroma

B3) Retriever → Generator Flow

Step B3.1 — Write retriever function src/retrieve.py

Step B3.2 — Build context window manager

Step B3.3 — Write generator src/generate.py

B4) Compare RAG vs LLM-only

Step B4.1 — Run same 15 questions through RAG

Step B4.2 — Score outputs

Step B4.3 — Write comparison summary

🧪 Part C — Advanced RAG Techniques (6.3)

C1) Hybrid Search (Keyword + Vector)

Step C1.1 — Build BM25 index

Step C1.2 — Retrieve with both methods

Step C1.3 — Merge results

C2) Re-ranking Strategies

Option A (No LLM): Cross-encoder re-ranker

Step C2.1 — Take top 12 candidates from hybrid

Step C2.2 — Re-rank to top 4 based on relevance score

C3) Multi-Document Reasoning

Step C3.1 — Create multi-doc questions (5)

Step C3.2 — Retrieval strategy

Step C3.3 — Generate answer with synthesis prompt

🧪 Part D — Context Window Management (Production Practice)

D1) Token Budget Stress Test

Step D1.1 — Try with large top_k (e.g., 15)

Step D1.2 — Compare strategies

✅ Final Submission Checklist (Section 6 Lab)

Step B1.1 — Build ingestion script `src/ingest.py`

Step B3.1 — Write retriever function `src/retrieve.py`

Step B3.3 — Write generator `src/generate.py`