Hands on Lab 5

✅ Lab Setup (10 min)

Step 0 — Create folder

Create: genai-section5-lab/ with:

data/
chunks/
indexes/
outputs/
src/
reports/

Step 1 — Install dependencies

pip install numpy pandas scikit-learn matplotlib
pip install sentence-transformers
pip install faiss-cpu chromadb
pip install nltk tiktoken python-dotenv

If faiss-cpu fails on your OS, skip FAISS and do Chroma only.

Step 2 — Download tokenizer resources (for chunking)

python -c "import nltk; nltk.download('punkt')"

🧪 Part A — Embeddings & Similarity (5.1)

A1) Create Embeddings for Sentences

Goal: Understand vectors as numerical meaning representations.

Step A1.1 — Create sample sentences

Create data/sentences.csv with 15–20 sentences across 4 topics:

Aviation maintenance
Finance
Healthcare
Education

Step A1.2 — Generate embeddings

Use sentence-transformers model:

all-MiniLM-L6-v2 (fast, strong baseline)

Save:

sentence text
embedding vector (as list)
topic label

✅ Deliverable: outputs/sentence_embeddings.parquet (or CSV + numpy arrays)

A2) Similarity Metrics: Cosine vs Dot Product

Goal: See how similarity changes when vectors are normalized.

Step A2.1 — Compute similarity matrix

Compute cosine similarity for all pairs
Compute dot product similarity for all pairs

Step A2.2 — Normalize embeddings

L2 normalize all embeddings
Recompute dot products

Step A2.3 — Compare rankings

Pick 3 query sentences and show:

top-3 neighbors by cosine
top-3 neighbors by dot product (raw)
top-3 neighbors by dot product (normalized)

✅ Deliverable: reports/similarity_metrics.md with a comparison table + conclusion:

“Cosine ≈ normalized dot product”
“Raw dot product is biased by vector magnitude”

🧪 Part B — Build a Semantic Search Pipeline (5.2)

B1) Create a Small Document Corpus

Goal: Build retrieval on real documents, not just sentences.

Step B1.1 — Create documents

Create 6–10 text files in data/docs/:
Examples:

“Aviation Maintenance Manual Notes”
“Finance Risk Policy”
“Healthcare SOP”
“Student Handbook”
Each doc should be 400–1200 words.

✅ Deliverable: data/docs/*.txt

B2) Chunking Strategies (Fixed-size vs Sentence-based)

Goal: Learn chunking impacts retrieval quality.

Step B2.1 — Implement two chunkers

Create src/chunking.py:

Chunker A: Fixed token window

chunk size: 250 tokens
overlap: 40 tokens

Chunker B: Sentence-based

group 3–5 sentences per chunk
approximate token cap: 300 tokens

Step B2.2 — Chunk every document

Save chunks to:

chunks/fixed_chunks.jsonl
chunks/sentence_chunks.jsonl

Each chunk record contains:

chunk_id
doc_id
chunk_text
start/end positions

✅ Deliverable: both chunk files + chunk counts

B3) Generate Embeddings for Chunks

Goal: Convert chunks to vectors for retrieval.

Step B3.1 — Embed all chunks

Use the same embedding model and create:

outputs/fixed_chunk_embeddings.npy
outputs/sentence_chunk_embeddings.npy

Store mapping:

chunk_id → vector
chunk_id → doc_id, text

✅ Deliverable: embeddings files + metadata JSON

B4) Similarity-Based Retrieval (Top-k Search)

Goal: Build a working retrieval function.

Step B4.1 — Write retrieval function

Create src/retrieval.py with:

input: query string, top_k, strategy
steps:
1. embed query
2. compute similarity with all chunk vectors
3. return top_k chunks with scores

Step B4.2 — Create evaluation queries

Create data/queries.json with 12 queries.
Each query includes:

query_text
expected_doc_id (ground truth)

Example:

“What causes hydraulic leakage?” → aviation doc
“Define credit risk controls” → finance doc

Step B4.3 — Evaluate retrieval accuracy

Compute:

Top-1 accuracy
Top-3 accuracy
Compare fixed vs sentence chunking.

✅ Deliverable: reports/retrieval_eval.md with results table.

🧪 Part C — Vector Databases (5.3)

C1) FAISS Index (Local High-Performance Indexing)

Goal: Experience how vector indexing speeds up retrieval.

Step C1.1 — Build FAISS index

Use:

IndexFlatIP for cosine-style retrieval (normalized vectors)
OR
IndexFlatL2 for euclidean

Step C1.2 — Add all chunk vectors

Save index to indexes/faiss.index

Step C1.3 — Query FAISS and compare runtime

Time:

brute force search vs FAISS search
Use 50 queries (repeat your 12 queries multiple times).

✅ Deliverable: reports/faiss_speed.md with timing results.

C2) Chroma (Developer-Friendly Vector DB)

Goal: Store embeddings + metadata and query them like a real app.

Step C2.1 — Create a Chroma collection

collection name: semantic_search_demo

Step C2.2 — Insert

For each chunk, store:

id = chunk_id
embedding vector
metadata:
- doc_id
- chunk_index
- chunking_strategy
document text

Step C2.3 — Query

Search top_k for each query and return:

chunk text
doc_id
distance score

✅ Deliverable: outputs/chroma_query_results.json

C3) Compare Vector DBs (FAISS vs Chroma)

Goal: Understand “library vs database” tradeoffs.

Step C3.1 — Compare on 3 dimensions

Latency
Metadata filtering support
Persistence + developer experience

Step C3.2 — Write a short recommendation

When to use:

FAISS
Chroma
Pinecone/Weaviate (cloud)

✅ Deliverable: reports/vector_db_choice.md (1 page)

🧪 Part D — Performance Considerations (Production Thinking)

D1) Measure Chunk Size Tradeoffs

Goal: Show how chunk size affects retrieval quality + cost.

Step D1.1 — Run retrieval with 3 chunk sizes

150 tokens
250 tokens
400 tokens

Step D1.2 — Compare

For each:

Top-3 accuracy
Average chunk count (storage)
Average query latency

✅ Deliverable: reports/chunk_tradeoffs.md

D2) Add Metadata Filters (Enterprise Pattern)

Goal: Restrict retrieval to relevant subsets.

Step D2.1 — Add metadata fields

Example:

department = "finance" | "aviation" | ...
access_level = "public" | "internal"

Step D2.2 — Query with filters (Chroma)

Example:

“Retrieve only finance documents”

✅ Deliverable: filtered retrieval demo output.

✅ Final Submission Checklist (Section 5 Lab)

Students submit:

Sentence embeddings file + similarity metric comparison report
Two chunking strategies outputs (fixed vs sentence-based)
Chunk embedding generation output + metadata mapping
Semantic retrieval function + evaluation results (Top-1/Top-3)
FAISS index + speed comparison report
Chroma collection demo + query results JSON
Vector DB choice memo (FAISS vs Chroma vs managed DB)
Chunk size tradeoff report + metadata filtering demo