✅ Lab Setup (10 min)
Step 0 — Create folder
Create: genai-section5-lab/ with:
data/chunks/indexes/outputs/src/reports/
Step 1 — Install dependencies
pip install numpy pandas scikit-learn matplotlib pip install sentence-transformers pip install faiss-cpu chromadb pip install nltk tiktoken python-dotenv
If
faiss-cpufails on your OS, skip FAISS and do Chroma only.
Step 2 — Download tokenizer resources (for chunking)
python -c "import nltk; nltk.download('punkt')"
🧪 Part A — Embeddings & Similarity (5.1)
A1) Create Embeddings for Sentences
Goal: Understand vectors as numerical meaning representations.
Step A1.1 — Create sample sentences
Create data/sentences.csv with 15–20 sentences across 4 topics:
Aviation maintenance
Finance
Healthcare
Education
Step A1.2 — Generate embeddings
Use sentence-transformers model:
all-MiniLM-L6-v2(fast, strong baseline)
Save:
sentence text
embedding vector (as list)
topic label
✅ Deliverable: outputs/sentence_embeddings.parquet (or CSV + numpy arrays)
A2) Similarity Metrics: Cosine vs Dot Product
Goal: See how similarity changes when vectors are normalized.
Step A2.1 — Compute similarity matrix
Compute cosine similarity for all pairs
Compute dot product similarity for all pairs
Step A2.2 — Normalize embeddings
L2 normalize all embeddings
Recompute dot products
Step A2.3 — Compare rankings
Pick 3 query sentences and show:
top-3 neighbors by cosine
top-3 neighbors by dot product (raw)
top-3 neighbors by dot product (normalized)
✅ Deliverable: reports/similarity_metrics.md with a comparison table + conclusion:
“Cosine ≈ normalized dot product”
“Raw dot product is biased by vector magnitude”
🧪 Part B — Build a Semantic Search Pipeline (5.2)
B1) Create a Small Document Corpus
Goal: Build retrieval on real documents, not just sentences.
Step B1.1 — Create documents
Create 6–10 text files in data/docs/:
Examples:
“Aviation Maintenance Manual Notes”
“Finance Risk Policy”
“Healthcare SOP”
“Student Handbook”
Each doc should be 400–1200 words.
✅ Deliverable: data/docs/*.txt
B2) Chunking Strategies (Fixed-size vs Sentence-based)
Goal: Learn chunking impacts retrieval quality.
Step B2.1 — Implement two chunkers
Create src/chunking.py:
Chunker A: Fixed token window
chunk size: 250 tokens
overlap: 40 tokens
Chunker B: Sentence-based
group 3–5 sentences per chunk
approximate token cap: 300 tokens
Step B2.2 — Chunk every document
Save chunks to:
chunks/fixed_chunks.jsonlchunks/sentence_chunks.jsonl
Each chunk record contains:
chunk_id
doc_id
chunk_text
start/end positions
✅ Deliverable: both chunk files + chunk counts
B3) Generate Embeddings for Chunks
Goal: Convert chunks to vectors for retrieval.
Step B3.1 — Embed all chunks
Use the same embedding model and create:
outputs/fixed_chunk_embeddings.npyoutputs/sentence_chunk_embeddings.npy
Store mapping:
chunk_id → vector
chunk_id → doc_id, text
✅ Deliverable: embeddings files + metadata JSON
B4) Similarity-Based Retrieval (Top-k Search)
Goal: Build a working retrieval function.
Step B4.1 — Write retrieval function
Create src/retrieval.py with:
input: query string, top_k, strategy
steps:
embed query
compute similarity with all chunk vectors
return top_k chunks with scores
Step B4.2 — Create evaluation queries
Create data/queries.json with 12 queries.
Each query includes:
query_text
expected_doc_id (ground truth)
Example:
“What causes hydraulic leakage?” → aviation doc
“Define credit risk controls” → finance doc
Step B4.3 — Evaluate retrieval accuracy
Compute:
Top-1 accuracy
Top-3 accuracy
Compare fixed vs sentence chunking.
✅ Deliverable: reports/retrieval_eval.md with results table.
🧪 Part C — Vector Databases (5.3)
C1) FAISS Index (Local High-Performance Indexing)
Goal: Experience how vector indexing speeds up retrieval.
Step C1.1 — Build FAISS index
Use:
IndexFlatIPfor cosine-style retrieval (normalized vectors)
ORIndexFlatL2for euclidean
Step C1.2 — Add all chunk vectors
Save index to indexes/faiss.index
Step C1.3 — Query FAISS and compare runtime
Time:
brute force search vs FAISS search
Use 50 queries (repeat your 12 queries multiple times).
✅ Deliverable: reports/faiss_speed.md with timing results.
C2) Chroma (Developer-Friendly Vector DB)
Goal: Store embeddings + metadata and query them like a real app.
Step C2.1 — Create a Chroma collection
collection name:
semantic_search_demo
Step C2.2 — Insert
For each chunk, store:
id = chunk_id
embedding vector
metadata:
doc_id
chunk_index
chunking_strategy
document text
Step C2.3 — Query
Search top_k for each query and return:
chunk text
doc_id
distance score
✅ Deliverable: outputs/chroma_query_results.json
C3) Compare Vector DBs (FAISS vs Chroma)
Goal: Understand “library vs database” tradeoffs.
Step C3.1 — Compare on 3 dimensions
Latency
Metadata filtering support
Persistence + developer experience
Step C3.2 — Write a short recommendation
When to use:
FAISS
Chroma
Pinecone/Weaviate (cloud)
✅ Deliverable: reports/vector_db_choice.md (1 page)
🧪 Part D — Performance Considerations (Production Thinking)
D1) Measure Chunk Size Tradeoffs
Goal: Show how chunk size affects retrieval quality + cost.
Step D1.1 — Run retrieval with 3 chunk sizes
150 tokens
250 tokens
400 tokens
Step D1.2 — Compare
For each:
Top-3 accuracy
Average chunk count (storage)
Average query latency
✅ Deliverable: reports/chunk_tradeoffs.md
D2) Add Metadata Filters (Enterprise Pattern)
Goal: Restrict retrieval to relevant subsets.
Step D2.1 — Add metadata fields
Example:
department = "finance" | "aviation" | ...access_level = "public" | "internal"
Step D2.2 — Query with filters (Chroma)
Example:
“Retrieve only finance documents”
✅ Deliverable: filtered retrieval demo output.
✅ Final Submission Checklist (Section 5 Lab)
Students submit:
Sentence embeddings file + similarity metric comparison report
Two chunking strategies outputs (fixed vs sentence-based)
Chunk embedding generation output + metadata mapping
Semantic retrieval function + evaluation results (Top-1/Top-3)
FAISS index + speed comparison report
Chroma collection demo + query results JSON
Vector DB choice memo (FAISS vs Chroma vs managed DB)
Chunk size tradeoff report + metadata filtering demo