Hands on Lab 5


✅ Lab Setup (10 min)

Step 0 — Create folder

Create: genai-section5-lab/ with:

Step 1 — Install dependencies

pip install numpy pandas scikit-learn matplotlib
pip install sentence-transformers
pip install faiss-cpu chromadb
pip install nltk tiktoken python-dotenv

If faiss-cpu fails on your OS, skip FAISS and do Chroma only.

Step 2 — Download tokenizer resources (for chunking)

python -c "import nltk; nltk.download('punkt')"

🧪 Part A — Embeddings & Similarity (5.1)

A1) Create Embeddings for Sentences

Goal: Understand vectors as numerical meaning representations.

Step A1.1 — Create sample sentences

Create data/sentences.csv with 15–20 sentences across 4 topics:

Step A1.2 — Generate embeddings

Use sentence-transformers model:

Save:

Deliverable: outputs/sentence_embeddings.parquet (or CSV + numpy arrays)

A2) Similarity Metrics: Cosine vs Dot Product

Goal: See how similarity changes when vectors are normalized.

Step A2.1 — Compute similarity matrix

Step A2.2 — Normalize embeddings

Step A2.3 — Compare rankings

Pick 3 query sentences and show:

Deliverable: reports/similarity_metrics.md with a comparison table + conclusion:

🧪 Part B — Build a Semantic Search Pipeline (5.2)

B1) Create a Small Document Corpus

Goal: Build retrieval on real documents, not just sentences.

Step B1.1 — Create documents

Create 6–10 text files in data/docs/:
Examples:

Deliverable: data/docs/*.txt

B2) Chunking Strategies (Fixed-size vs Sentence-based)

Goal: Learn chunking impacts retrieval quality.

Step B2.1 — Implement two chunkers

Create src/chunking.py:

Chunker A: Fixed token window

Chunker B: Sentence-based

Step B2.2 — Chunk every document

Save chunks to:

Each chunk record contains:

Deliverable: both chunk files + chunk counts

B3) Generate Embeddings for Chunks

Goal: Convert chunks to vectors for retrieval.

Step B3.1 — Embed all chunks

Use the same embedding model and create:

Store mapping:

Deliverable: embeddings files + metadata JSON

B4) Similarity-Based Retrieval (Top-k Search)

Goal: Build a working retrieval function.

Step B4.1 — Write retrieval function

Create src/retrieval.py with:

Step B4.2 — Create evaluation queries

Create data/queries.json with 12 queries.
Each query includes:

Example:

Step B4.3 — Evaluate retrieval accuracy

Compute:

Deliverable: reports/retrieval_eval.md with results table.

🧪 Part C — Vector Databases (5.3)

C1) FAISS Index (Local High-Performance Indexing)

Goal: Experience how vector indexing speeds up retrieval.

Step C1.1 — Build FAISS index

Use:

Step C1.2 — Add all chunk vectors

Save index to indexes/faiss.index

Step C1.3 — Query FAISS and compare runtime

Time:

Deliverable: reports/faiss_speed.md with timing results.

C2) Chroma (Developer-Friendly Vector DB)

Goal: Store embeddings + metadata and query them like a real app.

Step C2.1 — Create a Chroma collection

Step C2.2 — Insert

For each chunk, store:

Step C2.3 — Query

Search top_k for each query and return:

Deliverable: outputs/chroma_query_results.json

C3) Compare Vector DBs (FAISS vs Chroma)

Goal: Understand “library vs database” tradeoffs.

Step C3.1 — Compare on 3 dimensions

  1. Latency

  2. Metadata filtering support

  3. Persistence + developer experience

Step C3.2 — Write a short recommendation

When to use:

Deliverable: reports/vector_db_choice.md (1 page)

🧪 Part D — Performance Considerations (Production Thinking)

D1) Measure Chunk Size Tradeoffs

Goal: Show how chunk size affects retrieval quality + cost.

Step D1.1 — Run retrieval with 3 chunk sizes

Step D1.2 — Compare

For each:

Deliverable: reports/chunk_tradeoffs.md

D2) Add Metadata Filters (Enterprise Pattern)

Goal: Restrict retrieval to relevant subsets.

Step D2.1 — Add metadata fields

Example:

Step D2.2 — Query with filters (Chroma)

Example:

Deliverable: filtered retrieval demo output.

✅ Final Submission Checklist (Section 5 Lab)

Students submit:

  1. Sentence embeddings file + similarity metric comparison report

  2. Two chunking strategies outputs (fixed vs sentence-based)

  3. Chunk embedding generation output + metadata mapping

  4. Semantic retrieval function + evaluation results (Top-1/Top-3)

  5. FAISS index + speed comparison report

  6. Chroma collection demo + query results JSON

  7. Vector DB choice memo (FAISS vs Chroma vs managed DB)

  8. Chunk size tradeoff report + metadata filtering demo