WEBVTT

00:00.040 --> 00:03.400
We will talk about building semantic search pipelines.

00:03.440 --> 00:10.240
A semantic search pipeline is an end to end system designed to retrieve information based on meaning,

00:10.360 --> 00:12.760
not just exact keyword matches.

00:13.520 --> 00:20.200
Unlike traditional search engines that rely on literal text overlap, semantic search interprets user

00:20.200 --> 00:27.520
intent and finds conceptually relevant content even when the wording differs, as illustrated on page

00:27.520 --> 00:28.440
two of the deck.

00:28.720 --> 00:35.680
This pipeline is composed of three tightly connected stages text chunking, embedding generation, and

00:35.680 --> 00:37.280
similarity based retrieval.

00:37.880 --> 00:43.960
Each stage plays a critical role, and failure in any one of them degrades the entire system.

00:44.600 --> 00:51.240
This architecture powers modern use cases such as enterprise knowledge bases, internal documentation,

00:51.240 --> 00:52.840
search and retrieval.

00:52.840 --> 00:54.760
Augmented generation systems.

00:55.400 --> 01:00.440
When a user asks a question, the system doesn't simply scan for matching words.

01:00.870 --> 01:07.190
Instead, it embeds the query, compares it against embedded document chunks, and retrieves the most

01:07.190 --> 01:09.070
semantically similar results.

01:09.070 --> 01:15.270
The most important takeaway from this slide is that semantic search is not a single API call.

01:15.830 --> 01:19.870
It is a system architecture that must be designed holistically.

01:20.510 --> 01:26.430
Strong results come from thoughtful integration of all components, not from model choice alone.

01:26.750 --> 01:33.590
Large language models operate within fixed context windows, as highlighted on page three of the deck.

01:34.110 --> 01:41.550
These limits mean that long documents such as manuals, research papers, or code bases cannot be processed

01:41.550 --> 01:42.670
in a single pass.

01:43.070 --> 01:45.630
Text chunking is the solution to this constraint.

01:46.030 --> 01:50.070
However, chunking is not just about fitting text into token limits.

01:50.630 --> 01:56.190
The way you split documents directly affects retrieval quality and system reliability.

01:56.990 --> 02:03.710
Poor chunking strategies lead to lost context, irrelevant retrieval and increased hallucinations in

02:03.710 --> 02:08.470
downstream models when important information is split across chunk boundaries.

02:08.670 --> 02:10.710
Semantic relationships are broken.

02:11.230 --> 02:17.350
This causes the embedding model to receive incomplete or confusing input, leading to poor similarity

02:17.350 --> 02:19.190
matching in retrieval.

02:19.190 --> 02:20.390
Augmented systems.

02:20.510 --> 02:26.190
This lack of context forces the generation model to fill gaps with fabricated information.

02:26.430 --> 02:32.870
The engineering goal, as emphasized in the slide, is to create chunks that preserve both semantic

02:32.870 --> 02:35.390
meaning and narrative coherence.

02:36.070 --> 02:40.790
Each chunk should represent a complete, self-contained unit of information.

02:41.470 --> 02:47.950
Effective chunking is one of the highest leverage improvements you can make in any semantic search or

02:47.990 --> 02:49.070
Rag pipeline.

02:49.830 --> 02:57.270
Fixed size chunking is the simplest approach to splitting text, as described on page four of the deck.

02:57.630 --> 03:04.620
This method divides documents using predetermined boundaries based on token count or character count.

03:05.220 --> 03:11.420
For example, a system might split text every 512 tokens or every 2000 characters.

03:11.980 --> 03:16.020
The primary advantage of fixed size chunking is predictability.

03:16.220 --> 03:22.020
Chunk sizes are consistent, processing is fast, and system design becomes straightforward.

03:22.540 --> 03:28.580
This approach works well for uniform documents such as product catalogs, standardized reports, or

03:28.580 --> 03:32.020
highly structured data where content patterns are consistent.

03:32.460 --> 03:38.780
However, fixed size chunking has significant limitations because it ignores semantic boundaries, it

03:38.780 --> 03:41.820
often splits text mid-sentence or mid idea.

03:42.260 --> 03:48.580
This disrupts meaning and breaks logical flow, which negatively impacts embedding quality and retrieval

03:48.580 --> 03:53.900
accuracy for complex documents such as technical guides or research papers.

03:54.180 --> 03:59.780
Fixed size chunking can lead to poor search results and increased hallucinations downstream.

04:00.260 --> 04:05.940
While attractive for its simplicity, this approach should be used carefully and only when document

04:05.940 --> 04:07.140
structure supports it.

04:07.580 --> 04:14.500
Semantic chunking addresses the shortcomings of fixed size approaches by respecting the natural structure

04:14.500 --> 04:17.940
of documents, as shown on page five of the deck.

04:18.260 --> 04:24.620
This method splits text at meaningful boundaries such as paragraphs, section headings, or complete

04:24.620 --> 04:25.460
sentences.

04:26.100 --> 04:33.340
By preserving semantic units, semantic chunking maintains the integrity of ideas and improves retrieval

04:33.340 --> 04:34.100
accuracy.

04:34.740 --> 04:40.500
However, even well-chosen boundaries can still cut off important context at the edges.

04:40.940 --> 04:43.820
This is where overlapping chunking becomes essential.

04:44.420 --> 04:51.100
Overlapping chunking introduces redundancy by including a small portion of adjacent text, typically

04:51.100 --> 04:53.980
10 to 20% from neighboring chunks.

04:54.580 --> 04:59.380
This ensures that important information near boundaries is preserved in multiple chunks.

04:59.740 --> 05:06.460
The benefit is significantly improved recall and relevance, especially for complex queries.

05:07.060 --> 05:15.500
The trade off is increased storage and compute cost, usually around 20 to 30% more embeddings in production

05:15.500 --> 05:16.220
systems.

05:16.420 --> 05:18.940
This trade off is almost always worth it.

05:19.220 --> 05:25.060
Better chunking leads to better retrieval, which directly translates into better user experience and

05:25.060 --> 05:27.100
more reliable AI behavior.

05:27.780 --> 05:34.740
Once documents are chunked, the next step is embedding generation as detailed on page six of the deck.

05:35.060 --> 05:43.460
Each chunk is transformed into a high dimensional vector, typically between 768 and 1436 dimensions.

05:43.700 --> 05:45.900
That represents its semantic meaning.

05:46.580 --> 05:49.020
Choosing the right embedding model is critical.

05:49.540 --> 05:55.900
Dedicated embedding models, such as OpenAI's text embedding series, Coheres embedding models, or

05:55.900 --> 06:00.420
open source sentence transformers define the structure of your vector space.

06:00.970 --> 06:02.890
Consistency is essential.

06:03.210 --> 06:08.250
The same embedding model must be used for both document indexing and query embedding.

06:08.970 --> 06:15.050
For efficiency, embeddings should be generated in batches and processed offline during ingestion.

06:15.610 --> 06:22.210
Embedding APIs often support batch sizes of hundreds or thousands of chunks per request, dramatically

06:22.210 --> 06:24.210
reducing latency and cost.

06:24.570 --> 06:31.250
In addition to vectors, embeddings should be stored with rich metadata, including document source,

06:31.290 --> 06:35.370
chunk position timestamps, and domain specific tags.

06:35.850 --> 06:38.650
Finally, versioning is essential.

06:38.970 --> 06:45.290
If you change embedding models, you will need to re-embedded and re-index your entire corpus to maintain

06:45.290 --> 06:46.410
compatibility.

06:46.450 --> 06:52.890
After embeddings are generated, they must be stored in a system optimized for high dimensional similarity

06:52.930 --> 06:53.450
search.

06:53.970 --> 06:57.210
Traditional databases are not designed for this task.

06:57.770 --> 07:04.480
As explained on page seven of the deck, vector databases use approximate nearest neighbor algorithms

07:04.480 --> 07:07.960
to efficiently search millions or even billions of vectors.

07:08.440 --> 07:16.600
Vector databases trade perfect accuracy for massive speed gains, typically achieving 95 to 99% recall

07:16.600 --> 07:18.520
with millisecond level latency.

07:19.200 --> 07:23.000
This performance is essential for production semantic search systems.

07:23.520 --> 07:28.000
Common options include face for open source high performance search.

07:28.520 --> 07:31.400
Pinecone for fully managed vector storage.

07:31.960 --> 07:39.600
Wiviott for flexible schema based search with GraphQL and Milvus for cloud native large scale deployments.

07:39.800 --> 07:46.640
The choice of vector database depends on operational constraints such as scale, latency requirements,

07:46.760 --> 07:49.920
infrastructure expertise, and vendor preferences.

07:50.320 --> 07:56.680
Regardless of the tool, vector databases are a core component of any production semantic search pipeline.

07:57.000 --> 08:00.960
Without them, real time retrieval at scale is not feasible.

08:01.120 --> 08:08.840
Similarity based retrieval is the real time execution phase of the semantic search pipeline, as outlined

08:08.880 --> 08:10.280
on page eight of the deck.

08:10.640 --> 08:16.320
This process mirrors document indexing but operates under strict latency constraints.

08:16.720 --> 08:22.320
First, the user's query is embedded using the same embedding model used for documents.

08:22.720 --> 08:24.800
This consistency is critical.

08:25.120 --> 08:30.920
Mixing embedding models breaks the vector space and invalidates similarity comparisons.

08:31.320 --> 08:38.040
Next, the vector database computes similarity scores between the query vector and all indexed document

08:38.040 --> 08:38.720
vectors.

08:39.240 --> 08:43.040
This step is handled efficiently using an algorithms.

08:43.520 --> 08:49.880
Finally, the system retrieves the top k most similar chunks, typically between 5 and 20.

08:50.360 --> 08:55.920
The quality of retrieval depends on two factors embedding quality and chunk quality.

08:56.240 --> 09:01.080
No amount of tuning can compensate for poor embeddings or badly formed chunks.

09:01.510 --> 09:07.710
Engineers should view retrieval as a mathematical operation whose success is determined upstream by

09:07.710 --> 09:10.230
design choices made during ingestion.

09:11.990 --> 09:19.150
Raw similarity scores provide an initial ranking, but production systems require additional refinement

09:19.150 --> 09:21.230
to deliver high quality results.

09:21.830 --> 09:27.270
Page nine of the deck outlined several critical post retrieval optimization techniques.

09:28.150 --> 09:34.310
Similarity thresholds filter out low confidence matches, preventing irrelevant content from polluting

09:34.310 --> 09:35.030
results.

09:35.670 --> 09:41.990
Metadata filtering further narrows retrieval based on attributes such as document type, date access

09:41.990 --> 09:44.030
permissions, or domain tags.

09:44.750 --> 09:50.190
For higher precision, systems, may apply LLM based reranking to the top k results.

09:50.870 --> 09:56.670
This step improves relevance but adds latency and cost, so it should be used selectively.

09:57.510 --> 10:03.430
Diversity filtering prevents near duplicate chunks from dominating the result set, ensuring broader

10:03.430 --> 10:04.990
coverage across sources.

10:05.230 --> 10:11.790
These techniques transform a raw list of candidates into a curated, high quality retrieval set.

10:12.390 --> 10:14.630
The key principle is iteration.

10:15.070 --> 10:22.190
Retrieval performance must be monitored continuously, and thresholds, filters and chunking strategies

10:22.190 --> 10:25.310
should be adjusted based on real user behavior.

10:26.030 --> 10:30.270
This final slide brings the entire semantic search pipeline together.

10:30.870 --> 10:37.790
Documents are ingested, chunked, embedded, and stored in a vector database at query time.

10:38.030 --> 10:44.590
User input is embedded compared against stored vectors and refined through ranking and filtering before

10:44.590 --> 10:45.830
results are returned.

10:46.430 --> 10:53.030
As emphasized throughout the deck, success depends more on pipeline design than on any individual model.

10:53.030 --> 10:53.590
Choice.

10:54.190 --> 11:00.950
Chunking, strategy, embedding consistency, storage architecture, and retrieval tuning all interact

11:00.950 --> 11:02.700
to determine system quality.

11:03.340 --> 11:07.740
Semantic search pipelines form the backbone of modern Rag systems.

11:07.900 --> 11:11.140
Enterprise search platforms and AI assistants.

11:11.580 --> 11:18.300
When designed correctly, they dramatically reduce hallucinations, improve relevance, and enable grounded,

11:18.340 --> 11:20.380
trustworthy AI responses.

11:21.660 --> 11:24.820
The key takeaway is simple but powerful.

11:25.060 --> 11:27.900
Semantic search is an engineering discipline.

11:28.060 --> 11:31.940
It rewards careful design, measurement, and iteration.

11:32.180 --> 11:39.260
Mastering this pipeline prepares you to build scalable, production grade AI systems that truly understand

11:39.260 --> 11:42.340
user intent, rather than just matching words.

11:42.860 --> 11:49.420
These techniques transform a raw list of candidates into a curated, high quality retrieval set.

11:49.900 --> 11:52.180
The key principle is iteration.

11:52.620 --> 11:59.460
Retrieval performance must be monitored continuously, and thresholds, filters and chunking strategies

11:59.460 --> 12:02.580
should be adjusted based on real user behavior.
