WEBVTT

00:00.160 --> 00:00.920
All right.

00:00.920 --> 00:04.280
So let's go and implement our retrieval pipeline.

00:04.880 --> 00:08.280
So I'm going to start by creating a main.py file.

00:08.280 --> 00:11.720
And here is going to be the retrieval part of our project here.

00:11.760 --> 00:18.960
So let's go now and start with the imports and the initialization of the object that we'll be using.

00:18.960 --> 00:20.360
And let's start.

00:20.560 --> 00:27.120
So first I want to import the OS module because we're going to be accessing environment variables.

00:27.600 --> 00:33.160
And now let's go and import load env because we're going to be loading all the environment variables

00:33.160 --> 00:35.120
from the env file over here.

00:35.720 --> 00:39.400
Let's go now and import from the chat prompt template.

00:39.400 --> 00:43.440
Because this is what's going to form the react prompt we're going to be using.

00:44.200 --> 00:46.640
We want to import human message.

00:46.640 --> 00:49.400
This is what is going to be invoking the pipeline.

00:50.160 --> 00:58.480
And let's go and import from OpenAI the chat OpenAI for our ML model and OpenAI embeddings for our embeddings

00:58.480 --> 01:01.290
modules, which we covered in the previous section.

01:01.290 --> 01:05.970
And also let's go and import like previously explained, the pine cone vector store.

01:06.170 --> 01:10.050
And let's start with the initialization of all of the objects.

01:10.330 --> 01:13.450
So first we want to load all the environment variables.

01:13.810 --> 01:17.410
And let's now go and print initializing components.

01:17.410 --> 01:20.930
So now we're going to be initializing every one of those objects here.

01:21.210 --> 01:23.970
So let me go now and run everything as a sanity check.

01:24.770 --> 01:27.570
And we can see we are printing this indeed.

01:27.930 --> 01:28.410
Cool.

01:28.890 --> 01:33.010
So let's start with initializing OpenAI embeddings.

01:33.650 --> 01:35.530
And let's also initialize the LLM.

01:35.530 --> 01:39.010
And here we're using the link chain OpenAI defaults.

01:39.010 --> 01:41.250
And we'll see which LLM is going to be using.

01:42.610 --> 01:46.370
Let's go now and initialize our vector store object.

01:46.370 --> 01:52.450
So in order to initialize it we need to give it the index name which we created earlier in the videos.

01:52.690 --> 01:58.490
And we need to give it the embeddings module, very similar to what we did in the previous video.

01:58.930 --> 02:04.370
And now we want to use this vector store and we want to use its searching capabilities.

02:04.370 --> 02:09.770
So we want to be able to make similarity searches to find these relevant contexts in direct pipeline.

02:10.130 --> 02:16.130
So if earlier let's go to the ingestion.py file we use the from document method.

02:16.410 --> 02:18.850
Let's go back now to main.py.

02:19.450 --> 02:24.010
Now we want to go and we want to take this vector store.

02:24.050 --> 02:27.570
And it has a method which is called as retriever.

02:27.970 --> 02:30.170
And this function is going to return us.

02:30.170 --> 02:32.050
Let's go and open the docs here.

02:32.050 --> 02:36.130
It's going to return us an object of vector store retriever.

02:36.330 --> 02:43.250
So this vector store retriever from link chain I'm not going to go too deep into this in this section

02:43.250 --> 02:49.570
right now, but it's going to give us a vector store with searching capabilities that the vendors of

02:49.570 --> 02:51.130
the vector store implemented.

02:51.330 --> 02:57.690
And eventually these vector store retriever class is going to have this searching capability.

02:57.730 --> 03:01.220
It's going to have a search function, which we're going to be using now.

03:01.500 --> 03:05.380
So when we initialize it we're going to give it search queries.

03:05.420 --> 03:06.660
K equals three.

03:06.820 --> 03:12.660
So this means every time we want to search in the vector store and get those relevant chunks those relevant

03:12.660 --> 03:13.420
contexts.

03:13.620 --> 03:19.740
I want to be limiting it to only three top documents that are going to be used.

03:19.940 --> 03:27.020
So if there are maybe ten relevant documents, it's going to sort them by order of importance of what's

03:27.020 --> 03:29.620
the most relevant and what's the least relevant.

03:29.620 --> 03:31.940
And I want to take the top three.

03:32.780 --> 03:37.020
And let me now go and initialize the prompt template here.

03:37.380 --> 03:41.020
So I'm going to initialize it with a very simple prompt.

03:41.020 --> 03:43.180
But it's a very powerful prompt.

03:43.500 --> 03:44.940
The prompt goes as follows.

03:44.980 --> 03:48.780
Answer the question based only on the following context.

03:49.020 --> 03:51.460
Here we have the placeholder for the context.

03:51.700 --> 03:57.460
Here we have the original question of the user and provide a detailed answer.

03:57.660 --> 03:59.460
So this context here.

03:59.460 --> 04:03.100
This is going to be the augmentation part of our prompt.

04:03.100 --> 04:06.100
So the question here is going to be our original prompt.

04:06.100 --> 04:08.220
And this is going to be the augmentation.

04:08.220 --> 04:11.780
So from the retrieval augmentation generation of Rag.

04:13.300 --> 04:18.020
And now let's go and implement a very simple auxiliary function.

04:18.020 --> 04:23.940
So we want to implement format docs which is going to receive docs which are going to be linked in documents.

04:24.380 --> 04:31.860
And this function is going to take the documents that were retrieved from the vector store, and it's

04:31.860 --> 04:34.780
simply going to format them nicely into a nice string.

04:34.940 --> 04:36.500
So you can see all that it's doing.

04:36.500 --> 04:39.660
It's going to iterate through all the documents.

04:39.700 --> 04:41.180
This is the for loop over here.

04:41.580 --> 04:45.420
Every document we saw before has a page content.

04:45.580 --> 04:48.620
So we're going to take this page content which is a string.

04:48.620 --> 04:51.900
And we are simply going to append newlines before it.

04:52.140 --> 04:57.460
So if we have a bunch of documents we're simply going to format them as one string.

04:57.780 --> 05:02.790
And this one string is eventually what's going to be sent here to this context, right?

05:03.590 --> 05:09.670
So those are all the initialization of the objects, and those are all the building blocks that we're

05:09.670 --> 05:10.830
going to be using now.

05:11.750 --> 05:12.230
Cool.

05:12.270 --> 05:14.910
So now let's go try to run everything.

05:14.910 --> 05:18.390
Let's see that everything is going to be initialized.

05:18.430 --> 05:18.790
Cool.

05:18.830 --> 05:19.950
We didn't get any error.

05:19.990 --> 05:21.550
This was a sanity check.

05:22.910 --> 05:23.350
Cool.

05:23.550 --> 05:27.430
So now let's go and use our main runner here.

05:27.670 --> 05:31.750
Now I wrote if name equals main, we're going to be print retrieving.

05:31.750 --> 05:35.790
And the query is going to be what is pinecone in machine learning.

05:35.950 --> 05:40.550
So first let me go and show you what we're going to get if we're not going to be using rag.

05:40.790 --> 05:45.350
So here I pasted a raw invocation which is simply going to use the LM.

05:45.470 --> 05:50.830
It's going to send it in the human message, the content of the query, which is what is pinecone in

05:50.830 --> 05:51.550
machine learning.

05:51.790 --> 05:53.310
And then it's going to print it.

05:53.510 --> 05:56.910
So let's go and let me show you what we're going to be getting.

05:57.110 --> 05:58.840
So this is a raw invocation.

05:58.840 --> 06:00.120
We are not using rag.

06:00.560 --> 06:02.160
So here we can see the answer.

06:02.360 --> 06:09.080
A pinecone algorithm is a method used in machine learning to search for the best configuration hyperparameters.

06:09.600 --> 06:15.400
This is not the answer that we want, because we want to talk about pinecone as the vector store.

06:15.800 --> 06:16.240
All right.

06:16.240 --> 06:17.760
And let me even go to links.

06:18.520 --> 06:22.400
Let's go to our links project here that I initialized.

06:22.400 --> 06:23.360
Here we have the chain.

06:23.360 --> 06:25.280
What is pinecone in machine learning.

06:25.720 --> 06:27.080
And here we can see the answer.

06:27.080 --> 06:31.160
By the way the model that we're using is GPT 3.5 turbo.

06:31.440 --> 06:33.920
So just for the sake of it let's go and change here the model.

06:33.920 --> 06:37.680
Let's try to use model GPT 5.2.

06:51.920 --> 06:53.240
Now we can see the answer.

06:53.240 --> 06:58.320
Python is a managed vector store database using machine learning to store index.

06:58.360 --> 06:58.640
Okay.

06:58.680 --> 07:00.280
So this is a nice answer.

07:00.280 --> 07:02.240
So this is like the answer we wanted.

07:02.240 --> 07:07.640
And by the way the difference here and this is by the way one of the motivations of using Rag because

07:07.680 --> 07:15.480
GPT 5.2 was trained in 2025 and pinecone was much established by then.

07:15.480 --> 07:19.640
So we had lots of documents and lots of training data on pinecone as the vector store.

07:19.800 --> 07:23.600
And before that it was not so GPT 3.5.

07:23.640 --> 07:26.240
As we saw before, it hallucinated the answer.

07:26.280 --> 07:28.120
This is also something interesting to see.

07:28.160 --> 07:28.560
Cool.

07:28.560 --> 07:31.760
So now let's go and use a chat.

07:31.760 --> 07:34.320
OpenAI with GPT 3.5.

07:34.720 --> 07:38.640
And we saw what was the answer without using Rag.

07:38.680 --> 07:43.160
And now let's go and implement the Rag pipeline with all the building blocks.

07:43.200 --> 07:43.760
All right.

07:43.760 --> 07:48.440
So now let's go and implement the retrieval part of our pipeline.

07:48.600 --> 07:54.800
So first let me define a function which is called retrieval chain without link chain expression language.

07:55.160 --> 08:01.450
And this is a function which is going to receive a string, and it's going to return us a response of

08:01.450 --> 08:02.090
the LM.

08:02.130 --> 08:07.770
The way we're going to be implementing now is without using linked chain expression language, and this

08:07.770 --> 08:12.130
is simply to show you the flow and to show you what's happening under the hood.

08:12.330 --> 08:16.850
And this is to show you how we can do it without link chain and how it's going to be looking like.

08:17.090 --> 08:19.210
So let's go and start and implement this.

08:19.210 --> 08:21.410
And this is the most naive implementation.

08:21.730 --> 08:27.570
After we're going to be doing that, we are going to implement it with linked chain expression language,

08:27.810 --> 08:33.010
which is going to yield us with a much more elegant solution and much more capabilities.

08:33.250 --> 08:37.170
So here in the description this is going to be a simple retrieval chain.

08:37.210 --> 08:38.610
It's not going to be really a chain.

08:38.650 --> 08:41.650
It's simply going to be a bunch of function invocations.

08:41.970 --> 08:45.010
And here we're going to manually retrieve the documents.

08:45.010 --> 08:46.970
We're going to manually format them.

08:46.970 --> 08:49.370
And we're going to generate the answer.

08:49.570 --> 08:54.170
So in this implementation we have a bunch of limitations which are going to be solved.

08:54.490 --> 08:58.380
In this video we are going to invoke everything manually.

08:58.660 --> 09:00.820
We don't have any streaming support.

09:00.820 --> 09:02.700
We don't have any async support.

09:02.740 --> 09:06.380
It's hard to take these and compose it in other chains.

09:06.380 --> 09:10.020
And this is going to be error prone and it's going to be harder to maintain.

09:10.020 --> 09:14.460
And we're going to see how we're going to be solving every one of those limitations in the length chain

09:14.460 --> 09:18.300
expression language implementation, which is going to be in this video.

09:18.460 --> 09:24.420
So the first thing we want to do is we want to take the user query the question, what is pinecone in

09:24.420 --> 09:25.180
machine learning.

09:25.340 --> 09:28.580
And we want to get the most relevant documents.

09:28.900 --> 09:32.500
We want to perform similarity search in the pinecone vector store.

09:32.740 --> 09:36.100
So when we're going to be using our retriever dot invoke.

09:36.380 --> 09:41.220
So retriever right over here is actually a link chain runnable.

09:41.500 --> 09:48.900
So if I'm going to actually go over here and this is going to return us a vector store retriever object.

09:48.980 --> 09:53.500
This object is inheriting from base retriever and base retriever.

09:53.500 --> 09:55.100
Here is a runnable.

09:55.500 --> 09:58.660
So this is going to have the invoke method here.

09:58.700 --> 10:05.580
Now when we're going to be using a vector to retrieve our in length chain because it has an invoke method,

10:05.580 --> 10:09.260
its invoke method, it's going to run something.

10:09.260 --> 10:14.820
It's going to run a bunch of things, but eventually it's going to run this function, get relevant

10:14.820 --> 10:21.140
documents, which is going to take our query, and it's going to return us a list of documents which

10:21.140 --> 10:23.540
are going to be the most relevant documents.

10:23.780 --> 10:25.780
So these get relevant documents.

10:25.780 --> 10:28.820
This is eventually going to be implemented by the vendor.

10:28.940 --> 10:31.380
Pinecone have their own way of doing things.

10:31.380 --> 10:33.260
It's going to use the pinecone SDK.

10:33.300 --> 10:37.900
If we're going to be using chroma it's going to be using another implementation.

10:37.900 --> 10:40.500
And this is eventually what's going to be running.

10:40.620 --> 10:44.380
And we have also an async version of getting the relevant documents.

10:44.380 --> 10:50.660
But eventually when we are going to be running this invoke method, it's going to be going and doing

10:50.660 --> 10:50.940
this.

10:50.940 --> 10:53.940
And we're going to see this in the length trace by the way.

10:53.940 --> 10:55.260
And let's now assume.

10:55.260 --> 11:01.590
So this is going to magically give us the most relevant documents from our original query here.

11:01.590 --> 11:05.070
And because we initialized it with k equals three.

11:05.390 --> 11:10.030
So this means that this is going to be a list of three link chain documents.

11:10.270 --> 11:12.750
So we have now the relevant documents.

11:12.950 --> 11:17.390
Next we want to go and let's format them into a string.

11:17.670 --> 11:20.110
So we're going to be using the format docs.

11:20.110 --> 11:24.510
And it's going to give us here the context which is going to be a string.

11:24.550 --> 11:24.910
Cool.

11:24.910 --> 11:29.230
So right now we have the question we have the context for this question.

11:29.390 --> 11:32.710
So now we want to go and we want to take this prompt.

11:32.710 --> 11:37.630
And we simply want to plug the value of the context and the value of the questions here.

11:37.830 --> 11:40.270
And we want to go and send this to the LLM.

11:40.310 --> 11:43.350
So now we are taking the prompt template.

11:43.390 --> 11:45.550
We are going to use the format messages.

11:45.550 --> 11:49.230
And here we're going to provide it with the context equals context.

11:49.230 --> 11:50.990
And the questions equals query.

11:51.390 --> 11:54.710
So question here is matching.

11:54.710 --> 12:00.470
Now this placeholder here in the query is the value that we're going to be giving it.

12:00.790 --> 12:07.590
So now this is going to give us a list of messages holding one message, which is going to be everything

12:07.590 --> 12:09.110
we want to send to the LM.

12:09.350 --> 12:16.030
And now let's go and invoke our LM with the list of messages, which is going to be one message.

12:16.190 --> 12:20.150
And we simply want to return the content of that response here.

12:20.270 --> 12:26.030
So this is the entire implementation the manual implementation of the Reg pipeline over here.

12:26.230 --> 12:26.750
All right.

12:26.750 --> 12:33.470
And now let me paste in the snippet which is simply going to print that we want to invoke the implementation

12:33.470 --> 12:35.350
without an expression language.

12:35.590 --> 12:37.870
We're going to be taking this function.

12:38.030 --> 12:41.710
And we're going to be running it with the user's query.

12:42.070 --> 12:43.870
And we're going to be printing.

12:43.870 --> 12:46.430
Now the result nothing special over here.

12:46.430 --> 12:46.990
All right.

12:47.030 --> 12:47.710
Let me now.

12:47.750 --> 12:50.350
So let's go now and run everything.

12:54.720 --> 12:59.480
So this is the raw invocation here we see we get the wrong answer.

12:59.480 --> 13:02.080
And let's see the question answered with rag.

13:02.320 --> 13:08.040
And here we can see pinecone is a fully managed cloud based vector database specifically designed for

13:08.040 --> 13:14.200
businesses and organizations looking to build and deploy large scale machine learning applications.

13:14.520 --> 13:18.160
So this is the answer that we wanted working with Rag here.

13:18.160 --> 13:20.160
So now I want to show you in debug.

13:20.280 --> 13:22.640
And let's see step by step what's happening here.

13:22.920 --> 13:24.880
So let me go to the function here.

13:25.160 --> 13:27.080
And let me now put a breakpoint.

13:30.200 --> 13:31.720
Let's go run it in debug.

13:36.120 --> 13:36.680
All right.

13:36.680 --> 13:41.160
So here we can see we have now the query what is pinecone machine learning.

13:42.000 --> 13:44.080
Let me now go and execute this.

13:50.280 --> 13:51.600
So here we can see.

13:51.640 --> 13:54.360
Now we have documents variable.

13:54.610 --> 13:58.690
and it's going to contain a list of three link chain documents.

13:59.090 --> 14:06.330
And every document here is going to have its page content attribute here, which is going to be the

14:06.370 --> 14:08.010
data of this document here.

14:08.010 --> 14:13.210
So these are the most relevant documents that pinecone found for us.

14:13.410 --> 14:19.650
Now we want to take those documents and now we want to format them as strings.

14:19.850 --> 14:23.250
So let's go and run this format docs function.

14:24.090 --> 14:26.530
And now we have the context variable.

14:26.530 --> 14:29.250
And now we can see this is a string okay.

14:29.290 --> 14:32.290
So now let me copy here this expression.

14:32.290 --> 14:34.090
Let me open it here and let me paste it.

14:34.090 --> 14:38.490
And we can see it's a very big string of all the documents combined.

14:38.730 --> 14:44.770
So now we want to go and we want to initialize the prompt template with the context of the documents.

14:44.770 --> 14:46.370
Here we can see all the documents.

14:46.730 --> 14:49.490
And we want now to give it the query.

14:49.490 --> 14:52.010
And we want to initialize it with the query.

14:52.050 --> 14:54.490
What is pinecone in machine learning.

14:54.730 --> 14:55.930
Let's go and do that.

14:56.490 --> 14:58.370
Let's go and look for messages.

14:58.370 --> 15:02.490
So it gave us a messages list with one message.

15:02.810 --> 15:06.450
And each content is the prompt we had before.

15:06.570 --> 15:09.730
Answer the questions based on the following context.

15:09.730 --> 15:13.210
And then it pasted the context and gave the question here.

15:13.490 --> 15:14.090
Cool.

15:14.090 --> 15:17.810
And let's go now and invoke now the LM.

15:17.970 --> 15:21.130
And this is a regular LM invocation right now.

15:21.570 --> 15:23.890
And we get here a response.

15:23.890 --> 15:25.650
Let's go and check out this response.

15:25.650 --> 15:30.690
The response is an AI message and its content is the answer.

15:30.850 --> 15:34.090
So now I want to show you the Lang Smith trace.

15:34.570 --> 15:40.050
And because here in these functions we did not really use a Lang chain chain.

15:40.210 --> 15:47.090
And we did not use the chain expression language, then we simply used all linked chains components

15:47.130 --> 15:48.010
separately.

15:48.290 --> 15:56.020
So all the traces are not going to be organized under one component of a chain, and this is not convenient

15:56.020 --> 15:56.420
at all.

15:56.460 --> 15:59.620
Okay, so now let's go and review the traces here.

15:59.620 --> 16:03.500
So let's take the first step of retrieving the relevant documents.

16:03.900 --> 16:08.900
So here we can see we have a vector store retrieval right over here.

16:09.300 --> 16:12.780
And this is the run of this line over here.

16:13.620 --> 16:17.660
So here we can see the input is what is pinecone in machine learning.

16:17.900 --> 16:24.980
When we ran it in evoke, the output we get was those three documents of the data that are the most

16:24.980 --> 16:26.180
relevant documents.

16:26.460 --> 16:33.180
So this was the similarity search that pinecone did for us under the hood here and here we can see we

16:33.180 --> 16:34.700
have the text.

16:34.700 --> 16:36.540
So this is the text of the document.

16:36.540 --> 16:41.180
And here we even have the source which I did not show you in the debug console.

16:41.180 --> 16:46.020
So now after we did that we went we formatted nicely.

16:46.020 --> 16:48.020
This is nothing to do with Lang Smith.

16:48.020 --> 16:50.780
And here we simply populated the prompt template.

16:50.780 --> 16:54.900
And this is not going to show in Lang Smith as well because it's not under the chain.

16:55.140 --> 17:01.380
And finally, we invoke the LM with all the relevant data in the augmented prompt.

17:01.380 --> 17:04.780
So we can see the LM invocation right over here.

17:05.060 --> 17:09.300
So the final call to LM is the original prompt.

17:09.460 --> 17:11.140
Here we have the context.

17:11.340 --> 17:14.300
And here we have the question that we plugged in.

17:14.340 --> 17:18.180
And this is the answer that the LM returned us.

17:18.420 --> 17:23.460
So we can see tracing this and debugging this is going to be a bit harder.

17:23.620 --> 17:24.340
All right.

17:24.340 --> 17:30.660
So what we did now in the implementation is implement everything very very naively.

17:30.860 --> 17:33.780
And it has a bunch of disadvantages.

17:33.780 --> 17:38.340
But the main disadvantage is that it's going to be very hard to trace it.

17:38.380 --> 17:44.820
So in the next video we're going to implementing the same functionality but binding everything under

17:44.820 --> 17:48.020
a link chain chain with the link chain expression language.

17:48.020 --> 17:50.820
And this is going to be a much better implementation.

17:50.820 --> 17:53.340
And we're going to see all the reasons why it's better.