WEBVTT

00:00.120 --> 00:00.680
Hey there.

00:00.720 --> 00:01.440
Eden here.

00:01.440 --> 00:08.360
And as you know, before we implement an advanced RAC solution, we first need to index our documents

00:08.360 --> 00:09.480
into a vector store.

00:09.760 --> 00:16.240
So in this video we're going to be implementing the ingestion py file where we're going to load articles

00:16.240 --> 00:18.080
into linked documents.

00:18.080 --> 00:20.880
We're going to chunk them up into smaller pieces.

00:21.160 --> 00:26.880
And then we're going to embed them and store it in Chrome ADB open source vector store.

00:27.480 --> 00:34.600
And one quick disclaimer here in a JNI application and specifically in a react based application, and

00:34.600 --> 00:38.200
here we're implementing a very advanced version of RAC.

00:38.640 --> 00:42.560
Then the ingestion pipeline can have a lot optimizations in it.

00:42.600 --> 00:49.240
We can optimize every step of this pipeline from loading the documents when we transform it and chunk

00:49.280 --> 00:53.360
it, and up until we embed it, depending on the model we're using.

00:53.560 --> 00:59.360
And in this project, we're going to be focused on the retrieval part rather than the ingestion.

00:59.360 --> 01:04.970
So in the ingestion part, I'm simply going to go back to the defaults and I'm not going to do anything

01:04.970 --> 01:05.570
special.

01:05.610 --> 01:09.170
Simply chunk up the documents and load them into the vector store.

01:09.570 --> 01:15.690
However, in the retrieval part we're going to be implementing with Landgraf very advanced techniques

01:15.690 --> 01:16.890
for retrieval.

01:17.570 --> 01:21.770
Alrighty, let's go to the code and let's start with our imports.

01:22.050 --> 01:26.450
We want to import load dot env to load the environment variables.

01:26.730 --> 01:31.730
We're going to be using the recursive character text splitter to split up our documents.

01:32.050 --> 01:36.730
We're going to use web based loader to load the documents from the internet.

01:37.090 --> 01:40.930
And we want to use chroma as our vector store.

01:41.450 --> 01:45.850
And let's import OpenAI embeddings for the embeddings module.

01:46.170 --> 01:49.170
Finally, I want to load the environment variables.

01:49.650 --> 01:55.050
And let's create a list of URLs, which is going to be the URLs that we're going to be scraping.

01:55.290 --> 02:01.010
So we're going to be scraping those kinds of Articles about generative AI.

02:01.050 --> 02:10.130
And this one is discussing autonomous agents and all the cool topics of memory and planning and reasoning.

02:10.290 --> 02:12.130
So this is one article.

02:13.330 --> 02:16.410
The other article is about prompt engineering.

02:18.370 --> 02:25.330
So we can see, um, yeah, we can see we have all the prompt engineering techniques zero shot, few

02:25.370 --> 02:34.210
shot, chain of thought, react, etc. and the last one is going to be about adversarial attacks on

02:34.210 --> 02:35.450
LLM security.

02:35.730 --> 02:39.650
So how to hack prompt hacking all of those.

02:39.970 --> 02:42.930
And those are the articles we're going to be ingesting.

02:44.530 --> 02:45.170
All right.

02:45.170 --> 02:46.410
Let's go to the code.

02:46.410 --> 02:51.850
And let's first start by loading the URLs into link chain documents.

02:52.210 --> 02:54.410
So I'm going to use the web based loader.

02:54.450 --> 02:57.690
I'm going to plug in the URLs for each URL.

02:58.180 --> 03:01.460
And now I'm going to have a list of linked chain documents.

03:03.020 --> 03:11.100
So if we'll run this in debug just to examine the output of this, we can evaluate this expression.

03:12.060 --> 03:19.620
And we'll see that what we get back is a list that every element in that list is going to be a list

03:19.620 --> 03:21.740
that contains only one document.

03:21.740 --> 03:26.860
That's the content of that URL loaded into a link chain document.

03:27.220 --> 03:30.180
So we want to flatten that list.

03:30.180 --> 03:32.380
So let's go and do that.

03:33.860 --> 03:37.180
So we're going to iterate through the docs.

03:37.180 --> 03:40.700
And each item here is going to be a sublist.

03:40.700 --> 03:46.180
And each item in that sublist is going to be the document that we want.

03:46.220 --> 03:48.940
So eventually we'll get the document list.

03:49.740 --> 03:55.380
All right so let's debug and check out that we do have a document list.

03:56.060 --> 03:59.460
And we want to evaluate this expression over here.

04:00.380 --> 04:03.380
And we can see we indeed have a list with three elements.

04:03.380 --> 04:06.100
And each one is a length chain document.

04:07.340 --> 04:07.780
Cool.

04:08.340 --> 04:10.420
So we've loaded our content.

04:10.460 --> 04:13.220
Now we want to split it up into chunks.

04:14.340 --> 04:18.820
And now we want to use the recursive character tag splitter.

04:18.820 --> 04:21.180
We'll use the from Tik token encoder.

04:21.660 --> 04:26.500
And we want our chunk size to be 250 with no overlap.

04:27.340 --> 04:31.020
So just some basic text splitting definitions.

04:31.420 --> 04:37.340
And lastly, we want to use the text splitter to split the documents in the document list.

04:37.500 --> 04:41.860
So eventually we'll get from it a list of smaller chunks.

04:42.780 --> 04:49.060
And let's run this in debug and see that we indeed have the documents chunked up.

04:50.940 --> 04:53.740
Let's go and evaluate this expression.

04:56.030 --> 05:00.430
And we can see we have a lot of chunks here, almost 200.

05:01.110 --> 05:01.590
Cool.

05:01.590 --> 05:07.590
So now we're ready to index them into chroma DB which is going to run locally on our machines.

05:09.070 --> 05:09.630
Alrighty.

05:09.630 --> 05:11.470
So we'll use the chroma objects.

05:11.470 --> 05:16.790
It's going to have a from documents method which is going to take the documents which are going to be

05:16.790 --> 05:21.430
the chunks we're going to call our connection name in our index rag chroma.

05:21.590 --> 05:24.350
And we want to use the OpenAI embeddings.

05:24.350 --> 05:27.750
And I think this will default to small embeddings.

05:27.750 --> 05:34.350
Three and now we want to persist our vector store into our disk.

05:34.350 --> 05:37.950
So we're going to add the argument of persist directory.

05:37.950 --> 05:41.230
And I'm going to give it dot slash dot chroma.

05:41.230 --> 05:47.150
So it will persist in that location under the root directory over here after it's done indexing.

05:47.350 --> 05:52.510
And now we want to create a retriever object a Lang chain retriever from that chroma db.

05:52.990 --> 06:00.110
And we'll initialize an object of the class and use the as retriever method in order to turn it into

06:00.110 --> 06:01.310
a retriever.

06:01.310 --> 06:05.830
So we'll be able to perform similarity searches, and we want to load it from the disk.

06:05.830 --> 06:10.630
So we're going to give it the collection name we gave earlier and the persistent directory.

06:10.950 --> 06:13.710
And of course to supply the embeddings function.

06:14.470 --> 06:16.110
All right let's run it.

06:16.110 --> 06:22.390
And let's see that we indeed index our documents and we create this subdirectory.

06:22.550 --> 06:27.710
And we can see now we have the directory with the persistent storage.

06:27.910 --> 06:31.590
Now I'm going to comment this because we don't want to index this.

06:31.590 --> 06:33.230
Every time we run the program.

06:33.230 --> 06:35.910
We simply want to load everything from the disk.

06:36.270 --> 06:42.550
And let's go and let's run it again to see that everything is working and no errors.

06:43.070 --> 06:49.950
And lastly, all of the code is available on the GitHub repository in the branch three ingestion.

06:50.350 --> 06:54.630
So feel free to compare it and to even use this code in the repository.