WEBVTT

00:00.320 --> 00:05.400
Let's go to the top of this file here, and let's create another function.

00:05.400 --> 00:09.800
And let me go and paste this snippet over here for starts.

00:10.200 --> 00:12.840
And we want to define a coroutine.

00:13.280 --> 00:16.200
So we'll call it index documents async.

00:16.440 --> 00:20.720
And the input is going to be a list of documents link chain documents.

00:20.720 --> 00:23.880
It's going to receive an integer of batch size.

00:24.400 --> 00:27.560
And this coroutine is going to take all the documents.

00:27.560 --> 00:32.960
And it's going to batch index them into the vector store that we're going to be using.

00:33.320 --> 00:34.760
So we'll log that.

00:34.760 --> 00:36.720
This is the vector storage phase.

00:37.000 --> 00:42.360
And we'll log the how many documents we are now going to index.

00:43.440 --> 00:50.360
We want to create a variable called batches, which is going to be a list that contains lists of documents.

00:50.800 --> 00:57.320
And we're going to split the list into batches with the batch size.

00:57.320 --> 00:59.800
So this is what this line is going to do here.

01:00.240 --> 01:03.780
After that we want to log how many batches we have.

01:04.820 --> 01:06.740
So let's go and do that.

01:09.820 --> 01:17.140
And let me go and paste here another coroutine which is inside this coroutine which is called ad batch.

01:17.580 --> 01:24.140
And this coroutine is going to receive a batch which is a list of documents.

01:24.380 --> 01:29.420
So you can probably expect that each batch we have from above, we're going to be calling this function

01:29.420 --> 01:29.820
with.

01:30.220 --> 01:30.660
Anyways.

01:30.660 --> 01:33.780
The second argument here is going to be the batch number.

01:33.980 --> 01:36.900
And we need this batch number for logging.

01:36.900 --> 01:40.420
So in case one batch fails we'll know exactly which batch fails.

01:40.420 --> 01:42.140
And what was the issue.

01:42.380 --> 01:47.380
Maybe it had some um non valid documents or any other reason.

01:47.900 --> 01:51.900
And in this function we want to take the vector store.

01:52.140 --> 01:54.620
And here I'm going to be using pinecone.

01:54.620 --> 01:59.980
And we want to use and call in the way this coroutine a ad.

02:00.300 --> 02:02.460
And we want to call it with this batch.

02:03.540 --> 02:09.720
And this is going to be implemented by LinkedIn, and it's going to take all the documents.

02:09.720 --> 02:15.600
It's going to use the embeddings model and transform each document into a vector with the embeddings

02:15.600 --> 02:16.120
model.

02:16.360 --> 02:20.080
And then it's going to index it into the vector store.

02:22.600 --> 02:27.320
And after we do that we want to log success in case we didn't get an exception.

02:27.320 --> 02:34.240
And if we get an exception we want to simply log it and we want to return false.

02:34.680 --> 02:37.960
And if everything succeeded we want to return true.

02:38.080 --> 02:40.040
So we managed to add the batches.

02:41.160 --> 02:45.240
And we're going to process this add batches functions concurrently.

02:45.440 --> 02:47.320
So this is what we're going to do here.

02:47.560 --> 02:50.320
We want to create a variable of tasks.

02:50.320 --> 02:53.320
And here we're going to enumerate over the batches.

02:53.320 --> 02:56.400
And we're going to get a number for every batch here.

02:56.760 --> 03:03.640
And once we have the batch and the batch number we want to create for each batch, a coroutine of the

03:03.640 --> 03:08.120
add batch which is going to receive the batch and the batch number.

03:08.380 --> 03:15.540
So once we do that, tasks is going to hold a list of coroutines of all the batches that we need to

03:15.580 --> 03:18.100
index and we want to process.

03:18.860 --> 03:24.220
And once we do that we can call async IO gather with the tasks.

03:24.540 --> 03:26.620
And we're going to await that.

03:26.620 --> 03:30.060
And that's going to fire up all the coroutines concurrently.

03:30.220 --> 03:32.380
And they are going to run independently.

03:32.380 --> 03:37.220
And they're going to index all the documents into our vector store.

03:37.380 --> 03:40.180
So this is going to optimize our runtime.

03:40.460 --> 03:45.060
The results variable over here is going to hold a list of booleans.

03:45.300 --> 03:47.860
And hopefully all of them are going to be true.

03:47.900 --> 03:49.500
If all of them succeeded.

03:49.500 --> 03:52.780
And if we add some failure it's going to hold false.

03:54.020 --> 03:57.300
And let me go and paste this snippet over here.

03:57.820 --> 04:02.420
And here we simply want to count how many successful batches we had.

04:02.660 --> 04:08.460
And we want to make sure that the number of batches that were successful is equals to the number of

04:08.460 --> 04:09.100
batches.

04:09.500 --> 04:12.160
If so, we'll log that there is a success.

04:12.160 --> 04:16.240
If not, we're simply going to log a warning.

04:17.280 --> 04:20.400
And let's go here to the main function here.

04:20.560 --> 04:23.040
And let's await index documents async.

04:23.040 --> 04:26.200
And let's call it with all the splitted documents.

04:26.360 --> 04:29.440
And we'll give it the batch size of 500.

04:29.760 --> 04:35.320
Now remember depending on which vector store we are going to be using, we will need to adjust this

04:35.320 --> 04:39.080
number here and this batch size here.

04:39.080 --> 04:40.680
It's not a magic number.

04:40.960 --> 04:46.440
We need to find the sweet spot of making this number big enough but not too big.

04:46.440 --> 04:51.800
Because if it's going to be too big, we're going to be rate limited and we can be rate limited from

04:51.800 --> 04:53.000
the embeddings model.

04:53.000 --> 05:00.600
So the embeddings model have a tokens per minute or tokens per second, a limit that we do not want

05:00.640 --> 05:01.800
to go through.

05:02.040 --> 05:05.360
And we can even have a vector store limitation.

05:05.360 --> 05:11.040
So the vector store which is going to be cloud based, usually has also a limit of how much it can process

05:11.080 --> 05:15.660
per minute or per second, but usually we do not hit these limits.

05:15.700 --> 05:19.100
The main limitation here is going to be our embeddings model.

05:19.580 --> 05:23.100
And I'm going to be using here pinecone as the vector store.

05:24.180 --> 05:26.700
Lastly, let's go and add a bunch of logs.

05:26.700 --> 05:29.100
Once the pipeline is completed, that will tell.

05:29.140 --> 05:31.300
How many documents did we scrape?

05:31.940 --> 05:33.620
How many URLs did we have?

05:33.660 --> 05:35.340
How many chunks did we index.

05:35.340 --> 05:37.660
And yeah, we want all those stats.

05:38.140 --> 05:38.580
Cool.

05:38.620 --> 05:43.060
Let's go now and run all this code here and let's see what we get.

05:44.180 --> 05:47.820
I also want to open pinecone so we can see everything in real time.

05:47.820 --> 05:50.820
And we can see live the documents being indexed.

05:51.740 --> 05:56.020
So right now we can see we don't have any documents in the vector store.

05:56.620 --> 06:00.460
And let's wait until we get the documentation.

06:00.660 --> 06:03.820
And right now I'm using the code from the optional video.

06:04.060 --> 06:05.900
I'm getting all of it manually.

06:06.380 --> 06:07.940
So we got the documentation.

06:07.940 --> 06:11.660
We severely map interview extract and now we're chunking everything.

06:12.180 --> 06:18.760
After we're chunking, we want to take all of the splits and we want to create batches.

06:19.040 --> 06:23.240
And each batch is going to be indexed in a different call.

06:23.560 --> 06:27.320
So now we can see we indexed a one batch.

06:27.680 --> 06:29.040
Let me refresh the page.

06:29.040 --> 06:36.000
And it might take a couple of seconds until we see the actual vectors in the uh in the vector store.

06:36.360 --> 06:38.200
So let me wait for a second.

06:40.320 --> 06:42.800
Let me go and refresh it one more time.

06:54.400 --> 06:58.280
And right now we can see we have the vectors which are ingested.

06:58.520 --> 07:03.560
So we have the actual content and we have the source and we have the vector ID.

07:04.280 --> 07:08.960
Now it's also important to note that we have also the original text.

07:09.440 --> 07:16.920
And this is important to save because there is no backwards function from a vector to the text that

07:16.920 --> 07:18.020
it represents.

07:18.260 --> 07:23.340
So the embeddings function doesn't have an invert function that correlates to it.

07:23.700 --> 07:30.500
And if I'm going to geek out for a moment and remember my math classes in university, this is a one

07:30.500 --> 07:32.500
way function, the embeddings function.

07:32.940 --> 07:35.220
So it's important to save also the text.

07:35.220 --> 07:39.180
So when we get back the query from the vector stores doing direct.

07:39.180 --> 07:41.980
So doing the retrieval we getting the text back.

07:41.980 --> 07:43.780
So we're not only getting the vector.

07:43.980 --> 07:46.060
So this is something very important to note.

07:47.460 --> 07:55.660
And we can see that we manage to index 6506 documents.

07:56.300 --> 08:02.380
And you can see here in the right side the number of documents that are ingested in pinecone.

08:02.620 --> 08:04.300
So the numbers should match.

08:04.420 --> 08:09.780
And if you don't have the same number so either some of the batches failed or you need to wait a couple

08:09.820 --> 08:11.580
of moments until it finished syncing.

08:12.420 --> 08:14.820
Alrighty, I want to show you something else here.

08:14.860 --> 08:18.540
Now remember when we talked about the retry mean seconds?

08:18.740 --> 08:22.680
So let me go and remove these arguments from our embeddings.

08:22.720 --> 08:26.600
Object we of LinkedIn and let's see what happens.

08:26.600 --> 08:33.640
And I want to show you that if we remove it in this constellation of all of our documents, we're going

08:33.640 --> 08:34.840
to be rate limited.

08:35.240 --> 08:36.960
So you can say removed it.

08:36.960 --> 08:41.000
Now let me run everything again and let me show you the error that we get.

08:43.760 --> 08:48.600
So we're expecting to get this error once we have all the documents.

08:48.600 --> 08:50.760
And we are going to index them.

08:50.760 --> 08:56.160
Because in that process we are going to embed every document and turn it into a vector.

08:56.160 --> 08:58.600
And that's where we're going to hit the rate limit.

08:58.720 --> 09:05.080
And because we don't have the retry mechanism, I mean we do, but we have the default which has a lower

09:05.080 --> 09:05.600
value.

09:05.840 --> 09:07.880
Then we're going to get rate limited.

09:09.320 --> 09:09.760
All right.

09:09.760 --> 09:12.560
So now we are indexing the documents.

09:12.560 --> 09:17.320
So the first step is to turn each vector each document into a vector.

09:17.680 --> 09:19.640
And let's wait and see what happens.

09:21.320 --> 09:25.540
And boom we get here an error so we can see we failed.

09:25.540 --> 09:31.460
We can see the batch number and we can see we get a 4 to 9, which is a rate limit error.

09:31.740 --> 09:38.100
And we can see that the rate limit came from us trying to embed with text embedding three small here.

09:38.300 --> 09:43.500
Now we can see by the way in each error code we can see how much time do we need to keep waiting here.

09:44.580 --> 09:46.740
All right let me go and bring that back.

09:46.740 --> 09:50.540
And now let me use Chrome ADB instead of pinecone.

09:50.700 --> 09:55.140
So I'm going to rename the variable in chrome to be vector store.

09:55.540 --> 09:58.180
And everything should run the same here.

09:58.380 --> 10:04.940
And notice now when we are starting to run it, then we're going to create a chrome adb directory.

10:05.100 --> 10:07.220
And you can see it on the left side here.

10:07.420 --> 10:11.460
And we're going to start indexing there.

10:15.140 --> 10:18.980
And we can see now that Chrome is actually using an SQLite DB.

10:20.220 --> 10:20.700
All right.

10:20.700 --> 10:23.460
So now we can see we're still getting rate limited.

10:23.460 --> 10:29.440
And this is because we haven't really wasted the time we needed to wait in between the and the runs.

10:29.640 --> 10:30.040
So.

10:30.400 --> 10:35.160
And don't mind that, but let me show you that we actually go and index everything in Chrome at this

10:35.160 --> 10:36.560
time and not in pinecone.

10:39.480 --> 10:40.080
All right.

10:40.080 --> 10:46.840
So we can see that here we manage to index a sum of the batches.

10:46.840 --> 10:47.640
Sum failed.

10:47.920 --> 10:51.480
And but anyways everything is saved here in Chrome adb.

10:51.640 --> 10:53.280
If we want we can use it as well.

10:53.840 --> 10:57.240
It's persistent because it's saved in the chrome adb directory.

10:57.920 --> 11:02.880
Alrighty, so I hope you enjoyed the video and you enjoyed the ingestion pipeline.

11:02.880 --> 11:06.160
And the next step is to go and do the retrieval.

11:06.320 --> 11:07.800
So let's go and do that.

11:08.240 --> 11:11.920
And by the way, all the videos that you saw in this section are actually new.

11:12.120 --> 11:15.800
So right now I used a different index name.

11:15.800 --> 11:22.480
My index name in pinecone was linked in docs 2025, and I'm going to have a different index name in

11:22.480 --> 11:23.480
the rest of the videos.

11:23.480 --> 11:26.160
So just giving you a heads up.

11:26.760 --> 11:27.200
Cool.

11:27.240 --> 11:28.480
So see you in the next video.