WEBVTT

00:00.300 --> 00:03.240
-: In this video we're gonna cover Langchain Vector stores

00:03.240 --> 00:04.440
and also we're gonna have a look

00:04.440 --> 00:06.117
at Langchain's Indexing API

00:06.117 --> 00:09.000
and what this means about when you're doing data ingestion

00:09.000 --> 00:11.010
into your vector databases.

00:11.010 --> 00:12.270
Firstly, you're gonna have to run

00:12.270 --> 00:14.190
a couple of Python imports for this

00:14.190 --> 00:16.860
and I'd recommend running this specific notebook

00:16.860 --> 00:20.460
on a local computer rather than inside Google Colab

00:20.460 --> 00:24.090
'cause we're gonna also run a Docker command as well.

00:24.090 --> 00:27.022
So you'll need to install langchain langchain_openai

00:27.022 --> 00:31.170
and langchain_elasticsearch and faiss-cpu.

00:31.170 --> 00:32.250
Once you've installed those,

00:32.250 --> 00:33.990
we're gonna import a couple of different things.

00:33.990 --> 00:36.883
So a TextLoader, OpenAIEmbeddings,

00:36.883 --> 00:39.270
a CharacterTextSplitter and Faiss,

00:39.270 --> 00:41.130
which is another vector database.

00:41.130 --> 00:42.750
Now if you haven't already,

00:42.750 --> 00:47.160
you'll need to set up your specific OpenAI API key.

00:47.160 --> 00:48.750
And we've just got some raw text here

00:48.750 --> 00:50.610
that we're just gonna write to a file.

00:50.610 --> 00:53.280
And then once we have that file saved locally,

00:53.280 --> 00:56.070
we're then gonna load that in with the TextLoader

00:56.070 --> 00:57.210
and we're then gonna split that

00:57.210 --> 01:00.720
into chunks of 1000 at a time with no overlap.

01:00.720 --> 01:02.100
And then we're gonna split those

01:02.100 --> 01:04.440
into Langchain documents that you can see here

01:04.440 --> 01:06.690
you get a list of Langchain documents

01:06.690 --> 01:09.900
and then we're gonna load our vector database from this.

01:09.900 --> 01:12.300
And this is just giving you another taste of the fact

01:12.300 --> 01:13.650
that there's lots of different methods

01:13.650 --> 01:15.510
on side of the vector database that you can use.

01:15.510 --> 01:16.830
So if you do db.,

01:16.830 --> 01:19.260
you'll see you can add lots of documents,

01:19.260 --> 01:21.330
add embeddings, you can add text.

01:21.330 --> 01:24.480
And the interesting things that are quite important are,

01:24.480 --> 01:27.780
for example, you've got the ability to look at the index

01:27.780 --> 01:29.460
and you can also save this,

01:29.460 --> 01:31.350
but I specifically recommend having a look

01:31.350 --> 01:33.090
at these similarity search.

01:33.090 --> 01:34.110
So you'll see when we do

01:34.110 --> 01:36.270
a similarity search of digital marketing.

01:36.270 --> 01:38.310
At the moment, we only got one document.

01:38.310 --> 01:40.650
And you can also see the actual,

01:40.650 --> 01:42.360
this gives you a tuple object

01:42.360 --> 01:44.340
and each of these has both the document

01:44.340 --> 01:47.370
and the the relevance of that document.

01:47.370 --> 01:49.500
Now you can also change the K as well,

01:49.500 --> 01:50.850
so that's quite nice.

01:50.850 --> 01:52.680
Now if we have some more documents,

01:52.680 --> 01:54.720
you can also load documents specifically

01:54.720 --> 01:56.430
from Langchain documents.

01:56.430 --> 01:58.080
And you can see here we've got two documents here.

01:58.080 --> 02:00.840
We've got the page_content of James Phoenix

02:00.840 --> 02:02.400
and we've also got the page_content,

02:02.400 --> 02:04.650
digital marketing is a growing industry.

02:04.650 --> 02:06.240
Now the important thing

02:06.240 --> 02:08.730
when you're doing Langchain indexing is to remember

02:08.730 --> 02:11.026
to add specific bits of metadata

02:11.026 --> 02:14.220
because it's gonna use both the page content,

02:14.220 --> 02:15.750
so what's inside the page,

02:15.750 --> 02:18.900
as well as also hashing your metadata dictionary

02:18.900 --> 02:20.640
as well as hashing the page content together

02:20.640 --> 02:22.620
to create a unique ID

02:22.620 --> 02:25.050
that's kept inside of Record Manager.

02:25.050 --> 02:28.440
So that's really important that you add specific metadata

02:28.440 --> 02:31.650
to tell the Langchain indexing exactly

02:31.650 --> 02:33.090
where these documents came from

02:33.090 --> 02:36.600
so that when you do then start using an indexer on those,

02:36.600 --> 02:38.970
you can effectively deduplicate the documents.

02:38.970 --> 02:40.410
So you can see, now if we look

02:40.410 --> 02:41.640
at the different LangChain documents,

02:41.640 --> 02:44.250
we've also got that metadata associated

02:44.250 --> 02:45.540
with (indistinct) documents

02:45.540 --> 02:47.070
and we can add these documents

02:47.070 --> 02:49.680
and Faiss will give us back a unique index ID

02:49.680 --> 02:51.720
for these documents and we can do a similarity search.

02:51.720 --> 02:53.100
And if I do for James,

02:53.100 --> 02:55.080
you can see I can lock in the K at one

02:55.080 --> 02:56.280
and we can see that the document

02:56.280 --> 02:57.540
that you get returned is this

02:57.540 --> 03:00.060
James Felix worked in digital marketing for three years

03:00.060 --> 03:02.340
and also we've got that metadata as well.

03:02.340 --> 03:05.490
Now the problem is if we don't use

03:05.490 --> 03:08.130
indexing or a record manager,

03:08.130 --> 03:09.990
there's no way to know necessarily

03:09.990 --> 03:11.790
when you need to reingest your documents

03:11.790 --> 03:13.380
whether anything's changed.

03:13.380 --> 03:15.690
Without knowing how those documents have been indexed,

03:15.690 --> 03:18.750
it's very difficult to keep track of specifically

03:18.750 --> 03:21.030
what documents need to be reingested,

03:21.030 --> 03:22.260
what vector database,

03:22.260 --> 03:23.520
bits of the vector database need

03:23.520 --> 03:26.070
to be pruned and deleted or updated

03:26.070 --> 03:28.890
and this means that you want to solve this problem.

03:28.890 --> 03:31.020
So Langchain have come up with a really nice way

03:31.020 --> 03:32.790
of solving this called the Indexing API.

03:32.790 --> 03:34.680
And I've linked to that inside of here.

03:34.680 --> 03:36.000
And basically how it works is

03:36.000 --> 03:37.710
you've got a couple of different components.

03:37.710 --> 03:40.140
So one of the components is a record manager

03:40.140 --> 03:42.000
and the record manager basically says,

03:42.000 --> 03:42.960
we'll take the document,

03:42.960 --> 03:46.830
which is a hash of both the page content and the metadata,

03:46.830 --> 03:49.200
and then what you can do is you can then deduplicate

03:49.200 --> 03:50.880
based on that document hash.

03:50.880 --> 03:53.400
And so when new documents are ingested,

03:53.400 --> 03:55.410
if the page content's changed

03:55.410 --> 03:58.080
or the metadata has changed for a specific document,

03:58.080 --> 04:00.330
then you need to do some different,

04:00.330 --> 04:03.120
different types of operations to clean the vector database.

04:03.120 --> 04:04.710
There's different types of deletion modes

04:04.710 --> 04:06.930
and we'll cover that towards the end of the video.

04:06.930 --> 04:08.460
For now, what I'd like you to do is

04:08.460 --> 04:09.900
just have an example of this.

04:09.900 --> 04:11.820
Now you will need to have Docker installed

04:11.820 --> 04:13.710
on your local machine to run this.

04:13.710 --> 04:15.420
And what I recommend doing is copying

04:15.420 --> 04:18.840
this docker run command, loading up a terminal window

04:18.840 --> 04:20.880
and then running that docker run command.

04:20.880 --> 04:21.713
And that will ask you

04:21.713 --> 04:23.550
if you want to install a load of packages.

04:23.550 --> 04:25.440
And you'll see here, for my example,

04:25.440 --> 04:27.390
we are loading Elasticsearch

04:27.390 --> 04:29.572
and basically this is a vector database

04:29.572 --> 04:31.470
that we can use locally.

04:31.470 --> 04:33.000
Now what's interesting about having

04:33.000 --> 04:35.820
an Elasticsearch installed from a Docker,

04:35.820 --> 04:38.970
a container is then we can load that up on this port.

04:38.970 --> 04:41.430
So it's localhost:9200,

04:41.430 --> 04:42.840
we've got an index name

04:42.840 --> 04:44.160
and we've got the embeddings

04:44.160 --> 04:46.620
coming from open AI embeddings, okay?

04:46.620 --> 04:48.930
Now you also are gonna create something called

04:48.930 --> 04:50.730
the SQLRecordManager.

04:50.730 --> 04:52.980
And remember the SQLRecordManager really does

04:52.980 --> 04:54.390
a couple of different things.

04:54.390 --> 04:56.310
What it's mainly doing is it's mainly storing

04:56.310 --> 04:59.790
what documents have been indexed, what is the document hash,

04:59.790 --> 05:02.310
and then it allows you to keep track of specific records.

05:02.310 --> 05:05.550
So I can run this record_manager.create_schema,

05:05.550 --> 05:07.590
and that creates the database schema.

05:07.590 --> 05:09.210
And then what's interesting about this is

05:09.210 --> 05:10.890
we're just using SQLite at the moment,

05:10.890 --> 05:12.720
but you could use any SQL database you want

05:12.720 --> 05:15.360
for storing the specific types of records.

05:15.360 --> 05:19.050
Now let's see what happens when we pass in a document

05:19.050 --> 05:22.020
that's got exactly the same kind of metadata,

05:22.020 --> 05:23.700
but it has different page content.

05:23.700 --> 05:25.410
So you can see if we scroll up,

05:25.410 --> 05:27.810
the page_content was James phoenix worked

05:27.810 --> 05:29.640
in digital marketing for three years,

05:29.640 --> 05:31.950
but now we're saying that James phoenix worked

05:31.950 --> 05:33.540
in digital marketing for seven years.

05:33.540 --> 05:35.220
So we've updated one document,

05:35.220 --> 05:38.400
also you'll see here we've got this little helper function

05:38.400 --> 05:40.380
which allows us to clear content.

05:40.380 --> 05:42.780
So we're gonna go and clear the index to start with.

05:42.780 --> 05:43.620
Then we're gonna go

05:43.620 --> 05:46.230
and with the documents we've already created,

05:46.230 --> 05:49.560
which are just above here, you've got these documents here

05:49.560 --> 05:51.780
and what we're gonna do is we're gonna add those

05:51.780 --> 05:54.300
specifically to the index.

05:54.300 --> 05:55.770
And what you'll see here is

05:55.770 --> 05:58.710
because our record manager was cleared,

05:58.710 --> 06:00.900
basically we've got two documents.

06:00.900 --> 06:03.210
And so for the first time we're saying

06:03.210 --> 06:04.920
use the source_id_key.

06:04.920 --> 06:07.590
So this is the key that we're duping against,

06:07.590 --> 06:09.390
which is the original source.

06:09.390 --> 06:10.740
And we're basically saying

06:10.740 --> 06:13.530
with my record manager is now gonna say

06:13.530 --> 06:16.170
for this rector store, insert these documents.

06:16.170 --> 06:18.480
Now what's interesting about this is

06:18.480 --> 06:20.370
if we have these documents here

06:20.370 --> 06:22.830
and you can see I've got these two documents,

06:22.830 --> 06:26.190
if we then were to rerun this exact same command

06:26.190 --> 06:29.070
and because we are using this cleanup incremental,

06:29.070 --> 06:32.040
so because these documents haven't inherently changed,

06:32.040 --> 06:33.450
there's nothing that's gonna happen here.

06:33.450 --> 06:35.460
So if when you see when I rerun this,

06:35.460 --> 06:37.290
it skips adding any documents

06:37.290 --> 06:40.020
because the document hash hasn't changed.

06:40.020 --> 06:42.750
Now, if I was to then update one of these documents,

06:42.750 --> 06:46.080
so if you see here, we've got these updated_docs

06:46.080 --> 06:47.790
where we've said James phoenix has worked

06:47.790 --> 06:49.710
in digital marketing for seven years,

06:49.710 --> 06:51.840
when we try and update a single document,

06:51.840 --> 06:54.780
we actually end up deleting one of those documents

06:54.780 --> 06:57.510
based on the source and then adding a new one

06:57.510 --> 06:59.520
based on the fact that document has changed

06:59.520 --> 07:02.280
and we've ingested it into the vector database.

07:02.280 --> 07:03.750
You can also see what it would be like

07:03.750 --> 07:05.130
to add a new document.

07:05.130 --> 07:06.720
So if we just had a new document

07:06.720 --> 07:08.970
within new different types of metadata,

07:08.970 --> 07:11.190
you'll see that we've got a null added,

07:11.190 --> 07:14.010
but we haven't necessarily got any that have been updated,

07:14.010 --> 07:15.420
skipped or deleted.

07:15.420 --> 07:18.540
And also if we already have something that's the same,

07:18.540 --> 07:20.730
then essentially what's gonna end up happening is

07:20.730 --> 07:22.470
you can see we're skipping the document.

07:22.470 --> 07:25.380
So from the document that we just tried to insert

07:25.380 --> 07:26.940
because nothing has changed,

07:26.940 --> 07:28.830
we've getting a skipped document.

07:28.830 --> 07:30.000
Now there's three modes

07:30.000 --> 07:32.850
that you can operate the Index API in.

07:32.850 --> 07:35.460
We've been primarily using the one that I'd recommend,

07:35.460 --> 07:36.810
which is incremental.

07:36.810 --> 07:39.150
And so what happens with incremental is

07:39.150 --> 07:42.120
it deduplicates content as you,

07:42.120 --> 07:45.510
for each individual document as you reingest it.

07:45.510 --> 07:46.470
Now if you do Full,

07:46.470 --> 07:48.990
it will do that at the end of the indexing,

07:48.990 --> 07:51.360
and if you do a Cleanup Mode of None,

07:51.360 --> 07:52.650
then basically it's up to you

07:52.650 --> 07:55.440
to clean your specific vector database.

07:55.440 --> 07:58.440
So again, the key points of this video

07:58.440 --> 08:01.530
and for you to take away is always make sure

08:01.530 --> 08:05.010
that you add on metadata to your Langchain documents.

08:05.010 --> 08:07.740
You can have a record manager

08:07.740 --> 08:10.170
and that will be a database that you can see here.

08:10.170 --> 08:12.870
I've got this record_manager_cache.sql,

08:12.870 --> 08:14.721
but you could have this specifically

08:14.721 --> 08:17.430
in any kind of SQL database that you want,

08:17.430 --> 08:19.140
whether you're using Postgres

08:19.140 --> 08:21.810
or whether you're using my SQL.

08:21.810 --> 08:24.750
Essentially all you need to do is change this db_url

08:24.750 --> 08:26.850
and add on the various connection strings

08:26.850 --> 08:28.140
with the namespace.

08:28.140 --> 08:30.510
And then that's an important point.

08:30.510 --> 08:33.390
The other important point is remember to pick the mode

08:33.390 --> 08:34.470
that you're most interested in.

08:34.470 --> 08:36.930
So I would recommend using incremental.

08:36.930 --> 08:39.600
And the reason why all of this is important is

08:39.600 --> 08:43.560
because this means that you can have a reingestion pipeline

08:43.560 --> 08:46.680
that's powered by a record_manager and a vectorstore,

08:46.680 --> 08:48.360
and then it's being deduped

08:48.360 --> 08:51.600
by this source_id_key plus a document hash,

08:51.600 --> 08:54.720
which means that when you rerun your ingestion,

08:54.720 --> 08:57.750
it doesn't have to always reingest every single document

08:57.750 --> 09:00.390
and you can have smarter ingestion pipelines

09:00.390 --> 09:02.340
into side of your vector database.

09:02.340 --> 09:04.090
Cool, I'll see you in the next one.
