WEBVTT

00:00.000 --> 00:00.960
-: Alright, welcome back.

00:00.960 --> 00:04.050
So you've been able to now summarize documents,

00:04.050 --> 00:06.450
split documents, and working with different types

00:06.450 --> 00:08.130
of summarization techniques.

00:08.130 --> 00:09.180
Let's step it up an notch

00:09.180 --> 00:11.310
and assume that you've got a range of documents

00:11.310 --> 00:14.520
that you want to specifically tag with certain features.

00:14.520 --> 00:16.680
This is a process called tagging.

00:16.680 --> 00:19.650
Let's have a look at an implementation inside of LangChain.

00:19.650 --> 00:22.080
Firstly, we're gonna import a couple of different packages

00:22.080 --> 00:25.380
such as LangChain, Pandas and Nest_asyncio.

00:25.380 --> 00:27.450
After that, we're gonna use a different document

00:27.450 --> 00:29.490
loaded to the one that we used previously.

00:29.490 --> 00:31.740
This one is called a sitemap loader and it allows us

00:31.740 --> 00:35.850
to automatically load webpage from a website site map.

00:35.850 --> 00:37.440
We'll use the standard chat model from OpenAI

00:37.440 --> 00:39.240
and we'll also use something called

00:39.240 --> 00:42.090
this create tagging chain function.

00:42.090 --> 00:44.610
Firstly, let's have a look at the sitemap loader.

00:44.610 --> 00:46.343
We've loaded in this website sitemap

00:46.343 --> 00:49.260
and we've set the request per second to five.

00:49.260 --> 00:52.650
After running sitemap loaded dot load, we'll actually end up

00:52.650 --> 00:56.340
with a list of LangChain documents inside of this variable.

00:56.340 --> 00:59.460
Next, what we need to do is define the schema

00:59.460 --> 01:01.560
that we want to look for and tag

01:01.560 --> 01:03.270
on each individual document.

01:03.270 --> 01:05.280
You'll see when we have a look at these schemas,

01:05.280 --> 01:07.650
it is a JSON schema and we've defined

01:07.650 --> 01:11.160
three individual properties with their various types.

01:11.160 --> 01:13.320
It's also possible to add descriptions.

01:13.320 --> 01:15.990
So here we've added a description for the primary topic

01:15.990 --> 01:18.840
and we've also made sure that every property is required.

01:18.840 --> 01:23.460
We also use a different flavored model of GPT-3.5 Turbo,

01:23.460 --> 01:26.820
which allows us to do open AI functions.

01:26.820 --> 01:29.340
We then use our create tagging chain function,

01:29.340 --> 01:33.450
injecting in the schema the LLM, and the output key.

01:33.450 --> 01:34.650
After that, then what we're doing

01:34.650 --> 01:36.090
is setting up a Python list

01:36.090 --> 01:38.700
looping through every document, printing the documents

01:38.700 --> 01:40.543
that you can see here with the page content

01:40.543 --> 01:43.230
and we're invoking and calling our chain

01:43.230 --> 01:46.410
the input being equal to the doc dot page content.

01:46.410 --> 01:49.460
So each of these bits of page content gets injected

01:49.460 --> 01:52.500
and inserted into the chain.

01:52.500 --> 01:53.610
The chain's output,

01:53.610 --> 01:56.520
which we've defined here as you can see on the output key,

01:56.520 --> 01:59.340
where they grabbing that and storing that

01:59.340 --> 02:01.290
inside of this Python list.

02:01.290 --> 02:03.600
Let's have a look at the Python list then.

02:03.600 --> 02:05.490
So in the Python list, we're now getting back some

02:05.490 --> 02:08.857
structured data against each individual webpage.

02:08.857 --> 02:10.500
Perfect.

02:10.500 --> 02:12.510
Now we can also import that same data

02:12.510 --> 02:14.430
into a Pandas data frame,

02:14.430 --> 02:15.660
and you'll see that if we look at

02:15.660 --> 02:17.280
a specific LangChain document,

02:17.280 --> 02:20.160
we can actually access the dot metadata property.

02:20.160 --> 02:22.440
This metadata is a dictionary and

02:22.440 --> 02:24.330
because we use one of LangChain's loaders,

02:24.330 --> 02:26.940
it automatically added some metadata for us.

02:26.940 --> 02:29.250
So you'll see that we can use things like the source,

02:29.250 --> 02:32.040
the location, when the webpage was last modified,

02:32.040 --> 02:33.930
and some other site map properties.

02:33.930 --> 02:36.120
In our case, we just want it to assume

02:36.120 --> 02:37.140
that we are really interested

02:37.140 --> 02:38.760
in just adding back the URL

02:38.760 --> 02:41.070
to each of these bits of structured data

02:41.070 --> 02:43.500
so we can loop through for, in this case,

02:43.500 --> 02:46.860
the first 10 URLs, getting the source of that metadata,

02:46.860 --> 02:48.660
and assigning that to the URL.

02:48.660 --> 02:50.460
And in this instance, you can now see

02:50.460 --> 02:52.530
that we've got the sentiment aggressiveness,

02:52.530 --> 02:54.630
the primary topic, and the URL.

02:54.630 --> 02:57.660
So we've been able to easily tag individual documents

02:57.660 --> 02:59.853
with unique features inside LangChain.