WEBVTT

00:00.000 --> 00:01.530
-: So in this video we're gonna cover

00:01.530 --> 00:04.710
how you can summarize large documents using lang chain.

00:04.710 --> 00:07.020
So let's start by installing lang chain

00:07.020 --> 00:09.600
and Py PDF if you don't already have it.

00:09.600 --> 00:11.340
After installing those packages,

00:11.340 --> 00:13.165
what we're then gonna do is import a range

00:13.165 --> 00:16.137
of different packages from lang chain and requests.

00:16.137 --> 00:19.521
So we're gonna import the chat models, open AI.

00:19.521 --> 00:22.200
We're gonna also import a document loader called

00:22.200 --> 00:23.580
Py PDF loader,

00:23.580 --> 00:25.188
that's going to allow us to load PDF documents

00:25.188 --> 00:28.560
directly into lang chain documents.

00:28.560 --> 00:30.000
We'll also use one of lang chain's

00:30.000 --> 00:31.590
most common text splitters called the

00:31.590 --> 00:33.240
recursive text splitter,

00:33.240 --> 00:35.160
and we'll load in something called a

00:35.160 --> 00:36.900
load summarized chain function.

00:36.900 --> 00:39.180
As well as that we're going to import requests

00:39.180 --> 00:41.632
and we'll be going and getting a marketing book

00:41.632 --> 00:44.640
called Principles of Marketing PDF.

00:44.640 --> 00:46.754
We're gonna just download that

00:46.754 --> 00:48.210
and store it on the local file system

00:48.210 --> 00:49.557
with the requests.

00:49.557 --> 00:52.320
After we have downloaded the PDF,

00:52.320 --> 00:54.360
what you're then going to do is you're going

00:54.360 --> 00:56.830
to use a recursive character text splitter,

00:56.830 --> 00:59.940
which will recursively split across a range

00:59.940 --> 01:02.220
of different separators and it will try

01:02.220 --> 01:03.893
and keep within a specific size.

01:03.893 --> 01:06.480
So there are a couple of different parameters

01:06.480 --> 01:08.970
that you can choose with the recursive text splitter.

01:08.970 --> 01:10.560
One that's quite common is using

01:10.560 --> 01:12.420
something called the chunk size.

01:12.420 --> 01:14.340
And the chunk size is where you're defining

01:14.340 --> 01:15.900
how large your chunks are

01:15.900 --> 01:18.060
and you can basically determine

01:18.060 --> 01:20.850
what size the chunks that you would like to have.

01:20.850 --> 01:23.820
And then by default, all of the documents

01:23.820 --> 01:25.290
that you create will have that size

01:25.290 --> 01:26.730
in terms of character length.

01:26.730 --> 01:28.401
You'll see here that we're loading in the PDF

01:28.401 --> 01:29.970
and we're then splitting it.

01:29.970 --> 01:32.340
So we're doing loader, load and split.

01:32.340 --> 01:35.370
And then if we have just a brief look here at the pages,

01:35.370 --> 01:38.400
you will see that we now have these pages

01:38.400 --> 01:40.380
which are basically lang chain documents

01:40.380 --> 01:43.230
and it's got some metadata about what the source of it is

01:43.230 --> 01:44.700
and the page of it is.

01:44.700 --> 01:47.130
Also, notice how we've got some of the page content here.

01:47.130 --> 01:49.530
So you can see the first page doesn't have much content.

01:49.530 --> 01:52.080
And then if we start to look at some of the other ones,

01:52.080 --> 01:53.550
we can see that this book is produced

01:53.550 --> 01:55.320
by the University of Minnesota.

01:55.320 --> 01:58.290
What we want to do is we want to load in our chat model

01:58.290 --> 02:00.390
as an LLM, and this is important

02:00.390 --> 02:02.940
because we're going to use this in the various functions.

02:02.940 --> 02:05.490
So what you're gonna do is you're going to import

02:05.490 --> 02:06.840
that chat, open AI.

02:06.840 --> 02:09.960
Remember that if you do need to set an API key,

02:09.960 --> 02:11.520
you will have to set that here.

02:11.520 --> 02:13.230
So the way that you can do that,

02:13.230 --> 02:14.760
and I'll leave this in the code,

02:14.760 --> 02:16.650
would be to use the OS package

02:16.650 --> 02:18.960
and then you could do OS environ,

02:18.960 --> 02:22.800
and then we'll put the open AI API key

02:22.800 --> 02:24.930
and you add your API key here.

02:24.930 --> 02:26.310
And then that would basically mean

02:26.310 --> 02:28.354
that this model is then authenticated

02:28.354 --> 02:31.170
so that you can easily call open AI.

02:31.170 --> 02:33.120
You'll see that the chat model by default uses

02:33.120 --> 02:34.590
GPT 3.5 turbo

02:34.590 --> 02:37.384
and this will be, you know, upgraded as more

02:37.384 --> 02:40.110
and more models roll out.

02:40.110 --> 02:42.480
We're gonna use a function today called

02:42.480 --> 02:43.950
load summarize chain.

02:43.950 --> 02:47.130
And essentially what this does is it takes in an LLM,

02:47.130 --> 02:50.880
a chain type, so that could be stuff, map, MapReduce,

02:50.880 --> 02:52.800
there's other ones that are quite important.

02:52.800 --> 02:55.110
I personally find that MapReduce is very good

02:55.110 --> 02:56.610
for large documents.

02:56.610 --> 02:59.010
And essentially what happens in MapReduce is

02:59.010 --> 03:00.780
you can think of MapReduce being,

03:00.780 --> 03:04.230
you will do an operation on each individual document

03:04.230 --> 03:06.090
and then once you've got all those documents,

03:06.090 --> 03:07.954
you will then use a reduce pattern

03:07.954 --> 03:09.870
on all of those documents.

03:09.870 --> 03:11.820
So you could have a hundred documents,

03:11.820 --> 03:13.560
you create a hundred summaries,

03:13.560 --> 03:15.180
and then you would have a reducing pattern

03:15.180 --> 03:17.970
that would then recursively reduce those

03:17.970 --> 03:19.890
hundred documents into a final summary.

03:19.890 --> 03:21.640
The way that we can load this up is

03:22.529 --> 03:23.362
by using load summarize chain,

03:23.362 --> 03:24.900
the LLM, and the chain type.

03:24.900 --> 03:27.600
And in this case we're passing in map underscore reduce.

03:27.600 --> 03:31.290
This one in particular to do the whole 400 pages of a book.

03:31.290 --> 03:32.790
And that took about 30 minutes.

03:32.790 --> 03:34.200
So it does take a lot of time

03:34.200 --> 03:36.548
to run a large summarization chain.

03:36.548 --> 03:39.210
However, this is kind of the summary that you can see.

03:39.210 --> 03:41.250
And it says this passage covers a range of topics

03:41.250 --> 03:43.260
related to marketing, including defining marketing,

03:43.260 --> 03:45.090
strategic planning, consumer behavior.

03:45.090 --> 03:47.400
So we've got a really nice high level summary

03:47.400 --> 03:49.770
after running that summarization chain on the

03:49.770 --> 03:51.390
entire PDF of the book.

03:51.390 --> 03:53.940
Now I think it's worthwhile just seeing it in action.

03:53.940 --> 03:55.710
So if we just have a look at like, you know,

03:55.710 --> 03:59.670
the 10 pages and running a summarization chain,

03:59.670 --> 04:01.890
you'll see it's taking a bit of time

04:01.890 --> 04:03.450
and what's going on underneath the hood

04:03.450 --> 04:05.357
is it's running that MapReduce pattern

04:05.357 --> 04:07.950
across a range of different documents for you

04:07.950 --> 04:09.587
and essentially getting summaries of each

04:09.587 --> 04:11.460
of the individual pieces

04:11.460 --> 04:14.640
and then reducing that back into a single LLM response,

04:14.640 --> 04:17.970
which you'll then get back as a summarized output.

04:17.970 --> 04:19.710
So we're just gonna see how long this takes here.

04:19.710 --> 04:21.540
We're currently at about half a minute,

04:21.540 --> 04:23.250
so let's actually just see what's come back now.

04:23.250 --> 04:25.020
So we've got various marketing concepts

04:25.020 --> 04:27.800
as marketing segmentation, targeting, planning,

04:27.800 --> 04:29.040
advertising, it's designed to help individuals

04:29.040 --> 04:31.380
effectively promote products and services.

04:31.380 --> 04:34.320
Cool, so it's very easy for you to, you know,

04:34.320 --> 04:38.400
do a summarization chain on any type of document

04:38.400 --> 04:40.110
and so essentially what we've got here is

04:40.110 --> 04:43.620
we've got this LLM, we have the chain type,

04:43.620 --> 04:46.410
and then we are also passing in these pages,

04:46.410 --> 04:47.880
which is a list of documents.

04:47.880 --> 04:50.310
So now that you can sort of see that running in action,

04:50.310 --> 04:52.710
I think it's important for us to take a step back

04:52.710 --> 04:54.330
and say, well, maybe you want

04:54.330 --> 04:56.940
to do a summary in a different language

04:56.940 --> 04:59.991
or you know, maybe you've got something that you need

04:59.991 --> 05:01.590
to really specifically do at every stage of

05:01.590 --> 05:03.240
that map or reduce pattern.

05:03.240 --> 05:05.370
This is where you can create a MapReduce chain,

05:05.370 --> 05:06.990
which is a custom chain.

05:06.990 --> 05:09.751
And so what you'll see here is we've got this

05:09.751 --> 05:10.584
map pattern here that says,

05:10.584 --> 05:12.450
given the following pages of a marketing book,

05:12.450 --> 05:14.460
generate a summary in Spanish

05:14.460 --> 05:16.080
and we've got the pages here.

05:16.080 --> 05:18.810
And then you say, given the following summaries of pages

05:18.810 --> 05:20.610
of a marketing book, generate a high level

05:20.610 --> 05:22.620
description of the book in Spanish.

05:22.620 --> 05:24.300
We've set up a map prompts

05:24.300 --> 05:27.000
that will map over the pages one at a time.

05:27.000 --> 05:28.860
We also set up a reduce prompt

05:28.860 --> 05:31.170
that will go over the summaries

05:31.170 --> 05:32.400
and then we'll reduce those.

05:32.400 --> 05:34.770
We set up two LLM chains.

05:34.770 --> 05:37.320
And then as well as that we have a list of documents

05:37.320 --> 05:39.720
and combining that into a single string.

05:39.720 --> 05:43.710
We also have the MapReduce documents chain,

05:43.710 --> 05:45.270
which takes in the map

05:45.270 --> 05:48.090
and then associates all of the documents,

05:48.090 --> 05:50.370
the individual documents with pages.

05:50.370 --> 05:52.890
And we also have the combined document chain here.

05:52.890 --> 05:56.160
So just in case we need to combine these,

05:56.160 --> 05:59.122
that's kind of how that's working underneath the hood.

05:59.122 --> 06:01.620
And then, so that's part

06:01.620 --> 06:04.664
of the reduce section when we're doing that combine.

06:04.664 --> 06:06.723
So that's a chain to use to combine results

06:06.723 --> 06:08.460
of applying LLM chains to documents.

06:08.460 --> 06:11.220
We then set that up with a MapReduce chain,

06:11.220 --> 06:14.100
and we feed it in the MapReduce documents chain.

06:14.100 --> 06:17.160
And as well as that you can also attach a text splitter

06:17.160 --> 06:19.380
and then essentially you just pass in your input text.

06:19.380 --> 06:21.780
So in this scenario here you can see,

06:21.780 --> 06:24.750
we call our MapReduce chain, which is called MapReduce

06:24.750 --> 06:27.413
and then we pass in our input text.

06:27.413 --> 06:28.440
And what we've done is I've looked at the top

06:28.440 --> 06:30.360
hundred documents to start with

06:30.360 --> 06:32.910
and then what we've done is extracted the page content,

06:32.910 --> 06:34.650
join those altogether and run that

06:34.650 --> 06:36.450
through a MapReduce pattern.

06:36.450 --> 06:38.220
But notice now that the actual text

06:38.220 --> 06:39.390
that you're getting back is

06:39.390 --> 06:41.550
not necessarily an English

06:41.550 --> 06:43.260
summarization across all of these

06:43.260 --> 06:44.310
first hundred documents.

06:44.310 --> 06:46.140
It's actually a Spanish translation.

06:46.140 --> 06:49.012
So you know, being able to specifically change

06:49.012 --> 06:51.900
what happens at the map section

06:51.900 --> 06:53.550
of your summarization chains,

06:53.550 --> 06:56.130
or even just what's happening when all the documents

06:56.130 --> 06:57.900
are getting stuffed in to get reduced,

06:57.900 --> 07:00.420
can give you the flexibility to really

07:00.420 --> 07:02.613
optimize your summarization techniques.
