WEBVTT

00:00.180 --> 00:01.710
-: Now that you've explored a range

00:01.710 --> 00:03.390
of different summarization techniques,

00:03.390 --> 00:06.000
let's move on to how can we load in documents

00:06.000 --> 00:07.334
and text splitting.

00:07.334 --> 00:10.830
We've got an example here using another document loader

00:10.830 --> 00:13.530
and we have this package called Beautiful Soup.

00:13.530 --> 00:17.010
So you'll need to install Beautiful Soup when using this.

00:17.010 --> 00:18.870
We then import requests

00:18.870 --> 00:20.340
and we've got this Read Me file

00:20.340 --> 00:22.410
from a GitHub repository.

00:22.410 --> 00:25.770
We make a get URL to get the Read Me.

00:25.770 --> 00:27.300
We then get the text

00:27.300 --> 00:30.480
and we save that locally in a Read Me dot MD file,

00:30.480 --> 00:32.010
which we then load in.

00:32.010 --> 00:34.410
So if I run all these bits of code,

00:34.410 --> 00:36.841
what you'll see is we generate these docs

00:36.841 --> 00:40.090
and these documents are Lang chain documents.

00:40.090 --> 00:42.960
You can also get these Lang chain documents directly

00:42.960 --> 00:47.490
by just typing in Lang chain dot schema import document,

00:47.490 --> 00:51.000
and you can create documents just like this.

00:51.000 --> 00:52.500
So we can have a document here

00:53.430 --> 00:55.995
and we can define its page content.

00:55.995 --> 01:00.003
You can also add something called metadata.

01:00.900 --> 01:03.960
So that is how you can make documents inside Python

01:03.960 --> 01:05.730
and you'll generally have a list

01:05.730 --> 01:07.140
of these Lang chain documents.

01:07.140 --> 01:10.553
Now these loaders also provide us a nice way

01:10.553 --> 01:15.450
for us to be able to automatically get Lang chain documents

01:15.450 --> 01:18.600
directly from different file sources like Markdown,

01:18.600 --> 01:22.260
GitHub, PDFs, any kind of source you can think of,

01:22.260 --> 01:23.914
notion databases.

01:23.914 --> 01:26.460
What we need to do though, is we need to make sure

01:26.460 --> 01:28.620
that the documents are not too large.

01:28.620 --> 01:31.440
And to do that we can use a text splitter.

01:31.440 --> 01:33.970
So you'll see this document content is quite large.

01:33.970 --> 01:36.450
It might not necessarily all fit into

01:36.450 --> 01:37.830
the large language model.

01:37.830 --> 01:39.510
So what we can do is use something called a

01:39.510 --> 01:41.568
recursive text splitter.

01:41.568 --> 01:43.890
So what we can do is use something called

01:43.890 --> 01:45.960
a recursive character text splitter,

01:45.960 --> 01:49.020
which will split based on a series

01:49.020 --> 01:51.450
of characters recursively,

01:51.450 --> 01:53.064
making sure that the splits

01:53.064 --> 01:55.560
adhere to specific things.

01:55.560 --> 01:57.780
And those things are the chunk size,

01:57.780 --> 01:59.010
the amount of overlap,

01:59.010 --> 02:00.390
the length function,

02:00.390 --> 02:02.520
and is it gonna be separated or not.

02:02.520 --> 02:06.030
So let's show you a really, a chunk size of 300.

02:06.030 --> 02:08.580
And what we're doing here is we make our text splitter,

02:08.580 --> 02:10.726
then we take our documents in that we load

02:10.726 --> 02:12.510
from the Read Me file,

02:12.510 --> 02:15.229
we can then split those documents into chunks.

02:15.229 --> 02:17.370
And so you'll now see that we've got all these different

02:17.370 --> 02:19.920
documents that have got chunks,

02:19.920 --> 02:21.770
and we can also increase the overlap,

02:21.770 --> 02:24.360
which will mean that we have more documents, right?

02:24.360 --> 02:26.641
So if I then look at the length of these final documents,

02:26.641 --> 02:29.340
you can see I have 182.

02:29.340 --> 02:31.505
As soon as I start reducing the overlap

02:31.505 --> 02:34.530
for the chunk size,

02:34.530 --> 02:37.180
we then start getting less and less documents, right?

02:43.310 --> 02:46.200
Because the overlap will cause the window chunks,

02:46.200 --> 02:48.810
you have different amounts of overlap, right?

02:48.810 --> 02:50.340
And therefore what we're doing really is

02:50.340 --> 02:51.834
as we increase the overlap,

02:51.834 --> 02:55.350
there's a larger number of chunks that we need to create.

02:55.350 --> 02:57.780
However, we are less likely to lose information.

02:57.780 --> 02:59.898
I find that, you know, a small overlap is good.

02:59.898 --> 03:04.020
And then what we can do is we can make a model.

03:04.020 --> 03:06.390
So using the chat, Open AI,

03:06.390 --> 03:08.400
loading on our load summarize chain,

03:08.400 --> 03:11.520
and then we could do a chain type of MapReduce

03:11.520 --> 03:13.260
on the input documents that we had.

03:13.260 --> 03:14.940
And I'm not gonna necessarily show this

03:14.940 --> 03:16.650
'cause it will take a while to run.

03:16.650 --> 03:18.120
But essentially what we're doing here

03:18.120 --> 03:20.514
is we are using our input docs that we got from here

03:20.514 --> 03:24.510
and then we then put that as the variable in our input docs.

03:24.510 --> 03:26.915
And this will use a MapReduce summary.

03:26.915 --> 03:29.791
So just what we learned in the previous lesson.

03:29.791 --> 03:31.560
It'll use a MapReduce specific type of summary

03:31.560 --> 03:34.650
to create individual summaries on each document.

03:34.650 --> 03:36.120
And then what it will do is

03:36.120 --> 03:37.680
after it's made the summaries,

03:37.680 --> 03:39.570
it will then summarize those summaries

03:39.570 --> 03:41.940
inside of a reduced step.

03:41.940 --> 03:43.890
Cool, I'll see you all in the next one.