WEBVTT

00:00.360 --> 00:03.000
Teacher: A lot of LLM applications involve

00:03.000 --> 00:07.260
somewhat connecting LLMs to external data sources.

00:07.260 --> 00:08.310
Like we saw in the course,

00:08.310 --> 00:10.170
where we took Medium articles

00:10.170 --> 00:12.480
and we talked the link chain documentation

00:12.480 --> 00:14.670
and we QAed over them

00:14.670 --> 00:18.780
by using the retrieval augmentation generation method.

00:18.780 --> 00:22.680
But a prerequisite for doing it is to ingest the data

00:22.680 --> 00:27.680
into a format that the LLM can easily understand it.

00:27.750 --> 00:29.790
So, most of the time it'll mean

00:29.790 --> 00:31.530
that we need to take our data

00:31.530 --> 00:33.510
and ingest it into a vector store

00:33.510 --> 00:35.070
like we showed in the course.

00:35.070 --> 00:39.150
But before we do it, we need to first split it

00:39.150 --> 00:41.730
or chunkify it into smaller chunks

00:41.730 --> 00:44.520
because we can't just put all of the data as is

00:44.520 --> 00:45.630
in the vector store.

00:45.630 --> 00:47.460
We need to have smaller chunks

00:47.460 --> 00:50.430
so we won't pass the token limitation.

00:50.430 --> 00:53.490
So, while this task may seem trivial,

00:53.490 --> 00:56.670
and we saw this in the course, it's often nuanced

00:56.670 --> 00:59.520
and overlooked because when splitting the text,

00:59.520 --> 01:02.520
we want to ensure that each chunk

01:02.520 --> 01:07.410
has some cohesive information that we can understand

01:07.410 --> 01:09.300
and the LLM can understand.

01:09.300 --> 01:11.550
We don't want to simply split a sentence

01:11.550 --> 01:12.930
in the middle of it.

01:12.930 --> 01:16.260
So, we want to have a chunk that is small enough

01:16.260 --> 01:18.120
that we can understand it,

01:18.120 --> 01:19.770
but not too big

01:19.770 --> 01:20.760
and not too large,

01:20.760 --> 01:23.700
so that it will have a lot of tokens.

01:23.700 --> 01:26.670
So, I often get messages of questions

01:26.670 --> 01:30.330
of how to split the data, how to set the chunk overlap,

01:30.330 --> 01:33.690
and what should be the chunk size, or what should we split.

01:33.690 --> 01:36.810
And to be honest, there is no correct answer for this,

01:36.810 --> 01:39.420
and every case needs to be examined

01:39.420 --> 01:41.370
and need to be handled differently.

01:41.370 --> 01:44.670
However, luckily for us, lang chain created a tool

01:44.670 --> 01:48.690
that we can use to visualize how we split our text.

01:48.690 --> 01:51.690
So, this is called the Text Splitter Playground.

01:51.690 --> 01:53.280
So, let me show how to find it.

01:53.280 --> 01:55.950
I'm simply going to Google lang chain

01:55.950 --> 01:57.633
text splitting playground,

01:59.280 --> 02:02.010
and I'm going to select the second result.

02:02.010 --> 02:04.050
This is the Git Hub repo. It's open source.

02:04.050 --> 02:05.700
You can check out this code.

02:05.700 --> 02:07.830
And this is basically a streaming publication.

02:07.830 --> 02:09.360
You can click on this link

02:09.360 --> 02:11.460
because it's hosted on lang chain.

02:11.460 --> 02:13.200
It's very intuitive to use.

02:13.200 --> 02:14.940
We can see at the first part,

02:14.940 --> 02:17.100
we can play around with the parameters.

02:17.100 --> 02:19.770
So, we can play around with the chunk size,

02:19.770 --> 02:21.420
with the chunk overlap,

02:21.420 --> 02:24.390
and how do we calculate this chunk size.

02:24.390 --> 02:27.000
And we can even select our text splitter.

02:27.000 --> 02:28.260
So, right now we're using

02:28.260 --> 02:31.200
the most common recursive character text splitter

02:31.200 --> 02:32.850
that we used in the course.

02:32.850 --> 02:35.940
And at the bottom, each time we change this parameter,

02:35.940 --> 02:37.980
we can see it reflected in the code,

02:37.980 --> 02:42.030
which we can simply copy paste into our workspace.

02:42.030 --> 02:43.440
Let's see this in action.

02:43.440 --> 02:45.720
And let me go to the lang chain blog

02:45.720 --> 02:49.980
and select a blog that we can copy data from.

02:49.980 --> 02:52.230
And let's see how we chunk it up.

02:52.230 --> 02:55.420
So, I'm simply going to copy all the text

02:56.760 --> 03:01.760
and I'm going to go back into the text splitter, pasting it.

03:01.980 --> 03:03.390
Let's click split text

03:03.390 --> 03:06.750
and let's take a look at the chunks.

03:06.750 --> 03:09.240
We can see visually all the chunks

03:09.240 --> 03:13.170
and we can make if they make sense or not.

03:13.170 --> 03:17.040
This way we can optimize our chunking strategy

03:17.040 --> 03:20.970
and to actually visualize and see how our chunks looks.

03:20.970 --> 03:23.040
And those are the chunks which are going

03:23.040 --> 03:26.910
to be embedded at the vector store, for example.

03:26.910 --> 03:28.320
This is also a good way

03:28.320 --> 03:30.210
to check out the chunk overlap.

03:30.210 --> 03:33.240
So, for example, we can see between two chunks,

03:33.240 --> 03:37.710
what is overlapped according to the overlapping size.

03:37.710 --> 03:39.330
And to wrap up this video,

03:39.330 --> 03:41.520
I really think this is a wonderful tool,

03:41.520 --> 03:43.860
which is helpful when we want

03:43.860 --> 03:46.440
to optimize our chunking strategy

03:46.440 --> 03:48.840
and when we want to visualize our chunks

03:48.840 --> 03:50.853
and what data they hold.