WEBVTT

00:00.120 --> 00:02.960
The recursive character splitter.

00:03.000 --> 00:03.440
Wow.

00:03.480 --> 00:04.720
It was hard to pronounce.

00:05.120 --> 00:11.360
It's a utility class which helps us to split large documents into smaller chunks.

00:11.680 --> 00:12.680
How does it do it?

00:12.880 --> 00:15.760
It does that recursively by characters.

00:16.360 --> 00:24.440
Recursive aspect of this splitter refers to the strategy of splitting the text by progressively trying

00:24.440 --> 00:31.080
different levels of separators in a hierarchical order, until the chunks are small enough.

00:31.520 --> 00:39.920
So it starts by attempting to split the text using a larger separators that correspond to bigger semantic

00:39.960 --> 00:40.680
units.

00:40.680 --> 00:48.360
So, for example, paragraphs are going to be splitted by double newlines backslash backslash n.

00:48.360 --> 00:54.360
So for example, we want to start breaking the text into chunks with the separator.

00:54.600 --> 01:02.160
So after we do that, if a chunk is still too large, even after the split, it will recursively remove

01:02.320 --> 01:06.560
other smaller separators like a single newline.

01:06.560 --> 01:13.080
So this is to represent the sentence a backslash n, and then we can even go to spaces which are going

01:13.080 --> 01:14.520
to separate words.

01:14.520 --> 01:18.160
And finally down to the individual character if necessary.

01:18.560 --> 01:25.840
And this recursive approach tries to keep semantically related text together as much as possible.

01:25.840 --> 01:30.800
So this to preserve the natural language flow and coherence within the chunks.

01:30.840 --> 01:37.600
Of course, this is a heuristic, and this doesn't always mean that it will be split it up coherently.

01:37.600 --> 01:46.280
And this method actually contrasts the simple fixed length splitting by characters or by token, because

01:46.280 --> 01:52.800
it respects the inherent structure of text to maintain semantic integrity within chunks.

01:53.000 --> 01:54.680
So it's a different strategy.
