1
00:00:03,450 --> 00:00:11,940
What are splitters in long chain and why it is important for you to master this technique.

2
00:00:12,030 --> 00:00:14,040
So, splitters.

3
00:00:14,040 --> 00:00:18,780
Uh, we use the splitters to split documents in the rack technique.

4
00:00:18,780 --> 00:00:23,850
Remember, the rack technique is one central technique.

5
00:00:23,850 --> 00:00:31,740
In order to, uh, uh, be able to use the LM models with your own private data.

6
00:00:31,740 --> 00:00:32,040
Right.

7
00:00:32,040 --> 00:00:41,490
So, for example, to be able to ask ChatGPT about private data from your company or about a book or

8
00:00:41,490 --> 00:00:45,240
about a private database or whatever, in order to do that.

9
00:00:45,240 --> 00:00:53,010
Since the LM models are limited by the context window, we use a technique called the rack technique.

10
00:00:53,010 --> 00:00:54,930
And you remember the rack technique.

11
00:00:54,930 --> 00:01:02,880
What it does is it takes this huge private document you want to combine with, with ChatGPT for example,

12
00:01:02,880 --> 00:01:06,930
and it breaks it down in small chunks of text.

13
00:01:07,320 --> 00:01:17,580
So after breaking breaking it down in small chunks of text, it converts the chunks of text into numbers.

14
00:01:17,580 --> 00:01:23,490
And then it loads the these numbers into a vector database.

15
00:01:23,490 --> 00:01:26,520
And there is where you can start playing with data.

16
00:01:26,520 --> 00:01:32,190
So you remember the steps we follow in this rack technique are.

17
00:01:32,190 --> 00:01:36,870
First, we use splitters to create chunks of text.

18
00:01:36,870 --> 00:01:42,840
Second, we use embeddings to convert the chunks of text into numbers.

19
00:01:42,840 --> 00:01:45,570
And then we load it into a vector database.

20
00:01:45,570 --> 00:01:49,290
So the splitters are the first part of the rack technique.

21
00:01:49,410 --> 00:01:57,390
And using them correctly can very positively affect the quality of our rack application.

22
00:01:57,390 --> 00:02:06,780
So it's very important to master issues such as the technique for creating relevant text chunks, or

23
00:02:06,780 --> 00:02:10,800
the technique for adding relevant metadata to a text chunks.

24
00:02:10,800 --> 00:02:12,390
And this in the example.

25
00:02:12,390 --> 00:02:16,950
So we are going to see a few examples of of splitters.

26
00:02:16,950 --> 00:02:24,240
The most important thing for me is that you get the concept that you can start using a working code,

27
00:02:24,270 --> 00:02:29,490
you know, to experiment, to try, etc. and you understand where to find more information.

28
00:02:29,490 --> 00:02:29,940
Right?

29
00:02:29,940 --> 00:02:36,780
So we have not going into the small details because that's why you have the working code for you to

30
00:02:36,780 --> 00:02:37,470
experiment.

31
00:02:37,470 --> 00:02:43,470
But the we are going to stop in the concept and in the, in the main ideas.

32
00:02:43,470 --> 00:02:54,450
So I just, uh, I want to tell you, uh, why it is very important for us to master techniques like

33
00:02:54,450 --> 00:02:55,470
this here.

34
00:02:56,100 --> 00:03:05,940
So if we have a document, this is a sample document that, uh, we want to use with our rack application.

35
00:03:05,940 --> 00:03:06,570
Right.

36
00:03:06,750 --> 00:03:17,160
So it is going to be very different if we divide this document into one word, for example, fragments

37
00:03:17,160 --> 00:03:23,070
or one word chunks or three word chunks or paragraph chunks.

38
00:03:23,550 --> 00:03:34,290
So if we separate this document into paragraphs or into whole sentences, ChatGPT is going to be a much

39
00:03:34,290 --> 00:03:40,710
more proficient, uh, the quality of the application that our application is going to be much better

40
00:03:40,710 --> 00:03:43,950
than if we break this document into words.

41
00:03:43,950 --> 00:03:44,610
Why?

42
00:03:44,700 --> 00:03:52,470
Because for ChatGPT and for LM models, uh, it is very important the context.

43
00:03:52,620 --> 00:03:55,920
What why do I mean with the word context here?

44
00:03:55,950 --> 00:04:00,150
So if you use, for example, the word red.

45
00:04:01,010 --> 00:04:01,940
A.

46
00:04:03,580 --> 00:04:06,280
Probably a person or ChatGPT.

47
00:04:06,370 --> 00:04:09,910
They don't know what this read is about.

48
00:04:09,940 --> 00:04:13,120
Are you talking about a dress?

49
00:04:13,120 --> 00:04:14,470
The color of a dress?

50
00:04:14,500 --> 00:04:17,380
Are you talking about a light, a traffic light?

51
00:04:17,410 --> 00:04:19,390
Are you talking about the politics?

52
00:04:19,390 --> 00:04:21,310
What are you talking about when you say red?

53
00:04:21,310 --> 00:04:24,910
So the context around a word is important.

54
00:04:25,180 --> 00:04:34,480
So that's why for ChatGPT is much more useful if we give a sentence than if we give one separate word

55
00:04:34,480 --> 00:04:37,570
or three separate words or words altogether.

56
00:04:37,570 --> 00:04:37,990
Right.

57
00:04:37,990 --> 00:04:45,640
So that's why it is important to master the techniques around splitters.

58
00:04:45,640 --> 00:04:46,000
Right.

59
00:04:46,000 --> 00:04:49,660
So you will see here a few ways of doing things.

60
00:04:49,660 --> 00:04:51,730
Some of them are good, some of them are not.

61
00:04:51,730 --> 00:04:53,950
And you will see that some of them fail.

62
00:04:53,950 --> 00:04:54,430
Right.

63
00:04:54,430 --> 00:05:02,500
So for example, the first thing we are going to use here is character splitter is the more simple splitter

64
00:05:02,500 --> 00:05:04,060
long chain provide us.

65
00:05:04,060 --> 00:05:06,220
And here you see that.

66
00:05:06,220 --> 00:05:12,400
Well we are telling, you know, the size of the chunk and the size of the over the chunk overlap we

67
00:05:12,400 --> 00:05:13,420
want to have, etc..

68
00:05:13,420 --> 00:05:22,150
But you see that this splitter is not very good because when we provide, you know, a document which

69
00:05:22,150 --> 00:05:30,100
is longer than 26 characters, it doesn't divide it, you know, or even a longer document, it doesn't

70
00:05:30,100 --> 00:05:30,640
divide this.

71
00:05:30,640 --> 00:05:32,860
So this splitter is not good.

72
00:05:32,860 --> 00:05:39,190
But if we take a look at the recursive character splitter, which is the most commonly used, the most

73
00:05:39,190 --> 00:05:41,800
popular in lunch you will see different results.

74
00:05:41,800 --> 00:05:50,470
So applying a recursive character text splitter, we immediately see how to separate the first text.

75
00:05:50,470 --> 00:05:52,630
And you will see the overlap here.

76
00:05:52,630 --> 00:05:59,680
So the termination the ending of this initial chunk is the beginning of the second chunk.

77
00:05:59,770 --> 00:06:07,480
And you can see how this recursive splitter apply to the long document works.

78
00:06:07,480 --> 00:06:11,080
So we have a lot of, you know, chunks of text.

79
00:06:11,080 --> 00:06:19,000
But you will see that even if this chunk of text look okay in a similar length, etc., etc., you will

80
00:06:19,000 --> 00:06:25,480
see that they are not going to be very useful for ChatGPT, or at least they are going to be less useful

81
00:06:25,480 --> 00:06:33,460
than if we have just a couple of chunks of text, one paragraph and then the other, because this way

82
00:06:33,460 --> 00:06:38,200
is going to give much more, uh, much more context to ChatGPT.

83
00:06:38,200 --> 00:06:42,550
So he's going to be much more proficient in order to respond to our questions.

84
00:06:42,550 --> 00:06:49,210
So that's the things that that's the kind of things, you know, that we are going to learn in the next,

85
00:06:49,210 --> 00:06:50,320
uh, chapters.

86
00:06:51,010 --> 00:06:59,650
Another interesting concept is, uh, that adding helpful metadata to the text chunks is going to be

87
00:06:59,650 --> 00:07:03,730
good for the performance of the rack application.

88
00:07:03,730 --> 00:07:09,490
So, uh, think of, uh, this same text.

89
00:07:09,490 --> 00:07:15,460
So let's say that instead of having two paragraphs, you have like 1 million paragraphs or, you know,

90
00:07:15,460 --> 00:07:17,530
10,000 or whatever, right?

91
00:07:17,530 --> 00:07:23,050
And you load paragraph by paragraph into the vector database, okay.

92
00:07:23,050 --> 00:07:27,040
It's better than word by word or three words length or whatever.

93
00:07:27,040 --> 00:07:38,020
But still it can be improved with some metadata added, added to, uh, each of the chunks of text.

94
00:07:38,020 --> 00:07:47,290
For example, if we add to this chunk of text some metadata like the name of the document, the page

95
00:07:47,290 --> 00:07:53,350
of the document, the I don't know the section of the document or the year of the document.

96
00:07:53,350 --> 00:07:53,860
It can be.

97
00:07:54,010 --> 00:08:00,670
It can be very helpful for ChatGPT in order to search, you know and understand what is the response

98
00:08:00,670 --> 00:08:02,140
that it has to provide.

99
00:08:02,140 --> 00:08:06,910
So a little example about that is the last exercise.

100
00:08:06,910 --> 00:08:14,170
So in the last exercise we are making use of a module called markdown header text splitter.

101
00:08:14,740 --> 00:08:21,010
And what this module does is it uses the markdown as metadata.

102
00:08:21,010 --> 00:08:23,500
So probably you are familiar with markdown.

103
00:08:23,500 --> 00:08:29,950
If you are not familiar with it, just Google it or go to ChatGPT and and ask, you know what is markdown?

104
00:08:29,950 --> 00:08:35,590
You will see that markdown is a way to write text to provide format.

105
00:08:35,590 --> 00:08:38,409
So for example, here we are using markdown.

106
00:08:38,409 --> 00:08:41,440
In order to make these letters a title.

107
00:08:41,440 --> 00:08:45,490
We could use markdown in order to make these letters cursive or whatever.

108
00:08:45,490 --> 00:08:45,820
Right.

109
00:08:45,820 --> 00:08:50,050
So these symbols you see here are markdown symbols.

110
00:08:50,050 --> 00:08:57,940
These for example is is telling that this thing my book or title, my book is a title format.

111
00:08:57,940 --> 00:09:02,740
And this second thing is this is smaller in size and this is smaller in.

112
00:09:02,910 --> 00:09:04,080
Size, etc., etc..

113
00:09:04,080 --> 00:09:04,440
Right.

114
00:09:04,440 --> 00:09:15,540
So what this module does is, is it is going to use this markdown information in order to uh, include

115
00:09:15,540 --> 00:09:19,800
metadata, uh, with our chunks of text.

116
00:09:19,800 --> 00:09:27,090
And I think the easier way to understand this is to see one chunk of text.

117
00:09:27,090 --> 00:09:33,810
So here you have how this splitter A builds a chunk of text.

118
00:09:33,810 --> 00:09:38,790
So this is the chunk of text I was born in a very sunny day of summer.

119
00:09:39,610 --> 00:09:43,420
I was born in a very sunny day of summer, but.

120
00:09:44,090 --> 00:09:50,720
With this chunk of text, the splitter is providing a metadata.

121
00:09:51,230 --> 00:10:01,130
So this metadata is telling us this chunk of text belongs to a book with this title, to the chapter,

122
00:10:01,130 --> 00:10:04,670
one of this title that has this particular name.

123
00:10:04,670 --> 00:10:08,060
So I was born in a very sunny day of summer.

124
00:10:08,060 --> 00:10:13,550
Of summer belongs to the chapter one of the book called My book.

125
00:10:13,850 --> 00:10:15,830
Okay, so.

126
00:10:16,920 --> 00:10:18,810
That's it for this chapter.

127
00:10:18,810 --> 00:10:25,860
The most important thing you need to to get from this chapter is that splitters are the first step of

128
00:10:25,860 --> 00:10:34,830
the rack technique, and that the quality of your rack applications is going to be affected by how correctly

129
00:10:34,830 --> 00:10:36,240
you use a splitter.

130
00:10:36,240 --> 00:10:39,870
So it is very important to master this technique.

131
00:10:40,140 --> 00:10:44,100
And it's also very important to master the embedding technique for example.

132
00:10:44,100 --> 00:10:47,370
So we will study this further.

133
00:10:47,370 --> 00:10:54,900
But you right now can start, you know, experimenting with the text, going to the chain documentation

134
00:10:54,900 --> 00:11:01,380
to know more about that and to experiment with solutions that can respond to your particular cases.