1
00:00:11,080 --> 00:00:16,600
So in this lecture, we are going to discuss one of the details we previously glossed over, which is

2
00:00:16,600 --> 00:00:20,620
how do we actually get the words from a string of multiple words?

3
00:00:21,550 --> 00:00:27,460
Now, if you've ever coded in languages like C or C++, you may recall having to write the code to do

4
00:00:27,460 --> 00:00:28,330
this yourself.

5
00:00:28,960 --> 00:00:35,150
Luckily, in the higher level languages like Java and Python, this is not the case in Java or Python.

6
00:00:35,170 --> 00:00:38,020
You simply need to call the string function named Split.

7
00:00:38,650 --> 00:00:44,380
What this does is is it takes a string containing multiple tokens and another string to separate each

8
00:00:44,380 --> 00:00:51,040
token by, for example, a space or punctuation or any other character or sequence of characters.

9
00:00:52,270 --> 00:00:57,100
The output of this call is a list of strings where each string in that list is a token.

10
00:00:57,760 --> 00:01:01,990
For example, if they pass in the string, I like cats and I call the split function.

11
00:01:02,290 --> 00:01:06,400
This will give me back a list with three items I like and cats.

12
00:01:07,000 --> 00:01:10,450
Note that by default, the split function will split on whitespace.

13
00:01:11,930 --> 00:01:14,900
So this process is what we call tokenization.

14
00:01:15,440 --> 00:01:21,770
Unfortunately, this is somewhat of a confusing term, you see in the old days when we said it tokenization,

15
00:01:21,770 --> 00:01:25,760
we really meant what we referred to in modern times as a string split.

16
00:01:26,360 --> 00:01:32,450
So if you look at tutorials in C and C++ and they talk about tokenization, that does the same thing

17
00:01:32,450 --> 00:01:39,680
as the split function in Python today, when we talk about tokenization, it can be generalized to include

18
00:01:39,680 --> 00:01:40,820
many more features.

19
00:01:42,050 --> 00:01:46,430
The final note I want to mention on this slide is that you can view this lecture from two different

20
00:01:46,430 --> 00:01:47,300
perspectives.

21
00:01:47,780 --> 00:01:53,000
One way to view this lecture is to learn about all the different tokenization options you could apply

22
00:01:53,300 --> 00:01:57,770
if you use the site could learn a count of activities or class in this case.

23
00:01:57,800 --> 00:02:03,530
Tokenization is actually done for you, and so your job is to simply select the right options for your

24
00:02:03,530 --> 00:02:05,090
data and your task.

25
00:02:05,660 --> 00:02:10,580
On the other hand, if you want to implement count of factorization yourself, then you should view

26
00:02:10,580 --> 00:02:14,330
these as possible steps you could take in your implementation.

27
00:02:19,020 --> 00:02:24,660
Luckily for many applications, these extra features we're about to discuss are not needed, and thus

28
00:02:24,930 --> 00:02:29,550
it's completely fine to think about tokenization as simply being a string split.

29
00:02:30,150 --> 00:02:35,490
For example, if we're using deep learning and we want to classify text with an aunt, then just using

30
00:02:35,490 --> 00:02:37,650
a string split turns out to work pretty well.

31
00:02:39,020 --> 00:02:43,090
But there are several ways that we can expand the functionality of a tokenize here.

32
00:02:43,760 --> 00:02:48,380
So in this lecture, we'll be discussing all of these possible varieties of tokenization.

33
00:02:54,030 --> 00:02:56,520
The first thing I want to mention is punctuation.

34
00:02:57,150 --> 00:03:00,600
Punctuation may be important for downstream and Ielpi tasks.

35
00:03:00,870 --> 00:03:02,910
For example, sentiment analysis.

36
00:03:03,510 --> 00:03:07,050
Suppose we have a sentence I hate cats, which ends in a period.

37
00:03:07,770 --> 00:03:09,320
Now suppose we have a sentence.

38
00:03:09,330 --> 00:03:12,000
I hate cats, which ends with a question mark.

39
00:03:12,420 --> 00:03:15,360
These two sentences clearly have different meanings.

40
00:03:15,870 --> 00:03:22,260
The sentence that ends in a period is a statement declaring that I dislike cats very much, but the

41
00:03:22,260 --> 00:03:25,020
sentence that ends in a question mark is a question.

42
00:03:25,500 --> 00:03:30,690
It could mean that the person asking the question is wondering why someone else thinks that they hate

43
00:03:30,690 --> 00:03:31,290
cats.

44
00:03:31,800 --> 00:03:38,340
So in this context, punctuation matters, and thus it is possible, in some token, ICER's to treat

45
00:03:38,340 --> 00:03:40,230
punctuation as tokens as well.

46
00:03:40,920 --> 00:03:46,950
So both of these sentences would be converted into four tokens where the token is the punctuation.

47
00:03:47,820 --> 00:03:49,680
Now, do you always need to do this?

48
00:03:49,890 --> 00:03:50,880
The answer is no.

49
00:03:51,270 --> 00:03:53,370
It really depends on the results you observe.

50
00:03:53,760 --> 00:03:54,750
Use it if it works.

51
00:03:54,930 --> 00:03:56,460
Don't use it if it doesn't.

52
00:03:56,940 --> 00:04:01,910
By default, the count victimizer insight can learn completely ignore as punctuation.

53
00:04:06,600 --> 00:04:08,880
So that was one way to deal with punctuation.

54
00:04:09,120 --> 00:04:11,310
But here's another way that could be more simple.

55
00:04:12,090 --> 00:04:16,680
As you recall, the simplest way to do tokenization is to just call string split.

56
00:04:17,820 --> 00:04:23,070
Note that if you do the simplest thing, which is to split on whitespace, this will keep the punctuation

57
00:04:23,070 --> 00:04:24,570
along with the word itself.

58
00:04:25,200 --> 00:04:26,460
That is, I hate cats.

59
00:04:26,460 --> 00:04:29,710
The statement would still be treated differently than I hate cats.

60
00:04:29,730 --> 00:04:35,100
The question The only issue with this is you may need a lot of data for the model to learn from.

61
00:04:35,940 --> 00:04:41,790
That is to say you'll want a sufficient number of sentences that end with cats period and a sufficient

62
00:04:41,790 --> 00:04:44,800
number of sentences that end with cats question mark.

63
00:04:45,450 --> 00:04:50,010
As you recall, for machine learning data is what we need to make our models learn.

64
00:04:50,100 --> 00:04:52,620
And so the more examples we have, the better.

65
00:04:53,190 --> 00:04:59,220
But by keeping punctuation with the words when we do a string split, this effectively increases our

66
00:04:59,220 --> 00:05:00,330
data requirement.

67
00:05:01,050 --> 00:05:06,420
Now, the reason I mentioned this is not to suggest to you what to do or not do, but to simply tell

68
00:05:06,420 --> 00:05:08,370
you the pros and cons of each approach.

69
00:05:08,820 --> 00:05:12,810
Ultimately, it's still up to you to check the performance on your dataset.

70
00:05:17,520 --> 00:05:23,640
OK, so the next modification I want to discuss is casing again, suppose we're doing some task like

71
00:05:23,640 --> 00:05:25,800
sentiment analysis or spam detection.

72
00:05:26,340 --> 00:05:31,860
It's probably the case that the word cat, all lowercase has the same meaning as cat with a capital

73
00:05:31,860 --> 00:05:32,310
S..

74
00:05:33,000 --> 00:05:36,120
In other words, we may want our model to be case insensitive.

75
00:05:36,870 --> 00:05:41,160
One way to do this is to simply lowercase all the letters in our text corpus.

76
00:05:42,030 --> 00:05:47,310
Note that with the count, victimizer inside could learn this can be done by setting lowercase to true

77
00:05:47,520 --> 00:05:50,070
when you construct your account victimizer object.

78
00:05:50,640 --> 00:05:55,590
Now again, whether or not this will help is not guaranteed, and you always have to check the results

79
00:05:55,590 --> 00:05:56,370
for yourself.

80
00:06:01,040 --> 00:06:07,100
Yet another modification I want to discuss is accents, as you recall, some words, although it is

81
00:06:07,100 --> 00:06:09,440
less common in English, have accents.

82
00:06:10,040 --> 00:06:15,530
However, people who speak English tend not to use accents, even if they are using a word that has

83
00:06:15,530 --> 00:06:16,190
accents.

84
00:06:16,640 --> 00:06:18,410
For example, the word naive.

85
00:06:18,950 --> 00:06:22,370
Technically, this word is spelled correctly, whether it has accents or not.

86
00:06:22,760 --> 00:06:24,860
Therefore, both should be treated the same.

87
00:06:26,090 --> 00:06:31,970
If you're quoting this yourself, you could write a function to map accents and characters to the corresponding

88
00:06:31,970 --> 00:06:33,350
non accented character.

89
00:06:33,830 --> 00:06:39,350
Or if you're using psychic learns count of exerciser, then you can simply use the argument strip accents.

90
00:06:44,060 --> 00:06:48,650
OK, so throughout this lecture, we've been assuming that our tokens are full words.

91
00:06:49,130 --> 00:06:51,110
However, this need not be the case.

92
00:06:51,530 --> 00:06:55,580
In fact, it's also possible to build models based on characters as well.

93
00:06:56,150 --> 00:06:58,460
And in fact, we will do so in this course.

94
00:06:59,060 --> 00:07:03,890
Now, this is just a brief preview, but let's go through some pros and cons of different approaches

95
00:07:04,970 --> 00:07:08,060
as you learn more about different techniques, such as deep learning.

96
00:07:08,330 --> 00:07:11,510
The reasons why these are pros and cons will become more apparent.

97
00:07:13,390 --> 00:07:15,860
So let's think about word based tokenization.

98
00:07:16,450 --> 00:07:21,700
As mentioned, there could be up to a million words in a vocabulary because of this when you're building

99
00:07:21,700 --> 00:07:22,240
a model.

100
00:07:22,270 --> 00:07:25,420
You actually need to store all these million words.

101
00:07:25,870 --> 00:07:31,300
Or if you want a vector representations of these words, then you need up to a million vectors, which

102
00:07:31,300 --> 00:07:32,740
could take up lots of space.

103
00:07:33,550 --> 00:07:38,140
Furthermore, in deep learning, if we're going to build some kind of language model, then our neural

104
00:07:38,140 --> 00:07:43,120
network will output a probability distribution over all the words in our vocabulary.

105
00:07:43,660 --> 00:07:49,690
You might not realize why this is bad now, but intuitively you can imagine that a probability distribution

106
00:07:49,690 --> 00:07:54,160
with one million possible values might be hard to get exactly right.

107
00:07:55,840 --> 00:08:00,400
If you know a bit about deep learning, then you know that this would require a final weight matrix

108
00:08:00,580 --> 00:08:05,530
with size one million in the last I mentioned, which is pretty large by deep learning standards.

109
00:08:06,070 --> 00:08:09,970
So these are some disadvantages of using word based tokenization.

110
00:08:14,620 --> 00:08:17,560
On the other hand, words do contain a lot of information.

111
00:08:18,190 --> 00:08:23,140
For example, when I see the word cat, this has meaning we can picture this in our minds.

112
00:08:23,440 --> 00:08:26,170
It's pretty much unambiguous what a cat is.

113
00:08:27,370 --> 00:08:34,929
Now consider character based tokenisation characters, unlike words, do not contain much information.

114
00:08:35,620 --> 00:08:36,679
Take the letter C.

115
00:08:37,270 --> 00:08:41,590
I know that the word cat contains the letter C, but so does the word car.

116
00:08:41,679 --> 00:08:42,909
So does the word carbon.

117
00:08:43,390 --> 00:08:45,370
And these are completely different things.

118
00:08:45,970 --> 00:08:50,590
So the letter C does not tell me a whole lot about the idea I want to represent.

119
00:08:51,100 --> 00:08:54,520
So that's a disadvantage of using character based tokenization.

120
00:08:59,230 --> 00:09:02,740
On the other hand, in English, there are only 26 letters.

121
00:09:03,670 --> 00:09:07,090
Add to that a few different whitespace tokens and some punctuation.

122
00:09:07,420 --> 00:09:09,520
And there really aren't that many characters.

123
00:09:10,090 --> 00:09:16,210
Thus, if we use characters, then our vocabulary size is small, much less than one million, and it

124
00:09:16,210 --> 00:09:19,000
is easy and efficient to represent them in a computer.

125
00:09:20,350 --> 00:09:22,840
OK, so these are two options for tokenization.

126
00:09:23,140 --> 00:09:24,940
Word based or character based?

127
00:09:25,360 --> 00:09:28,030
Both have pros and cons, and neither is perfect.

128
00:09:32,750 --> 00:09:37,460
Now there is a third tape of tokenization, which is sort of a middle ground between word based and

129
00:09:37,460 --> 00:09:38,390
character based.

130
00:09:38,870 --> 00:09:41,450
This is sub ward based tokenization.

131
00:09:42,260 --> 00:09:46,790
In this case, some words can actually be split up into multiple sub words.

132
00:09:47,330 --> 00:09:53,570
For example, take the word walking walking can be split up into two sub words walk and doing.

133
00:09:54,320 --> 00:09:56,000
So why might we want to do this?

134
00:09:56,750 --> 00:10:01,940
Well, consider the fact that the word walk by itself is closely related to the word walking.

135
00:10:02,750 --> 00:10:08,450
Thus, we would want them probably to have some shared representation in our machine learning model.

136
00:10:09,590 --> 00:10:14,630
You can think of the suffix inji to simply be a modifier on the word walk.

137
00:10:19,270 --> 00:10:25,120
On the other hand, suppose that we did not tokenize walking into two sub words, what would happen

138
00:10:25,120 --> 00:10:29,500
is walk in walking would be treated like two completely different words.

139
00:10:30,070 --> 00:10:36,430
In this case, walk is no closer to walking than it is to a tree, which is a completely unrelated word.

140
00:10:37,090 --> 00:10:42,310
If we did this, then our model would have to learn the relationship between a walk and walking using

141
00:10:42,310 --> 00:10:44,080
only the sentences we pass in.

142
00:10:44,860 --> 00:10:47,740
We could only hope that the model realizes they are similar.

143
00:10:48,910 --> 00:10:54,730
If we do split them into sub words, then a model knows that the subway to walk appears in all cases,

144
00:10:54,940 --> 00:10:59,080
whether it is walk walking, walked walks and so forth.

145
00:10:59,770 --> 00:11:05,050
So the question is, do we want our model to know that the same subway ride appears whenever any of

146
00:11:05,050 --> 00:11:05,920
these words appear?

147
00:11:06,220 --> 00:11:09,940
Or do we want our model to have to learn about each variation independently?

148
00:11:11,880 --> 00:11:15,510
Now, I realize this makes a strong case for Subway tokenization.

149
00:11:15,990 --> 00:11:20,700
And yet we pretty much won't encounter it in NLP until we study Transformers.

150
00:11:21,240 --> 00:11:26,610
What you'll see is that although Subway tokenization seems to be a pretty powerful middle ground between

151
00:11:26,610 --> 00:11:30,930
word based and character based methods, it's not necessary to build a good model.

152
00:11:35,700 --> 00:11:41,070
Now, in terms of implementation psyche, it learns that count victimizer allows you to do both word

153
00:11:41,070 --> 00:11:43,590
based and character based tokenization.

154
00:11:44,280 --> 00:11:48,780
We can control what it does by using the analyzer arguments in the constructor.

155
00:11:49,500 --> 00:11:51,570
Specifically, we set this to a word.

156
00:11:51,780 --> 00:11:58,140
If we want word based tokenization or car, if we want character based tokenization, note that the

157
00:11:58,140 --> 00:11:59,340
default is word.

158
00:12:00,930 --> 00:12:06,270
Now, if you want to implement tokenization yourself, then note that a string is already a sequence

159
00:12:06,270 --> 00:12:07,140
of characters.

160
00:12:07,680 --> 00:12:13,500
So if you had a string s and you did a for loop for C in s, then this would give you each character

161
00:12:13,500 --> 00:12:17,070
one at a time and thus you wouldn't have to call string split.

162
00:12:17,370 --> 00:12:22,980
You would simply use the string as is perhaps after lower casing and removing accents and so forth.

163
00:12:27,640 --> 00:12:32,800
Now, the final topic I want to discuss in this lecture is how is tokenization actually done?

164
00:12:33,610 --> 00:12:38,830
I want to mention this because in my first and Ielpi course, I had a lot of beginners get angry with

165
00:12:38,830 --> 00:12:39,670
me for this.

166
00:12:40,120 --> 00:12:44,860
The beginners thought that there was some fancy machine learning algorithm that they had to learn in

167
00:12:44,860 --> 00:12:47,350
order to understand how tokenization works.

168
00:12:47,860 --> 00:12:52,150
So they thought I was withholding information from them or failing to teach them something.

169
00:12:52,690 --> 00:12:54,400
But of course, this is incorrect.

170
00:12:54,880 --> 00:12:58,420
This is one task in NLP that does not require machine learning.

171
00:12:59,200 --> 00:13:05,380
As mentioned, tokenization can be as simple as a string split, and if you want to know how that works.

172
00:13:05,590 --> 00:13:11,650
Recall that there are many tutorials out there in C and C++ to explain it is just basic coding.

173
00:13:12,610 --> 00:13:17,770
In fact, if you can't imagine immediately how it would be done, I would actually recommend that you

174
00:13:17,770 --> 00:13:19,390
try it as an exercise.

175
00:13:21,870 --> 00:13:26,940
OK, but what about the more complex types of tokenization like what we mentioned in this lecture?

176
00:13:27,660 --> 00:13:31,560
Well, again, note that none of these involved machine learning in any way.

177
00:13:32,280 --> 00:13:37,350
Lowercase letters and removing accents are just mapping one character to another character.

178
00:13:38,460 --> 00:13:41,670
OK, but what about punctuation and subway tokenization?

179
00:13:46,440 --> 00:13:50,670
Well, let's remember that languages essentially follow a fixed set of rules.

180
00:13:51,030 --> 00:13:52,560
There is nothing to be learned.

181
00:13:53,040 --> 00:13:57,600
And in fact, there are many exceptions as anyone who has studied language should know.

182
00:13:58,260 --> 00:14:04,620
For example, there's a rule in English that says I before e except after C as an example of a word

183
00:14:04,620 --> 00:14:07,620
that follows this rule, consider the word receive.

184
00:14:08,220 --> 00:14:10,230
But there are many exceptions to this rule.

185
00:14:10,410 --> 00:14:12,240
For example, the word weird.

186
00:14:12,840 --> 00:14:17,340
So if you're trying to follow patterns, which is what machine learning does, then you would likely

187
00:14:17,340 --> 00:14:21,060
fail because exceptions are the opposite of patterns.

188
00:14:21,600 --> 00:14:27,030
So in fact, it's not machine learning that we need for tokenization, but simply a rule based routine

189
00:14:27,120 --> 00:14:29,880
that takes into account all of these exceptions.

190
00:14:30,570 --> 00:14:35,580
OK, so I hope that clears up any misconceptions you may have had about string slips.

191
00:14:36,000 --> 00:14:39,330
It's just basic programming, and it is not machine learning.