1
00:00:11,100 --> 00:00:16,320
So in this lecture, we are going to discuss yet another modification of the basic counting method,

2
00:00:17,040 --> 00:00:22,230
the motivation for this lecture is similar to our motivation for using subir tokenization.

3
00:00:23,070 --> 00:00:29,280
As you recall, with basic tokenization and counting each word, no matter how similar, will occupy

4
00:00:29,280 --> 00:00:31,410
its own component in our final vector.

5
00:00:32,040 --> 00:00:38,250
So if we have the words walk, walking walks and walks, they would all be treated as separate words.

6
00:00:38,850 --> 00:00:44,790
In this case, a walk is as close to walking as it is to cartwheel, which is definitely an unrelated

7
00:00:44,790 --> 00:00:45,300
word.

8
00:00:46,140 --> 00:00:52,170
Another problem with this approach is again related to high dimensional vectors, despite the fact that

9
00:00:52,170 --> 00:00:57,180
these words have very close to the same meaning we treat them all as separate vector components.

10
00:00:57,600 --> 00:01:00,390
Since, as you recall, there is one component per word.

11
00:01:01,110 --> 00:01:06,510
By having separate vector components for many related words, we increase the dimensionality of our

12
00:01:06,510 --> 00:01:12,120
vector representation, which is not good for many reasons, especially when those components may be

13
00:01:12,120 --> 00:01:12,900
correlated.

14
00:01:14,940 --> 00:01:19,080
Yet another reason we don't want to treat these words separately is very practical.

15
00:01:19,890 --> 00:01:23,070
Suppose that we are building a search engine like DuckDuckGo.

16
00:01:23,820 --> 00:01:29,700
Now suppose that I'm searching for content on people running, but content creators might express this

17
00:01:29,700 --> 00:01:31,840
in different ways, they might say.

18
00:01:31,890 --> 00:01:38,010
Here is a picture of me running or this is me when I ran to the park last week, or I run to the office

19
00:01:38,010 --> 00:01:38,610
on Monday.

20
00:01:39,420 --> 00:01:44,370
All of these are relevant search results, but they wouldn't be considered a match if we treated each

21
00:01:44,370 --> 00:01:47,040
variation of the word run in a different way.

22
00:01:47,940 --> 00:01:50,490
So what is one possible solution to this issue?

23
00:01:52,440 --> 00:01:57,900
The main motivation of this lecture is to take these related words and simply convert them to the same

24
00:01:57,900 --> 00:02:03,600
word as a pre processing step that is walk, walking walks and so forth.

25
00:02:03,720 --> 00:02:06,460
All become walk in some sense.

26
00:02:06,480 --> 00:02:08,460
What we want is the root word.

27
00:02:09,030 --> 00:02:13,650
This lecture will look at two of the most popular techniques for doing this, known as stemming and

28
00:02:13,650 --> 00:02:14,730
limitation.

29
00:02:19,390 --> 00:02:25,660
So what's the difference between stemming and limitation, the difference between a stemming and limitation

30
00:02:25,660 --> 00:02:30,250
is that stemming is very crude and won't necessarily output real words.

31
00:02:30,610 --> 00:02:34,270
It simply chops off the ends of a word and gives you back the rest.

32
00:02:35,470 --> 00:02:41,380
On the other hand, limitation is more sophisticated, and it uses actual rules of the language in order

33
00:02:41,380 --> 00:02:43,960
to give you back the true base word or route.

34
00:02:44,500 --> 00:02:48,640
The route is also known as the lemma, hence the term limitation.

35
00:02:49,810 --> 00:02:55,330
Either way, both of these methods will help you reduce your vocabulary size, which will also reduce

36
00:02:55,330 --> 00:03:00,910
your vector dimensionality and hence speed up any further processing you might need to do on your text.

37
00:03:05,450 --> 00:03:11,930
So let's start by looking at stemming, as mentioned, stemming is a very simple method based on heuristics.

38
00:03:12,290 --> 00:03:21,260
For example, if a word ends in yes, remove the final yes, so bosses becomes boss or if a word ends

39
00:03:21,260 --> 00:03:28,760
in EMEA a. Remove the whole thing, so replacement becomes plaque without an e at the end.

40
00:03:29,360 --> 00:03:31,070
Note that this is not a real word.

41
00:03:32,270 --> 00:03:35,180
Also, note that stimming is not just one method.

42
00:03:35,540 --> 00:03:37,280
There are multiple kinds of steamers.

43
00:03:37,760 --> 00:03:43,220
The most popular algorithm is the Porter Steamer, which can be found in Nulty K and other libraries

44
00:03:43,220 --> 00:03:43,850
as well.

45
00:03:44,300 --> 00:03:47,240
So if you'd like to use it, here is some sample code.

46
00:03:47,870 --> 00:03:50,690
First, you want to import the Porter Steamer class.

47
00:03:51,290 --> 00:03:54,290
The next step is to create an instance of the Porter Steamer.

48
00:03:54,980 --> 00:03:58,070
At this point, you can pass in any token you want to stem.

49
00:03:58,520 --> 00:04:03,290
For example, if I call Port, I'm walking, this will return the string walk.

50
00:04:07,870 --> 00:04:14,530
Now that we've discussed stemming the next step is to discuss limitation as mentioned, limits ization

51
00:04:14,530 --> 00:04:16,149
is a bit more sophisticated.

52
00:04:16,690 --> 00:04:22,690
Basically, you can think of this as a lookup table or, in other words, a dictionary as an example,

53
00:04:22,690 --> 00:04:27,340
suppose we want to find the root of the word better if we were to use a steamer.

54
00:04:27,370 --> 00:04:29,770
It would simply return the same string better.

55
00:04:30,370 --> 00:04:36,070
But if we use a limited user, it will return to good, which is a completely separate word in terms

56
00:04:36,070 --> 00:04:37,120
of how it's spelt.

57
00:04:37,690 --> 00:04:41,140
There's no way to manipulate the string better to become good.

58
00:04:41,410 --> 00:04:44,140
It simply must come from a database of knowledge.

59
00:04:45,580 --> 00:04:51,400
Another example are the words was and is the word was, is the past tense of is.

60
00:04:51,880 --> 00:04:58,870
But these are both derivatives of the word be a steamer will take the word was in return water without

61
00:04:58,870 --> 00:05:03,040
the yes, but a limited zero will return be in both case, it's.

62
00:05:04,410 --> 00:05:06,600
Another nice example is the word mice.

63
00:05:07,140 --> 00:05:12,120
This is one of those strange words that is a plural, but it doesn't simply add an s at the end of the

64
00:05:12,120 --> 00:05:13,050
original word.

65
00:05:13,620 --> 00:05:19,260
A steamer will simply return the same word mice, but a limited zero will return the word mouse.

66
00:05:23,800 --> 00:05:30,250
So note that limited visitors appear in many Python libraries, such as Nulty Can Spacey, we'll look

67
00:05:30,250 --> 00:05:34,860
at NCTC since the API is a bit simpler and mirrors that of the of cinema.

68
00:05:35,920 --> 00:05:41,050
So again, we begin by importing and I'll take in the word net limitées a class.

69
00:05:41,920 --> 00:05:47,110
The next step, if you haven't already done so, is to download the word in that database, which the

70
00:05:47,110 --> 00:05:48,880
word net limitation makes use of.

71
00:05:49,780 --> 00:05:55,960
Once we create an object of Typekit word net limits area, we can then call its function limits to convert

72
00:05:55,960 --> 00:05:58,870
one word at a time into its corresponding lemma.

73
00:06:00,430 --> 00:06:05,650
One important thing to recognize is that although it's not required, the limits function it takes in

74
00:06:05,650 --> 00:06:10,210
one extra argument called P.O.S., which stands for parts of speech.

75
00:06:10,840 --> 00:06:15,610
Basically, this signifies whether the word is a noun, verb, adjective and so forth.

76
00:06:16,180 --> 00:06:20,290
The default is noun, but note that this does not work for all cases.

77
00:06:20,770 --> 00:06:26,770
For example, if you pass in the word going by itself, the limiter will return to going, which is

78
00:06:26,770 --> 00:06:27,850
not what we want.

79
00:06:28,510 --> 00:06:31,000
Also note that the word going is not a noun.

80
00:06:31,810 --> 00:06:37,030
Instead, we should specify that it's a verb, in which case the word go will be returned.

81
00:06:38,000 --> 00:06:43,310
So although the parts of speech argument is not required, it is strongly recommended to make use of

82
00:06:43,310 --> 00:06:43,700
it.

83
00:06:48,280 --> 00:06:52,900
So why might we want to do this where we do something different based on the parts of speech?

84
00:06:53,470 --> 00:06:55,180
Well, let's consider this sentence.

85
00:06:55,600 --> 00:06:57,670
Donald Trump has a devoted following.

86
00:06:58,420 --> 00:07:00,250
And let's consider another sentence.

87
00:07:00,550 --> 00:07:04,930
The cat was following the bird as it flew by in the first case.

88
00:07:04,960 --> 00:07:08,470
Following is a noun, in which case that is its root form.

89
00:07:09,190 --> 00:07:13,810
In the second case, following is a verb, in which case its root form is followed.

90
00:07:14,620 --> 00:07:20,200
Thus, we have seen that the root form of a word can be dependent on its parts of speech.

91
00:07:24,850 --> 00:07:30,440
Now, there is one weird quirk about Nalchik, which essentially just wraps another library to do limited

92
00:07:30,460 --> 00:07:35,650
zation, as we just saw in order to properly use the word in limited Heizer.

93
00:07:35,950 --> 00:07:39,160
We need to first do parts of speech tagging on our sentence.

94
00:07:39,760 --> 00:07:45,370
So you may have assumed that you can just use an LG K to do parts of speech tagging, which you can

95
00:07:45,370 --> 00:07:49,600
and then you can pass in those parts of speech tags into the limitées function.

96
00:07:50,920 --> 00:07:56,980
But unfortunately, the parts of speech tags, which are returned by analytics tagger are not compatible

97
00:07:57,220 --> 00:07:58,690
with the limited size function.

98
00:07:59,380 --> 00:08:02,260
Instead, they both use different sets of tags.

99
00:08:02,800 --> 00:08:08,560
And so one way to get them to work together is to map the tags from the parts of speech together into

100
00:08:08,560 --> 00:08:11,140
a form that's acceptable to the limited size function.

101
00:08:11,860 --> 00:08:16,690
We'll go through the exact details in a notebook, but this is something that you have to know if you're

102
00:08:16,690 --> 00:08:18,610
going to use the limits of correctly.

103
00:08:23,210 --> 00:08:29,150
Now, the final topic I want to discuss in this lecture is the application of stemming and limits ization

104
00:08:29,390 --> 00:08:31,070
to real world scenarios.

105
00:08:31,760 --> 00:08:36,470
You may believe that these techniques are outdated because you can just leave all your words as is,

106
00:08:36,710 --> 00:08:41,929
and use a deep neural network to do any NLP task, but you would be incorrect.

107
00:08:42,890 --> 00:08:48,410
This is due to a bit of shortsightedness and lack of awareness of where NLP is used in the industry.

108
00:08:49,040 --> 00:08:55,430
So if you think that generating text like Djibouti or doing sentiment analysis and spam detection are

109
00:08:55,430 --> 00:08:57,350
the only applications of NLP.

110
00:08:57,710 --> 00:08:59,060
This is simply not true.

111
00:08:59,750 --> 00:09:03,980
So here are some billion dollar industries where these techniques are used.

112
00:09:05,680 --> 00:09:09,100
Some examples are search engines and document retrieval.

113
00:09:09,340 --> 00:09:12,280
Online ads and social media tags.

114
00:09:14,040 --> 00:09:19,740
As an exercise, I recommend thinking about how stemming in limited physician could be applied in these

115
00:09:19,740 --> 00:09:20,700
scenarios.

116
00:09:25,410 --> 00:09:29,310
The first example, which I briefly mentioned earlier, is search engines.

117
00:09:29,790 --> 00:09:34,170
In fact, searches how Google became one of the largest tech companies in the world.

118
00:09:34,770 --> 00:09:40,560
As you recall, when a user enters a query, we don't want to only include exact matches because then

119
00:09:40,560 --> 00:09:43,080
we wouldn't get back all the relevant results.

120
00:09:43,710 --> 00:09:49,830
Instead, by converting all the terms into their route form, we can search through more possible matches.

121
00:09:50,580 --> 00:09:55,530
Also, keep in mind that the users who enter these search terms in the first place won't answer them

122
00:09:55,530 --> 00:09:58,620
the same way, even though they might want the same thing.

123
00:09:59,340 --> 00:10:02,790
So one person might type running while another might type Iran.

124
00:10:03,300 --> 00:10:06,090
But the meaning of these two words is essentially the same.

125
00:10:06,780 --> 00:10:11,790
We want the search engine to respond to what the user means, not necessarily what they typed.

126
00:10:12,510 --> 00:10:15,570
Furthermore, we want to make things easy for our users.

127
00:10:15,990 --> 00:10:20,750
Imagine if the user had to type in all the different variations of each word run.

128
00:10:20,790 --> 00:10:25,380
Running runs ran and so forth just to get all the relevant results.

129
00:10:25,800 --> 00:10:30,540
That would take a lot of time, and users would not want to use a search engine that requires them to

130
00:10:30,540 --> 00:10:31,470
do so much work.

131
00:10:36,300 --> 00:10:39,690
Here's another example, which is closely related to Google search.

132
00:10:40,140 --> 00:10:41,850
And this is online advertising.

133
00:10:42,690 --> 00:10:48,270
In fact, although you may know Google because of its famous search engine, it is in fact mainly an

134
00:10:48,270 --> 00:10:51,420
ad company, as that is where it makes most of its revenue.

135
00:10:52,140 --> 00:10:53,610
So how do ads work?

136
00:10:54,240 --> 00:10:56,190
Well, ads are all based on keywords.

137
00:10:57,030 --> 00:11:01,560
You can think of keywords as the relevant terms that a user types into the search box.

138
00:11:02,010 --> 00:11:06,960
If you are an advertiser, then you want to match your ads to those search terms.

139
00:11:07,500 --> 00:11:12,420
For example, suppose you are working for Apple and your job is to run ads for the iPhone.

140
00:11:13,110 --> 00:11:17,430
Now, let's keep in mind that you have to pay Google every time they show one of your ads.

141
00:11:17,910 --> 00:11:22,800
So do you want Google to show your ad when someone typed send flowers or a Tesla Model three?

142
00:11:23,400 --> 00:11:27,510
The answer is no, because iPhone is irrelevant to those search terms.

143
00:11:28,050 --> 00:11:33,450
Instead, you want your ad for the iPhone and to show up when users are searching for phones or other

144
00:11:33,450 --> 00:11:34,620
related items.

145
00:11:35,130 --> 00:11:40,110
So perhaps when they are searching for the Google Pixel or a Samsung Galaxy, that is when you want

146
00:11:40,110 --> 00:11:41,010
to show your ad.

147
00:11:41,580 --> 00:11:44,460
Of course, these keywords can have many variations.

148
00:11:44,940 --> 00:11:47,280
Again, imagine you are selling or running shoes.

149
00:11:47,610 --> 00:11:49,760
You don't only want to cover the word running.

150
00:11:49,770 --> 00:11:53,400
You also want to cover words like run runs and so forth.

151
00:11:53,970 --> 00:11:57,900
So again, stemming in limited vision are useful in the scenario.