1
00:00:11,090 --> 00:00:16,850
So in this lecture, we will be doing a short demo in CoLab to see for ourselves some of the differences

2
00:00:16,850 --> 00:00:23,060
between us stemming in Limited's Asian note that many of these are the same examples we've just discussed.

3
00:00:23,360 --> 00:00:27,620
So if you feel you already have a good understanding, then please feel free to move on.

4
00:00:28,160 --> 00:00:33,380
This lecture should help draw the points home, since this is real code and you don't have to believe

5
00:00:33,380 --> 00:00:34,340
anything on faith.

6
00:00:36,120 --> 00:00:42,120
OK, so we'll start by importing, and I'll take note that in this notebook, we will be importing things

7
00:00:42,120 --> 00:00:42,840
as needed.

8
00:00:43,140 --> 00:00:46,140
But in practice, your imports will typically go at the top.

9
00:00:46,650 --> 00:00:51,570
So just be aware of this stylistic difference, which is only for the purpose of this lecture.

10
00:00:56,240 --> 00:00:58,940
The next step is to import the Porter Steamer class.

11
00:01:04,450 --> 00:01:09,100
Now that we have the Porter Steiner class, we can create an instance which we will call Porter.

12
00:01:14,380 --> 00:01:17,920
The next step is to use the Porter Steamer to stem the word of walking.

13
00:01:22,960 --> 00:01:25,870
As you can see, the STEM for this word is walk.

14
00:01:29,360 --> 00:01:33,230
Now, let's try walked, which is the past tense version of Walk.

15
00:01:36,600 --> 00:01:39,690
Again, we get walk, which is what one might expect.

16
00:01:43,260 --> 00:01:46,230
Now, let's try walks, which is the plural of walk.

17
00:01:50,070 --> 00:01:51,630
So again, we get walk.

18
00:01:54,400 --> 00:01:56,710
Now, let's try something a bit more complex.

19
00:01:57,250 --> 00:01:59,650
The word RAM is the past tense of run.

20
00:02:03,540 --> 00:02:07,980
OK, so as you can see, there is no rule for this particular sequence of letters.

21
00:02:08,310 --> 00:02:09,780
The result is still ran.

22
00:02:10,590 --> 00:02:14,250
This could be an issue since one might think of the root word as run.

23
00:02:17,080 --> 00:02:18,220
Now, let's try running.

24
00:02:21,450 --> 00:02:24,660
OK, so in this case, we get run, which makes sense.

25
00:02:27,170 --> 00:02:32,300
Now, let's try the word bosses, which is a plural, but ends with ease instead of just as.

26
00:02:36,080 --> 00:02:41,240
OK, so as you can see, if this turns out to give us back a real word, which is the singular form

27
00:02:41,240 --> 00:02:41,900
boss.

28
00:02:45,500 --> 00:02:47,180
Now, let's try the word replacement.

29
00:02:50,660 --> 00:02:53,600
OK, so this time we do not get back a real word.

30
00:02:57,280 --> 00:03:03,580
OK, so in this next block, we are going to stem an entire sentence limitation is more sophisticated

31
00:03:03,580 --> 00:03:04,360
than stemming.

32
00:03:06,490 --> 00:03:12,430
So one thing to be mindful of is that the Porter Steamer stems a single word at a time, so if you have

33
00:03:12,430 --> 00:03:15,520
a sentence of words, you don't pass in the whole sentence.

34
00:03:15,760 --> 00:03:19,240
But instead, you split the sentence into single words first.

35
00:03:19,900 --> 00:03:23,260
As you can see, we've called the split function to do just that.

36
00:03:29,300 --> 00:03:34,550
The next step is to loop over each token in the sentence and then call the STEM function on each token.

37
00:03:35,360 --> 00:03:41,150
We'll also print out each stem ending with a space so that the entire printout fits in a single line.

38
00:03:46,470 --> 00:03:51,520
OK, so as you can see, several of the words in the above sentence have been transformed.

39
00:03:52,050 --> 00:03:57,990
We can see that limitation sophisticated and stemming have all been made into a simpler form.

40
00:03:58,770 --> 00:04:01,620
Again, a note that not all of these are real words.

41
00:04:05,340 --> 00:04:08,100
So here's one interesting rule in the Porter steamer.

42
00:04:08,700 --> 00:04:11,580
Let's see what happens if we stem the word unnecessary.

43
00:04:15,600 --> 00:04:21,720
So as you can see, the word didn't get shorter, but the final why was replaced with and I also note

44
00:04:21,720 --> 00:04:24,840
that the word on in front of the word was not removed.

45
00:04:25,530 --> 00:04:30,810
This would make sense in terms of doing an LP since although the word necessary is kind of the root

46
00:04:30,810 --> 00:04:32,280
of the word unnecessary.

47
00:04:32,490 --> 00:04:38,280
They have opposite meanings, so we wouldn't want to treat them the same way in any downstream task.

48
00:04:38,820 --> 00:04:42,300
In other words, it helps us that they would remain distinct words.

49
00:04:45,300 --> 00:04:47,010
Let's try this again on the word buried.

50
00:04:50,710 --> 00:04:54,130
So again, we see that the final why is replaced with an eye?

51
00:04:57,540 --> 00:05:00,210
The next step is to import the word in a limited isare.

52
00:05:06,520 --> 00:05:11,740
Now, as you recall, limits ization essentially amounts to looking things up in a database.

53
00:05:12,250 --> 00:05:19,750
Our data in this case is part of Nutcase Word Net Package, which we can download by calling NCTC download.

54
00:05:20,950 --> 00:05:24,700
So you might be wondering what would happen if we did not include this code.

55
00:05:24,820 --> 00:05:29,560
For example, if you forgot to include it, this is entirely possible, by the way.

56
00:05:30,010 --> 00:05:34,930
For example, if you installed in a long time ago, you'd probably just forget.

57
00:05:35,500 --> 00:05:40,390
And then when you install your code on a new machine, you'll get an error which will remind you to

58
00:05:40,390 --> 00:05:41,710
download this database.

59
00:05:42,340 --> 00:05:47,140
Now, luckily, analytics is pretty smart, and it knows what you need to download when you call certain

60
00:05:47,140 --> 00:05:47,860
functions.

61
00:05:48,340 --> 00:05:49,960
So suppose you forgot this?

62
00:05:50,260 --> 00:05:55,600
Well, then you get an error saying to run this line in effect, you don't have to worry too much about

63
00:05:55,600 --> 00:05:58,930
forgetting this, because Ntsiki will remind you anyway.

64
00:06:04,700 --> 00:06:09,050
So the next step is to import the warden and module from Knowles-Carter Corpus.

65
00:06:09,530 --> 00:06:11,360
We'll see how this is used very shortly.

66
00:06:16,310 --> 00:06:20,240
The next step is to instantiate an object of type order that limits her.

67
00:06:25,250 --> 00:06:28,310
Now that we have our limited zero, we can test it on words.

68
00:06:28,820 --> 00:06:31,460
So let's start with walking as we did before.

69
00:06:36,310 --> 00:06:40,970
OK, so interestingly, the result is still walking, as you recall.

70
00:06:40,990 --> 00:06:47,560
This is because the limits function takes in an argument for part of speech, which by default is noun.

71
00:06:48,550 --> 00:06:52,630
Walking is not a noun, and this is why the word remains unchanged.

72
00:06:56,030 --> 00:07:01,730
Now, let's try walking again, but this time, let's correctly specify the part of speech as a verb.

73
00:07:02,450 --> 00:07:05,470
Note that these are constants which come from the word that module.

74
00:07:09,600 --> 00:07:13,350
OK, so this time the result is correctly transformed into walk.

75
00:07:17,000 --> 00:07:18,500
Now, let's try the word going.

76
00:07:22,240 --> 00:07:25,900
So again, this remains the same because going is a noun.

77
00:07:28,740 --> 00:07:32,460
Now, let's try going again, but this time or set policy to divert.

78
00:07:36,890 --> 00:07:39,650
As expected, the dilemma of going is go.

79
00:07:42,750 --> 00:07:48,900
Now, let's try the word Iran again, setting post a verb, as you recall, we tried this earlier with

80
00:07:48,900 --> 00:07:51,360
the steamer, which just gave us back ran.

81
00:07:55,490 --> 00:07:59,960
OK, so as you can see, the limits Sizer is a bit smarter and returns run.

82
00:08:03,650 --> 00:08:06,590
Now, let's try the word mice, which is the plural of mouse.

83
00:08:07,250 --> 00:08:08,870
Notice that we are using the stammer.

84
00:08:13,020 --> 00:08:15,990
So this returns mice, which is not unexpected.

85
00:08:19,400 --> 00:08:21,470
Now, let's try mice with the limited either.

86
00:08:25,640 --> 00:08:27,890
So this returns, mouse, which is correct.

87
00:08:28,610 --> 00:08:32,150
Note that this would be difficult to do based on spelling alone.

88
00:08:32,390 --> 00:08:35,990
We need to be told that mouse is the singular form of mice.

89
00:08:39,419 --> 00:08:41,640
Now, let's try the word was with the steamer.

90
00:08:45,710 --> 00:08:49,640
OK, so this gets converted to NWA, which is not a real word.

91
00:08:50,060 --> 00:08:52,700
It's possible that the streamer thinks this is a plural.

92
00:08:56,150 --> 00:09:01,160
Now, let's try the word was with the limits of note, that was is a verb.

93
00:09:04,580 --> 00:09:10,220
So this gives us back the word B, which makes sense again and notice how this is the kind of thing

94
00:09:10,220 --> 00:09:12,680
that cannot be inferred by spelling alone.

95
00:09:13,040 --> 00:09:16,100
We need to be given these rules in the form of a database.

96
00:09:18,920 --> 00:09:21,170
Now, let's try the word is with the steamer.

97
00:09:24,600 --> 00:09:27,300
So we get back is which is the same word.

98
00:09:30,170 --> 00:09:34,940
Now, let's try it is with the limits isare again, a note that is is a verb.

99
00:09:38,620 --> 00:09:44,380
So again, we get back the word B, which makes sense since was is just the past tense of its.

100
00:09:47,470 --> 00:09:49,240
Now, let's try stemming the word better.

101
00:09:52,490 --> 00:09:57,140
OK, so there is no shortage stem for this word, and we simply get back the same word.

102
00:10:00,020 --> 00:10:03,980
Now, let's try to clematis better note that this is an adjective.

103
00:10:07,320 --> 00:10:13,380
So this time we get back good, which makes sense, as you recall, the word better essentially means

104
00:10:13,380 --> 00:10:14,190
more good.

105
00:10:18,270 --> 00:10:24,090
OK, so it seems like limited vision is of limited use because we have to make sure to input the correct

106
00:10:24,090 --> 00:10:25,080
parts of speech.

107
00:10:25,680 --> 00:10:27,720
The question is how can we do this?

108
00:10:28,080 --> 00:10:31,050
We can't manually enter the parts of speech for every word.

109
00:10:31,650 --> 00:10:33,960
Instead, we would like to do this automatically.

110
00:10:34,950 --> 00:10:39,750
Luckily, parts of speech tagging is another one of the fundamental NLP tasks.

111
00:10:40,110 --> 00:10:43,980
And luckily, this functionality is included in NCTC.

112
00:10:44,940 --> 00:10:50,760
The problem is that the parts of speech returned by the parts of speech tagger are not compatible with

113
00:10:50,760 --> 00:10:52,290
the input to the limits, either.

114
00:10:53,100 --> 00:10:56,700
Therefore, we'll need a function that will convert from one form to another.

115
00:10:57,450 --> 00:11:02,010
So that's what this function below does, which is essentially just a bunch of if statements.

116
00:11:02,640 --> 00:11:07,230
It should be pretty obvious what it's doing, but if you have any questions, please feel free to ask

117
00:11:07,230 --> 00:11:08,310
them on the Q&A.

118
00:11:13,600 --> 00:11:18,010
So the next step is to download the appropriate package for the parts of speech tiger.

119
00:11:18,490 --> 00:11:21,190
This happens to be the average perceptron tiger.

120
00:11:26,480 --> 00:11:32,600
The next step is to instantiate a sentence, I've chosen the sentence Donald Trump has a devoted following,

121
00:11:33,410 --> 00:11:38,390
as you recall, we use this sentence because it's an example of where we use the word following as a

122
00:11:38,390 --> 00:11:40,130
noun instead of a verb.

123
00:11:41,030 --> 00:11:44,540
Note that we also split the sentence into individual words.

124
00:11:48,460 --> 00:11:50,980
The next step is to run the parts of speech tagger.

125
00:11:51,760 --> 00:11:54,370
Note that this returns a list containing tuples.

126
00:11:54,820 --> 00:11:59,290
Each tuple contains two items which are the word in the corresponding tag.

127
00:12:03,790 --> 00:12:07,420
OK, so as you can see, we get back exactly what I described.

128
00:12:07,930 --> 00:12:09,730
Note that following is a noun.

129
00:12:14,070 --> 00:12:18,990
The next step is to run our limits, either on each token, passing in the tax, we just got back.

130
00:12:19,680 --> 00:12:23,880
Note that we convert the tax first, using the function we defined above.

131
00:12:28,980 --> 00:12:31,770
OK, so we can see that two words are now different.

132
00:12:32,220 --> 00:12:39,090
The word has has been limited as to have, and the word devoted has been limited is to devote note that

133
00:12:39,090 --> 00:12:41,280
the word following has not been reduced.

134
00:12:44,470 --> 00:12:50,050
OK, so now we're going to try our other sentence, which is the cat was following the bird as it flew

135
00:12:50,050 --> 00:12:50,650
by.

136
00:12:55,700 --> 00:12:58,700
Again, we'll begin by getting the parts of speech tags.

137
00:13:02,920 --> 00:13:06,220
As you can see, and this example following is a verb.

138
00:13:10,770 --> 00:13:13,740
The next step is to limits each word in our sentence.

139
00:13:17,300 --> 00:13:21,170
So we see that in this case, following has been reduced to follow.