1
00:00:11,550 --> 00:00:15,990
So in this lecture, we'll be looking at how to do parts of speech tagging in TensorFlow.

2
00:00:16,800 --> 00:00:22,290
Now, just in case you don't know what this task is, all you have to know is the concept of nouns,

3
00:00:22,290 --> 00:00:24,180
verbs, adjectives and so forth.

4
00:00:24,810 --> 00:00:29,790
If you've never heard of those, take a few minutes to look it up, as it's pretty simple to understand.

5
00:00:31,980 --> 00:00:37,470
So the major difference between this task and the previous task is the type of Arnon and we need to

6
00:00:37,470 --> 00:00:38,070
use.

7
00:00:38,790 --> 00:00:43,590
For the previous example, we wanted to classify a document into a single class.

8
00:00:44,130 --> 00:00:50,010
This is called a many to one task because we have many items in our sequence of inputs, but only one

9
00:00:50,010 --> 00:00:51,030
item as output.

10
00:00:51,720 --> 00:00:53,220
This task is different.

11
00:00:53,790 --> 00:01:00,240
This task is called a many to many task because there are many items in the sequence of inputs and many

12
00:01:00,240 --> 00:01:02,640
items in the corresponding sequence of outputs.

13
00:01:03,240 --> 00:01:06,870
For each input word, we should have a corresponding part of speech tag.

14
00:01:07,530 --> 00:01:09,930
Note that this is still a classification task.

15
00:01:10,260 --> 00:01:13,560
It's simply that we now have a sequence of classes as the target.

16
00:01:15,760 --> 00:01:22,210
So let's begin by importing a few modules from Nanotech, since it conveniently keeps a copy of a parts

17
00:01:22,210 --> 00:01:23,320
of speech data set.

18
00:01:23,920 --> 00:01:26,170
We'll be using a dataset called the Brown Corpus.

19
00:01:33,640 --> 00:01:37,780
The next step is to download the data using analytics download method.

20
00:01:43,330 --> 00:01:49,030
The next step is to retrieve the corpse from the module we imported by calling the tagged sense function

21
00:01:49,780 --> 00:01:55,570
note that we specify the tag set as universal, which will give us slightly different targets than default.

22
00:02:00,590 --> 00:02:04,580
The next step is to simply print the data set so that you know what it looks like.

23
00:02:09,000 --> 00:02:16,110
So as you can see, it's essentially a list of lists of tuples, each list within this list is a sentence

24
00:02:16,530 --> 00:02:23,850
represented as a list of tables in each tuple contains two items of a word and the corresponding tag.

25
00:02:30,930 --> 00:02:36,540
The next step is to put our data set into a format that's easier to work with by separating the inputs

26
00:02:36,540 --> 00:02:42,510
and targets, the output of this will be two lists which basically have the same structure as what we

27
00:02:42,510 --> 00:02:47,850
just saw for the inputs will have a list of lists of words for the targets.

28
00:02:47,850 --> 00:02:49,980
Will have a list of lists of tags.

29
00:02:56,630 --> 00:03:02,360
The next step is to do some additional imports specifically for TensorFlow, Nampai and so forth.

30
00:03:08,890 --> 00:03:11,800
The next step is to split our data into train and test.

31
00:03:18,840 --> 00:03:24,660
The next step is to convert our sentences into sequences of integers for this data set.

32
00:03:24,690 --> 00:03:29,460
We'll keep the max vocab size as none so that we get a unique tag for each word.

33
00:03:30,300 --> 00:03:34,830
I've also said lower to false, which means that we will not lowercase the words.

34
00:03:36,180 --> 00:03:41,530
This might be important since a capitalized word might have a different meaning than a lowercase word.

35
00:03:42,270 --> 00:03:44,490
As an example, consider the word bill.

36
00:03:44,970 --> 00:03:50,910
This might be a name which is a noun, or it may be a verb like I'll bill you for the service.

37
00:03:54,750 --> 00:04:00,180
Now, there's a very important argument here, which is that out of vocabulary token, I've set this

38
00:04:00,180 --> 00:04:03,870
string to UNC, which obviously stands for unknown.

39
00:04:05,070 --> 00:04:06,720
So why is this so important?

40
00:04:07,530 --> 00:04:10,380
Well, remember that this is a many to many task.

41
00:04:11,070 --> 00:04:18,089
Recall the default behavior of the tokenize or object if it encounters any word that is not in its vocabulary.

42
00:04:18,450 --> 00:04:21,990
It simply ignores that word and doesn't convert it to an integer.

43
00:04:22,740 --> 00:04:27,510
In this case, that would mean that some words showed up in the test set that wasn't in the train set.

44
00:04:28,290 --> 00:04:29,850
But what would be the result?

45
00:04:30,600 --> 00:04:35,850
Well, being a many to many task, we're going to also have a sequence of tags as the target.

46
00:04:37,680 --> 00:04:43,620
The problem is that if we ignore unknown words, the input sequence length will end up being shorter

47
00:04:43,860 --> 00:04:45,360
than the target sequence length.

48
00:04:46,140 --> 00:04:49,980
Now what's really bad is that this bug is completely silent.

49
00:04:50,430 --> 00:04:54,690
You could run this code and get no errors, but your performance would be bad.

50
00:04:55,350 --> 00:04:59,190
The reason it won't throw any errors is because we'll be padding our sequences.

51
00:04:59,790 --> 00:05:05,100
So even if the inputs and targets have different lengths after we had padding, they will all have the

52
00:05:05,100 --> 00:05:05,820
same length.

53
00:05:06,510 --> 00:05:11,760
Of course, that won't lead to runtime errors, but it will mean that your inputs are no longer aligned

54
00:05:11,760 --> 00:05:13,320
with the corresponding targets.

55
00:05:13,890 --> 00:05:19,410
In other words, some of those inputs will be assigned to the wrong tag, so it's important to keep

56
00:05:19,410 --> 00:05:21,930
all words, even those that are unknown.

57
00:05:29,980 --> 00:05:33,610
The next step is to get our word to index mapping and our vocab size.

58
00:05:38,930 --> 00:05:44,570
The next step is to define a flattened function, which will convert a list of list into a single list.

59
00:05:45,140 --> 00:05:46,700
We'll be using this in the next step.

60
00:05:52,360 --> 00:05:59,110
The next step is to basically ensure that all the targets appear in both the training test sets so well,

61
00:05:59,110 --> 00:06:04,180
first to flatten the train targets into a single list and then convert that list to a set.

62
00:06:08,710 --> 00:06:10,930
OK, so this is our set of train targets.

63
00:06:14,490 --> 00:06:16,770
Now, let's do the same thing with the test targets.

64
00:06:20,890 --> 00:06:22,780
So this is our list of test targets.

65
00:06:23,200 --> 00:06:24,640
They appear to be the same.

66
00:06:28,210 --> 00:06:31,930
Now, just to be very sure, let's check whether or not these sets are equal.

67
00:06:35,640 --> 00:06:41,340
OK, so these sets are equal, meaning that all targets show up in both the train set and the test set.

68
00:06:45,460 --> 00:06:49,510
The next step is to convert the targets to sequences of integers as well.

69
00:06:50,260 --> 00:06:54,640
Luckily, we already have a tool that can do this, which is the tokenize a class.

70
00:06:55,600 --> 00:07:00,190
One small disadvantage of using this is that the class zero will just never be used.

71
00:07:00,610 --> 00:07:06,910
So there is one extra unnecessary class note that there's no need to consider options such as max vocab

72
00:07:06,910 --> 00:07:10,090
size, down casing or out of vocabulary tokens.

73
00:07:10,420 --> 00:07:13,870
Since our set of targets is consistent and also pretty small.

74
00:07:19,510 --> 00:07:21,640
The next step is to pat our sequences.

75
00:07:22,300 --> 00:07:27,370
Well, first wants to find the maximum length of both the training test sets and a sign that to T..

76
00:07:28,300 --> 00:07:32,110
Now you might think that this is cheating, but in fact, this is totally OK.

77
00:07:32,740 --> 00:07:38,770
This is because an Aunt N can actually work with arbitrary sequence length and thus in the real world,

78
00:07:39,130 --> 00:07:41,860
truncating any sequences would not be necessary.

79
00:07:48,780 --> 00:07:51,120
The next step is to pad the train inputs.

80
00:07:57,520 --> 00:07:59,680
The next step is to pad the test inputs.

81
00:08:05,700 --> 00:08:07,800
The next step is to pad the train targets.

82
00:08:13,750 --> 00:08:15,880
The next step is to pad the test targets.

83
00:08:22,250 --> 00:08:24,620
The next step is to get the number of classes.

84
00:08:25,250 --> 00:08:31,010
Note that this is the number of tags plus one since zero must be counted, even though it will not be

85
00:08:31,010 --> 00:08:31,640
used.

86
00:08:38,860 --> 00:08:40,600
The next step is to create our model.

87
00:08:41,320 --> 00:08:45,140
Now this is a relatively straightforward model with just a few lines of code.

88
00:08:45,490 --> 00:08:46,960
Like everything else we've seen.

89
00:08:47,770 --> 00:08:50,980
However, there are again some important points to consider.

90
00:08:52,030 --> 00:08:56,500
Firstly, one minor detail is that we're using a bi directional list here.

91
00:08:57,220 --> 00:08:59,800
This might sound fancy, but it's actually very simple.

92
00:09:00,430 --> 00:09:04,690
Recall that context may be used to help predict the part of speech tag for a word.

93
00:09:05,320 --> 00:09:10,330
Again, consider words like bill or milk that can have different tags, depending on their meaning.

94
00:09:11,500 --> 00:09:18,700
It may be the case that this useful context could come before or after in the sentence, as you've seen,

95
00:09:18,700 --> 00:09:23,740
are it ends only go in one direction from the first input to the last in order.

96
00:09:24,340 --> 00:09:27,670
So that would only incorporate past context, but not the future.

97
00:09:28,510 --> 00:09:35,020
Imagine instead that we created two LSD ms for one of the LSD arms we pass in the sequence as normal

98
00:09:35,500 --> 00:09:36,720
for the other LSHTM.

99
00:09:36,740 --> 00:09:38,350
We read the sequence backwards.

100
00:09:38,920 --> 00:09:41,980
In this way, we know about both the past and future.

101
00:09:42,670 --> 00:09:48,010
Once we get the head in states for both the forward and backward Elysium's, we can simply concatenate

102
00:09:48,010 --> 00:09:50,290
them together into a single vector.

103
00:09:51,040 --> 00:09:55,540
So that's what a bi directional LSD is, as you can see.

104
00:09:55,570 --> 00:10:01,090
It's a very simple code modification where we just add bi directional around the LSD unit.

105
00:10:03,290 --> 00:10:09,320
Also note that we set return sequences to true, since we want all the hit in states, one for each

106
00:10:09,320 --> 00:10:14,810
input soak in, each of these hidden states will then be passed through the final dance layer to get

107
00:10:14,810 --> 00:10:16,400
a prediction for each input.

108
00:10:18,590 --> 00:10:21,710
Now, that was just minor stuff, but I said that there was a change here.

109
00:10:22,040 --> 00:10:23,330
That's very important.

110
00:10:24,080 --> 00:10:29,240
If you look at the embedding, you can see that there's one additional argument here which says mask

111
00:10:29,240 --> 00:10:30,500
zero equals true.

112
00:10:31,190 --> 00:10:32,450
So what is this for?

113
00:10:33,080 --> 00:10:36,080
Well, there is a problem if we use our network as is.

114
00:10:36,860 --> 00:10:43,550
As you recall, this is a many to many network and we patted both our inputs and our targets so that

115
00:10:43,550 --> 00:10:44,960
they all have the same length.

116
00:10:46,160 --> 00:10:52,100
We know that this padding is just a bunch of zeros, but these zeros now appear in the target, which

117
00:10:52,100 --> 00:10:53,320
means we are trying to predict.

118
00:10:53,330 --> 00:10:58,910
Then of course, there's really no way around this because we must have the target be the same length

119
00:10:58,910 --> 00:10:59,840
as the input.

120
00:11:00,560 --> 00:11:05,630
What does matter is how we treat these zeros, and we are currently not treating them correctly.

121
00:11:06,410 --> 00:11:12,380
Effectively, what we would like to do is build our laws function such that it ignores any entries where

122
00:11:12,380 --> 00:11:17,090
the target is zero in terms of the built in cross entropy method and terrorists.

123
00:11:17,480 --> 00:11:18,890
There is no way to do this.

124
00:11:19,760 --> 00:11:25,850
What you could do is build a custom lost function, which I have done in other courses, but unfortunately

125
00:11:25,850 --> 00:11:29,750
the skill level required for that goes beyond the scope of this course.

126
00:11:31,670 --> 00:11:37,330
Luckily, there is a built in method to deal with this increase, which is to set this argument mask

127
00:11:37,350 --> 00:11:38,360
zero equals true.

128
00:11:39,140 --> 00:11:44,960
This will automatically make it so that any inputs which are zero will be ignored throughout the network,

129
00:11:44,960 --> 00:11:46,310
including the lost function.

130
00:11:47,060 --> 00:11:51,380
This effectively does what we want without us having to build a custom loss.

131
00:11:52,730 --> 00:11:55,490
Now, while this is convenient, there is a downside.

132
00:11:56,000 --> 00:11:58,040
The downside is that this is very slow.

133
00:11:58,700 --> 00:12:02,930
I've tested both methods and using the custom lost function I wrote is much faster.

134
00:12:04,400 --> 00:12:09,950
But whether you use mass zero or a custom loss, you must do one of these in order for the results to

135
00:12:09,950 --> 00:12:10,730
be correct.

136
00:12:11,390 --> 00:12:16,910
Unfortunately, you setting mask zero to true will make the fitting process go much slower than it would

137
00:12:16,910 --> 00:12:19,280
have had you done it incorrectly.

138
00:12:19,940 --> 00:12:24,290
Unfortunately, this is unnecessary evil if you want correct results.

139
00:12:25,520 --> 00:12:29,240
Even worse is that this will go slower on the GPU than the CPU.

140
00:12:29,840 --> 00:12:35,060
It takes about 30 minutes per epoch on CPU, but only five minutes per epoch on CPU.

141
00:12:35,600 --> 00:12:39,440
And both of these are much slower than if you had not used this argument at all.

142
00:12:40,730 --> 00:12:45,800
The blame for this essentially falls to TensorFlow, and I've seen at least one person say that the

143
00:12:45,800 --> 00:12:48,680
solution to this is to simply use PyTorch.

144
00:12:49,370 --> 00:12:55,430
In any case, in my opinion, this is the best solution for this class because I can still get the idea

145
00:12:55,430 --> 00:12:58,310
across while keeping the code as simple as possible.

146
00:12:58,850 --> 00:13:02,390
Otherwise, we'd have to write a custom Lawson that would take a lot of work.

147
00:13:07,890 --> 00:13:11,340
The next step is to call the pilot fit, which you've seen before.

148
00:13:30,580 --> 00:13:32,890
The next step is to plot our loss per epoch.

149
00:13:37,330 --> 00:13:38,980
So the last report looks good.

150
00:13:43,140 --> 00:13:45,660
The next step is to plot the accuracy per epoch.

151
00:13:50,080 --> 00:13:52,120
So the accuracy pre-pack looks good.

152
00:13:57,220 --> 00:14:02,890
Now, just as a sanity check, we are going to compute the true accuracy of our model to ensure that

153
00:14:02,890 --> 00:14:09,070
the results we got above are actually correct and to ensure that Mask Zero is actually correct.

154
00:14:09,790 --> 00:14:15,010
The first step here is to simply get the length of each sequence for both the train and test sets.

155
00:14:20,830 --> 00:14:23,380
The next step is to get our models trained predictions.

156
00:14:24,130 --> 00:14:30,670
Recall that the model returns probabilities or logins, so the output shape will be by T by Kay, which

157
00:14:30,670 --> 00:14:34,690
includes padding in order to get the predictions without padding.

158
00:14:35,020 --> 00:14:39,130
We're going to live through the model outputs and sequence lengths in corresponding order.

159
00:14:40,000 --> 00:14:45,370
Inside the loop, we'll grab only the first part of the sequence that is not padded, and we use the

160
00:14:45,370 --> 00:14:46,780
sequence length to do that.

161
00:14:48,220 --> 00:14:52,900
The next step is to take the AMAX to get the class integer for each time step.

162
00:14:54,360 --> 00:14:58,170
Finally, we appended these predictions to our list of predictions.

163
00:14:59,040 --> 00:15:05,850
The final step after the above loop is complete is to flatten both the predictions and unpaid targets.

164
00:15:06,480 --> 00:15:11,730
This is because the methods to compute various metrics in the socket learn do not account for nested

165
00:15:11,730 --> 00:15:12,240
lists.

166
00:15:12,750 --> 00:15:18,110
So it's simpler for us to turn them into one dimensional lists so that we can use the functions inside

167
00:15:18,150 --> 00:15:18,240
it.

168
00:15:18,240 --> 00:15:18,750
Learn.

169
00:15:24,880 --> 00:15:27,370
The next step is to do the same thing for the test set.

170
00:15:34,210 --> 00:15:39,010
The next step is to compute the accuracy and F1 for both the train, its assets.

171
00:15:45,980 --> 00:15:48,470
So as you can see, our model does pretty well.

172
00:15:48,980 --> 00:15:55,460
As expected, the F1 is a bit less optimistic than the accuracy, since our classes are probably imbalanced.

173
00:16:00,140 --> 00:16:06,080
So you want to keep note of the above scores, since the next step is to compare our model to a baseline,

174
00:16:06,890 --> 00:16:12,170
the results we got are meaningless if we haven't confirmed that they perform better than something else.

175
00:16:12,890 --> 00:16:17,030
For all we know, the baseline could be 100 percent and our model is worse.

176
00:16:18,520 --> 00:16:23,500
To establish our baseline, we're going to create essentially the simplest model possible.

177
00:16:24,250 --> 00:16:27,790
Essentially, what we want to do is for each word in the train set.

178
00:16:28,090 --> 00:16:34,090
Keep track of all the possible tags it was assigned, then just map it to the most common tag.

179
00:16:35,290 --> 00:16:38,860
Now, I won't bore you with the code because it's pretty similar to what we did above.

180
00:16:39,310 --> 00:16:40,930
So let's go down to the results.

181
00:16:52,880 --> 00:16:58,440
So as you can see, the baseline model is pretty good, which makes sense for most words.

182
00:16:58,460 --> 00:17:03,740
They only have one possible tag, and if they don't, then it's likely that those words are more commonly

183
00:17:03,740 --> 00:17:05,720
used one way than another.

184
00:17:06,470 --> 00:17:12,349
Even so, our model does perform better than the baseline, which demonstrates that accounting for context,

185
00:17:12,349 --> 00:17:16,700
is useful and better than simply memorizing a tag for each word.

186
00:17:21,410 --> 00:17:26,030
Now, there are several exercises you should do after you've gone through this notebook.

187
00:17:26,720 --> 00:17:33,140
As usual, you should try different settings, so try the grew simple and try different numbers of head

188
00:17:33,140 --> 00:17:36,050
units, different numbers of hidden layers and so forth.

189
00:17:36,650 --> 00:17:41,390
Additionally, you should repeat the exercise with the CNN instead of an art end.

190
00:17:45,510 --> 00:17:51,870
The second exercise is easier, but in my opinion, more important for this exercise, you're going

191
00:17:51,870 --> 00:17:57,750
to take this notebook and you're actually going to implement the two major mistakes I mentioned in this

192
00:17:57,750 --> 00:17:58,260
lecture.

193
00:17:58,980 --> 00:18:01,350
As a reminder, these are the two mistakes.

194
00:18:02,100 --> 00:18:09,270
Mistake number one is not using an out of vocabulary token in the input organizer, as you recall.

195
00:18:09,300 --> 00:18:14,190
This will make it so that the input sequence becomes misaligned with the target sequence.

196
00:18:14,580 --> 00:18:17,010
Since unknown words are simply ignored.

197
00:18:17,550 --> 00:18:20,550
So the wrong tags are assigned it to the input words.

198
00:18:22,400 --> 00:18:25,730
Mistake number two is not ignoring padding in the loss.

199
00:18:26,300 --> 00:18:32,000
This will lead you to thinking that zeros are targets that were meant to be predicted, even though

200
00:18:32,000 --> 00:18:37,100
they do not represent actual parts of speech tags and have nothing to do with the input text.

201
00:18:37,820 --> 00:18:43,130
You should observe that even when you make both of these mistakes, the code doesn't break.

202
00:18:43,670 --> 00:18:48,440
Instead, you'll get 99 percent accuracy even though your code is incorrect.

203
00:18:49,280 --> 00:18:55,850
As a further exercise, consider why the accuracy is so high and discuss this on the Q&A with your peers.

204
00:18:58,220 --> 00:19:02,300
As a final note for this lecture, I want to discuss why these mistakes happen.

205
00:19:02,960 --> 00:19:08,190
I'm pretty confident that every blog and Kaggle notebook I came across made these mistakes.

206
00:19:08,810 --> 00:19:12,560
These mistakes essentially come from a trusting libraries too much.

207
00:19:13,280 --> 00:19:18,140
You think you can just use the tokenize and you think you can just use the embedding because that's

208
00:19:18,140 --> 00:19:19,340
what you saw in the past.

209
00:19:20,030 --> 00:19:25,850
Unfortunately, these default values don't work, and the only way to realize this is to think about

210
00:19:26,150 --> 00:19:27,860
what this code actually does.

211
00:19:28,430 --> 00:19:34,790
In other words, the lesson is that you have to think, and sometimes there is no way around understanding

212
00:19:34,790 --> 00:19:35,690
how things were.

213
00:19:36,470 --> 00:19:41,990
If you just use libraries and you don't have any understanding, you will make mistakes and build models

214
00:19:41,990 --> 00:19:44,540
that don't actually work in the real world.

