1
00:00:11,200 --> 00:00:14,560
So in this lecture, we'll be implementing Cibo words, Vivek.

2
00:00:15,250 --> 00:00:19,470
Let's begin by importing Jensen, which has a nice interface to the data set.

3
00:00:19,840 --> 00:00:21,460
We'll be using for this notebook.

4
00:00:27,950 --> 00:00:31,280
The next step is to load in our data set, which is called tech state.

5
00:00:41,700 --> 00:00:44,310
The next step is to check the type of our data set.

6
00:00:47,730 --> 00:00:50,150
As you can see, this is a data set of objects.

7
00:00:50,730 --> 00:00:55,800
Now it's not my objective in this notebook to get into the details of what this object is, but feel

8
00:00:55,800 --> 00:00:58,140
free to read the documentation if you like.

9
00:01:02,020 --> 00:01:07,300
So all you really need to know is that you can loop through the data set, the next step is to show

10
00:01:07,300 --> 00:01:12,310
that the data set is an admirable meaning that you can do it for loop over it to get it right.

11
00:01:13,240 --> 00:01:19,390
I've commented this out since essentially this will go so fast with so much data that it will overwhelm

12
00:01:19,390 --> 00:01:20,950
CoLab and you'll get an error.

13
00:01:24,690 --> 00:01:27,150
The next step is to do what we wanted to do above.

14
00:01:27,180 --> 00:01:31,530
But just for the first 10 items so that we won't have any issues with the output.

15
00:01:37,700 --> 00:01:40,700
So as you can see, each item is a list of words.

16
00:01:41,150 --> 00:01:44,450
Thus, this is a data set that is already tokenized.

17
00:01:48,790 --> 00:01:53,890
The next step is to count how many samples we have in our data set to get a sense of its size.

18
00:01:58,220 --> 00:02:01,340
As you can see, we have about 7500 documents.

19
00:02:05,280 --> 00:02:09,570
The next step is to get a sense of how many tokens are contained in each document.

20
00:02:10,259 --> 00:02:14,430
So in this loop, I'm getting the length of each document and saving it to a list.

21
00:02:20,940 --> 00:02:24,150
The next step is to plot a histogram of the document lengths.

22
00:02:29,160 --> 00:02:33,900
As you can see, a great majority of the documents have about 10000 words.

23
00:02:34,440 --> 00:02:38,700
It appears to have a few documents with less down to about 5000.

24
00:02:42,770 --> 00:02:47,240
The next step is to check the mean and standard deviation of the document, Lex.

25
00:02:51,360 --> 00:02:56,820
As you can see, the average length is about 10000 with a deviation of about 100.

26
00:03:00,900 --> 00:03:03,570
The next step is to import the Keros tokenize her.

27
00:03:09,260 --> 00:03:15,530
The next step is to fit the tokenize year to our data set and transform the data into sequences of integers,

28
00:03:16,640 --> 00:03:21,350
as you recall, the variable sequences will be a list of lists of integers.

29
00:03:22,550 --> 00:03:25,880
Note that have set the vocab size limit to be twenty thousand.

30
00:03:33,270 --> 00:03:36,000
OK, so let's check the length of our sequences list.

31
00:03:38,920 --> 00:03:42,880
As expected, it still contains one thousand seven hundred one items.

32
00:03:45,150 --> 00:03:47,460
But also check the length of the first document.

33
00:03:50,890 --> 00:03:54,580
So our first document has about 10000 words as expected.

34
00:03:57,590 --> 00:04:02,030
Let's now call the number attribute to see how many words are tokenize are contains.

35
00:04:05,370 --> 00:04:08,730
As you can see, this is set to 20000 as expected.

36
00:04:11,400 --> 00:04:14,610
The next step is to check the length of our word to index map.

37
00:04:18,459 --> 00:04:24,130
As you can see, this contains about 250000 words, much more than no words.

38
00:04:24,700 --> 00:04:29,980
This is because this dictionary stores all the words it encountered, not just the words it keeps.

39
00:04:33,140 --> 00:04:37,130
The next step is to print out the word to index mapping just to see what's in it.

40
00:04:46,760 --> 00:04:51,980
So as you can see, it's a dictionary containing words as keys and integers as values.

41
00:04:52,670 --> 00:04:57,770
Note that they appear to be sorted by frequency, which makes sense since that's how we decide which

42
00:04:57,770 --> 00:04:58,580
words to keep.

43
00:04:59,150 --> 00:05:02,000
So we have the of and and so forth.

44
00:05:06,250 --> 00:05:11,560
Note that the tokenize are also contains the reverse mapping, which is useful if we want to know what

45
00:05:11,560 --> 00:05:13,420
word our neural network has predicted.

46
00:05:22,090 --> 00:05:26,110
So as you can see, this is a dictionary which maps integer back to word.

47
00:05:31,380 --> 00:05:34,320
The next step is to do our imports for the rest of the script.

48
00:05:40,510 --> 00:05:45,760
I've also imported random and said a few random seeds so that you can replicate these results.

49
00:05:51,220 --> 00:05:54,340
The next step is to build our model to start.

50
00:05:54,370 --> 00:05:59,530
We'll need to decide what we want to set as the context size as well as the embedding dimension.

51
00:06:00,400 --> 00:06:04,380
Note that these are hyper parameters which can be tuned to optimize the results.

52
00:06:04,870 --> 00:06:06,970
I've chosen these values somewhat arbitrarily.

53
00:06:07,270 --> 00:06:09,370
So please feel free to test your own.

54
00:06:10,300 --> 00:06:14,800
In fact, that would be a great way to exercise what you've learned in this part of the course.

55
00:06:15,970 --> 00:06:22,090
Note that the context size is for both sides of the middle word, so this means five words on the left

56
00:06:22,240 --> 00:06:23,650
and five words on the right.

57
00:06:24,190 --> 00:06:29,500
In hindsight, it probably would have been better to simply set a variable for one side and then multiplied

58
00:06:29,500 --> 00:06:30,070
by two.

59
00:06:31,870 --> 00:06:34,090
The next step is to create our neural network.

60
00:06:34,720 --> 00:06:41,380
As you can see, this involves four layers the input, the embedding, a lambda layer and a final dance.

61
00:06:41,980 --> 00:06:44,860
Most of this is probably trivial, except for the lambda.

62
00:06:45,490 --> 00:06:51,250
Basically, the lambda layer allows you to specify your own function, but as a carris layer, this

63
00:06:51,250 --> 00:06:56,260
is helpful if you have some custom function you want to define and you want to incorporate it into a

64
00:06:56,260 --> 00:06:57,040
neural network.

65
00:06:58,180 --> 00:07:03,610
So you can see here that all it does is call the TensorFlow reduce mean function on axis one.

66
00:07:04,330 --> 00:07:11,320
As you recall, the output of the embedding has the shape and by T by D so axes zero corresponds to

67
00:07:11,320 --> 00:07:15,880
n axis, one corresponds to T and access to corresponds to D.

68
00:07:16,480 --> 00:07:22,660
We want to get rid of the taxes because we want to take the mean of each embedding vector in the sequence

69
00:07:22,660 --> 00:07:23,980
of context vectors.

70
00:07:24,460 --> 00:07:29,710
Because of this, the output will have the shape and by D, which is what we normally have for tabular

71
00:07:29,710 --> 00:07:30,460
data sets.

72
00:07:31,240 --> 00:07:34,540
Also notice that the final layer sets use bias to false.

73
00:07:40,880 --> 00:07:45,140
OK, so the next step is to call model that summary to see the structure of our model.

74
00:07:50,990 --> 00:07:53,630
As you can see, that shows us our four layers.

75
00:07:54,320 --> 00:07:58,490
Note that the input layer and lambda layer have no parameters as expected.

76
00:07:59,270 --> 00:08:04,040
On the other hand, both the embedding layer and the dense layer have the same number of parameters

77
00:08:04,050 --> 00:08:04,970
one million.

78
00:08:05,510 --> 00:08:09,350
This is because both of these store a weight matrix of opposite size.

79
00:08:09,860 --> 00:08:16,490
So the embedding matrix has the shape 20000 by 50 in the final dense layer, matrix has the shape 50

80
00:08:16,490 --> 00:08:17,710
by 20000.

81
00:08:18,620 --> 00:08:20,480
Also make note of the output shapes.

82
00:08:21,080 --> 00:08:27,350
Our input has the shape and by T where T is 10, which is our sequence length or the length of our context.

83
00:08:28,040 --> 00:08:34,010
The output of the embedding layer as the shape and by T by D where D is 50, which is the size of our

84
00:08:34,010 --> 00:08:35,090
embedding vectors.

85
00:08:36,530 --> 00:08:42,380
The output of the lambda layer is of size and by D since this takes the mean along the T dimension.

86
00:08:43,640 --> 00:08:50,180
And finally, the output of the final layer is NY V, where V is the vocab size since we need to have

87
00:08:50,180 --> 00:08:52,310
an output for every possible word.

88
00:08:56,320 --> 00:08:59,650
The next big block of code is used to build our data generator.

89
00:09:00,130 --> 00:09:02,770
This is perhaps the most complex part of this script.

90
00:09:03,910 --> 00:09:09,520
Basically, this is going to randomly generate context windows from our corpus to feed into the network.

91
00:09:10,210 --> 00:09:15,910
The reason we want to do this is because if we were to convert every possible context and target pair

92
00:09:16,240 --> 00:09:21,940
out of our data set, there would be many redundant entries since all the context windows would overlap.

93
00:09:22,630 --> 00:09:27,190
Therefore, it's more memory efficient to generate inputs and targets on the fly.

94
00:09:28,120 --> 00:09:33,490
So we begin by setting half the context size, which is just the context size divided by two.

95
00:09:37,420 --> 00:09:44,200
Next, we define a data generator function, which takes in a list of sequences as input inside the

96
00:09:44,200 --> 00:09:50,320
function, we pre allocate expansion y batch as an umpire raise, which we will eventually yield from

97
00:09:50,320 --> 00:09:50,980
this function.

98
00:09:54,390 --> 00:09:59,280
The next step is to compute the number of batches, which is the number of sequences divided by the

99
00:09:59,280 --> 00:10:00,780
batch size rounded up.

100
00:10:01,350 --> 00:10:07,110
We round up since if we don't divide evenly, the final batch will simply contain fewer samples.

101
00:10:09,490 --> 00:10:15,190
Next, we enter an infinite loop inside a loop, we shuffle the list of sequences.

102
00:10:15,730 --> 00:10:20,710
This is because we don't want to encounter the sequences in the same order on each epoch, which could

103
00:10:20,710 --> 00:10:21,850
bias our results.

104
00:10:25,330 --> 00:10:28,750
The next step is to live through our badges for this script.

105
00:10:28,780 --> 00:10:35,320
Well, the final one epoch to be one pass through each sample for each sample will choose only a single

106
00:10:35,320 --> 00:10:36,280
context window.

107
00:10:37,030 --> 00:10:41,860
So inside this inner loop, we first grab the relevant sequences for the current batch.

108
00:10:45,550 --> 00:10:48,600
The next step is to compute the size of this current patch.

109
00:10:49,300 --> 00:10:54,490
As you may recall, this may be less than the actual batch size if we are at the final batch and it

110
00:10:54,490 --> 00:10:55,630
doesn't divide evenly.

111
00:10:56,650 --> 00:10:59,710
The next step is to live through each sequence in our batch.

112
00:11:01,180 --> 00:11:04,060
Inside this loop, we first grab the current sequence.

113
00:11:04,660 --> 00:11:09,730
The next step is to choose a random position in the sequence, making space for the context.

114
00:11:10,570 --> 00:11:16,930
The next step is to select the context, so X1 represents everything to the left of the middle word,

115
00:11:17,380 --> 00:11:20,620
and X2 represents everything to the right of the middle word.

116
00:11:25,310 --> 00:11:30,080
You can see that I've commented out some code, which is probably less efficient, but describes what

117
00:11:30,080 --> 00:11:31,250
we are doing conceptually.

118
00:11:32,000 --> 00:11:38,300
So conceptually what we are doing is concatenating x one and X two, which will become the input context.

119
00:11:39,020 --> 00:11:44,840
However, we've already pre allocated our expansion umpire Ray, which is easy to index using Nampai

120
00:11:44,840 --> 00:11:50,810
syntax, so we can simply set X one and X two separately without doing any concatenation.

121
00:11:53,710 --> 00:11:58,960
The next step is to get the target why, which is the middle word and to store that in a wide batch.

122
00:12:02,940 --> 00:12:08,790
Once we finish looping through each sequence in our batch, we then yield expansion y back up to the

123
00:12:08,790 --> 00:12:09,870
current batch size.

124
00:12:16,920 --> 00:12:20,340
The next step is to call the compile method, which you've seen before.

125
00:12:25,880 --> 00:12:27,620
The next step is to call a fit method.

126
00:12:28,220 --> 00:12:32,900
Note that this is a bit different because our data is not in the form of an umpire raise.

127
00:12:33,530 --> 00:12:39,080
Luckily, the TensorFlow Fit function is pretty flexible, so it accepts data generators as well.

128
00:12:39,920 --> 00:12:44,660
Note that when we do this, we also have to tell the fit method how many steps are any cheaper?

129
00:12:45,380 --> 00:12:48,680
This is just the number of batches, as we computed above.

130
00:12:49,580 --> 00:12:52,340
Note that this code takes about 45 minutes to run.

131
00:12:52,730 --> 00:12:55,910
So you may want to simply run it locally on your own machine.

132
00:12:57,650 --> 00:13:01,040
Now you might be wondering why I've chosen 10000 epochs.

133
00:13:01,940 --> 00:13:08,630
As you recall, each document in our dataset contains about 10000 words, and each epoch takes only

134
00:13:08,630 --> 00:13:10,940
one context window from each document.

135
00:13:11,510 --> 00:13:18,350
Therefore, on average, we would need approximately 10000 epochs to select all possible context windows.

136
00:13:34,010 --> 00:13:36,230
The next step is to plot the loss per epoch.

137
00:13:40,400 --> 00:13:43,010
Noted the loss per epoch is pretty stochastic.

138
00:13:43,580 --> 00:13:49,010
This is because, as you recall, predicting a word in a sentence, given other words in a sentence

139
00:13:49,280 --> 00:13:50,420
has some variation.

140
00:13:51,110 --> 00:13:56,300
That's why we can build things like language models that generate different text every time we run it.

141
00:13:56,960 --> 00:13:59,570
But overall, the loss is decreasing, which is good.

142
00:14:05,040 --> 00:14:07,620
The next step is to plot the accuracy per epoch.

143
00:14:12,580 --> 00:14:16,480
Again, we see that it's pretty stochastic, but overall it increases.

144
00:14:21,180 --> 00:14:25,800
The next step is to get our embedding matrix, which is stored in the layer at index one.

145
00:14:32,820 --> 00:14:36,990
The next step is to create an instance of Saikia Learns Nearest Neighbors object.

146
00:14:37,620 --> 00:14:43,260
Basically, this uses special algorithms to find the closest vectors to a query vector, just like we

147
00:14:43,260 --> 00:14:44,490
did with TFI Taf.

148
00:14:45,270 --> 00:14:50,190
We're going to create this object to look for five neighbors using an algorithm called a ball tree.

149
00:14:51,000 --> 00:14:54,540
We also need to call the fit method passing in our matrix of embeddings.

150
00:14:55,860 --> 00:15:01,140
So basically, what we're going to do later is passing a query vector, and then this neighbors object

151
00:15:01,140 --> 00:15:04,080
will tell us which of these embedding vectors are the closest.

152
00:15:04,530 --> 00:15:07,590
In particular, it will choose the five closest vectors.

153
00:15:14,350 --> 00:15:19,510
The next step is to demonstrate how to use what we've created to find the closest neighbors to a given

154
00:15:19,510 --> 00:15:22,750
word to start, I've chosen the word queen.

155
00:15:24,430 --> 00:15:26,620
The first step is to get the integer index.

156
00:15:26,620 --> 00:15:33,010
For this word, which is stored in our tokenize are the next step is to select the corresponding embedding

157
00:15:33,010 --> 00:15:34,300
vector for this word.

158
00:15:35,350 --> 00:15:41,170
Note that because I can learn models, expect a 2D input array, I've indexed the embeddings at Queen

159
00:15:41,170 --> 00:15:45,760
RDX to Queen RDX plus one, which gives us back a one by D matrix.

160
00:15:47,140 --> 00:15:51,700
The next step is to call neighbors dot k neighbors passing in our query vector.

161
00:15:52,390 --> 00:15:57,340
This will return both the indices of the closest vectors and their corresponding distances.

162
00:15:57,880 --> 00:16:00,730
We don't need the distances, but I've printed the indices.

163
00:16:05,770 --> 00:16:09,190
OK, so as expected, it gives us back a list of integers.

164
00:16:13,310 --> 00:16:19,070
Luckily, our tokenize are also has a mapping from index back to word, so we can see what words these

165
00:16:19,070 --> 00:16:20,300
indices represent.

166
00:16:25,980 --> 00:16:27,580
So as you can see, we get back.

167
00:16:27,600 --> 00:16:32,370
Queen Elizabeth, King Mary and Princess, which all makes sense.

168
00:16:33,090 --> 00:16:38,370
Note that the same word as the query will always be in the first position because of the distance from

169
00:16:38,370 --> 00:16:40,320
a vector to itself is zero.

170
00:16:44,850 --> 00:16:49,920
The next step is to create a function, to encapsulate what we just wrote in order to check the neighbors

171
00:16:49,920 --> 00:16:50,910
for other words.

172
00:16:56,570 --> 00:16:57,950
So let's try the word, uncle.

173
00:17:02,440 --> 00:17:07,750
As you can see, the results other than uncle or cousin, grandfather, mentor and father.

174
00:17:08,260 --> 00:17:11,530
These make sense because they are mostly family relationships.

175
00:17:15,740 --> 00:17:17,000
Let's now try Paris.

176
00:17:21,280 --> 00:17:27,280
As you can see, the results other than Paris or Venice, Vienna, Florence and Milan, which all makes

177
00:17:27,280 --> 00:17:27,760
sense.

178
00:17:28,119 --> 00:17:29,650
These are all places in Europe.

179
00:17:33,040 --> 00:17:34,330
Now, let's try Japan.

180
00:17:38,630 --> 00:17:43,310
So we get Taiwan, Turkey, Singapore and Pakistan, which makes sense.

181
00:17:43,520 --> 00:17:44,660
These are all countries.

182
00:17:48,310 --> 00:17:49,660
Let's now try election.

183
00:17:54,110 --> 00:17:56,360
So we get presidential elections.

184
00:17:56,450 --> 00:17:59,090
Vote in candidate, which all makes sense.

185
00:18:02,820 --> 00:18:04,290
Let's now try California.

186
00:18:08,880 --> 00:18:14,790
So we get Texas, Florida, Illinois and Michigan, which makes sense since they are all U.S. states.

187
00:18:19,530 --> 00:18:25,530
The next step is to try an analogy, basically, if we think of this like an equation, we can isolate

188
00:18:25,530 --> 00:18:28,380
queen to get king of minus man, a plus woman.

189
00:18:29,010 --> 00:18:34,260
This gives us a vector and we want to find the closest neighbors of this vector, which we hope will

190
00:18:34,260 --> 00:18:34,920
be queen.

191
00:18:36,360 --> 00:18:37,440
So let's try this out.

192
00:18:45,310 --> 00:18:52,090
OK, so as you can see, Queen appears in the list of top five neighbors know that by convention, we

193
00:18:52,090 --> 00:18:54,940
would typically ignore any words that were part of the query.

194
00:18:55,390 --> 00:18:57,910
So in this instance, we would have returned the queen.

195
00:19:02,300 --> 00:19:03,530
Let's now try this again.

196
00:19:03,560 --> 00:19:07,850
But for England is too English, as Australia is to Australian.

197
00:19:14,960 --> 00:19:19,490
So as you can see, Australia appears as the first neighbor, which is correct.

