1
00:00:11,680 --> 00:00:16,840
In this section of the course we are going to discuss how to deal with tax data in natural language

2
00:00:16,840 --> 00:00:18,280
processing.

3
00:00:18,340 --> 00:00:24,620
This directly ties into the previous section which was on using R and ends for modeling sequence data.

4
00:00:24,760 --> 00:00:31,030
Of course text is also sequence data but there is a major difference between tax sequences and the continuous

5
00:00:31,030 --> 00:00:35,000
valued sequences we were dealing with in the previous section.

6
00:00:35,260 --> 00:00:36,250
That differences.

7
00:00:36,280 --> 00:00:40,380
Text is made of words and words are categorical objects.

8
00:00:40,480 --> 00:00:42,580
In other words they are not continuous

9
00:00:47,770 --> 00:00:49,740
suppose we're given a sequence of words.

10
00:00:49,750 --> 00:00:56,820
The quick brown fox jumps over the lazy dog so x1 equals the X two equals quick and so on.

11
00:00:56,920 --> 00:01:02,410
But here's the problem we know that in order to calculate the output of an hour and end unit we have

12
00:01:02,410 --> 00:01:07,390
to multiply each x by the inputs ahead and wait w x H.

13
00:01:07,390 --> 00:01:10,780
You can see that if X is a category this is not possible

14
00:01:15,900 --> 00:01:21,450
you might think Aha I've seen this before when I have categories all I need to do is one hot encode

15
00:01:21,500 --> 00:01:21,930
them.

16
00:01:22,350 --> 00:01:29,080
So for example what you would do is create an array which has length equal to the size of our vocabulary.

17
00:01:29,130 --> 00:01:32,400
In other words the number of words in the English language.

18
00:01:32,730 --> 00:01:36,020
Typically we denote that with a capital letter V.

19
00:01:36,120 --> 00:01:39,210
So here's how we are going to 1 Hot in quote a word.

20
00:01:39,480 --> 00:01:44,290
First we're going to create a vector of size Big V containing all zeros.

21
00:01:44,580 --> 00:01:50,190
Then we're going to create a mapping where we have for each word a corresponding integer starting from

22
00:01:50,190 --> 00:01:51,180
1.

23
00:01:51,180 --> 00:01:59,100
So for example the word a we'll get the index 1 The word A will get the index to the word activation

24
00:01:59,160 --> 00:02:00,890
might get the index 100.

25
00:02:01,200 --> 00:02:02,910
And we do this for all other words.

26
00:02:02,910 --> 00:02:08,730
So we get all the way up to Zu which maybe has the index 1 million finally.

27
00:02:08,760 --> 00:02:14,930
Once we have a map for each word to an integer we simply set that corresponding index to 1.

28
00:02:14,940 --> 00:02:21,030
So as you can see all of these vectors are a bunch of zeros with only a single one add a unique position

29
00:02:21,960 --> 00:02:33,590
so thus our final one hot encoded word vector becomes a vector of mostly all zeros except with one one.

30
00:02:33,650 --> 00:02:36,060
So what happens after we do this.

31
00:02:36,080 --> 00:02:42,590
Well now we can go back to our usual scenario if we have a sequence of T words and each of them becomes

32
00:02:42,590 --> 00:02:48,560
a vector of size V then we have a t by V matrix which represents our sequence.

33
00:02:48,560 --> 00:02:52,490
This is the same shape as our generic t by D sequence.

34
00:02:52,490 --> 00:02:55,700
We can then pass this into our recurrent neuron that work as normal.

35
00:02:56,330 --> 00:02:58,230
So let's do a very simple example.

36
00:02:58,310 --> 00:03:01,320
Suppose we had the sentence I like cats.

37
00:03:01,340 --> 00:03:06,290
This is a sequence of length 3 because the sentence has three words.

38
00:03:06,380 --> 00:03:13,520
This might turn into the sequence of vectors 0 0 0 1 0 1 0 0 1 0 0 0.

39
00:03:13,820 --> 00:03:19,670
For this simple example we assume that our vocabulary only contains four words which of course is much

40
00:03:19,670 --> 00:03:22,040
smaller than a real English language dataset

41
00:03:27,150 --> 00:03:29,690
but there are some problems with this approach.

42
00:03:29,820 --> 00:03:36,030
Some English language data sets have about 1 million possible tokens or words that would lead to an

43
00:03:36,030 --> 00:03:41,750
extremely large one hot and feature vector which means that your weight matrix would also be very large.

44
00:03:43,170 --> 00:03:47,810
So in other words you have a t by V matrix but V is a very big number.

45
00:03:48,000 --> 00:03:53,550
In actuality the English dictionary contains only about two hundred thousand words so it really depends

46
00:03:53,550 --> 00:03:57,480
on how you split up your words and where you obtain your vocabulary from.

47
00:03:57,540 --> 00:03:59,900
But this is still a significantly large number.

48
00:04:04,780 --> 00:04:07,320
Here's another problem with this approach.

49
00:04:07,480 --> 00:04:11,280
Remember that we would like our input features to have some structure.

50
00:04:11,470 --> 00:04:16,900
If you recall one of the earliest rules we encountered was machine learning is nothing but a geometry

51
00:04:16,900 --> 00:04:18,250
problem.

52
00:04:18,250 --> 00:04:23,200
This rule relies on the fact that there's geometrical structure in our data.

53
00:04:23,200 --> 00:04:27,000
Data vectors from the same class are probably close to each other

54
00:04:32,130 --> 00:04:34,800
but what do one hot encoded feature vectors look like.

55
00:04:35,460 --> 00:04:41,820
Well if we take any to one hot encoded feature vectors and we calculate that Euclidean distance we will

56
00:04:41,820 --> 00:04:47,540
always get the square root of 1 squared plus 1 squared which is just the square root of 2.

57
00:04:47,670 --> 00:04:52,420
It doesn't matter which two words we select because they are all one hardcoded.

58
00:04:52,680 --> 00:04:57,170
For example the distance between Cat and feline would be square root of 2.

59
00:04:57,720 --> 00:05:02,190
But the distance between Cat and airplane would also be squared of 2.

60
00:05:02,190 --> 00:05:10,800
In other words this data has no useful geometrical structure.

61
00:05:10,960 --> 00:05:13,000
So what can we do instead.

62
00:05:13,000 --> 00:05:19,560
Well it would be nice if for each word we could map them to a D dimensional vector.

63
00:05:19,570 --> 00:05:21,490
Luckily we have just the tool for this.

64
00:05:21,610 --> 00:05:23,740
It's called an embedding layer.

65
00:05:23,740 --> 00:05:29,180
Now you might be wondering how can we actually make these vectors have a useful structure.

66
00:05:29,200 --> 00:05:30,250
We'll get back to that later.

67
00:05:30,250 --> 00:05:31,120
So stay tuned.

68
00:05:32,110 --> 00:05:35,200
But before we discuss that I want to show you a little coding trick

69
00:05:40,550 --> 00:05:46,070
so consider what happens if we multiply a 1 Hot encoded vector with a weight matrix.

70
00:05:46,070 --> 00:05:52,250
Let's suppose the vector is just three dimensional 1 0 0 and the weight matrix just contains the numbers

71
00:05:52,280 --> 00:05:53,950
1 up to 9.

72
00:05:54,050 --> 00:06:00,860
You can verify for yourself that when we multiply this vector by this matrix we get the vector 1 2 3

73
00:06:06,020 --> 00:06:09,900
now consider the one hot and quoted vector 0 1 0.

74
00:06:09,950 --> 00:06:13,790
What happens when you multiply this vector by the same weight matrix.

75
00:06:14,030 --> 00:06:16,130
We get the vector four 5 Six

76
00:06:21,220 --> 00:06:24,830
now consider the 1 Hot encoded vector 0 0 1.

77
00:06:24,880 --> 00:06:30,790
What happens when we multiply this vector by the same weight matrix we get the vector 7 8 Nine.

78
00:06:35,870 --> 00:06:42,980
So what is the pattern we see if the index in the one hot encoded vector which was set to 1 was 1.

79
00:06:42,980 --> 00:06:45,800
Then we get the first row of the weight matrix.

80
00:06:45,800 --> 00:06:50,630
If the index was set to 2 then we get the second row of the weight matrix.

81
00:06:50,630 --> 00:06:59,660
If the index was set to 3 then we get the third row of the weight matrix.

82
00:06:59,970 --> 00:07:06,140
In other words if we won hot in code the integer k and multiply it by the weight by critics.

83
00:07:06,180 --> 00:07:15,370
All that's really doing is selecting the K ith row of the weight matrix.

84
00:07:15,470 --> 00:07:16,840
So here's a shortcut.

85
00:07:17,210 --> 00:07:20,130
The old way of doing this took two steps.

86
00:07:20,150 --> 00:07:25,410
First we had to create a 1 Hot encoded vector and set the K of entry to 1.

87
00:07:25,460 --> 00:07:33,210
Second we had to multiply this one hard and coded vector by the way matrix the shortcut way of doing

88
00:07:33,210 --> 00:07:35,150
this takes only one step.

89
00:07:35,160 --> 00:07:38,940
We simply index the weight matrix at the rock.

90
00:07:38,940 --> 00:07:49,570
This is obviously much more efficient than creating a one hot vector and then doing matrix multiplication.

91
00:07:49,590 --> 00:07:54,570
Think of it this way indexing an array is all of one that's constant time.

92
00:07:54,870 --> 00:08:00,060
But how long does it take to create a one hot vector and then do matrix multiplication.

93
00:08:00,240 --> 00:08:12,050
If your vector is of size V and the weight matrix is a size v times D then this is o of V times D.

94
00:08:12,120 --> 00:08:15,930
So this is exactly what the embedding layer in PI torch does.

95
00:08:16,530 --> 00:08:22,140
Instead of doing all the work yourself to create one hot and coded vectors the only step you need to

96
00:08:22,140 --> 00:08:29,900
do is map your data set with sequences of words into a data set with sequences of word indices.

97
00:08:30,000 --> 00:08:36,690
Now that you have only integers you can use an embedding layer to map each word's integer to a corresponding

98
00:08:36,690 --> 00:08:37,860
word vector.

99
00:08:38,310 --> 00:08:44,610
From there on out your sequence becomes A T by D a matrix at which point you can use an R and then as

100
00:08:44,610 --> 00:08:45,480
you normally would

101
00:08:50,590 --> 00:08:53,880
Conceptually you can think of the process like this.

102
00:08:54,010 --> 00:09:01,030
First we have some sentence a sequence of words let's say for simplicity's sake it's I like cats.

103
00:09:01,030 --> 00:09:05,250
Then this sequence of words becomes a sequence of integers.

104
00:09:05,260 --> 00:09:10,610
Obviously it doesn't matter which integer we assign to each word as long as they are unique.

105
00:09:10,630 --> 00:09:13,100
This is just like when we are doing classification.

106
00:09:13,270 --> 00:09:17,320
It doesn't matter if we set dog to 1 and cat to 0 or vice versa.

107
00:09:17,320 --> 00:09:20,210
They are just arbitrary assignments.

108
00:09:20,260 --> 00:09:26,950
Finally we can use these integers to index an embedding matrix which will convert each integer into

109
00:09:26,950 --> 00:09:28,290
a word vector.

110
00:09:28,600 --> 00:09:31,570
So in the end what we have done is two steps.

111
00:09:31,570 --> 00:09:34,050
First we converted words to integers.

112
00:09:34,060 --> 00:09:36,700
Second we map those integers to vectors

113
00:09:41,810 --> 00:09:47,240
one question you might have at this point is we know intuitively that similar words should be closer

114
00:09:47,240 --> 00:09:49,910
together than dissimilar words.

115
00:09:49,910 --> 00:09:55,080
In other words these feature vectors should exist somewhere meaningful relative to each other.

116
00:09:55,190 --> 00:10:00,980
For example Kings should be close to Queen in car should be close to automobile but Queen should not

117
00:10:00,980 --> 00:10:08,160
be close to car and King should not be close to automobile so how can we make sure that that's the case.

118
00:10:08,230 --> 00:10:13,760
Luckily this is the same story as with convolution of filters because they are just waits in a neuron

119
00:10:13,760 --> 00:10:14,190
that work.

120
00:10:14,200 --> 00:10:16,260
They all get trained automatically.

121
00:10:16,270 --> 00:10:22,870
There is nothing fancy you need to do.

122
00:10:23,040 --> 00:10:29,590
Now there's one caveat to that which is that what people often do is use pre trained word vectors.

123
00:10:29,730 --> 00:10:32,640
How this works is when they create an embedding layer.

124
00:10:32,640 --> 00:10:39,130
They set the weights to a set of pre trained vectors that were obtained through some other method then

125
00:10:39,190 --> 00:10:42,820
they freeze this layer so that those weights are never changed.

126
00:10:45,890 --> 00:10:50,950
Typically these word vectors are found through algorithms such as words effect and glove.

127
00:10:50,960 --> 00:10:52,880
Now that is a deep topic in itself.

128
00:10:52,880 --> 00:10:56,830
So in this course we are going to avoid discussing that.

129
00:10:56,870 --> 00:11:04,270
Instead we will train embedding is like any other layer by letting gradient descent do its magic if

130
00:11:04,270 --> 00:11:04,760
you want.

131
00:11:04,790 --> 00:11:14,720
You're encouraged to read the words of in glove papers since they are generally pretty accessible.

132
00:11:14,740 --> 00:11:18,000
To summarize this lecture here's what we discussed.

133
00:11:18,040 --> 00:11:23,390
First we recognize that it's not possible to multiply a word by a way matrix.

134
00:11:23,440 --> 00:11:26,270
There is no concept of multiplying words.

135
00:11:26,320 --> 00:11:33,020
Therefore we needed to somehow convert words into a numerical representation we first thought about

136
00:11:33,020 --> 00:11:34,370
one hot encoding each word.

137
00:11:34,670 --> 00:11:38,420
But we soon realized why that would not be a good idea.

138
00:11:39,140 --> 00:11:42,180
First of all it would take up way too much space.

139
00:11:42,230 --> 00:11:47,780
There were around 1 million possible tokens in the English language so each word would take up 1 million

140
00:11:47,780 --> 00:11:50,830
numbers to represent which is quite silly.

141
00:11:50,930 --> 00:11:56,620
Also these one had encoded vectors are actually meaningless in terms of being feature vectors.

142
00:11:56,630 --> 00:12:02,540
This is because each word vector if it's just a one hot encoded vector would be equal distances apart

143
00:12:03,230 --> 00:12:06,660
so cat is the same distance the feline as it is to airplane.

144
00:12:06,710 --> 00:12:08,960
So that's not very useful.

145
00:12:08,960 --> 00:12:11,500
It doesn't allow us to really make use of the rule.

146
00:12:11,510 --> 00:12:14,030
Machine learning is nothing but a geometry problem.

147
00:12:15,950 --> 00:12:19,940
And thus we need a better way to convert words into vectors.

148
00:12:19,940 --> 00:12:26,180
This method is to take each word turn it into an integer and then use that integer to index an embedding

149
00:12:26,180 --> 00:12:27,530
matrix.

150
00:12:27,530 --> 00:12:31,640
This is much faster than one hot encoding and then doing matrix multiplication

151
00:12:36,730 --> 00:12:37,570
in this way.

152
00:12:37,840 --> 00:12:44,840
An input sentence consists of words then becomes a t by the matrix of numbers which you know and RNA

153
00:12:44,890 --> 00:12:47,050
can readily accept.

154
00:12:47,110 --> 00:12:50,640
We saw that this layers parameters are trained just like any other layer.

155
00:12:50,710 --> 00:12:53,980
So nothing other than gradient descent needs to be done.

156
00:12:55,330 --> 00:13:01,450
Alternatively if you were interested you're welcome to read up on words of Vec and glove which provide

157
00:13:01,450 --> 00:13:06,190
different ways of initializing word and buildings so that you don't need to train them along with the

158
00:13:06,190 --> 00:13:07,240
rest of the neuron that we're.