1
00:00:11,080 --> 00:00:16,420
So in this lecture, we will be discussing embeddings, which are the deep learning way of converting

2
00:00:16,420 --> 00:00:19,630
words into vectors to begin this lecture.

3
00:00:19,660 --> 00:00:22,270
Let's step outside of NLP for a little bit.

4
00:00:22,960 --> 00:00:27,910
Suppose we're just doing some generic machine learning task, like trying to predict whether or not

5
00:00:27,910 --> 00:00:30,700
a user will make a purchase on your website.

6
00:00:31,510 --> 00:00:35,560
One of the data points you have is how the user got to your website.

7
00:00:36,160 --> 00:00:37,540
There are three possibilities.

8
00:00:38,020 --> 00:00:43,450
Either they got there organically, like searching on DuckDuckGo, or they got there from an advertisement,

9
00:00:43,810 --> 00:00:45,490
or they got there from an affiliate.

10
00:00:46,180 --> 00:00:52,060
This data is categorical, so we cannot just put them into our end by the matrix like we normally would.

11
00:00:53,770 --> 00:00:57,010
So how can we include this data in our machine learning model?

12
00:00:58,000 --> 00:01:02,620
One idea you might have is to simply assign a numbers to these possibilities.

13
00:01:03,070 --> 00:01:04,690
So organic becomes zero.

14
00:01:04,810 --> 00:01:06,490
Advertisement becomes one.

15
00:01:06,490 --> 00:01:07,990
An affiliate becomes, too.

16
00:01:08,770 --> 00:01:12,970
This is not a good approach, since this puts these different categories on a scale.

17
00:01:13,690 --> 00:01:17,680
Numerical data only makes sense when your data actually is numerical.

18
00:01:18,190 --> 00:01:24,370
For example, like using the number of hours you studied for an exam to predict your exam grade, but

19
00:01:24,370 --> 00:01:27,490
there's no reason for these particular numbers to be assigned.

20
00:01:28,180 --> 00:01:33,100
For example, why can't we make organic minus one million instead?

21
00:01:33,130 --> 00:01:39,160
As you may recall, one popular option is a one hot encoding using this method.

22
00:01:39,190 --> 00:01:43,540
This particular input becomes a vector of size three one.

23
00:01:43,540 --> 00:01:49,620
Hot encoding means that for each of these possibilities, they will get a unique vector representation

24
00:01:49,960 --> 00:01:54,880
where one of the positions in this vector has the value one, while the rest are all zeros.

25
00:01:55,600 --> 00:01:58,510
So, for instance, organic becomes one zero zero.

26
00:01:58,900 --> 00:02:01,000
Advertisement becomes zero one zero.

27
00:02:01,330 --> 00:02:03,520
And affiliate becomes zero zero one.

28
00:02:04,300 --> 00:02:05,620
Now, this makes sense.

29
00:02:06,340 --> 00:02:11,080
Imagine if you build a model using logistic regression, as we've done in the past.

30
00:02:12,370 --> 00:02:16,270
Suppose that the wait for the first position is very large and positive.

31
00:02:16,960 --> 00:02:22,300
This means that users who came to your site organically are more likely to make a purchase.

32
00:02:23,230 --> 00:02:26,380
Suppose that the wait in the second position is very negative.

33
00:02:26,980 --> 00:02:32,380
This means that users who came to your site from an advertisement are less likely to make a purchase.

34
00:02:36,940 --> 00:02:43,210
Now, clearly, this is directly applicable to an LP since no deals with words and currently what we

35
00:02:43,210 --> 00:02:49,510
are trying to do is to figure out how to convert text into a numerical feature vector, which is appropriate

36
00:02:49,510 --> 00:02:50,470
for a neural network.

37
00:02:51,910 --> 00:02:57,670
If we treat each word as a category, then all we have to do is create a very big feature vector with

38
00:02:57,670 --> 00:02:59,890
one position for every possible word.

39
00:03:00,610 --> 00:03:06,700
Once we have this vector, which is a vector of mostly zeros except for a single one, we can then use

40
00:03:06,700 --> 00:03:11,530
the usual W Transpose X plus B computation, which you've seen many times.

41
00:03:12,400 --> 00:03:17,020
Unfortunately, there's a problem with this because this vector is very big.

42
00:03:17,410 --> 00:03:19,990
This matrix multiplication will be very slow.

43
00:03:24,750 --> 00:03:29,670
So consider what happens if we multiply a one hot encoded vector with a weight matrix.

44
00:03:30,270 --> 00:03:33,780
Let's suppose the vector is just three dimensional one zero zero.

45
00:03:34,320 --> 00:03:37,350
And the way Matrix just contains the numbers one up tonight.

46
00:03:38,250 --> 00:03:44,520
You can verify for yourself that when we multiply this vector by this matrix, we get the vector one

47
00:03:44,520 --> 00:03:45,090
two three.

48
00:03:50,210 --> 00:03:53,330
Now, consider the one hot quoted Vector zero one zero.

49
00:03:54,140 --> 00:03:59,710
What happens when you multiply this vector by the same way, matrix, we get the vector of four or five

50
00:03:59,720 --> 00:04:00,320
six.

51
00:04:05,390 --> 00:04:08,450
Now, consider the one hot and quoted Vector zero zero one.

52
00:04:09,080 --> 00:04:12,770
What happens when we multiply this vector by the same weight matrix?

53
00:04:13,250 --> 00:04:15,050
We get the vector seven, eight nine.

54
00:04:20,110 --> 00:04:26,320
So what is the pattern we see if the index in the one hot and coded vector, which was set to one was

55
00:04:26,320 --> 00:04:26,860
one.

56
00:04:27,190 --> 00:04:31,750
Then we get the first row of the weight matrix if the index was set to two.

57
00:04:32,080 --> 00:04:36,550
Then we get the second row of the weight matrix if the index was set to three.

58
00:04:36,820 --> 00:04:39,040
Then we get the third row of the weight matrix.

59
00:04:44,180 --> 00:04:50,570
In other words, if we one hot and cold at the integer K and multiply it by the weight matrix, all

60
00:04:50,570 --> 00:04:54,530
that's really doing is selecting the fifth row of the weight matrix.

61
00:04:59,680 --> 00:05:00,700
So here's a shortcut.

62
00:05:01,450 --> 00:05:03,820
The old way of doing this took two steps.

63
00:05:04,360 --> 00:05:10,170
First, we had to create a one hot and coated vector and set the key for entry to one second.

64
00:05:10,180 --> 00:05:14,410
We had to multiply this one hot and encoded vector, by the way matrix.

65
00:05:16,400 --> 00:05:18,950
The shortcut way of doing this takes only one step.

66
00:05:19,340 --> 00:05:22,280
We simply index the weight matrix at the rock.

67
00:05:23,150 --> 00:05:28,760
This is obviously much more efficient than creating a one shot vector and then doing matrix multiplication.

68
00:05:33,810 --> 00:05:34,770
Think of it this way.

69
00:05:35,340 --> 00:05:38,460
Indexing an array is of one that's constant time.

70
00:05:39,090 --> 00:05:43,500
But how long does it take to create a one hot vector and then do matrix multiplication?

71
00:05:44,430 --> 00:05:51,510
If your vector is a size V and the weight matrix is of size v times D, then this is o of v times D.

72
00:05:56,510 --> 00:06:02,120
In fact, what I've just described is the embedding layer, the embedding layer is just like a dense

73
00:06:02,120 --> 00:06:02,510
layer.

74
00:06:02,630 --> 00:06:08,780
If you passed in a one hot and coated vector and you didn't have any bites terms because of what we

75
00:06:08,780 --> 00:06:09,450
just learned.

76
00:06:09,470 --> 00:06:15,830
We know that if our input is a one shot encoded vector, then it's much more efficient to simply select

77
00:06:15,830 --> 00:06:20,630
the appropriate role of our weight matrix instead of doing a full matrix multiply.

78
00:06:25,360 --> 00:06:28,780
So let's consider how we will use the embedding layer in code.

79
00:06:29,740 --> 00:06:31,990
As usual, we start with an input layer.

80
00:06:32,650 --> 00:06:36,790
For this input layer, I've specified that the input dimension is just T.

81
00:06:37,690 --> 00:06:41,380
What this means is that we'll be passing in sequences of lengthy.

82
00:06:42,520 --> 00:06:48,790
The next step is to pass that input into an embedding layer, as you've seen internally.

83
00:06:49,000 --> 00:06:54,970
This is just a weight matrix, and thus we have to specify both the number of inputs and the number

84
00:06:54,970 --> 00:06:55,870
of outputs.

85
00:06:56,530 --> 00:06:59,350
The number of inputs should be our vocabulary size.

86
00:06:59,800 --> 00:07:02,830
This is because we should have a row for each unique word.

87
00:07:03,550 --> 00:07:09,010
The number of outputs should be the vector dimensionality that we choose for our word embeddings.

88
00:07:09,550 --> 00:07:13,570
This is another example of a hyper parameter to be optimized by the user.

89
00:07:18,330 --> 00:07:21,440
Now, as you know, it's always important to think about shapes.

90
00:07:22,150 --> 00:07:27,390
Suppose we called model that summer and we looked at the shape of the output after our embedding.

91
00:07:28,320 --> 00:07:34,330
What we would see is that our data now has three dimensions and by T, by D, or, in other words,

92
00:07:34,360 --> 00:07:37,590
number of samples, by sequence length, by embedding dimension.

93
00:07:38,550 --> 00:07:40,170
Of course, this makes sense.

94
00:07:40,800 --> 00:07:44,440
If we passed in any documents, we should still have any samples.

95
00:07:44,940 --> 00:07:50,040
If all of those documents contain a T words, then they should still have the sequence length T.

96
00:07:51,030 --> 00:07:57,360
The key is that each of those words was converted into a d dimensional vector and therefore our output

97
00:07:57,600 --> 00:07:59,660
has the shape end by T by D.

98
00:08:00,600 --> 00:08:04,980
Understanding the shape is critical for understanding CNN's inheritance.

99
00:08:05,640 --> 00:08:11,010
Essentially, you can think of this like a multidimensional time series data sets after which you could

100
00:08:11,010 --> 00:08:14,370
pass this into a CNN or Arnon like you normally would.

101
00:08:19,060 --> 00:08:24,580
At this point, I would like to take a moment to step back and to think about the big picture of what

102
00:08:24,580 --> 00:08:25,660
we have really done.

103
00:08:26,620 --> 00:08:32,080
Originally, I said we should a one hot encode each word, which technically is still the case.

104
00:08:32,620 --> 00:08:36,880
We just haven't explicitly created the one hot and coatings because that's inefficient.

105
00:08:37,480 --> 00:08:41,980
But notice how such one hot and coded vectors are not useful geometrically.

106
00:08:42,730 --> 00:08:48,670
As you recall, one of my mottos for machine learning is that machine learning is nothing but geometry.

107
00:08:49,510 --> 00:08:54,160
When we plot word vectors, they should exist in some useful vector space.

108
00:08:55,030 --> 00:09:00,100
So, for instance, if I searched for the nearest neighbors of cats, I might find words like feline

109
00:09:00,100 --> 00:09:02,260
lion, cougar and so forth.

110
00:09:02,950 --> 00:09:06,760
But notice how one hot and coded vectors do not have this characteristic.

111
00:09:07,480 --> 00:09:12,760
If you take any two one hot and coded vectors, the distance between them is always the square root

112
00:09:12,760 --> 00:09:13,300
of two.

113
00:09:13,930 --> 00:09:18,700
So Cat is as close to feline as it is to an unrelated word, such as airplane.

114
00:09:23,440 --> 00:09:28,990
Effectively, what we are really doing with an embedding is creating a table of word vectors.

115
00:09:29,590 --> 00:09:32,980
The embedding layer is like a database for these word vectors.

116
00:09:33,550 --> 00:09:40,390
The word index is like a query into that database and the value or the output is the word vector corresponding

117
00:09:40,390 --> 00:09:41,200
to that word.

118
00:09:42,040 --> 00:09:46,210
It is our hope that these vectors will have some useful structure.

119
00:09:46,960 --> 00:09:51,250
So if we search for the nearest neighbors of cat, we will find words like feline.

120
00:09:53,650 --> 00:09:58,960
From this perspective, we have not exactly one hot encoded each word and multiply that by a matrix.

121
00:09:59,350 --> 00:10:05,680
But instead, we've simply come up with a better vector representation for each word in the next lecture.

122
00:10:05,710 --> 00:10:08,410
We'll look at one method to find such vectors.

