1
00:00:11,190 --> 00:00:15,750
So in this lecture, we will briefly introduce the concept of neural word embeddings.

2
00:00:16,290 --> 00:00:20,970
Now this topic is a bit more advanced than the rest of this section, so we won't go into great detail.

3
00:00:21,660 --> 00:00:26,790
However, I would like to introduce this topic now so that you have at least some idea of the other

4
00:00:26,790 --> 00:00:29,040
ways we can create vectors out of text.

5
00:00:30,120 --> 00:00:35,490
When you go on to study deep learning for NLP, then you will be able to study these methods more in

6
00:00:35,490 --> 00:00:35,970
depth.

7
00:00:36,330 --> 00:00:40,350
But for now, this is just a brief introduction about what is possible.

8
00:00:45,080 --> 00:00:49,610
So in the previous lectures, we saw two methods of converting documents into vectors.

9
00:00:50,420 --> 00:00:55,850
That is, we converted each document in a machine learning dataset into a vector where each component

10
00:00:55,850 --> 00:01:02,540
of that vector corresponds to a different unique word neural embeddings or a bit different.

11
00:01:03,740 --> 00:01:09,070
Firstly, note that the word embedding is just a fancy way of saying vector, at least for our purposes,

12
00:01:09,080 --> 00:01:11,690
so don't be intimidated by different terminology.

13
00:01:12,620 --> 00:01:18,530
Secondly, these neural methods are typically used to convert words into vectors instead of full documents.

14
00:01:19,010 --> 00:01:21,710
Thus, we get a much more fine grained representation.

15
00:01:22,340 --> 00:01:27,410
If each word can become a vector than a document, then becomes a whole sequence of vectors.

16
00:01:27,890 --> 00:01:32,930
This contains much more information than what we looked at in this section, where the whole document

17
00:01:32,930 --> 00:01:34,910
gets turned into just a single vector.

18
00:01:35,510 --> 00:01:38,420
As you recall, this is called a bag of words.

19
00:01:43,070 --> 00:01:45,590
So why might sequences of vectors be useful?

20
00:01:46,280 --> 00:01:51,230
Well, in machine learning and specifically deep learning, we have special models that are purpose

21
00:01:51,230 --> 00:01:52,490
built for sequences.

22
00:01:52,850 --> 00:01:56,090
These include CNN's ANA Ends and Transformers.

23
00:01:56,660 --> 00:02:01,910
Unlike bag of words methods, these models do consider the ordering of the words in a sentence.

24
00:02:02,450 --> 00:02:07,160
These methods are so powerful they have led to state of the art performance in many areas.

25
00:02:07,610 --> 00:02:15,320
For example, language translation question answering chat bots, speech to text, text to speech and

26
00:02:15,320 --> 00:02:17,210
biological sequence analysis.

27
00:02:21,820 --> 00:02:25,720
Now, there are two methods I like to discuss when it comes to word embeddings.

28
00:02:26,140 --> 00:02:28,120
These are where to effect and glove.

29
00:02:28,660 --> 00:02:33,790
At this point, there's no need to go into any detail since we haven't discussed the necessary prerequisites.

30
00:02:34,240 --> 00:02:37,900
However, I can share with you the kinds of techniques which are used.

31
00:02:42,460 --> 00:02:48,220
So four words of EC, what you need to know is how to build a feedforward neural network in order to

32
00:02:48,220 --> 00:02:49,310
find the word embeddings.

33
00:02:49,330 --> 00:02:54,280
We need to find the weights of the neural network, which effectively are the word embeddings themselves.

34
00:02:55,150 --> 00:02:57,670
The goal of training this neural network is quite simple.

35
00:02:57,880 --> 00:02:59,800
Although the details are more complex.

36
00:03:04,380 --> 00:03:09,960
Basically, the goal of this neural network is to predict, given an input word, whether or not another

37
00:03:09,960 --> 00:03:16,260
word would be found in its context by context, we mean a small window surrounding the input word.

38
00:03:16,860 --> 00:03:21,360
So consider the sentence the quick brown fox jumps over the lazy dog.

39
00:03:21,990 --> 00:03:27,840
Suppose that our input to the neural network is jumps, and the size of the context window is five words,

40
00:03:28,050 --> 00:03:30,300
which is two to the left and two to the right.

41
00:03:30,960 --> 00:03:38,520
So the output brown would be considered a positive match, as would Fox over in the, on the other hand,

42
00:03:38,520 --> 00:03:40,260
dog would be a negative match.

43
00:03:40,950 --> 00:03:46,770
Furthermore, any word not found in the sentence, such as dinosaur, would also be considered a negative

44
00:03:46,770 --> 00:03:47,310
match.

45
00:03:47,910 --> 00:03:48,930
So that's words of EQ.

46
00:03:49,260 --> 00:03:54,540
You're essentially training a neural network to make a binary prediction about whether an input word

47
00:03:54,540 --> 00:03:58,610
and some output word can be found close together in the text corpus.

48
00:04:03,260 --> 00:04:08,480
So Glove doesn't use neural networks, but it was invented at the same time that deep learning became

49
00:04:08,480 --> 00:04:09,830
popular for NLP.

50
00:04:10,430 --> 00:04:15,440
The word embeddings that come from glove are also used in neural networks at later stages.

51
00:04:15,920 --> 00:04:20,720
And thus, although we don't technically use neural networks to find the glove embeddings, it is still

52
00:04:20,720 --> 00:04:22,700
considered part of deep LP.

53
00:04:24,560 --> 00:04:25,850
So how does glove work?

54
00:04:26,450 --> 00:04:32,780
Glove essentially works like a recommender system, suppose that you have a list of 1000 users and 100

55
00:04:32,780 --> 00:04:33,440
movies.

56
00:04:33,920 --> 00:04:38,940
Not every user has watched every movie and thus there are still items to recommend.

57
00:04:39,560 --> 00:04:45,920
Your goal in training a recommender system is to take the existing ratings and to try to predict what

58
00:04:45,920 --> 00:04:48,590
users will rate movies they have not yet seen.

59
00:04:49,220 --> 00:04:54,680
The idea being that if you predict a user will rate a movie highly, then you should recommend that

60
00:04:54,680 --> 00:04:56,210
movie to that user.

61
00:05:00,970 --> 00:05:02,800
So how does this relate back to love?

62
00:05:03,520 --> 00:05:06,970
Well, just like we're Divac, we apply the idea of context.

63
00:05:07,480 --> 00:05:09,010
Imagine again this sentence.

64
00:05:09,400 --> 00:05:12,010
The quick brown fox jumps over the lazy dog.

65
00:05:12,610 --> 00:05:14,560
Suppose again, we take the word jumps.

66
00:05:15,160 --> 00:05:20,950
We can see that jumps is just one word away from Fox, so perhaps we may give that a score of one out

67
00:05:20,950 --> 00:05:21,490
of one.

68
00:05:22,810 --> 00:05:25,420
We see that jumps is two words away from brown.

69
00:05:25,720 --> 00:05:28,480
So perhaps we may give that a score of one out of two.

70
00:05:29,800 --> 00:05:35,560
So you can see that we are pretending that in this recommender system, the users are words, but the

71
00:05:35,560 --> 00:05:37,390
movies are also words.

72
00:05:37,810 --> 00:05:42,110
Our so-called ratings are simply based on the distance between the words.

73
00:05:42,640 --> 00:05:44,680
And that's because of this analogy.

74
00:05:44,950 --> 00:05:47,560
We can build a recommender system out of words.

75
00:05:48,010 --> 00:05:50,440
How high one word would rate another word.

76
00:05:50,980 --> 00:05:52,480
So that's the essence of glove.

77
00:05:53,760 --> 00:05:59,100
Now, please keep in mind that I do have material for you if you want to learn the details about how

78
00:05:59,100 --> 00:06:00,630
words affect and glove work.

79
00:06:01,020 --> 00:06:06,090
So if you're eager to learn about them, please look through the rest of the course materials or ask

80
00:06:06,090 --> 00:06:08,880
me on the Q&A if you can't find what you need.

81
00:06:13,500 --> 00:06:15,840
OK, so now let's get back to the practical world.

82
00:06:16,140 --> 00:06:18,780
What can we actually do with word embeddings?

83
00:06:20,100 --> 00:06:25,560
One simplistic way to use the word embeddings is to convert a full document into a vector, just like

84
00:06:25,560 --> 00:06:26,760
we did in this section.

85
00:06:27,540 --> 00:06:33,480
However, these vectors, unlike count vectors and TFI Taf, will not be sparse and will not be high

86
00:06:33,480 --> 00:06:34,170
dimensional.

87
00:06:34,830 --> 00:06:40,500
This is because we typically choose a small number for the vector, a dimensionality such as 20, 50,

88
00:06:40,500 --> 00:06:42,810
100, 300 and so forth.

89
00:06:43,470 --> 00:06:48,330
So these vectors are very small, unlike the vectors from this section, which have size V.

90
00:06:49,290 --> 00:06:50,430
So how does it work?

91
00:06:51,180 --> 00:06:57,750
Well, suppose we did the following steps given a document, we first tokenize it into individual words.

92
00:06:58,440 --> 00:07:04,380
The next step is to use our word embeddings to map each of those words to their corresponding word vectors.

93
00:07:05,010 --> 00:07:10,650
Once we have those word vectors, we simply take their average to get a single vector of the same size.

94
00:07:11,160 --> 00:07:16,850
This gives us one vector that represents the whole document without regard to the ordering of the words.

95
00:07:17,280 --> 00:07:20,130
Thus, it is another bag of words representation.

96
00:07:24,720 --> 00:07:28,710
Another interesting thing you can do with word embeddings is word analogies.

97
00:07:29,250 --> 00:07:35,400
This is perhaps the most impressive result of these word embedding techniques because what we have is

98
00:07:35,400 --> 00:07:41,340
a set of vectors we can do arithmetic on these vectors, that is, we can add and subtract.

99
00:07:42,030 --> 00:07:45,870
So consider the analogy king is to man as what is to woman.

100
00:07:46,680 --> 00:07:48,300
The answer, of course, is queen.

101
00:07:48,900 --> 00:07:54,300
This is because kings are men and queens are women, and they both occupy sort of the same role in their

102
00:07:54,300 --> 00:07:54,960
society.

103
00:07:55,710 --> 00:07:59,080
It turns out that we can express this analogy in math.

104
00:07:59,640 --> 00:08:05,520
We can say that the vector for king minus the vector for men is approximately equal to the vector for

105
00:08:05,520 --> 00:08:07,890
queen, minus the vector for a woman.

106
00:08:08,610 --> 00:08:11,010
The way we actually do it in code is this.

107
00:08:12,850 --> 00:08:15,490
Suppose we want to answer the question, as I asked it.

108
00:08:15,940 --> 00:08:18,370
King is to man as what is to woman.

109
00:08:19,000 --> 00:08:22,330
In this case, we would take king minus man, a plus woman.

110
00:08:22,930 --> 00:08:25,090
This would give us some arbitrary vector.

111
00:08:25,840 --> 00:08:30,310
Then we would search through our list of vectors to find the closest one to this vector.

112
00:08:31,000 --> 00:08:34,000
The result is that we find queen as expected.

113
00:08:38,590 --> 00:08:42,340
So it turns out that we can do this type of analogy for all sorts of things.

114
00:08:42,820 --> 00:08:46,480
For example, France is to Paris, as Italy is to Rome.

115
00:08:46,840 --> 00:08:48,680
So that's with countries and cities.

116
00:08:49,180 --> 00:08:53,470
We can also do Japan as the Japanese, as China is to Chinese.

117
00:08:53,890 --> 00:08:57,220
So that's the country and what we call the people of that country.

118
00:08:58,150 --> 00:09:02,140
We can also do Miami is to Florida, as Dallas is to Texas.

119
00:09:02,630 --> 00:09:06,040
Thus, the model knows about how cities are related to states.

120
00:09:07,610 --> 00:09:11,620
We can also do December is to November as July and June.

121
00:09:12,170 --> 00:09:15,350
Does the model know something about the ordering of months?

122
00:09:17,220 --> 00:09:20,250
We can also do manage to woman as he is to she.

123
00:09:20,940 --> 00:09:23,550
So the model knows something about pronouns.

124
00:09:24,810 --> 00:09:28,440
Now at first glance, these might seem like just toy applications.

125
00:09:28,800 --> 00:09:32,670
You might ask How is this going to help me predict spam or predict sentiment?

126
00:09:33,480 --> 00:09:36,600
The answer is that these are just small parts of a bigger machine.

127
00:09:36,900 --> 00:09:42,990
That machine being neural networks, what this tells us is that we are able to learn to some sense of

128
00:09:42,990 --> 00:09:49,890
meaning for each of these words by their position in a vector space that by itself is pretty profound.

129
00:09:50,430 --> 00:09:56,130
And what this tells us is that if we were to use these vectors in a neural network, then at least we

130
00:09:56,130 --> 00:10:01,830
know that this part of the neural network has close to the optimal weights, and thus we could perhaps

131
00:10:01,830 --> 00:10:03,750
avoid having to train those weights.

132
00:10:04,980 --> 00:10:09,750
You can think of this like transfer learning, where you transfer knowledge from one task to another

133
00:10:09,750 --> 00:10:10,440
task.