1
00:00:11,050 --> 00:00:15,100
So in this lecture, we will be looking at some neural word embeddings in code.

2
00:00:15,730 --> 00:00:19,120
The first step will be to download some pre-trained word embeddings.

3
00:00:19,810 --> 00:00:24,070
So in practice, we typically use word embeddings that have been trained by someone else.

4
00:00:24,550 --> 00:00:29,110
The reason why we do this is because these methods tend to use very large data sets.

5
00:00:29,470 --> 00:00:32,680
So you don't want to replicate this work if you don't have to.

6
00:00:33,340 --> 00:00:38,170
Now, if you just want to learn, then that's a great exercise, which we do in my other courses that

7
00:00:38,170 --> 00:00:40,210
focus on these specific algorithms.

8
00:00:40,660 --> 00:00:43,810
But for the purpose of this class, that won't be necessary.

9
00:00:44,860 --> 00:00:49,990
In addition, a note that for newer methods like transformers, there is absolutely no way for you to

10
00:00:49,990 --> 00:00:51,340
train those models yourself.

11
00:00:51,910 --> 00:00:57,400
Modern methods can cost millions of dollars per training, which doesn't include all the experimentation

12
00:00:57,400 --> 00:01:01,480
you need to do to optimize hyper parameters and debug and so forth.

13
00:01:02,110 --> 00:01:07,810
So for these methods, you essentially have no choice but to use pre-trained models for word embeddings.

14
00:01:07,810 --> 00:01:12,400
The situation is not so severe, but we typically start with pre-trained vectors anyway.

15
00:01:19,380 --> 00:01:22,680
In any case, the next step is to unzip our zipped file.

16
00:01:29,080 --> 00:01:32,230
The next step is to import key two vectors from Jensen.

17
00:01:33,400 --> 00:01:37,180
This class has the necessary API for interacting with the word embeddings.

18
00:01:37,420 --> 00:01:38,530
We just downloaded.

19
00:01:43,770 --> 00:01:48,510
The next step is to call the function of load words in that format, passing in the pre-trained word

20
00:01:48,510 --> 00:01:49,260
embeddings.

21
00:01:53,670 --> 00:01:56,880
The next step is to implement a function called find analogies.

22
00:01:57,390 --> 00:02:01,600
This function will take, in three words W1, W2 and W3.

23
00:02:02,250 --> 00:02:04,840
Their positions in the analogy are as follows.

24
00:02:05,370 --> 00:02:10,259
We want to have one is to W2 as the result is two three.

25
00:02:11,039 --> 00:02:14,850
So you can see I've represented the result as a question mark in the comments.

26
00:02:15,390 --> 00:02:19,230
For example, we might want to do king minus man equals queen minus woman.

27
00:02:19,740 --> 00:02:22,020
In this case, the question mark should be queen.

28
00:02:22,710 --> 00:02:28,470
If we reorganize the vector so that we isolate the question mark, we get plus king plus woman minus

29
00:02:28,470 --> 00:02:28,920
man.

30
00:02:29,640 --> 00:02:32,190
Basically, the important thing to notice is the sign.

31
00:02:32,880 --> 00:02:36,030
So what we're going to do is call the function most similar.

32
00:02:36,630 --> 00:02:41,790
The inputs to this function are the words which are represented by positive vectors in the words which

33
00:02:41,790 --> 00:02:43,800
are represented by negative vectors.

34
00:02:44,340 --> 00:02:50,010
As you've seen from above, W one and W three should be positive, while W two should be negative.

35
00:02:50,820 --> 00:02:56,370
Note that the result, which have called R, actually contains multiple words along with their corresponding

36
00:02:56,370 --> 00:02:57,150
scores.

37
00:02:57,720 --> 00:03:02,520
Thus, to get the top word only without the score, we index are at zero zero.

38
00:03:03,480 --> 00:03:08,280
The last step is to print out the full analogy so that we can see it in the format we expect.

39
00:03:13,490 --> 00:03:17,660
OK, so the next step is to test our function, starting with our running example.

40
00:03:21,620 --> 00:03:26,420
As you can see, we get king minus men, equals queen minus woman as expected.

41
00:03:29,420 --> 00:03:31,910
Now, let's see if this works for cities and countries.

42
00:03:35,860 --> 00:03:40,420
As you can see, we get France minus Paris equals England minus London.

43
00:03:40,690 --> 00:03:41,680
This makes sense.

44
00:03:44,210 --> 00:03:46,010
Now, let's test this with another city.

45
00:03:49,440 --> 00:03:55,860
So this time we get France minus Paris equals Italy, minus Rome again, a satisfying result.

46
00:03:59,190 --> 00:04:04,740
So you might wonder what would happen if we reverse the order of the country and city, what do we get

47
00:04:04,740 --> 00:04:08,130
if we use country minus city instead of city minus country?

48
00:04:11,770 --> 00:04:17,230
OK, so this time we get a poor result, we get Paris minus France equals Lohan minus Italy.

49
00:04:17,829 --> 00:04:20,050
As far as I know, this does not make sense.

50
00:04:20,560 --> 00:04:24,760
This is our first sign that these vectors are not perfect, although they are pretty good.

51
00:04:27,580 --> 00:04:28,180
Now, let's try.

52
00:04:28,210 --> 00:04:31,120
France is too French as what is too English?

53
00:04:34,460 --> 00:04:36,260
As expected, we get England.

54
00:04:39,360 --> 00:04:43,290
The next step is to try the same kind of analogy, but with different countries.

55
00:04:47,010 --> 00:04:49,170
OK, so this time we get a strange result.

56
00:04:49,680 --> 00:04:52,740
We should expect to get China, but instead we get Tibet.

57
00:04:53,280 --> 00:04:57,480
Of course, these are closely related countries, so it's not completely incorrect.

58
00:04:58,170 --> 00:05:02,730
Furthermore, it's still remarkable that the result is a country and not some random word.

59
00:05:05,680 --> 00:05:10,150
The next step is to try the same type of analogy again, but with Italian instead.

60
00:05:13,810 --> 00:05:16,540
OK, so this time we get Italy as expected.

61
00:05:19,370 --> 00:05:25,640
The next step is to test this with months of the year, so December is to November as what is the June?

62
00:05:29,320 --> 00:05:34,810
OK, so we get September, which is not quite what we expected, but still not bad as it is a month

63
00:05:34,810 --> 00:05:35,530
of the year.

64
00:05:36,280 --> 00:05:39,700
Note that for this type of analogy, I've had better success with glove.

65
00:05:42,810 --> 00:05:46,650
The next step is to test out cities and states in the United States.

66
00:05:47,310 --> 00:05:50,460
So Miami is to Florida as what is to Texas?

67
00:05:54,320 --> 00:05:57,530
OK, so we get Dallas, which is an answer that makes sense.

68
00:06:00,430 --> 00:06:02,890
The next step is to test out occupations.

69
00:06:03,340 --> 00:06:06,430
So Einstein, as a scientist, as what is to painter?

70
00:06:10,070 --> 00:06:13,700
Now for this, I expected to get something like Picasso or Rembrandt.

71
00:06:14,180 --> 00:06:17,120
Personally, I don't know whether or not there is a painter named Jude.

72
00:06:17,390 --> 00:06:20,600
But you can feel free to look that up yourself if you're curious.

73
00:06:21,380 --> 00:06:23,870
At the very least, it is a name which makes sense.

74
00:06:26,470 --> 00:06:28,780
The next thing we're going to test is pronouns.

75
00:06:29,110 --> 00:06:31,660
So, man, is to woman as what is to she.

76
00:06:35,020 --> 00:06:37,180
And the answer is he as expected.

77
00:06:40,190 --> 00:06:43,400
Now, let's try managed to woman as what is two ends.

78
00:06:47,080 --> 00:06:49,420
And the answer is uncle, as expected.

79
00:06:52,270 --> 00:06:55,240
Let's try managed to woman as what is two sister?

80
00:06:58,940 --> 00:07:01,040
The answer is brother, as expected.

81
00:07:03,870 --> 00:07:06,840
Now, let's try man as a woman as what is to wife.

82
00:07:10,240 --> 00:07:11,440
In the answer is Son.

83
00:07:12,190 --> 00:07:17,110
For this, I would have expected to get husband, but again, remember that these vectors are not perfect.

84
00:07:17,710 --> 00:07:23,560
Note that it still may be the case that husband is in the list of returned results is just not the top

85
00:07:23,560 --> 00:07:24,100
result.

86
00:07:27,540 --> 00:07:30,810
Now, let's try managed to woman as actor is to actress.

87
00:07:34,030 --> 00:07:36,580
So the result is actor, as expected.

88
00:07:40,050 --> 00:07:43,200
Now, let's try managed to woman as father is the mother.

89
00:07:46,940 --> 00:07:49,310
And again, we get farther as expected.

90
00:07:52,110 --> 00:07:55,500
Now, let's try nephew is the niece as uncle is two aunt.

91
00:07:58,940 --> 00:08:01,100
And again, we get uncle as expected.

92
00:08:04,650 --> 00:08:10,380
OK, so recall that what we've done is mapped words to vectors, we've so far looked at analogies,

93
00:08:10,380 --> 00:08:17,040
but an even simpler question is to ask, given a word, what are the most similar words to do this?

94
00:08:17,040 --> 00:08:21,330
We're going to write a function called the nearest neighbors that takes in only one word.

95
00:08:21,990 --> 00:08:27,900
In fact, this function is much simpler than before because we only have to specify the positive arguments

96
00:08:28,530 --> 00:08:33,210
once we get the result was simply going to loop through them and print out the matching words.

97
00:08:38,409 --> 00:08:40,750
OK, so let's look at the nearest neighbors for King.

98
00:08:45,090 --> 00:08:51,330
So as you can see, we get Kings, Queen, Monarch, Crown Prince, Prince Sultan and so forth.

99
00:08:51,780 --> 00:08:53,280
So these are seem to make sense.

100
00:08:57,610 --> 00:08:58,900
Now, let's try France.

101
00:09:03,130 --> 00:09:07,540
So we get Spain, French, Germany, Europe, Italy and so forth.

102
00:09:08,170 --> 00:09:13,180
This all makes sense since they are countries except French, which is not a country, but still makes

103
00:09:13,180 --> 00:09:15,610
sense in Europe, which is a continent.

104
00:09:16,360 --> 00:09:18,670
But notice that all these countries are in Europe.

105
00:09:22,070 --> 00:09:23,420
Now, let's try Einstein.

106
00:09:28,210 --> 00:09:30,640
So in this case, we get some pretty weird results.

107
00:09:31,210 --> 00:09:33,370
Most of these are names, which makes sense.

108
00:09:34,030 --> 00:09:39,730
Albert is somehow in third place, which is odd, and somehow Ellen Fellow is in second place.

109
00:09:40,300 --> 00:09:44,680
Perhaps this has to do with the fact that people often use the term Einstein sarcastically.

110
00:09:47,620 --> 00:09:49,600
Now, let's look at the nearest neighbors for a woman.

111
00:09:53,760 --> 00:09:58,380
In this case, we get man, girl, teenage girl, teenager and so on.

112
00:09:58,920 --> 00:10:00,290
So these mostly make sense.

113
00:10:03,270 --> 00:10:04,470
Now, let's try, nephew.

114
00:10:08,140 --> 00:10:12,790
So this time we get Sun, uncle, brother, grandson, a cousin and so forth.

115
00:10:13,150 --> 00:10:14,320
So these are make sense.

116
00:10:17,150 --> 00:10:19,880
Now, let's try February, which is a month of the year.

117
00:10:24,620 --> 00:10:28,970
OK, so as you can see, most of the other matches are months, which makes sense.

118
00:10:32,400 --> 00:10:36,420
So now that you know how word vectors work, it's time to do an exercise.

119
00:10:36,840 --> 00:10:39,870
We've just looked at how to analyze words Buvac vectors.

120
00:10:40,320 --> 00:10:44,610
This was relatively easy because we had the Genzyme Library to do most of the work.

121
00:10:45,240 --> 00:10:48,210
However, note that this work is not really that difficult.

122
00:10:48,240 --> 00:10:54,150
Once you have the vectors for this exercise, you're going to download glove vectors, which gives you

123
00:10:54,150 --> 00:10:57,750
the actual vector is stored as text instead of a binary file.

124
00:10:57,750 --> 00:11:03,150
Like Google's words of that, this file is a very simple format, which you should be able to figure

125
00:11:03,150 --> 00:11:03,630
out.

126
00:11:04,260 --> 00:11:08,910
Once you've downloaded the glove vectors, you should write the code to implement, find analogies and

127
00:11:08,910 --> 00:11:12,990
nearest neighbors to work exactly the same way as they did in this notebook.

128
00:11:13,560 --> 00:11:17,460
Your goal is to check whether or not glove vectors are better than words of X..

129
00:11:17,940 --> 00:11:19,850
Good luck, and I'll see you in the next lecture.

