1
00:00:11,140 --> 00:00:15,850
OK, so in this lecture, we are going to discuss the concept of vector similarity.

2
00:00:16,600 --> 00:00:22,210
Basically, you can think of this as a function that will take in a two vectors as input and as output.

3
00:00:22,210 --> 00:00:26,740
We will get back a single number, which tells us how similar those two vectors were.

4
00:00:27,280 --> 00:00:28,930
You can think of this like a score.

5
00:00:29,680 --> 00:00:31,120
So why is this useful?

6
00:00:32,080 --> 00:00:37,480
Well, suppose that we were successful in mapping all of the documents in our text corpus into vectors,

7
00:00:37,480 --> 00:00:39,120
perhaps using TFI Taf.

8
00:00:39,850 --> 00:00:45,790
Now we can ask questions like given some document, find me another document in the corpus that is most

9
00:00:45,790 --> 00:00:47,080
like the given document.

10
00:00:47,620 --> 00:00:52,540
You can imagine this might be useful for researchers who are looking for journal papers on a specific

11
00:00:52,540 --> 00:00:53,230
subject.

12
00:00:53,920 --> 00:00:58,930
We can also ask the opposite question which document is most unlike the given document?

13
00:00:59,650 --> 00:01:01,690
We can also do the same thing with words.

14
00:01:02,260 --> 00:01:04,970
Suppose that we've managed to map words into vectors.

15
00:01:05,290 --> 00:01:08,050
We can then compute the similarity between words.

16
00:01:08,620 --> 00:01:13,060
So, for example, one might expect the words king and queen to be very similar.

17
00:01:13,690 --> 00:01:17,140
Car vehicle and automobile should also be considered similar.

18
00:01:17,770 --> 00:01:21,190
Conversely, car and drive should be very dissimilar.

19
00:01:25,940 --> 00:01:31,160
This could be useful, for example, if you wanted to build an article spinner where your goal is to

20
00:01:31,160 --> 00:01:37,340
rewrite an existing article, but to replace every few words such that your new article is sufficiently

21
00:01:37,340 --> 00:01:38,930
different from the existing one.

22
00:01:39,740 --> 00:01:43,400
This technique is used by marketers who want to build websites with lots of content.

23
00:01:43,760 --> 00:01:49,160
But, of course, don't have the time or knowledge to write that content themselves, and also may not

24
00:01:49,160 --> 00:01:52,610
have the funds to pay others to write the content for them.

25
00:01:53,450 --> 00:01:58,790
Article spinning is a way to automatically generate content that won't be penalized by search engines

26
00:01:58,790 --> 00:02:01,910
like Google for being too similar to existing content.

27
00:02:03,110 --> 00:02:08,509
Now, I personally would not advocate doing this, since it's considered a Black Hat SEO technique,

28
00:02:08,840 --> 00:02:11,900
but it's an example of how machine learning could be used.

29
00:02:12,440 --> 00:02:17,750
Ultimately, whether it is ethical or not, it is one method that marketers have used to make money

30
00:02:17,750 --> 00:02:18,560
in the past.

31
00:02:19,010 --> 00:02:23,180
So word replacement is one application of vector similarity.

32
00:02:27,840 --> 00:02:33,120
OK, so as you recall, when we talk about vectors in machine learning, we typically assume they live

33
00:02:33,120 --> 00:02:34,470
in a Euclidean space.

34
00:02:35,040 --> 00:02:39,180
So each component is the length of the vector in that respective direction.

35
00:02:39,930 --> 00:02:46,200
As a simple example, the vector of three four means go three units in the X One Direction and go for

36
00:02:46,200 --> 00:02:50,320
units in the X two direction, and that will give you the tip of the vector.

37
00:02:50,400 --> 00:02:55,020
If you think of it as an arrow or simply the data point representing that vector.

38
00:02:56,010 --> 00:03:01,170
Now, because of this, there is a very natural way to define a similarity of vectors, and that is

39
00:03:01,170 --> 00:03:02,460
in terms of distance.

40
00:03:03,120 --> 00:03:09,390
Essentially, distance is the opposite of similarity when two vectors are close together in distance.

41
00:03:09,450 --> 00:03:12,880
They are similar when two vectors are far away in distance.

42
00:03:12,990 --> 00:03:14,070
They are dissimilar.

43
00:03:14,790 --> 00:03:19,380
In this case, we can think of similarity as simply the negative of distance.

44
00:03:23,930 --> 00:03:26,900
So how do we define distance and Euclidean space?

45
00:03:27,650 --> 00:03:32,360
Well, one typical way to define distance is to use the Euclidean distance.

46
00:03:32,930 --> 00:03:37,730
Of course, this makes complete sense Euclidean space, Euclidean distance.

47
00:03:38,360 --> 00:03:41,630
So let's review how to compute the Euclidean distance.

48
00:03:42,380 --> 00:03:49,370
So if we have two vectors X and Y each with size D, then what we do is we take the component wise difference

49
00:03:49,670 --> 00:03:51,350
of each corresponding component.

50
00:03:51,740 --> 00:03:56,450
Square those differences, add them together and then take the square root of the results.

51
00:03:57,170 --> 00:04:02,210
Obviously, if you want the squared Euclidean distance, then simply do not take the square root.

52
00:04:03,820 --> 00:04:09,220
And the reason we can use the square of the Euclidean distance is because the square is a monotonically

53
00:04:09,220 --> 00:04:10,300
increasing function.

54
00:04:10,990 --> 00:04:17,079
Thus, if Point A and B have a smaller Euclidean distance compared to you point A and point C, then

55
00:04:17,079 --> 00:04:22,510
it is also true that point and point B have a smaller squared Euclidean distance compared to point A

56
00:04:22,510 --> 00:04:23,260
and point C.

57
00:04:23,950 --> 00:04:28,570
In other words, when the only thing you care about doing is comparing relative distances, then it

58
00:04:28,570 --> 00:04:33,250
doesn't matter whether you use the squared Euclidean distance or just the Euclidean distance.

59
00:04:37,930 --> 00:04:42,040
Now, Euclidean distance is not the only way to define vector similarity.

60
00:04:42,550 --> 00:04:48,520
Another way to define a similarity is by looking at the angle between two vectors, specifically the

61
00:04:48,520 --> 00:04:49,870
cosine of the angle.

62
00:04:50,590 --> 00:04:53,110
So suppose we have two vectors X and Y.

63
00:04:53,800 --> 00:04:57,400
Recall that the inner product can be defined in two different ways.

64
00:04:58,000 --> 00:05:03,190
The first way to define the inner product is that it's the sum of the Element Y's product of the two

65
00:05:03,190 --> 00:05:03,880
vectors.

66
00:05:04,300 --> 00:05:08,470
So that's x one times y one plus x two times y two and so forth.

67
00:05:09,370 --> 00:05:15,010
Now, the second way to define the inner product is this it's the magnitude of x times the magnitude

68
00:05:15,010 --> 00:05:18,760
of Y times the cosine of the angle between X and Y.

69
00:05:19,390 --> 00:05:22,330
Of course, the cosine of the angle is what we want to solve for.

70
00:05:22,660 --> 00:05:26,740
And so if we equate the two forms of the inner product, this is what we get.

71
00:05:27,280 --> 00:05:33,640
It's the sum of the Element Y's product of the two vectors divided by the magnitudes of the two vectors.

72
00:05:38,300 --> 00:05:40,280
So what is the cosine similarity?

73
00:05:41,000 --> 00:05:46,040
It's simply the cosine of the angle between two vectors, but we just looked at how to compute.

74
00:05:46,790 --> 00:05:48,800
So let's think about why this works.

75
00:05:49,580 --> 00:05:52,940
Firstly, it helps to recall what the cosine function looks like.

76
00:05:53,600 --> 00:05:58,250
So it's a periodic wave, but we are only interested in the values from zero to PI.

77
00:05:59,300 --> 00:06:03,080
Now, suppose that we have two vectors which are very close to being parallel.

78
00:06:03,680 --> 00:06:05,780
The angle between them is about zero.

79
00:06:06,500 --> 00:06:11,900
In this case, the cosine of this angle is about one, which is the maximum value for the cosine.

80
00:06:12,410 --> 00:06:15,980
That is to say, this is when the two vectors are most similar.

81
00:06:20,760 --> 00:06:26,880
Now, suppose that we have two vectors which are A. parallel that is facing opposite directions in this

82
00:06:26,880 --> 00:06:31,980
case, the angle is 180 degrees and the cosine of this angle is minus one.

83
00:06:32,880 --> 00:06:35,190
This is the minimum value of the cosine.

84
00:06:35,790 --> 00:06:38,880
And this corresponds to when the two vectors are at least similar.

85
00:06:39,810 --> 00:06:45,420
OK, so now we're convinced that using the cosine of the angle between two vectors works as a measure

86
00:06:45,420 --> 00:06:46,380
of similarity.

87
00:06:51,060 --> 00:06:56,970
Note that sometimes people refer to the cosine distance, which is one minus the cosine similarity.

88
00:06:57,630 --> 00:06:58,770
So how does this work?

89
00:06:59,460 --> 00:07:04,830
Well, again, we can see that if two vectors are parallel, then they are as similar as possible.

90
00:07:05,400 --> 00:07:09,750
In this case, the cosine similarity will be one and one minus one is zero.

91
00:07:09,990 --> 00:07:16,770
So the cosine distance is zero and this is the minimum possible cosine distance when there is zero angle

92
00:07:16,770 --> 00:07:18,210
between the two vectors.

93
00:07:19,410 --> 00:07:25,500
On the other hand, if two vectors are A. parallel, then the cosine similarity will be minus one and

94
00:07:25,500 --> 00:07:30,630
one minus minus one is two, which is the maximum possible cosine distance.

95
00:07:31,230 --> 00:07:36,810
So whether you want to work with cosine similarity or cosine distance, the principle is exactly the

96
00:07:36,810 --> 00:07:37,350
same.

97
00:07:38,250 --> 00:07:42,720
Note that, strictly speaking, the cosine distance is not a true distance.

98
00:07:43,170 --> 00:07:49,200
This is because it doesn't satisfy the axioms of distance metrics, specifically the triangle inequality

99
00:07:49,860 --> 00:07:50,940
at the same time.

100
00:07:50,970 --> 00:07:54,260
This doesn't really prevent us from doing anything we want to do.

101
00:07:59,020 --> 00:08:04,820
So it turns out that cosine similarity is a bit more common than Euclidean distance in NLP.

102
00:08:05,620 --> 00:08:12,220
If we think about vectors form from counting, we can think about why this would be suppose for simplicity

103
00:08:12,220 --> 00:08:16,180
that we have two words we are counting mitochondria and voltage.

104
00:08:16,690 --> 00:08:22,060
So intuitively, we know that mitochondria would appear more often in books about biology and the life

105
00:08:22,060 --> 00:08:27,160
sciences, while voltage would appear more often in books about electronics or physics.

106
00:08:27,910 --> 00:08:34,450
Now, suppose that we have one very long book where mitochondria appears 500 times and we have one very

107
00:08:34,450 --> 00:08:37,929
short book where mitochondria appear as just a few times.

108
00:08:38,679 --> 00:08:43,659
Now, suppose we have an electronics book where voltage appears just a few times.

109
00:08:44,200 --> 00:08:48,070
What happens when we look at the Euclidean distance between these vectors?

110
00:08:48,700 --> 00:08:50,590
Well, something surprising happens.

111
00:08:51,760 --> 00:08:57,670
It turns out that the electronic book is closer to our short biology book than the short biology book

112
00:08:57,670 --> 00:08:59,620
is to the long biology book.

113
00:09:00,310 --> 00:09:05,230
And this doesn't make sense simply by virtue of having fewer words overall.

114
00:09:05,500 --> 00:09:11,200
We can see that two completely unrelated books have a smaller distance compared to two books on the

115
00:09:11,200 --> 00:09:12,070
same topic.

116
00:09:12,730 --> 00:09:16,930
This might suggest that perhaps Euclidean distance is not the best choice.

117
00:09:18,400 --> 00:09:23,050
Based on this example, the cosine distance would be zero between our two biology books.

118
00:09:23,350 --> 00:09:27,160
Well, it would be one for the biology book compared to the electronic book.

119
00:09:27,850 --> 00:09:33,400
Thus, when we have to deal with it, vectors of different sizes cosine distances might be more useful.

120
00:09:38,110 --> 00:09:43,510
So there is one scenario where a cosine distance and Euclidean distance yields equivalent results.

121
00:09:44,200 --> 00:09:49,870
Suppose that we don't care about distance or similarity value, but instead just the relative distance

122
00:09:50,140 --> 00:09:52,210
or similarity between different items.

123
00:09:52,780 --> 00:09:57,430
For example, if you're building a recommendation system or a search engine, you don't care about the

124
00:09:57,430 --> 00:09:58,240
score itself.

125
00:09:58,690 --> 00:10:02,800
You only care about the ranking of the items after you've sorted them by the score.

126
00:10:03,820 --> 00:10:08,920
So let's suppose that you've normalized your feature vectors now that you know their magnitude can have

127
00:10:08,920 --> 00:10:10,180
an effect on the outcome.

128
00:10:10,750 --> 00:10:13,210
That is, the L2 norm of each vector is one.

129
00:10:14,260 --> 00:10:16,060
In this case, we can note the following.

130
00:10:16,780 --> 00:10:20,080
We'll start with the squared Euclidean distance between X and Y.

131
00:10:20,860 --> 00:10:25,390
Recall that this is simply equal to the vector x minus Y dotted with itself.

132
00:10:26,650 --> 00:10:33,220
The next step is to expand this expression, which gives us x transpose times x minus two times x transpose

133
00:10:33,220 --> 00:10:36,100
times y plus y transpose times y.

134
00:10:36,910 --> 00:10:43,180
However, we've just established that X transpose times X and Y transpose times y are both one since

135
00:10:43,180 --> 00:10:44,710
we normalized X and Y.

136
00:10:45,400 --> 00:10:49,270
So this gives us two minus two times X transpose y.

137
00:10:50,830 --> 00:10:57,700
Furthermore, as you recall, it's transpose Y is just the magnitude of X times the magnitude of Y times

138
00:10:57,700 --> 00:11:00,130
the cosine of the angle between X and Y.

139
00:11:00,640 --> 00:11:03,610
But the magnitudes of X and Y again are just one.

140
00:11:04,360 --> 00:11:08,530
Therefore, we end up with two minus two times the cosine similarity.

141
00:11:09,370 --> 00:11:12,730
Or put another way two times the cosine distance.

142
00:11:13,240 --> 00:11:19,300
And of course, multiplying by two has no effect on the ranking of items which use this as a score.

143
00:11:20,350 --> 00:11:26,050
What this means is that if you normalize your vectors and the only task you need to do is rank items,

144
00:11:26,320 --> 00:11:30,040
then it doesn't matter which similarity or distance you use.