1
00:00:11,050 --> 00:00:14,830
So in this lecture, we are going to answer the question, what is a vector?

2
00:00:15,490 --> 00:00:20,050
Now the reason we need to answer this question is because it's going to be very important throughout

3
00:00:20,050 --> 00:00:20,770
the course.

4
00:00:21,250 --> 00:00:25,840
Vectors are what we are going to use as our numerical representation of text.

5
00:00:26,350 --> 00:00:31,390
They are also the foundation for machine learning, which can be seen as a purely geometrical problem.

6
00:00:32,020 --> 00:00:33,280
We'll speak more about that later.

7
00:00:33,280 --> 00:00:39,190
But for now, just know that vector representations are kind of like the raw materials of machine learning

8
00:00:39,190 --> 00:00:40,390
and data analysis.

9
00:00:45,130 --> 00:00:48,040
OK, so now let's answer the question, what is a vector?

10
00:00:48,760 --> 00:00:53,680
We'll start by answering this question generically that is not in a sense that is related to machine

11
00:00:53,680 --> 00:00:54,670
learning or NLP.

12
00:00:55,630 --> 00:00:57,640
So there are two ways to think of vectors.

13
00:00:58,180 --> 00:01:02,860
The first way is probably what you were introduced to first in your high school math or physics course

14
00:01:02,860 --> 00:01:07,420
is a vector is a quantity that has both magnitude and direction.

15
00:01:07,960 --> 00:01:12,070
This is opposed to a scalar which has a magnitude and perhaps also a sign.

16
00:01:12,310 --> 00:01:13,900
But no arbitrary direction.

17
00:01:14,830 --> 00:01:19,390
As an example, velocity is a vector because it has both magnitude and direction.

18
00:01:19,960 --> 00:01:26,740
I can say my velocity is 10 meters per second in the northeast direction at the same time speed as a

19
00:01:26,740 --> 00:01:27,340
scalar.

20
00:01:27,790 --> 00:01:32,650
I can say that my speed is 10 meters per second, but there is no associated direction.

21
00:01:37,190 --> 00:01:42,470
Now, the second way to think of vectors is arguably more useful, especially in the context of machine

22
00:01:42,470 --> 00:01:48,740
learning where we work in very high dimensions, in high dimensions, direction is not intuitive and

23
00:01:48,740 --> 00:01:50,660
the number of angles you need to specify.

24
00:01:50,660 --> 00:01:56,660
The direction will grow with the number of dimensions and thus the second way, which is to represent

25
00:01:56,660 --> 00:02:00,230
a vector as an array of scales is typically more useful.

26
00:02:01,010 --> 00:02:07,460
So, for example, the vector or array three four on the X y plane is an equivalent way of representing

27
00:02:07,460 --> 00:02:13,280
the vector, which has magnitude five and direction is zero point nine three radians from the x axis.

28
00:02:14,030 --> 00:02:19,400
Now, you may recognize that one of these representations uses a Cartesian coordinate system, while

29
00:02:19,400 --> 00:02:21,560
the other uses a polar coordinate system.

30
00:02:22,730 --> 00:02:27,410
In fact, if you think about it, both of these forms are still arrays of scalar components.

31
00:02:27,800 --> 00:02:31,250
It is simply that the components mean something different in each case.

32
00:02:31,910 --> 00:02:37,820
But notice that with either representation, a two dimensional vector still requires two components.

33
00:02:38,660 --> 00:02:42,290
Now, in practice, our vectors could have hundreds or thousands of components.

34
00:02:42,650 --> 00:02:46,580
So it's more useful to think of them as simply points in a Euclidean space.

35
00:02:47,120 --> 00:02:52,730
That is, each component corresponds to an axis and each axis is orthogonal to the other axis.

36
00:02:53,090 --> 00:02:55,970
This is better than thinking of the components as angles.

37
00:03:00,530 --> 00:03:05,300
OK, so now that we know what vectors are, let's think about why they would be useful, especially

38
00:03:05,300 --> 00:03:06,800
in the context of NLP.

39
00:03:07,790 --> 00:03:13,760
In order to do this, it helps to consider what we would do if we did not convert our text into vectors.

40
00:03:14,540 --> 00:03:18,140
Suppose that we only worked with text or sequences of words.

41
00:03:18,740 --> 00:03:24,590
Suppose I said, write a computer program that will take as input the text of an email and output,

42
00:03:24,590 --> 00:03:26,330
whether or not that email is spam.

43
00:03:27,110 --> 00:03:32,420
Consider how you would solve this problem without having to resort to converting the text into a numerical

44
00:03:32,420 --> 00:03:33,380
representation.

45
00:03:34,220 --> 00:03:38,510
In fact, I recommend actually giving some deep thought into how you would write this function.

46
00:03:39,350 --> 00:03:43,790
Once you start to think about this, you should realize that the application of machine learning is

47
00:03:43,790 --> 00:03:45,470
not as trivial as you might assume.

48
00:03:46,310 --> 00:03:51,290
Suppose that this is the 1990s, and machine learning is still not a very popular subject.

49
00:03:52,100 --> 00:03:57,320
Suppose you're a software engineer at AOL, and you just graduated from college with a computer science

50
00:03:57,320 --> 00:03:57,890
degree.

51
00:03:58,430 --> 00:04:03,110
So you know how to write a computer program, but you know absolutely nothing about machine learning.

52
00:04:03,770 --> 00:04:05,120
What would your approach be?

53
00:04:05,840 --> 00:04:07,460
Well, here are some things you could try.

54
00:04:07,910 --> 00:04:10,010
You could try looking for a certain keywords.

55
00:04:10,280 --> 00:04:16,730
For example, if an email contains phrases like Nigerian prince or insurance or credit or make money,

56
00:04:17,060 --> 00:04:19,790
you might want to consider these to be spam emails.

57
00:04:20,899 --> 00:04:23,810
In fact, you could set up email filters right now that do this.

58
00:04:24,020 --> 00:04:27,080
For example, in Gmail or a client such as Thunderbird.

59
00:04:27,920 --> 00:04:30,260
Now, of course, your program doesn't end there.

60
00:04:30,650 --> 00:04:32,300
What we have now is imperfect.

61
00:04:32,930 --> 00:04:37,850
What if, for example, you're interested in data science courses that teach you how to build models

62
00:04:38,000 --> 00:04:39,620
to computer credit scores?

63
00:04:40,160 --> 00:04:45,290
In this case, you would not want to mark the email as spam simply because it contained the word credit.

64
00:04:45,980 --> 00:04:52,460
Or perhaps you're taking a course in NLP and the instructor uses the Nigerian prince example in an announcement

65
00:04:52,460 --> 00:04:53,270
about the course.

66
00:04:53,630 --> 00:04:55,610
Since the course covers spam detection.

67
00:04:56,270 --> 00:05:01,670
In this case, you would not want to mark the email to spam simply because it contain the phrase Nigerian

68
00:05:01,670 --> 00:05:02,270
prince.

69
00:05:02,930 --> 00:05:05,390
So you can see that things are not so clear cut.

70
00:05:07,010 --> 00:05:11,450
Consider what criteria you would use if you want to consider an email to be safe.

71
00:05:12,110 --> 00:05:17,570
Perhaps if the email contains the terms Udemy or lazy programmer, these should be considered safe.

72
00:05:18,260 --> 00:05:23,810
This is because you signed up to receive emails from these entities and thus you actually want to receive

73
00:05:23,810 --> 00:05:25,850
them as a side note.

74
00:05:25,880 --> 00:05:31,580
Recall that we are trying to detect spam based only on the text itself, but we are assuming for now

75
00:05:31,580 --> 00:05:33,080
that we cannot check the sender.

76
00:05:33,860 --> 00:05:38,180
Of course, Udemy is a big company and thus likely to be targeted by spammers.

77
00:05:38,570 --> 00:05:44,600
And so if a spammer decides to write fake emails pretending to be from Udemy, then your rule is broken.

78
00:05:45,950 --> 00:05:49,280
Now what if the previous two rules we just discussed conflict?

79
00:05:49,730 --> 00:05:52,040
How would you write code to deal with this scenario?

80
00:05:52,400 --> 00:05:56,270
Perhaps rule number one should always Trump rule number two or vice versa.

81
00:05:56,810 --> 00:05:59,660
But again, this is clearly not easy to write code for.

82
00:05:59,930 --> 00:06:04,580
And as you add more and more of these rules, you can see how things can become quite complex.

83
00:06:05,300 --> 00:06:09,170
So hopefully this makes you realize the need for numerical representations.

84
00:06:09,350 --> 00:06:11,990
And beyond that, machine learning techniques.

85
00:06:16,520 --> 00:06:22,040
OK, so at this point, you now understand that working with raw text is difficult and working with

86
00:06:22,040 --> 00:06:25,280
numerical representations is potentially more simple.

87
00:06:26,240 --> 00:06:30,560
We also now understand that these numerical representations will be vectors.

88
00:06:31,250 --> 00:06:35,180
So the next question to ask is in what way will vectors be useful?

89
00:06:35,990 --> 00:06:40,640
Now this is more just a preview for the rest of the course, since you'll be exposed to more examples

90
00:06:40,640 --> 00:06:42,350
in more detail as we go along.

91
00:06:42,650 --> 00:06:48,650
But I want to give you at least some idea of the utility of vectors, so let's return to this idea of

92
00:06:48,650 --> 00:06:49,760
spam detection.

93
00:06:50,510 --> 00:06:56,990
Suppose that there is some way to map each email to a vector such that all the emails which are spam,

94
00:06:57,230 --> 00:06:59,090
fall into one cloud of points.

95
00:06:59,450 --> 00:07:03,470
Well, all the emails, which are not spam, fall into a different cloud of points.

96
00:07:04,550 --> 00:07:08,870
At this point, I won't say exactly how we would do this, but suppose that we could.

97
00:07:09,560 --> 00:07:15,260
If we could do this, then it would be very easy to separate spam from non spam emails.

98
00:07:15,860 --> 00:07:20,390
As you can see, we can simply draw a line between the two clouds of points.

99
00:07:22,080 --> 00:07:25,950
Suppose that we have a new email now who's spam status is unknown.

100
00:07:26,700 --> 00:07:29,850
We can ask the question, is this email spam or not?

101
00:07:30,480 --> 00:07:36,120
And intuitively, you know that this boils down to a much simpler question which side of the line is?

102
00:07:36,120 --> 00:07:42,840
The point on that is if we map this new email to a vector, we can determine which side of the line

103
00:07:42,840 --> 00:07:43,530
it falls on.

104
00:07:43,560 --> 00:07:46,020
And if it's on the spam side, we say it's spam.

105
00:07:46,410 --> 00:07:48,120
Otherwise, we say it's not spam.

106
00:07:49,710 --> 00:07:54,600
Now, let's think back to our previous example, where I asked you to think about how to detect spam

107
00:07:54,690 --> 00:07:57,840
based on raw text, as you saw.

108
00:07:57,870 --> 00:08:01,860
This could involve many rules which translate into many if statements.

109
00:08:02,160 --> 00:08:06,990
And ultimately, it ends up being very complicated, since it's not clear what to do when the rules

110
00:08:06,990 --> 00:08:07,710
conflict.

111
00:08:08,400 --> 00:08:13,140
On the other hand, if you know your high school geometry, then figuring out which side of the line

112
00:08:13,140 --> 00:08:15,810
the point falls on should be relatively easy.

113
00:08:16,740 --> 00:08:22,410
Now it does remain to be seen whether or not we can actually convert text into vectors in such a manner.

114
00:08:22,620 --> 00:08:25,080
So for now, you just have to trust that we can.

115
00:08:29,780 --> 00:08:31,850
Here's another thing we can do with vectors.

116
00:08:32,510 --> 00:08:37,730
Suppose that instead of doing spam detection, we simply have a large collection of documents that we

117
00:08:37,730 --> 00:08:38,870
want to organize.

118
00:08:39,350 --> 00:08:43,880
But this collection of documents is way too large to read each one one by one.

119
00:08:44,630 --> 00:08:48,710
So we would like to have some automated way to categorize these documents.

120
00:08:49,580 --> 00:08:54,560
You can imagine that this would be useful for businesses who have to deal with large quantities of documents.

121
00:08:55,720 --> 00:08:59,320
So you might recognize this problem as the problem we call clustering.

122
00:09:00,130 --> 00:09:06,550
Well, suppose that we map our documents to vectors and then we plot each vector on a grid like so again,

123
00:09:06,550 --> 00:09:12,250
we find that certain documents tend to fall into the same clusters as each other while being far away

124
00:09:12,250 --> 00:09:13,780
from other such clusters.

125
00:09:14,290 --> 00:09:20,170
In other words, again, it is the case that turning text into vectors has made our problem very easy.

126
00:09:21,340 --> 00:09:25,600
Now, at this point, we're not going to discuss the specific algorithms we use for clustering.

127
00:09:25,750 --> 00:09:29,410
But rest assured, many effective clustering techniques exist.

128
00:09:33,920 --> 00:09:38,900
But again, we should maintain caution because it still remains to be seen whether or not we can actually

129
00:09:38,900 --> 00:09:44,180
convert text into such nice looking vectors that have the properties I've just described.

130
00:09:44,870 --> 00:09:49,700
Of course, in reality, things are not so nice, but in practice you'll see that the methods we will

131
00:09:49,700 --> 00:09:51,920
use tend to work pretty well anyway.

132
00:09:52,790 --> 00:09:58,280
What we do not want is something like this where all the spam and non spam emails simply overlap.

133
00:09:58,910 --> 00:10:04,640
In this case, we can see that there is no way to draw a line that can easily separate the spam and

134
00:10:04,640 --> 00:10:05,870
non spam emails.

135
00:10:06,560 --> 00:10:13,160
So one might say our objective in converting text into vectors will be to do so intelligently such that

136
00:10:13,160 --> 00:10:14,750
we can avoid the situation.

137
00:10:15,770 --> 00:10:21,230
In other words, we don't just want any set of vectors, but we want a useful mapping from text to vector,

138
00:10:21,500 --> 00:10:26,150
which will make things easier for any subsequent machine learning technique that we want to apply.

139
00:10:27,110 --> 00:10:32,210
So in the coming lectures, we will be exploring methods to convert text into vectors, and you'll see

140
00:10:32,210 --> 00:10:36,020
that we can apply various strategies to give us useful vectors.

