1
00:00:11,090 --> 00:00:14,880
So in this video, we are going to discuss support vector machines.

2
00:00:15,470 --> 00:00:20,340
Now, luckily, thanks to the previous two lectures, you now understand the basics of linear models.

3
00:00:20,810 --> 00:00:24,320
This lecture is kind of a different way of putting a linear model together.

4
00:00:25,010 --> 00:00:30,370
So the easiest way to start this discussion of support vector machines is with classification.

5
00:00:31,130 --> 00:00:34,700
As you've heard me say before, this is nothing but a geometry problem.

6
00:00:35,480 --> 00:00:38,270
Now we know how logistic regression finds this line.

7
00:00:38,780 --> 00:00:44,030
It does so by minimizing a lost function, which happens to be the negative log likelihood of a coin

8
00:00:44,030 --> 00:00:45,090
toss distribution.

9
00:00:45,830 --> 00:00:49,100
The details aren't too important, but it's a model based on probability.

10
00:00:49,970 --> 00:00:54,740
In contrast, the support vector machine is a model based completely on geometry.

11
00:00:59,360 --> 00:01:02,480
So how is the support vector machine based on geometry?

12
00:01:03,200 --> 00:01:10,070
Well, the SVM is what we call a maximum margin classifier using this data set as an example, we can

13
00:01:10,070 --> 00:01:14,230
see that there is definitely more than one line that can separate the two classes.

14
00:01:14,690 --> 00:01:17,090
But how do we define what a good line is?

15
00:01:17,630 --> 00:01:22,100
Intuitively, you might say that this line is not a good line because it doesn't generalize.

16
00:01:22,100 --> 00:01:24,740
Well, it's too close to the data points.

17
00:01:25,070 --> 00:01:30,890
If I add a new data point here, it's pretty obvious which class this belongs to, and yet this line

18
00:01:30,890 --> 00:01:33,350
would classify it as the wrong class.

19
00:01:38,010 --> 00:01:44,310
The intuition behind the SVM is that the best line is the line that is as far away from the points as

20
00:01:44,310 --> 00:01:45,010
possible.

21
00:01:45,930 --> 00:01:51,240
You can see that according to this criterion, this line is the best line, because if we rotate it

22
00:01:51,240 --> 00:01:56,000
in any direction, it becomes a worse line going closer to other data points.

23
00:01:57,580 --> 00:02:04,630
The space between the closest data points is called the margin, hence the SVM is a maximum margin classifier.

24
00:02:05,050 --> 00:02:06,990
It tries to maximize this margin.

25
00:02:07,900 --> 00:02:11,470
As mentioned, you can see that this is based completely on geometry.

26
00:02:12,100 --> 00:02:16,660
Furthermore, notice that there are data points which fall on the lines which define the boundaries

27
00:02:16,660 --> 00:02:17,410
of the margin.

28
00:02:18,010 --> 00:02:23,230
These are input vectors which are from our data set because they fall directly on these lines.

29
00:02:23,380 --> 00:02:25,090
We call them support vectors.

30
00:02:25,420 --> 00:02:28,170
Hence why this model is called the support vector machine.

31
00:02:28,750 --> 00:02:33,610
These support vectors are in some sense supporting this line with the corresponding margin.

32
00:02:34,420 --> 00:02:37,940
OK, so hopefully that makes sense why this is called a support vector machine.

33
00:02:38,620 --> 00:02:43,250
So basically the methods of solving this geometry problem are well-established.

34
00:02:43,600 --> 00:02:47,670
So bringing this from concept to code is technically not too difficult.

35
00:02:48,040 --> 00:02:53,290
If you're a person who enjoys working on geometrical problems, I'd encourage you to think about how

36
00:02:53,290 --> 00:02:54,340
this might be solved.

37
00:02:59,070 --> 00:03:04,320
Now, the picture we just discussed is pretty nice, but to ideal, we know that in the real world,

38
00:03:04,350 --> 00:03:06,300
data is not completely separable.

39
00:03:06,780 --> 00:03:12,480
That's why when you train a model, you get 95 percent accuracy and not 100 percent accuracy.

40
00:03:13,020 --> 00:03:18,690
In fact, the methods used to solve the geometrical problem will fail in the case where your data points

41
00:03:18,810 --> 00:03:21,750
cannot be perfectly separated by a line or a plane.

42
00:03:26,420 --> 00:03:33,020
So the solution to this problem is to introduce variables like variables are basically a way for the

43
00:03:33,020 --> 00:03:34,700
optimization to cheat.

44
00:03:35,180 --> 00:03:41,120
Instead of trying to only maximize the margin, we allow a few data points to not be classified correctly.

45
00:03:41,750 --> 00:03:45,000
When this happens, it simply increases our objective function.

46
00:03:45,740 --> 00:03:50,470
So in effect, our optimization ends up being a balance between two terms.

47
00:03:50,840 --> 00:03:53,330
On one hand, we'd like to maximize the margin.

48
00:03:53,630 --> 00:03:57,500
On the other hand, we'd like to have the minimum possible amount of slack.

49
00:03:58,010 --> 00:04:02,840
Of course, this balance can be controlled by the user and it would depend on the problem at hand.

50
00:04:07,600 --> 00:04:12,370
OK, so those are the basics for the support vector machine when we want to do classification.

51
00:04:13,420 --> 00:04:15,430
Remember that this was only intuition.

52
00:04:15,850 --> 00:04:20,370
There are still details we haven't discussed, like how one would handle multiple classes.

53
00:04:20,770 --> 00:04:25,620
You're encouraged to check out extra reading text if you'd like to go into further detail.

54
00:04:26,380 --> 00:04:29,800
The next topic for this video is how do EVMs do a regression?

55
00:04:30,400 --> 00:04:35,230
That is, instead of predicting a category, how do we get the model to predict a number?

56
00:04:39,750 --> 00:04:45,090
So the regression case is kind of interesting because it's not at all clear how this is related to the

57
00:04:45,090 --> 00:04:47,550
model we just built for classification.

58
00:04:48,330 --> 00:04:51,740
There's no real concept of a margin that we want to maximize.

59
00:04:52,260 --> 00:04:57,000
Unfortunately, the link is not really easy to understand unless you've seen the math.

60
00:04:57,010 --> 00:05:00,720
So consider learning this topic in depth if you're curious.

61
00:05:01,680 --> 00:05:07,500
Now, the bright side to this is that understanding the intuition behind SVM for regression is just

62
00:05:07,500 --> 00:05:10,140
as easy as it was before using a picture.

63
00:05:10,800 --> 00:05:12,450
So what is this picture saying?

64
00:05:13,380 --> 00:05:18,180
Well, on the left, we have a picture of our line of best fit, and on the right we have a picture

65
00:05:18,180 --> 00:05:19,730
of the corresponding loss function.

66
00:05:20,370 --> 00:05:23,710
This loss function is called the Epsilon insensitive loss.

67
00:05:24,300 --> 00:05:30,930
The basic idea is there any point within Epsilon distance of the line does not contribute to the loss.

68
00:05:31,140 --> 00:05:32,810
In other words, its losses zero.

69
00:05:33,270 --> 00:05:39,090
In other words, if the line is close enough to all the points that is considered acceptable only when

70
00:05:39,090 --> 00:05:43,020
points go outside the Epsilon boundary do they have non-zero slack.

71
00:05:43,440 --> 00:05:49,920
In this case, the loss would increase above zero and therefore the optimization problem is to minimize

72
00:05:49,920 --> 00:05:55,920
the slack and make it so that as many points as possible are within epsilon distance to the line.

73
00:05:56,790 --> 00:06:01,860
Contrast this with the squared loss, which is positive for every point, which isn't precisely on the

74
00:06:01,860 --> 00:06:02,340
line.

75
00:06:07,100 --> 00:06:12,650
Now, one question worth asking at this point is why should we do all this work just to get yet another

76
00:06:12,650 --> 00:06:15,300
linear model for regression and classification?

77
00:06:16,160 --> 00:06:19,450
In fact, this is not the true power behind this film.

78
00:06:20,210 --> 00:06:25,550
What makes this film really powerful is the fact that it employs something called the kernel trick in

79
00:06:25,550 --> 00:06:27,490
order to find non-linear models.

80
00:06:27,980 --> 00:06:31,520
And again, in order to understand this, you really have to understand the math.

81
00:06:31,670 --> 00:06:34,040
So just accept this as fact until you do.

82
00:06:35,120 --> 00:06:39,710
The simplest way to understand the SVM is by understanding feature engineering.

83
00:06:40,370 --> 00:06:45,980
As you may have learned in the past, one simple way to create nonlinear models out of linear ones is

84
00:06:45,980 --> 00:06:51,030
to simply create non-linear features, for example, X squared, X cubed and so forth.

85
00:06:51,650 --> 00:06:57,440
In general, we can think of a generic feature transformation called FI, and by applying this before

86
00:06:57,440 --> 00:07:01,150
we use our linear model, we effectively get a nonlinear model.

87
00:07:02,030 --> 00:07:06,800
The most popular kind of feature engineering for the SVM is called the RBF Kernel.

88
00:07:11,580 --> 00:07:13,260
So what is the RBF, Colonel?

89
00:07:13,920 --> 00:07:19,520
Well, again, it helps to look at a picture, as you recall, our data comes in the form of vectors.

90
00:07:19,830 --> 00:07:21,810
So imagine some vector space.

91
00:07:22,380 --> 00:07:26,870
Now, imagine that we have some feature vector X and we would like to feature it.

92
00:07:27,810 --> 00:07:32,600
As you recall, what is really important in SVM theory is the support vector.

93
00:07:33,060 --> 00:07:35,760
Sometimes we call these landmarks or exemplars.

94
00:07:36,210 --> 00:07:38,430
Basically, they are the points that really matter.

95
00:07:39,240 --> 00:07:44,470
And so what we effectively want to know is how close is X to each of these landmarks?

96
00:07:44,970 --> 00:07:50,080
Suppose that we have two landmarks, or in other words, to support vectors called L1 and L2.

97
00:07:50,730 --> 00:07:55,830
We want some measure of similarity to tell us how close X is to L1 and L2.

98
00:07:56,370 --> 00:08:02,850
Since X is close to our one, we want that to yield a larger number like zero point nine since X is

99
00:08:02,850 --> 00:08:03,880
far from L2.

100
00:08:04,080 --> 00:08:07,250
We want that to yield a smaller number like zero point zero one.

101
00:08:08,010 --> 00:08:12,630
And note that this measure of similarity always gives us the number between zero and one.

102
00:08:17,360 --> 00:08:21,860
So the army of colonel or the similarity function is essentially based on the Gaussian.

103
00:08:22,400 --> 00:08:26,540
It's very easy to see how this works by plugging in the extreme cases.

104
00:08:27,230 --> 00:08:33,190
If X is equal to the landmark and the distance between them is zero, the exponent of zero is one.

105
00:08:33,410 --> 00:08:35,210
And so the similarities maximum.

106
00:08:35,930 --> 00:08:40,310
If X is infinitely far away from the landmark, then the distance between them is infinity.

107
00:08:40,910 --> 00:08:43,190
The exponent of minus infinity is zero.

108
00:08:43,340 --> 00:08:45,160
And so the similarity is minimum.

109
00:08:45,980 --> 00:08:50,980
Not that we can control how sensitive this function is by using this hyper parameter called beta.

110
00:08:51,650 --> 00:08:55,940
Based on this value, you can control how skinny or fat the bell curve is.

111
00:09:00,570 --> 00:09:03,780
Now, you might be wondering, why is the army of colonels so special?

112
00:09:04,470 --> 00:09:10,470
Well, recall that earlier I said something about applying some feature transformation phi to your input

113
00:09:10,620 --> 00:09:12,260
to obtain a non-linear model.

114
00:09:12,750 --> 00:09:14,370
But FI is not the kernel.

115
00:09:14,880 --> 00:09:20,820
The real power behind the SVM is that it allows you to use Col's instead of computing PHY.

116
00:09:21,300 --> 00:09:26,740
This is helpful, especially when FIA's a very complicated function, perhaps even impossible to compute.

117
00:09:27,450 --> 00:09:32,790
And it turns out that after doing a lot of math, you can prove that the RB of COL corresponds to an

118
00:09:32,790 --> 00:09:34,730
infinite dimensional feature expansion.

119
00:09:35,490 --> 00:09:38,680
In other words, it certainly is impossible to compute directly.

120
00:09:39,540 --> 00:09:45,360
So using the kernel is a trick that allows you to compute something that would be otherwise in computable

121
00:09:45,600 --> 00:09:50,400
simply by calculating the distance between your input and the support vectors.

122
00:09:50,880 --> 00:09:54,930
Again, it's hard to appreciate this without going through the math, but I hope that this gives you

123
00:09:54,930 --> 00:09:57,020
at least some idea of what's going on.