1
00:00:11,550 --> 00:00:16,590
In this lecture, we are going to discuss a tiny bit of theory related to linear classification.

2
00:00:17,310 --> 00:00:19,740
Previously, we discussed that linear regression.

3
00:00:20,280 --> 00:00:25,440
As you recall, this is the task where our goal is to find a line or curve of best fit.

4
00:00:26,100 --> 00:00:32,210
You want your model to output predictions, which are close to the data points in classification.

5
00:00:32,220 --> 00:00:38,850
This is not our task in the classification, but we do not want the line to be close to the data points.

6
00:00:39,420 --> 00:00:45,270
Rather, in classification, our goal is to separate data points into distinct classes.

7
00:00:45,810 --> 00:00:50,250
Although we can consider the case where we have more than two classes, we'll start with two since that

8
00:00:50,250 --> 00:00:51,420
is the most intuitive.

9
00:00:53,270 --> 00:00:58,670
Because this is linear classification analogous to linear regression, our model is still a line.

10
00:00:59,270 --> 00:01:03,470
It's just that the job that we want this line to do is different from regression.

11
00:01:04,220 --> 00:01:08,210
Unlike regression, we don't care if this line is close to the data points.

12
00:01:08,780 --> 00:01:13,070
Instead, we want the line to separate data points of different classes.

13
00:01:13,670 --> 00:01:19,100
In other words, if one class consists of the red dots and one class consists of the blue dots, then

14
00:01:19,100 --> 00:01:21,890
we want all the red dots to be on one side of the line.

15
00:01:22,160 --> 00:01:24,980
And we want all the blue dots to be on the other side of the line.

16
00:01:26,090 --> 00:01:30,650
In this way, we can say that the line discriminates between the two classes.

17
00:01:31,370 --> 00:01:36,650
As you can see, this is not related at all to having the line be close to the data points.

18
00:01:37,610 --> 00:01:41,480
Also, I hope I haven't lost you in terms of what the colors of the dots mean.

19
00:01:41,960 --> 00:01:42,830
Remember the rule?

20
00:01:42,830 --> 00:01:44,180
All data is the same.

21
00:01:44,600 --> 00:01:50,090
So for one particular data set, the color might represent whether a credit card transaction is fraudulent

22
00:01:50,090 --> 00:01:50,510
or not.

23
00:01:51,170 --> 00:01:55,250
For another data set, the color might represent whether an email is spam or not.

24
00:01:55,730 --> 00:01:59,930
For another data set, the color might represent whether a student will graduate or not.

25
00:02:00,500 --> 00:02:04,550
And as you can see in this example, it represents disease versus healthy.

26
00:02:09,620 --> 00:02:13,400
As with the previous lectures, we are going to start from a very basic perspective.

27
00:02:13,820 --> 00:02:15,440
How would this work inside get learn?

28
00:02:16,280 --> 00:02:20,990
We're going to learn what each of these steps mean in terms of concepts, and we'll also learn how each

29
00:02:20,990 --> 00:02:22,820
of these steps are done in PyTorch.

30
00:02:24,430 --> 00:02:30,070
Remember that in PyTorch there is no model constructor or fit function or predictor function, so we

31
00:02:30,070 --> 00:02:31,600
have to do all that ourselves.

32
00:02:32,500 --> 00:02:33,280
To recap.

33
00:02:33,310 --> 00:02:37,600
Here are the steps we do in a typical machine learning script with socket learn.

34
00:02:39,580 --> 00:02:42,850
First we are going to load in some data we call that X and Y.

35
00:02:43,690 --> 00:02:45,850
Next, we're going to instantiate a model.

36
00:02:46,820 --> 00:02:50,750
Next we are going to train or fit the model using models that fit x y.

37
00:02:52,710 --> 00:02:56,910
Next, we are going to make predictions with the model using model that predict X.

38
00:02:57,300 --> 00:03:01,380
This can be the same X or a different x, for example, x train or x test.

39
00:03:02,250 --> 00:03:05,820
Finally, we can evaluate our model using model that score x y.

40
00:03:07,240 --> 00:03:11,110
For classification, the score returns the classification accuracy.

41
00:03:11,710 --> 00:03:17,320
And just to reiterate, this lecture is all about answering the question what actually happens inside

42
00:03:17,320 --> 00:03:18,250
these functions?

43
00:03:18,850 --> 00:03:23,500
This question is really important to answer because you'll notice that if you compare this to linear

44
00:03:23,500 --> 00:03:27,730
regression, literally nothing has changed except for the name of the class.

45
00:03:28,300 --> 00:03:32,020
So really what happens inside these functions makes all the difference.

46
00:03:32,410 --> 00:03:36,130
It's what really matters if you want to know how this stuff actually works.

47
00:03:41,230 --> 00:03:46,480
As a side note, if you're a beginner and you don't yet know what classification accuracy means, here's

48
00:03:46,480 --> 00:03:47,440
a quick definition.

49
00:03:48,070 --> 00:03:50,720
As you know, our model is a binary classifier.

50
00:03:50,740 --> 00:03:53,050
It's predicting spam or not spam, for example.

51
00:03:54,420 --> 00:03:57,150
Therefore its predictions can only be right or wrong.

52
00:03:57,210 --> 00:03:58,800
There are no other possibilities.

53
00:03:58,950 --> 00:04:00,360
Either it's right or it's wrong.

54
00:04:01,140 --> 00:04:05,340
The classification accuracy, then, is simply the number of predictions I get right.

55
00:04:05,760 --> 00:04:07,950
Divided by the number of total predictions.

56
00:04:08,670 --> 00:04:12,180
The classification error is the number of predictions I get wrong.

57
00:04:12,390 --> 00:04:14,430
Divided by the total number of predictions.

58
00:04:15,060 --> 00:04:21,450
It should be easy to verify that the classification error is equal to one minus the classification accuracy.

59
00:04:23,530 --> 00:04:29,860
Also note that you can calculate the accuracy and error on both the train antacids so you can speak

60
00:04:29,860 --> 00:04:35,830
of the train accuracy and the test accuracy which would be evaluated on the train set and the test set

61
00:04:35,830 --> 00:04:36,580
respectively.

62
00:04:41,690 --> 00:04:45,080
So the first concept we want to consider is what is the model?

63
00:04:45,620 --> 00:04:48,830
More accurately, we want to ask, what is the model architecture?

64
00:04:49,580 --> 00:04:51,580
As usual, it helps to look at a picture.

65
00:04:52,340 --> 00:04:55,610
So here is a typical picture of a linear classification problem.

66
00:04:56,210 --> 00:05:02,030
We have some data points and we want to separate the data points of different colors with a line as

67
00:05:02,030 --> 00:05:02,390
a sign.

68
00:05:02,390 --> 00:05:05,360
So why do we call this linear classification?

69
00:05:06,800 --> 00:05:10,520
Well, you will notice that the word line actually appears in the word linear.

70
00:05:11,060 --> 00:05:12,350
This surprises many people.

71
00:05:12,350 --> 00:05:14,450
So don't worry if you haven't noticed this before.

72
00:05:15,200 --> 00:05:15,590
All right.

73
00:05:15,590 --> 00:05:17,120
So what's the equation for a line?

74
00:05:17,720 --> 00:05:24,680
I claim that a line can be expressed using the equation w1x1 plus w2x2 plus b equals zero.

75
00:05:29,730 --> 00:05:34,470
Now at this point, you might say to yourself and say, Gosh, why does lazy programmer have to make

76
00:05:34,470 --> 00:05:35,730
everything so complicated?

77
00:05:36,150 --> 00:05:41,460
I already know that the equation for a line is Y equals m, x plus B, so why do we have to use this

78
00:05:41,460 --> 00:05:45,960
more complicated looking equation with WS and x one and x two and so on?

79
00:05:48,130 --> 00:05:49,300
In fact, we must.

80
00:05:49,840 --> 00:05:52,810
You'll see why this form of the line is useful very shortly.

81
00:05:53,470 --> 00:05:56,680
If this is new to you, I would recommend the following exercise.

82
00:05:57,340 --> 00:06:02,980
First, notice that all we've done is rename the axes from X and Y to x, one and two.

83
00:06:03,580 --> 00:06:07,030
This is important because these axes refer to the input data.

84
00:06:07,870 --> 00:06:10,030
If you recall, y is the target.

85
00:06:10,720 --> 00:06:16,660
For example, if we're going to classify images of cats versus images of dogs, then we might say dogs

86
00:06:16,660 --> 00:06:21,520
are the red dots and cats are the blue dogs, but they are not an axis on this graph.

87
00:06:22,150 --> 00:06:27,040
So x one represents the horizontal axis and x two represents the vertical axis.

88
00:06:27,370 --> 00:06:29,060
Y is not an axis.

89
00:06:29,440 --> 00:06:34,030
This is different from linear regression where the target y was an axis on the graph.

90
00:06:39,280 --> 00:06:46,240
Now that you know this, you should be able to rearrange the equation w1x1 plus w2x2 plus b equals zero

91
00:06:46,600 --> 00:06:48,640
and two slope intercept format.

92
00:06:49,420 --> 00:06:55,390
What I mean by that is have x two on the left side by itself and then you'll have some slope times x

93
00:06:55,390 --> 00:06:57,190
one plus some intercepts.

94
00:06:57,760 --> 00:07:03,790
Now it has the same format as y equals m x plus B, except that we use x one and x two instead of x

95
00:07:03,790 --> 00:07:04,270
and y.

96
00:07:05,290 --> 00:07:11,650
Also note that the B that appears in Y equals m x plus b is not the same as the B that appears in the

97
00:07:11,650 --> 00:07:12,460
other equation.

98
00:07:13,060 --> 00:07:18,460
So in this slide, I've called the intercept B prime to differentiate it from the original b.

99
00:07:20,610 --> 00:07:26,100
As always, if you can't immediately see how I got this, you should try it by yourself on paper so

100
00:07:26,100 --> 00:07:27,030
you know how to do it.

101
00:07:27,540 --> 00:07:30,180
It should only require elementary school algebra.

102
00:07:35,280 --> 00:07:39,510
The next question you might have is how does this line help us make predictions?

103
00:07:41,560 --> 00:07:47,620
Luckily due to the rules of geometry, if we plug in a data point X which is not on the line, then

104
00:07:47,620 --> 00:07:50,890
either we will get a number bigger than zero or less than zero.

105
00:07:51,630 --> 00:07:57,100
In fact, any data point on one side of the line will always give us a number bigger than zero.

106
00:07:57,640 --> 00:08:02,200
Any data point on the other side of the line will always give us a number less than zero.

107
00:08:02,950 --> 00:08:06,340
Using this, it's very easy to turn this into a prediction model.

108
00:08:07,030 --> 00:08:13,300
All we have to do is take in any data point x, which is a vector containing the elements x, one and

109
00:08:13,300 --> 00:08:21,130
next to pass it into the expression for our line w one, x one plus w two, x two plus b and then check

110
00:08:21,130 --> 00:08:21,760
it sine.

111
00:08:22,390 --> 00:08:24,790
If the sign is positive, we predict one.

112
00:08:25,120 --> 00:08:26,800
If it's negative, we predict zero.

113
00:08:31,850 --> 00:08:37,310
Mathematically, you can encapsulate this decision rule using the step function, or if you want to

114
00:08:37,310 --> 00:08:40,790
be less formal, you can think of it as the picture that you see here.

115
00:08:41,540 --> 00:08:43,250
We call the activation.

116
00:08:43,520 --> 00:08:46,730
And if the activation is greater than zero, we predict one.

117
00:08:47,090 --> 00:08:48,440
Otherwise we predict zero.

118
00:08:49,430 --> 00:08:55,040
Now, as you may already know, in deep learning, we really like differentiable smooth functions.

119
00:08:55,490 --> 00:09:01,130
So rather than the step function, which is not smooth, what we do is take a smooth version of this

120
00:09:01,310 --> 00:09:02,360
called the Sigmoid.

121
00:09:03,020 --> 00:09:09,110
The Sigmoid is an S-shaped curve, and it maps the activation to a number between a zero and one.

122
00:09:14,200 --> 00:09:19,000
We usually interpret this as the probability that Y equals one given X.

123
00:09:19,840 --> 00:09:23,620
Then when we want to make our prediction, we just round this probability.

124
00:09:24,280 --> 00:09:28,210
So if the probability is greater than 50%, we predict one.

125
00:09:28,450 --> 00:09:29,860
Otherwise, we predict zero.

126
00:09:30,730 --> 00:09:35,580
Again, this is called the sigmoid function, which is important to know for our implementation.

127
00:09:37,120 --> 00:09:42,760
As a side note, when we apply the sigmoid function on top of a linear function, we call this model

128
00:09:42,760 --> 00:09:44,170
logistic regression.

129
00:09:44,950 --> 00:09:50,200
This is because we also sometimes call the sigmoid function in the logistic function, although this

130
00:09:50,200 --> 00:09:52,540
term is not really used too often these days.

131
00:09:54,300 --> 00:10:00,600
In addition, we also refer to the argument into the logistic function as the largest but a more current

132
00:10:00,600 --> 00:10:03,180
and generic term for this is activation.

133
00:10:08,420 --> 00:10:13,910
You might realize that using what we know so far, there is a little bit of a notational challenge.

134
00:10:14,480 --> 00:10:18,410
If we keep writing each component of X separately, we're going to run out of space.

135
00:10:18,920 --> 00:10:24,560
It's easy to write w1x1 plus w2x2 when there are only two components of x.

136
00:10:25,160 --> 00:10:27,830
But what if there are 100 or 1000?

137
00:10:28,640 --> 00:10:31,940
Luckily, we have a mathematical way of representing this.

138
00:10:33,410 --> 00:10:40,850
If we consider X to be a feature vector containing each component of X and W to be a weight vector containing

139
00:10:40,850 --> 00:10:46,850
each component of W, then our expression just becomes the dot product between one x.

140
00:10:47,480 --> 00:10:54,410
So we can write this in a much more compact way by saying probability of y equals one given x is equal

141
00:10:54,410 --> 00:10:57,380
to the sigmoid of W transpose x plus b.

142
00:11:02,460 --> 00:11:02,910
Okay.

143
00:11:02,910 --> 00:11:06,060
So at this point, let's recap the three concepts we need to cover.

144
00:11:06,750 --> 00:11:08,580
Number one, model architecture.

145
00:11:08,730 --> 00:11:11,160
Number two, using the model to make predictions.

146
00:11:11,190 --> 00:11:14,310
And number three, model training we just covered.

147
00:11:14,340 --> 00:11:14,790
Number one.

148
00:11:14,790 --> 00:11:15,480
And number two.

149
00:11:16,110 --> 00:11:19,890
You know what the model architecture is and you know how to use it to make predictions.

150
00:11:20,790 --> 00:11:25,890
Basically, given some input X, you know how to plug it into the previous formula to get some output

151
00:11:25,890 --> 00:11:26,900
prediction y hat.

152
00:11:27,990 --> 00:11:29,370
The final step is training.

153
00:11:30,150 --> 00:11:35,370
As I promised earlier, we are going to follow the exact same general steps as we did when we learned

154
00:11:35,370 --> 00:11:36,510
about linear regression.

155
00:11:37,560 --> 00:11:42,090
It may surprise you, but these are the same steps that we are going to follow for every subsequent

156
00:11:42,090 --> 00:11:43,410
example in this course.

157
00:11:43,980 --> 00:11:46,200
That's why I said it's the simplest example.

158
00:11:46,440 --> 00:11:48,000
But it was also the most important.

159
00:11:48,540 --> 00:11:49,500
Now you see why.

160
00:11:50,900 --> 00:11:53,570
So as you recall, we need to define a loss function.

161
00:11:54,080 --> 00:11:59,360
Then after we've done that, all we need to do is apply the gradient descent procedure to update the

162
00:11:59,360 --> 00:12:02,360
parameters of the model in order to minimize that loss.

163
00:12:03,600 --> 00:12:08,580
Once we've completed training, we can plot the loss per iteration to ensure that the loss converts

164
00:12:08,580 --> 00:12:13,770
successfully, and we can test our model by checking its accuracy on the train and test sets.

165
00:12:18,930 --> 00:12:23,670
The main difference in the training process between regression and classification is that we have a

166
00:12:23,670 --> 00:12:24,750
different loss function.

167
00:12:25,500 --> 00:12:31,200
As you recall, in regression our target is a real number and our prediction is also a real number.

168
00:12:31,920 --> 00:12:36,300
In this scenario, it makes sense to use a loss like the mean squared error or MSI.

169
00:12:36,990 --> 00:12:39,720
But in classification our target is a category.

170
00:12:40,650 --> 00:12:45,000
Our model output is the probability that the input belongs to each of the categories.

171
00:12:45,420 --> 00:12:48,360
So the mean squared error doesn't make any sense in this scenario.

172
00:12:49,000 --> 00:12:52,620
In fact, for classification, what we want is the cross entropy loss.

173
00:12:53,790 --> 00:12:58,290
There's one little wrinkle here, which is that we're doing binary classification in which there are

174
00:12:58,290 --> 00:13:03,540
only two possible classes, and the corresponding loss is called the binary cross entropy loss.

175
00:13:04,660 --> 00:13:09,670
In general, if your model can handle any number of classes, then you'll use the regular cross entropy

176
00:13:09,670 --> 00:13:10,090
loss.

177
00:13:10,270 --> 00:13:12,160
But we'll discuss that later in this course.

178
00:13:13,600 --> 00:13:18,430
The reasoning behind the cross entropy loss is quite a bit more complicated than the mean squared error,

179
00:13:18,580 --> 00:13:20,710
so we won't discuss that theory in this lecture.

180
00:13:21,340 --> 00:13:25,510
If you want to learn more, then you should check out the in-depth sections of this course.

181
00:13:26,110 --> 00:13:29,590
For now, all you really need to know is the PyTorch API.

182
00:13:29,950 --> 00:13:35,800
Since as you recall, this course is not about mathematical theory but rather how we do things in pytorch.

183
00:13:36,700 --> 00:13:41,380
Therefore your job really is knowing the right function to call and spelling it correctly.

184
00:13:41,920 --> 00:13:46,960
Indeed, in this computer science course, what matters isn't your math skill but your spelling ability.

185
00:13:52,010 --> 00:13:56,090
To summarize this lecture, we went over the concepts behind linear classification.

186
00:13:56,750 --> 00:13:59,120
Our goal was to consider these three things.

187
00:13:59,450 --> 00:14:01,700
Number one, what is the model architecture?

188
00:14:02,030 --> 00:14:04,130
Number two, how do we make predictions?

189
00:14:04,340 --> 00:14:06,560
And number three, how do we train the model?

190
00:14:07,850 --> 00:14:13,400
As you saw, it was very similar to linear regression in the sense that our model is still a line or

191
00:14:13,400 --> 00:14:18,040
a hyper plane, but it's how we use this line that's different in regression.

192
00:14:18,050 --> 00:14:23,330
Our goal is to get the line close to the data points, but in classification, our goal is to separate

193
00:14:23,330 --> 00:14:24,200
the data points.

194
00:14:25,480 --> 00:14:30,520
In regression, we just pass our data into W times X plus B and that is our prediction.

195
00:14:31,150 --> 00:14:37,000
But in classification we still pass our data through W, times X plus B, we just have the additional

196
00:14:37,000 --> 00:14:42,610
steps of applying the sigmoid function and rounding the upward probability for number three.

197
00:14:42,670 --> 00:14:47,710
Again, it was very similar to our previous example where we set up a loss function and then use gradient

198
00:14:47,710 --> 00:14:49,960
descent to find the weights that minimize it.

199
00:14:50,560 --> 00:14:55,810
But the difference in classification is that instead of using the mean squared error, we use the cross

200
00:14:55,810 --> 00:14:56,320
entropy.

201
00:14:56,590 --> 00:15:01,420
And in our special case of binary classification, we use the binary cross entropy.