1
00:00:11,050 --> 00:00:16,780
So in this lecture, we will be discussing the intuition behind logistic regression, which is the machine

2
00:00:16,780 --> 00:00:19,750
learning model will be using to do sentiment analysis.

3
00:00:20,590 --> 00:00:25,480
Now one interesting way to compare what we are learning now to what we previously learned, meaning

4
00:00:25,480 --> 00:00:31,510
niveis is by thinking about these techniques in terms of the high level topics in this course.

5
00:00:32,290 --> 00:00:36,820
As you recall, this course started out with two major sections that were very distinct.

6
00:00:37,390 --> 00:00:43,240
One of them was vector models, where we thought about NLP in terms of converting text into vectors.

7
00:00:44,380 --> 00:00:50,140
The other one was probability models where we thought of NLP in terms of Markov chains and other related

8
00:00:50,140 --> 00:00:50,800
methods.

9
00:00:51,430 --> 00:00:55,060
So pretty obviously naive Bayes is a model based on probability.

10
00:00:55,690 --> 00:01:01,420
In this section, you will see that linear classification of which logistic regression is an example

11
00:01:01,660 --> 00:01:04,360
is based on vectors instead of probabilities.

12
00:01:05,019 --> 00:01:08,890
So this is a nice way to tie in what we learned earlier in this course.

13
00:01:13,490 --> 00:01:17,390
Let's now review the vector perspective on the classification task.

14
00:01:18,110 --> 00:01:23,690
As you recall, one of the nice consequences of converting text into vectors is that this allows us

15
00:01:23,690 --> 00:01:27,110
to plot them on a grid, of course in practice.

16
00:01:27,140 --> 00:01:32,330
These will be very high dimensional grids, but for the purpose of this visualization, we will pretend

17
00:01:32,330 --> 00:01:33,860
they are two dimensional grids.

18
00:01:34,790 --> 00:01:41,360
So the beauty of this method is that it makes understanding classification a simple problem of geometry.

19
00:01:42,080 --> 00:01:48,140
As you recall, our job in building a so-called classifier is to simply find a line or a curve that

20
00:01:48,140 --> 00:01:50,330
can separate points of different colors.

21
00:01:51,200 --> 00:01:54,170
Remember that different colors represent different classes.

22
00:01:54,770 --> 00:01:59,840
So one color might represent positive sentiment, while another might represent negative sentiment and

23
00:01:59,840 --> 00:02:00,530
so forth.

24
00:02:01,610 --> 00:02:07,590
Note that this approach applies to both binary and multi class problems in the multiclass case.

25
00:02:07,610 --> 00:02:12,260
It's simply that we will have more than two colors, but the general idea remains the same.

26
00:02:12,650 --> 00:02:15,740
We want to separate colors by either a line or a curve.

27
00:02:16,400 --> 00:02:21,410
Of course, in higher dimensions, a line becomes a plane or a hybrid plane, while the curve becomes

28
00:02:21,410 --> 00:02:23,150
a surface or hyper surface.

29
00:02:24,290 --> 00:02:29,990
Note that logistic regression is a linear model, and so the separating boundary will be a hyper plane

30
00:02:30,230 --> 00:02:31,820
instead of a hybrid surface.

31
00:02:36,430 --> 00:02:41,770
Now, in order to really understand logistic regression, it is necessary to use a little bit of math.

32
00:02:42,490 --> 00:02:47,740
Luckily, if you just want to understand the intuition, it doesn't require anything beyond approximately

33
00:02:48,100 --> 00:02:49,630
tenth grade mathematics.

34
00:02:50,320 --> 00:02:53,590
So that being said, let's review the equation for a line.

35
00:02:54,520 --> 00:02:58,720
As you recall, the equation for a line is Y equals m x plus b.

36
00:02:59,410 --> 00:03:04,300
In this form, M represents the slope, while B represents the Y intercept.

37
00:03:04,840 --> 00:03:06,700
So hopefully this is just a review.

38
00:03:06,700 --> 00:03:10,390
But if you don't understand this, and please let me know on the Q&A.

39
00:03:15,060 --> 00:03:20,740
Now for machine learning, it is better to use different symbols for our line in particular.

40
00:03:20,760 --> 00:03:26,580
Recall that each axis on this grid represents a component of our feature vector, which we call X.

41
00:03:27,420 --> 00:03:32,820
Furthermore, in machine learning, we typically reserve the letter Y for the output or target of our

42
00:03:32,820 --> 00:03:36,150
model, and thus it's not a good symbol to use on this grid.

43
00:03:36,960 --> 00:03:42,810
So because our feature vector is called X, then we'll simply refer to the components of X as X1 and

44
00:03:42,810 --> 00:03:45,000
X2, as you recall.

45
00:03:45,030 --> 00:03:48,060
This is because our grid currently has two dimensions.

46
00:03:48,660 --> 00:03:50,250
Of course, this is pretty trivial.

47
00:03:50,610 --> 00:03:58,110
Our line now has the equation x two equals m times x one a plus b, so hopefully not very challenging

48
00:03:58,110 --> 00:03:58,830
so far.

49
00:04:03,390 --> 00:04:06,450
The next step is to represent our line in a different form.

50
00:04:07,320 --> 00:04:13,680
In particular, it's much more convenient to have it in the form of W one times x one plus W two times

51
00:04:13,680 --> 00:04:18,190
X two plus B equals zero as an exercise.

52
00:04:18,209 --> 00:04:24,240
If this looks unfamiliar to you, I would recommend trying to convert this equation for a line into

53
00:04:24,240 --> 00:04:25,950
the format we previously saw.

54
00:04:26,760 --> 00:04:32,370
Basically, your goal is to isolate X two by moving all the other variables to the other side.

55
00:04:33,060 --> 00:04:40,770
After doing some algebra, you should get that X two is equal to minus W one over two times x one plus

56
00:04:40,860 --> 00:04:42,690
minus B over W two.

57
00:04:43,350 --> 00:04:46,770
So you might want to try that yourself to confirm that this is correct.

58
00:04:48,540 --> 00:04:55,890
What this tells us is that the slope is minus W one over W2 and the intercept is minus B over W2.

59
00:04:56,760 --> 00:05:00,300
Note that this new B is not the same as the B we saw previously.

60
00:05:00,600 --> 00:05:04,320
It's simply that B is a common letter used in both these scenarios.

61
00:05:05,430 --> 00:05:11,010
In any case, the important point to draw from this is that both of these forms are valid representations

62
00:05:11,010 --> 00:05:11,610
of a line.

63
00:05:16,330 --> 00:05:19,570
So in machine learning, there is some terminology you should know.

64
00:05:20,170 --> 00:05:24,940
In particular, we call the use of weights and we call the B term the bias.

65
00:05:25,570 --> 00:05:30,730
So this is the terminology will be using throughout the course, especially when it comes to deep learning

66
00:05:30,730 --> 00:05:31,810
and neural networks.

67
00:05:32,500 --> 00:05:37,660
As you'll soon see, logistic regression can be thought of as a neuron, which is the basic building

68
00:05:37,660 --> 00:05:39,220
block of neural networks.

69
00:05:43,680 --> 00:05:45,300
Here's another point to consider.

70
00:05:46,170 --> 00:05:52,020
As you recall, the whole reason we have a line is because we want to use the line to figure out which

71
00:05:52,020 --> 00:05:53,730
side a new data point falls on.

72
00:05:54,300 --> 00:05:57,480
For example, if it falls on one side, we say it's spam.

73
00:05:57,570 --> 00:05:59,310
Otherwise, we say it's not spam.

74
00:06:00,180 --> 00:06:04,110
Now you might at first say, Well, why can't I just look at this graph?

75
00:06:04,740 --> 00:06:06,810
And there are two reasons why that won't work.

76
00:06:07,650 --> 00:06:10,890
Firstly is that looking at a graph is not automatic.

77
00:06:11,370 --> 00:06:14,580
Remember, the whole point of this is to do everything inside a computer.

78
00:06:15,060 --> 00:06:18,000
So the process can't involve you looking at a graph.

79
00:06:18,780 --> 00:06:24,060
Secondly, recall that in practice, we're going to have many dimensions, maybe even thousands.

80
00:06:24,600 --> 00:06:29,280
Therefore, our line will in fact be a high dimensional hydroplane, which you cannot see.

81
00:06:29,850 --> 00:06:34,440
And thus, the only way to do it is by doing a calculation inside your computer.

82
00:06:39,250 --> 00:06:44,140
So let's take a very simple line X1 plus X2 minus one equals zero.

83
00:06:44,950 --> 00:06:51,070
In this case, both of our weights W1 and W2 are just one, and the bias term is minus one.

84
00:06:52,210 --> 00:06:59,680
The main thing we want to consider is the expression on the left side that is X1 plus X2 minus one when

85
00:06:59,680 --> 00:07:01,300
this is equal to zero.

86
00:07:01,330 --> 00:07:02,940
It's the equation of our line.

87
00:07:03,310 --> 00:07:07,000
But now this expression by itself is our main object of interest.

88
00:07:07,780 --> 00:07:12,970
So let's see what happens if we plug in the vector X1 equals one and X2 equals zero.

89
00:07:14,050 --> 00:07:19,480
In this case, the point falls on the line, which makes sense since when the left side expression is

90
00:07:19,480 --> 00:07:22,330
equal to zero, that is the equation of our line.

91
00:07:23,500 --> 00:07:25,990
Now let's plug in the point X1 equals zero.

92
00:07:25,990 --> 00:07:27,040
X2 equals zero.

93
00:07:27,910 --> 00:07:32,080
In this case, we get minus one, which falls to the left side of the line.

94
00:07:33,320 --> 00:07:36,770
Now, let's plug in the point one equals one and X two equals one.

95
00:07:37,700 --> 00:07:41,540
In this case, we get plus one, which falls to the right side of the line.

96
00:07:42,650 --> 00:07:48,440
Basically, here's the pattern you should notice when this expression is equal to zero, that's the

97
00:07:48,440 --> 00:07:49,190
line itself.

98
00:07:49,520 --> 00:07:53,420
So any point that satisfies the quality will be on the line.

99
00:07:54,230 --> 00:07:59,690
When this expression is less than zero, then the point we plugged in falls to the left of the line

100
00:08:00,470 --> 00:08:02,360
when this expression is greater than zero.

101
00:08:02,450 --> 00:08:05,720
Then the point we plugged in falls to the right of the line.

102
00:08:06,350 --> 00:08:12,200
I encourage you to try out different points to convince yourself that this pattern holds for all points.

103
00:08:16,770 --> 00:08:21,840
Note that we can simplify the rule we just came up with by assigning the expression we learned to a

104
00:08:21,840 --> 00:08:25,650
new name of X h stands for activation.

105
00:08:26,580 --> 00:08:33,150
So our rule can be stated as air of X is equal to zero, then X is on the line when they have X is greater

106
00:08:33,150 --> 00:08:33,700
than zero.

107
00:08:33,720 --> 00:08:35,429
We are on one side of the line.

108
00:08:35,970 --> 00:08:37,549
When they have X is less than zero.

109
00:08:37,559 --> 00:08:39,600
We are on the other side of the line.

110
00:08:44,210 --> 00:08:50,510
One thing to pay attention to is that the equation for a line is not unique in the format we've presented.

111
00:08:51,140 --> 00:08:55,460
For example, take the line X1 plus X2 minus one equals zero.

112
00:08:56,210 --> 00:09:03,740
I encourage you to convince yourself that two times X1 plus two times x two minus two equals zero represents

113
00:09:03,740 --> 00:09:05,030
the exact same line.

114
00:09:06,290 --> 00:09:08,960
Furthermore, we can even negate all the values.

115
00:09:09,440 --> 00:09:14,990
So minus x one minus x two plus one equals zero still represents the same line.

116
00:09:16,480 --> 00:09:21,700
In other words, an infinite collection of weights and biases can represent the exact same boundary.

117
00:09:22,630 --> 00:09:28,690
This is important since this means that a Vex being greater than zero doesn't automatically imply that

118
00:09:28,990 --> 00:09:31,840
X will be to the right or to the left of some line.

119
00:09:32,380 --> 00:09:39,490
Instead, it depends on the sign in front of you and B, for example, if we plug in one one into our

120
00:09:39,490 --> 00:09:42,220
previous expression, we would get plus one.

121
00:09:42,730 --> 00:09:46,360
But if we plug it into the negated expression, we would get minus one.

122
00:09:46,690 --> 00:09:49,390
And remember, these represent the exact same line.

123
00:09:54,060 --> 00:09:58,920
Now, as you recall, we've been looking at a two dimensional grid since, that's all we can practically

124
00:09:58,920 --> 00:09:59,400
see.

125
00:10:00,030 --> 00:10:04,430
However, in practice, our data sets will have many more dimensions.

126
00:10:05,100 --> 00:10:11,910
In this case, it becomes infeasible to write down W one x one plus W two x two plus W three x three

127
00:10:11,910 --> 00:10:12,750
and so forth.

128
00:10:13,500 --> 00:10:17,160
Instead, what we actually do is use vector notation.

129
00:10:17,970 --> 00:10:20,460
As you recall, X is actually a vector.

130
00:10:21,210 --> 00:10:27,480
Suppose X is a vector of size D since there must be a component of W for every component of X.

131
00:10:27,810 --> 00:10:30,720
This means that W is also a vector of size D.

132
00:10:31,950 --> 00:10:37,740
Furthermore, you may recognize that this Element Y's product in summation is in fact what is known

133
00:10:37,740 --> 00:10:38,880
as a dot product.

134
00:10:39,450 --> 00:10:44,880
Therefore, a much more compact way to write our expression is W Transpose X plus b.

135
00:10:45,780 --> 00:10:50,520
Now, recall that in linear algebra, vectors are normally thought of as column vectors.

136
00:10:50,970 --> 00:10:54,570
That is, a W and X are both D by one matrices.

137
00:10:55,230 --> 00:11:01,110
The reason I bring this up is because a few students have asked why we don't use W X transpose instead.

138
00:11:02,550 --> 00:11:08,550
Note that this is uncommon because we don't normally treat vectors as rogue vectors and note that it's

139
00:11:08,550 --> 00:11:11,910
also valid to use in product notation or dot notation.

140
00:11:12,240 --> 00:11:15,120
However, you seem to be less common in machine learning.

141
00:11:19,820 --> 00:11:24,740
Now, so far, all we've done in this lecture is described the form of a linear classifier.

142
00:11:25,370 --> 00:11:31,760
What we have not yet done is explain how this turns into logistic regression in order to do that.

143
00:11:31,790 --> 00:11:34,280
We need to introduce the logistic function.

144
00:11:35,120 --> 00:11:38,280
Note that the logistic function is also known as the sigmoid.

145
00:11:38,570 --> 00:11:44,210
In case you've seen different terminology elsewhere, the equation for this function is very simple.

146
00:11:44,540 --> 00:11:52,220
Again, just some tenth grade mathematics in particular, sigma of X is equal to one over one plus the

147
00:11:52,220 --> 00:11:54,050
exponential of minus X.

148
00:11:54,950 --> 00:12:00,320
Note that the equation itself is not too important unless you want to convince yourself of the properties

149
00:12:00,320 --> 00:12:02,510
of the sigmoid I'm about to describe.

150
00:12:02,960 --> 00:12:10,370
Or if you want to implement this yourself in code, so pay more attention to the properties in particular.

151
00:12:10,820 --> 00:12:16,730
Note that this function has two asymptotes on either side as the input approaches infinity.

152
00:12:17,090 --> 00:12:23,210
The output of the sigmoid approach is one on the left side as the input approaches minus infinity.

153
00:12:23,510 --> 00:12:28,640
The output of the sigmoid approaches zero in the middle when its argument is zero.

154
00:12:28,940 --> 00:12:31,040
The function is equal to zero point five.

155
00:12:31,880 --> 00:12:37,000
Importantly, notice that the output of the sigmoid is always between zero and one.

156
00:12:41,610 --> 00:12:45,270
OK, so what is the purpose of introducing this sigmoid function?

157
00:12:46,110 --> 00:12:50,670
The answer is that it allows us to interpret the output of our model as a probability.

158
00:12:51,420 --> 00:12:57,210
In particular, we're going to apply the sigmoid it to our previous expression w transpose x plus b.

159
00:12:58,050 --> 00:13:02,820
We interpret this as the probability that the target Y equals one given X.

160
00:13:03,690 --> 00:13:09,150
As you recall, probabilities must always be between zero and one, which is the case for the Sigma.

161
00:13:10,320 --> 00:13:14,100
Now you might wonder why should we care to output probabilities?

162
00:13:14,820 --> 00:13:21,270
One use case is to compute metrics like the AUC, which can only be done when the model outputs probabilities.

163
00:13:21,780 --> 00:13:26,940
We can even sample from this distribution, which is a common thing to do with neural networks.

164
00:13:31,450 --> 00:13:36,820
One thing to note is that our original decision rule corresponds to rounding the probability we get

165
00:13:36,820 --> 00:13:37,990
back from the sigmoid.

166
00:13:38,740 --> 00:13:44,860
In particular, suppose that our rule is if P of Y equals one, given X is greater than zero point five,

167
00:13:45,160 --> 00:13:46,330
then predicts one.

168
00:13:47,500 --> 00:13:52,120
If P vehicles one given X is less than 0.5, then predict zero.

169
00:13:52,990 --> 00:13:58,120
Otherwise, we fall directly on the line, in which case it doesn't really matter which class you pick.

170
00:13:58,390 --> 00:14:02,950
As long as you are consistent in practice, we normally call the round function.

171
00:14:02,960 --> 00:14:07,480
So if you happen to get exactly zero point five, then we would predict one.

172
00:14:09,300 --> 00:14:12,810
So how does this correspond to our decision ruled from before?

173
00:14:13,800 --> 00:14:18,120
Recall that this is where if a Ovex is greater than zero, then predict one.

174
00:14:18,690 --> 00:14:21,420
If they have X is less than zero, then predict zero.

175
00:14:21,990 --> 00:14:24,120
Otherwise, we fall directly on the line.

176
00:14:25,260 --> 00:14:28,860
This makes perfect sense since this line is the separating boundary.

177
00:14:29,700 --> 00:14:34,050
So if W transpose X plus B is equal to zero, then we are on the line.

178
00:14:34,830 --> 00:14:41,190
But as you recall, the sigmoid of zero is zero point five, which corresponds to a probability of 50

179
00:14:41,190 --> 00:14:43,390
percent 50 percent.

180
00:14:43,410 --> 00:14:49,590
Makes perfect sense because if we fall directly on the line, we are equally sure that this data point

181
00:14:49,860 --> 00:14:51,630
could belong to either class.

182
00:14:56,140 --> 00:15:01,720
Now, using a probability serves another purpose, which is that it allows us to use performance metrics

183
00:15:01,720 --> 00:15:05,590
like the AUC, which can be used when we have imbalanced classes.

184
00:15:06,400 --> 00:15:12,190
As you recall, this is how the RC curve is drawn by tracing at different points for different values

185
00:15:12,190 --> 00:15:13,210
of the threshold.

186
00:15:13,840 --> 00:15:18,880
In other words, we don't always have to use a threshold of 50 percent, even though this corresponds

187
00:15:18,880 --> 00:15:20,230
to the separating boundary.

188
00:15:20,950 --> 00:15:26,080
So, for example, we might use 30 percent if we want to make more positive predictions.

189
00:15:26,560 --> 00:15:31,120
This will give us more true positives, but it will also give us more false positives as well.

190
00:15:31,840 --> 00:15:37,300
Or if your model has too many false positives, you might increase the threshold beyond 50 percent.

191
00:15:37,990 --> 00:15:43,780
That's another way to use the RC is to use it to choose the best threshold based on the tradeoff you

192
00:15:43,780 --> 00:15:47,710
desire between the true positive and false positive rates.

193
00:15:52,280 --> 00:15:57,980
So the next thing I want to discuss in this lecture is to compare logistic regression with naive Bayes

194
00:15:57,980 --> 00:16:00,140
in terms of what kinds of models they are.

195
00:16:01,040 --> 00:16:07,390
One way to categorize them is to say that logistic regression is discriminative while naive bases generative.

196
00:16:08,730 --> 00:16:15,420
The reason for this distinction is as follows logistic regression directly models p of Y given X.

197
00:16:15,840 --> 00:16:17,580
This is what we call discriminative.

198
00:16:18,000 --> 00:16:25,200
It discriminates between the different classes for y directly niveis indirectly models of Y given X,

199
00:16:25,410 --> 00:16:27,480
but it really models p of X given Y.

200
00:16:28,020 --> 00:16:29,430
This is what we call generated.

201
00:16:30,450 --> 00:16:32,310
So why do we call this generative?

202
00:16:33,240 --> 00:16:39,150
Well, a simple reason is that once we have Pivac's given Y, we can actually generate samples from

203
00:16:39,150 --> 00:16:40,140
that distribution.

204
00:16:40,860 --> 00:16:47,730
So given a Class Y sample in X that could have come from this class, this is very easy to do once you

205
00:16:47,730 --> 00:16:50,160
have the distribution p of X given Y.

206
00:16:50,640 --> 00:16:54,090
But you cannot do this if you only have your Y given X.

207
00:16:54,930 --> 00:16:57,720
Now this might seem quite silly in the abstract sense.

208
00:16:57,870 --> 00:16:59,850
Why would you want to generate X?

209
00:17:00,660 --> 00:17:02,760
Well, it helps to think of some examples.

210
00:17:03,480 --> 00:17:05,690
One example is with image recognition.

211
00:17:06,359 --> 00:17:10,380
In this case, X is an image and a Y is the label to recognize.

212
00:17:10,950 --> 00:17:17,220
In this case, if we could sample X given Y, then we could generate random images from a given class.

213
00:17:17,910 --> 00:17:21,900
Gans are one example of state of the art generative models for images.

214
00:17:22,380 --> 00:17:28,590
So, for instance, I could say, given the class cat generate a random image of a cat or given a class

215
00:17:28,590 --> 00:17:31,200
car, generate a random image of a car.