1
00:00:10,420 --> 00:00:13,780
So in this video, we are going to discuss logistic regression.

2
00:00:14,290 --> 00:00:18,560
Now, despite its name, logistic regression is actually used as a classifier.

3
00:00:19,060 --> 00:00:22,690
It can be thought of as the classification analogue of linear regression.

4
00:00:23,620 --> 00:00:24,880
So how does it work?

5
00:00:25,450 --> 00:00:30,160
Well, the question you have to answer is, why can't we just use linear regression as it is?

6
00:00:30,610 --> 00:00:33,610
For example, applying this rule, all data is the same.

7
00:00:33,790 --> 00:00:37,120
You would simply have targets which are made up of zeros and ones.

8
00:00:37,960 --> 00:00:43,180
For example, instead of trying to predict a student's exam grade, you simply want to predict whether

9
00:00:43,180 --> 00:00:44,380
they passed or failed.

10
00:00:44,890 --> 00:00:47,790
So why can't we use linear regression for this task?

11
00:00:52,380 --> 00:00:56,010
Well, the answer is that technically you can, but it's not ideal.

12
00:00:56,670 --> 00:01:02,010
Now, this is outside the scope of this course, but you're encouraged to check extra reading text if

13
00:01:02,010 --> 00:01:03,000
you want to learn more.

14
00:01:04,020 --> 00:01:08,120
Basically, linear regression assumes that your prediction errors are Gaussian.

15
00:01:08,580 --> 00:01:12,300
If that is not the case, then you violated the assumptions of the model.

16
00:01:13,650 --> 00:01:16,680
Instead, when your targets are binary, it's more correct.

17
00:01:16,680 --> 00:01:21,840
You base your error on the binary distribution more commonly known as the Bernoulli distribution.

18
00:01:22,380 --> 00:01:27,240
Now, again, we won't go through any detail since this is just for intuition, but basically this is

19
00:01:27,240 --> 00:01:30,300
the distribution you would use to model a coin toss.

20
00:01:34,960 --> 00:01:40,060
There's just one piece of math you need to know about, which is how we go from a linear model to predicting

21
00:01:40,060 --> 00:01:40,870
a category.

22
00:01:41,710 --> 00:01:45,360
The answer is the logistic function, also known as the sigmoid.

23
00:01:45,790 --> 00:01:48,060
It has this nice formula you can see here.

24
00:01:48,310 --> 00:01:50,640
But the important thing is to understand its shape.

25
00:01:51,130 --> 00:01:55,240
As you can see, it has this shape with asymptote on the two ends.

26
00:01:55,840 --> 00:02:00,370
If its input goes to infinity, the output of the logistic function approaches one.

27
00:02:00,910 --> 00:02:07,900
If its input goes to minus infinity, the output of the logistic function approaches zero in the center.

28
00:02:07,900 --> 00:02:09,100
When its input is zero.

29
00:02:09,280 --> 00:02:10,860
The output is zero point five.

30
00:02:11,380 --> 00:02:16,180
Thus, this is a reasonable function for mapping real numbers to probabilities.

31
00:02:20,770 --> 00:02:27,160
As mentioned previously, you really want to understand this pattern, W, transpose X, plus B, remember

32
00:02:27,160 --> 00:02:29,840
that intuitively this forms a line or a hydroplane.

33
00:02:30,670 --> 00:02:34,040
Well, this is going to be the input to our logistic function.

34
00:02:34,960 --> 00:02:40,510
So basically, logistic regression is like linear regression, except we have this one extra step where

35
00:02:40,510 --> 00:02:46,690
we pass the result of W, transpose X plus B through the sigmoid, and that gives us a probability between

36
00:02:46,690 --> 00:02:50,410
zero and one, which makes sense if you want to predict a coin toss.

37
00:02:50,890 --> 00:02:55,300
In fact, this output tells us the probability that the target is equal to one.

38
00:02:55,960 --> 00:02:59,380
So we write it as P of Y equals one given X..

39
00:03:00,580 --> 00:03:07,240
Our model tells us what the probability of Y being one is, given the input X, for example, given

40
00:03:07,240 --> 00:03:12,940
that Alice studied for two hours and slept for eight hours, her probability of passing the exam is

41
00:03:12,940 --> 00:03:13,790
80 percent.

42
00:03:14,530 --> 00:03:19,630
Now, in order to make a decision, we must still pick either zero or one as a prediction.

43
00:03:20,470 --> 00:03:24,580
In this case, we would have round the probability, which gives us either zero or one.

44
00:03:25,540 --> 00:03:29,860
Sometimes people like to choose their own threshold, but we won't discuss that in this course.

45
00:03:34,460 --> 00:03:39,980
So, as you know, one of my important rules for machine learning is machine learning is nothing but

46
00:03:39,980 --> 00:03:40,730
geometry.

47
00:03:41,330 --> 00:03:45,410
So how does that work for classification and specifically logistic regression?

48
00:03:46,310 --> 00:03:48,890
Well, it helps to think of the extreme cases.

49
00:03:49,880 --> 00:03:54,840
As you know, W transpose X plus B is the equation for a line or a plane.

50
00:03:55,490 --> 00:04:01,540
So if you transpose X plus B is equal to zero, that means our data point lies directly on the line.

51
00:04:02,420 --> 00:04:06,740
As you recall, when we plug zero into the sigmoid, we get zero point five.

52
00:04:07,310 --> 00:04:12,680
This makes sense since if we are directly on the line, then the probability that Y belongs to either

53
00:04:12,680 --> 00:04:14,600
class would be 50 percent.

54
00:04:15,620 --> 00:04:21,790
Now imagine that X is very far away from the line so that W transpose X plus B approaches infinity.

55
00:04:22,460 --> 00:04:26,180
In this case, the probability is going to approach zero or one.

56
00:04:26,810 --> 00:04:31,940
This also makes sense because as you move further and further away from the line, you are more certain

57
00:04:31,940 --> 00:04:34,820
that the data point belongs to either of the two classes.

58
00:04:39,560 --> 00:04:44,390
So as a final point for this lecture, we need to discuss the concept of multi class problems.

59
00:04:45,290 --> 00:04:49,580
Now, at this point, we won't go too in-depth, since I want to save that for some later time.

60
00:04:49,880 --> 00:04:55,070
But I want to mention that it is possible to extend our linear model to predict multiple classes.

61
00:04:56,090 --> 00:05:01,130
This is similar to the idea that we can have linear regression with multiple output targets.

62
00:05:02,450 --> 00:05:08,390
The basic idea is you're still going to get a probability, but instead of being from a Bernoulli distribution,

63
00:05:08,510 --> 00:05:10,940
it's going to be from a categorical distribution.

64
00:05:11,510 --> 00:05:17,390
Basically, a categorical distribution for categories can be described by different probabilities.

65
00:05:18,290 --> 00:05:20,360
For example, imagine rolling a die.

66
00:05:20,810 --> 00:05:26,030
In this case, K equals six, so you'd have the probability of rolling a one or two or three and so

67
00:05:26,030 --> 00:05:26,580
forth.

68
00:05:27,050 --> 00:05:30,800
The only requirement, of course, is that those probabilities have to sum to one.

69
00:05:35,440 --> 00:05:41,620
The only important thing to mention is how this multiclass output works in code, so typically you're

70
00:05:41,620 --> 00:05:44,380
going to make a prediction on an input vectors.

71
00:05:44,380 --> 00:05:51,510
At the same time, as you recall, you're going to pass in an input matrix X of shape and build the

72
00:05:51,520 --> 00:05:54,010
output of this model will be another matrix of size.

73
00:05:54,010 --> 00:06:01,240
And becae that is to say, for each of the samples you'll have key output probabilities, which makes

74
00:06:01,240 --> 00:06:03,190
sense in light of what we just discussed.

75
00:06:04,180 --> 00:06:09,760
The important thing to notice is that these probabilities are along the roads, so each row must come

76
00:06:09,760 --> 00:06:10,280
to one.

77
00:06:11,260 --> 00:06:16,660
Now, in order to decide which class each sample belongs to, it doesn't make sense to round as it does

78
00:06:16,660 --> 00:06:17,830
in the binary case.

79
00:06:18,730 --> 00:06:23,860
Imagine, for example, you had 10 classes and each of them had an output probability of zero point

80
00:06:23,860 --> 00:06:24,310
one.

81
00:06:25,390 --> 00:06:29,770
If you rounded them, you would round each of them to zero, which obviously doesn't make any sense.

82
00:06:30,310 --> 00:06:35,620
In fact, if you rounded K numbers, you'd still have key numbers, which also doesn't make any sense.

83
00:06:36,700 --> 00:06:40,720
So in the multiclass case, the proper thing to do is take the Amax.

84
00:06:41,110 --> 00:06:46,660
That is to say, you pick the class, which is the highest probability, and you do this for each row

85
00:06:46,840 --> 00:06:49,300
to get one class for each of the samples.
