1
00:00:11,200 --> 00:00:17,020
So in this lecture, we will be continuing our discussion about the intuition behind logistic regression.

2
00:00:17,950 --> 00:00:23,140
Now one question you may have at this point is how do we find the weights in the biased terms in the

3
00:00:23,140 --> 00:00:23,920
first place?

4
00:00:24,760 --> 00:00:29,140
This question is not a question will be answering at this time, but we will eventually.

5
00:00:30,250 --> 00:00:35,410
For now, the important thing to pay attention to is the practical side, which is how do we do it in

6
00:00:35,410 --> 00:00:36,070
Python?

7
00:00:36,730 --> 00:00:40,810
Well, you're in luck because in addition to my rule, all data is the same.

8
00:00:41,170 --> 00:00:46,270
You now get to learn one of my other rules, which is all machine learning interfaces are the same.

9
00:00:47,080 --> 00:00:53,110
In particular, if you have used any classifier inside you learn which you have, then you already know

10
00:00:53,110 --> 00:00:53,890
how to do this.

11
00:00:54,670 --> 00:00:59,470
In particular, we begin by creating an instance of the logistic regression class.

12
00:01:00,520 --> 00:01:05,230
The next step is to call the fit function, which will train our model based on the data set we pass

13
00:01:05,230 --> 00:01:05,530
in.

14
00:01:06,520 --> 00:01:10,120
Note that training logistic regression is an iterative process.

15
00:01:10,510 --> 00:01:16,450
So one argument to the constructor is Max Ed, which controls the maximum number of iterations when

16
00:01:16,450 --> 00:01:17,260
you call fit.

17
00:01:18,100 --> 00:01:21,400
For some data sets, you may need more iterations than the default.

18
00:01:21,760 --> 00:01:25,860
You really just have to test it out to see what's right for your particular data set.

19
00:01:27,100 --> 00:01:31,300
After we've trained our model, we can use the model to make predictions for new data.

20
00:01:31,990 --> 00:01:33,370
There are two ways to do this.

21
00:01:34,060 --> 00:01:39,550
The first way is to call models that predict which returns and in length, one dimensional array of

22
00:01:39,550 --> 00:01:40,690
predicted labels.

23
00:01:41,590 --> 00:01:48,250
Note that if you have distinct classes, then this will contain integers from zero up to K minus one

24
00:01:48,250 --> 00:01:48,880
inclusive.

25
00:01:50,410 --> 00:01:56,590
The second way is to call Model Duck, predicts Prabha, which returns a matrix of posterior probabilities.

26
00:01:57,070 --> 00:02:03,040
If you have any data points and classes, then the output will be a two dimensional array of size n

27
00:02:03,040 --> 00:02:10,509
by K the value at the end throw in case the column is the probability that the input belongs to Class

28
00:02:10,509 --> 00:02:11,020
K.

29
00:02:11,770 --> 00:02:17,020
As you recall, probabilities might be useful in certain scenarios like computing the AUC.

30
00:02:21,690 --> 00:02:27,150
So now that you understand how to train a logistic regression model in practice, it's time to return

31
00:02:27,150 --> 00:02:32,490
to the intuition in particular after we found all the weights and vice terms.

32
00:02:32,850 --> 00:02:35,220
Is it possible to interpret what they mean?

33
00:02:35,880 --> 00:02:36,990
The answer is yes.

34
00:02:37,560 --> 00:02:41,010
This is one of the best reasons to use a model like logistic regression.

35
00:02:41,820 --> 00:02:47,760
Not only is it very powerful, it is very easy to interpret, and these explanations will please your

36
00:02:47,760 --> 00:02:49,110
clients and stakeholders.

37
00:02:49,950 --> 00:02:54,540
So to understand how to interpret the model weights, we're going to start with an assumption.

38
00:02:55,290 --> 00:03:01,260
This assumption is that all the input features of X are non-negative that is greater than or equal to

39
00:03:01,260 --> 00:03:01,770
zero.

40
00:03:02,550 --> 00:03:07,590
This makes sense in the context of a field such as in Ielpi, where X might represent word counts.

41
00:03:08,310 --> 00:03:12,010
Obviously, word counts must always be greater than or equal to zero.

42
00:03:12,600 --> 00:03:17,760
If this is not the case, then the ideas I'm about to describe still apply, but they're just reversed

43
00:03:17,760 --> 00:03:19,050
when the inputs are negative.

44
00:03:23,760 --> 00:03:29,550
So suppose that our model has three inputs, which are BMI, exercise frequency and hair length.

45
00:03:30,270 --> 00:03:34,830
The output is whether or not the patient will experience a severe reaction to COVID.

46
00:03:35,550 --> 00:03:40,470
Now, if these input features seem a bit weird, don't worry as we'll be explaining them shortly.

47
00:03:41,400 --> 00:03:44,940
It may help to draw a schematic of the logistic regression model.

48
00:03:45,750 --> 00:03:51,780
We draw the three inputs the circles in the weights as lines connecting the circles to a summation note

49
00:03:52,440 --> 00:03:54,210
after going through the summation note.

50
00:03:54,240 --> 00:03:59,370
We then apply the sigmoid function, which gives us the output p of Y equals one given X.

51
00:04:01,180 --> 00:04:07,090
So suppose that the weight for BMI is very large and positive, if you need a concrete value, let's

52
00:04:07,090 --> 00:04:08,170
suppose it's too.

53
00:04:09,380 --> 00:04:15,230
If this is the case, then the BMI multiplied by this weight will be even more positive, which will

54
00:04:15,230 --> 00:04:16,670
make the activation larger.

55
00:04:17,300 --> 00:04:22,340
In fact, multiplying BMI by two will obviously increase it by a factor of two.

56
00:04:23,790 --> 00:04:31,200
Now, as you recall, a large activation pushes the probability closer to one, therefore, we can interpret

57
00:04:31,200 --> 00:04:37,530
large positive ways to mean that the corresponding feature as a large positive influence on the output

58
00:04:37,530 --> 00:04:38,250
being one.

59
00:04:40,070 --> 00:04:45,950
Now, let's look at the second input, which is exercise frequency, suppose that the weight for this

60
00:04:45,950 --> 00:04:47,660
input is small and negative.

61
00:04:48,230 --> 00:04:50,480
Suppose it's minus 0.5.

62
00:04:51,230 --> 00:04:52,970
So what effect does this have?

63
00:04:53,750 --> 00:04:59,840
Well, since it's negative, then after multiplying by a positive input feature, it brings the activation

64
00:04:59,840 --> 00:05:04,670
down and activation going to the left or towards negative infinity.

65
00:05:05,000 --> 00:05:08,300
Results in the output being closer to zero, as you recall.

66
00:05:09,260 --> 00:05:14,840
However, since the magnitude of this weight is small, it has less of an effect compared to a weight

67
00:05:14,840 --> 00:05:15,860
which is large.

68
00:05:16,730 --> 00:05:22,460
So this is a lesson that there are two factors that matter the magnitude of the weight and the sign.

69
00:05:23,240 --> 00:05:26,900
If the sign is positive, then it brings the output closer to one.

70
00:05:27,440 --> 00:05:30,980
If the sign is negative, then it brings the output closer to zero.

71
00:05:31,700 --> 00:05:37,820
So a positive weight tells us that a larger value of the input feature is correlated with the positive

72
00:05:37,820 --> 00:05:38,480
class.

73
00:05:39,050 --> 00:05:44,300
A negative weight tells us that a larger value of the input feature is correlated with the negative

74
00:05:44,300 --> 00:05:49,220
class if the magnitude is larger than it has a more pronounced effect.

75
00:05:49,880 --> 00:05:56,480
So a weight with a larger magnitude tells us that this input matters a lot in predicting the output.

76
00:05:57,050 --> 00:06:01,790
A weight with a small magnitude tells us that this input doesn't matter as much.

77
00:06:02,810 --> 00:06:06,200
Finally, let's consider the third input, which is hair length.

78
00:06:06,830 --> 00:06:12,800
Obviously, this has no effect on the severity of COVID because you could simply cut your hair and thus

79
00:06:13,070 --> 00:06:14,300
its weight would be zero.

80
00:06:14,960 --> 00:06:20,990
So in practice, if you find a way that is zero or very close to zero, then the interpretation is that

81
00:06:21,230 --> 00:06:22,970
this input is irrelevant.

82
00:06:24,320 --> 00:06:28,790
Of course, this is simply an extension of the fact that the magnitude of the weight matters.

83
00:06:33,380 --> 00:06:38,960
So now that we've discussed how to interpret the weights in the binary case, let's consider the multiclass

84
00:06:38,960 --> 00:06:39,500
case.

85
00:06:40,250 --> 00:06:42,950
Luckily, the interpretation is largely the same.

86
00:06:43,610 --> 00:06:48,380
It just seems a bit more complicated because we have more weights and they sort of interact with each

87
00:06:48,380 --> 00:06:49,970
other due to the soft max.

88
00:06:50,750 --> 00:06:54,420
The best way to interpret the weights is to consider them as a matrix.

89
00:06:55,100 --> 00:07:02,090
Suppose that our model has the inputs and outputs instead of doing separate products with separate weight

90
00:07:02,090 --> 00:07:03,290
vectors of size d.

91
00:07:03,710 --> 00:07:05,570
It's simpler to do a single matrix.

92
00:07:05,570 --> 00:07:09,050
Multiply with a weight matrix of size d by K.

93
00:07:09,830 --> 00:07:16,460
So if X is a vector of size D and we multiply that with a matrix of size D by K, we will get a vector

94
00:07:16,460 --> 00:07:19,830
of size K, which represents the K activations.

95
00:07:20,540 --> 00:07:22,850
So let's call this weight matrix big w.

96
00:07:24,020 --> 00:07:29,750
Now let's consider the case where the value in a big W at a D column K is large.

97
00:07:30,350 --> 00:07:35,570
If this value is large and positive, then it will increase the K ith activation.

98
00:07:36,380 --> 00:07:42,830
Therefore, the interpretation is that input feature D has a positive correlation with the K of class

99
00:07:42,830 --> 00:07:43,820
being the target.

100
00:07:44,660 --> 00:07:51,530
Similarly, if food K is large and negative, then input feature D has a negative correlation with the

101
00:07:51,530 --> 00:07:53,180
K of class being the target.

102
00:07:53,990 --> 00:07:57,170
And again, magnitude matters in pretty much the same way.

103
00:07:57,650 --> 00:08:01,640
If the magnitude is small, then this input feature has less of an effect.

104
00:08:02,090 --> 00:08:06,050
If the magnitude is large, then this input feature has more of an effect.

105
00:08:06,590 --> 00:08:10,160
If the magnitude is zero, then the input feature has no effect.

