1
00:00:11,060 --> 00:00:16,640
So in this lecture, we will be continuing our discussion about the intuition behind logistic regression.

2
00:00:17,450 --> 00:00:23,270
Now one thing to be aware of is that logistic regression can be used for multiclass problems as well.

3
00:00:23,930 --> 00:00:29,150
This is relevant for this session, since the dataset will be looking at has more than two classes.

4
00:00:29,780 --> 00:00:35,480
Sometimes people call this multi nominal logistic regression, which is, in my opinion, a poor name,

5
00:00:35,480 --> 00:00:39,920
since there is no point at which we work with the multi nominal distribution.

6
00:00:40,580 --> 00:00:45,170
There are other names as well, such as maximum entropy, which are all equally as bad.

7
00:00:45,770 --> 00:00:51,560
My preference is to simply call it multi class logistic regression or simply just logistic regression.

8
00:00:52,370 --> 00:00:54,350
In any case, here's how this method works.

9
00:00:58,930 --> 00:01:05,950
Suppose that we have classes, then we will have Quade vectors, let's call them W1 and W2 and so on

10
00:01:05,980 --> 00:01:07,210
up to W.K.

11
00:01:08,080 --> 00:01:14,140
Note that in this case, these are not components of W, but rather each of these is a whole de size

12
00:01:14,140 --> 00:01:14,650
vector.

13
00:01:15,580 --> 00:01:22,210
Furthermore, suppose that we also have corresponding biased terms B1 B2 all the way up to b k.

14
00:01:23,080 --> 00:01:29,410
From these, we can then compute K activations using the usual formula w transpose X plus b.

15
00:01:30,040 --> 00:01:33,940
So this gives us A1 A2 all the way up to 8K.

16
00:01:38,540 --> 00:01:44,510
The next step is the critical part, instead of applying the sigmoid function, which is only used for

17
00:01:44,510 --> 00:01:45,440
the binary case.

18
00:01:45,830 --> 00:01:47,360
We use the soft max.

19
00:01:48,050 --> 00:01:51,020
The soft max function has the formula you see here.

20
00:01:51,680 --> 00:01:55,310
As with the sigmoid, we interpret the output as a probability.

21
00:01:56,000 --> 00:02:02,360
Specifically, if we have little K in the numerator, then the output is the probability that Y belongs

22
00:02:02,360 --> 00:02:03,800
to class little K.

23
00:02:04,550 --> 00:02:10,220
Now, if you have any math phobias, please don't worry as you don't have to solve any equations.

24
00:02:10,699 --> 00:02:14,090
The important part is just to be able to see how it works.

25
00:02:14,960 --> 00:02:18,710
Firstly, notice that we exponentially all the activations.

26
00:02:19,310 --> 00:02:24,470
Therefore, every element becomes positive, which is a requirement for this to be a probability.

27
00:02:25,550 --> 00:02:30,950
The second thing to notice is that on the denominator, we take the sum over all possible values of

28
00:02:30,950 --> 00:02:31,670
the numerator.

29
00:02:32,390 --> 00:02:38,510
The result is that the sum of all the probabilities will equal one, which again is a requirement for

30
00:02:38,510 --> 00:02:40,520
this to be a probability distribution.

31
00:02:41,780 --> 00:02:45,220
The third thing to notice is that there are such probabilities.

32
00:02:45,770 --> 00:02:51,470
As you can see, the little K in the numerator is an index in the possible values for it are one of

33
00:02:51,470 --> 00:02:52,310
two Big K.

34
00:02:53,540 --> 00:02:59,330
So I hope you're convinced that the SOF Max is an appropriate function for converting activations into

35
00:02:59,330 --> 00:03:02,600
a probability distribution in the multi class case.

36
00:03:07,270 --> 00:03:12,850
So due to the facts about soft Max from the previous slide, we should be able to infer what the outputs

37
00:03:12,850 --> 00:03:18,910
of the multiclass logistic regression should look like, in particular for each of our samples, we

38
00:03:18,910 --> 00:03:22,330
should get key different probabilities that all sum to one.

39
00:03:22,960 --> 00:03:28,840
Thus, if we have any samples in key outputs per sample, then our overall output will be a matrix of

40
00:03:28,840 --> 00:03:35,860
size n by K. As you recall, this is also what we get when we call model, predicts Prabha inside and

41
00:03:35,860 --> 00:03:36,280
learn.

42
00:03:37,210 --> 00:03:43,120
Now let's consider what we would do if we wanted to get the actual class predictions instead of probabilities.

43
00:03:43,930 --> 00:03:48,280
The answer is to take the AMAX over all the probabilities for a given sample.

44
00:03:48,850 --> 00:03:53,590
That is, we choose the Class K star that yields the maximum P of Y given X.

45
00:03:54,280 --> 00:03:58,540
Note that this will yield a one dimensional array containing the class labels.

46
00:03:59,080 --> 00:04:03,220
So let's do a short quiz to test your understanding of this concept.

47
00:04:07,910 --> 00:04:09,380
Consider this model output.

48
00:04:09,950 --> 00:04:15,350
It's an array containing the numbers zero point two, zero point five, zero point one and zero point

49
00:04:15,350 --> 00:04:15,890
nine.

50
00:04:16,640 --> 00:04:20,839
The question is for multi class logistic regression, is this a valid output?

51
00:04:21,350 --> 00:04:25,430
Please pause the video and take a moment to think about what the answer should be.

52
00:04:30,160 --> 00:04:33,040
OK, so the answer is that this is not a valid output.

53
00:04:33,550 --> 00:04:37,930
Firstly, notice that this is only a one dimensional array, which doesn't make sense.

54
00:04:38,500 --> 00:04:44,740
As you recall, we expect a two dimensional array of size end by the number of rows is the number of

55
00:04:44,740 --> 00:04:47,920
samples, and the number of columns is the number of classes.

56
00:04:48,550 --> 00:04:53,410
Furthermore, note that these values also don't sum to one, so they couldn't have been the distribution

57
00:04:53,410 --> 00:04:54,520
for a single sample.

58
00:04:59,200 --> 00:05:03,940
OK, so let's do another quiz this time we do have a two dimensional array.

59
00:05:04,330 --> 00:05:08,040
It contains the numbers that you see on the slide as before.

60
00:05:08,050 --> 00:05:10,870
We want to know from multiclass logistic regression.

61
00:05:11,140 --> 00:05:12,550
Is this a valid output?

62
00:05:13,630 --> 00:05:17,560
Please pause the video and take a moment to think about what the answer should be.

63
00:05:22,350 --> 00:05:26,010
OK, so again, the answer is no, this is not a valid output.

64
00:05:26,580 --> 00:05:29,190
Note that this matrix does meet one requirement.

65
00:05:29,490 --> 00:05:33,660
It is a two dimensional array, which is what we learned about from the first quiz.

66
00:05:34,350 --> 00:05:39,930
However, note that in the second row, we have a negative number, which is not allowed because probabilities

67
00:05:40,050 --> 00:05:41,070
can't be negative.

68
00:05:41,790 --> 00:05:47,010
Furthermore, note that there is no way the soft max can output a negative number if its inputs are

69
00:05:47,010 --> 00:05:47,910
real numbers.

70
00:05:52,450 --> 00:05:58,420
OK, so let's do another quit this time you can see that not only do we have a two dimensional array,

71
00:05:58,690 --> 00:06:04,450
but all the values are positive as before, we want to know from multiclass logistic regression.

72
00:06:04,510 --> 00:06:05,920
Is this a valid output?

73
00:06:06,460 --> 00:06:10,480
Please pause the video and take a moment to think about what the answer should be.

74
00:06:15,210 --> 00:06:18,900
OK, so again, the answer is no, this is not a valid output.

75
00:06:19,500 --> 00:06:22,380
Note that this matrix does meet several requirements.

76
00:06:22,980 --> 00:06:27,210
Firstly, it is a 2D matrix, which is what we learned about from the first clips.

77
00:06:27,780 --> 00:06:32,250
Secondly, all the values are valid probabilities, which are between zero and one.

78
00:06:33,090 --> 00:06:38,280
The reason why it's not a valid output is because not all the rows sum to one, and hence they are not

79
00:06:38,280 --> 00:06:39,570
valid distributions.

80
00:06:40,080 --> 00:06:45,840
In particular, the third row contains zero point five, zero point seven and zero point one, which

81
00:06:45,840 --> 00:06:47,190
adds up to one point three.