1
00:00:11,670 --> 00:00:19,780
In this lecture we are going to discuss another very important topic multi class classification previously

2
00:00:19,790 --> 00:00:24,470
we discuss the binary classification which used a sigmoid at the output.

3
00:00:24,470 --> 00:00:30,260
We also had signal for the hidden layer activation functions but we just got rid of that in favor of

4
00:00:30,260 --> 00:00:33,800
the real you which tends to work better for the output.

5
00:00:33,800 --> 00:00:40,400
However if we are doing binary classification the sigmoid is still the right choice binary classification

6
00:00:40,400 --> 00:00:42,060
has many applications.

7
00:00:42,200 --> 00:00:46,790
We can predict disease or notices is the user fraudulent or not.

8
00:00:47,090 --> 00:00:49,310
Well the user click my ad or not.

9
00:00:49,460 --> 00:00:51,470
Will the user accept a friend request or not.

10
00:00:51,620 --> 00:00:52,990
And so on and so forth.

11
00:00:53,540 --> 00:00:59,390
But there are situations that binary classification cannot handle such as when we might have multiple

12
00:00:59,390 --> 00:01:00,860
categorical outcomes

13
00:01:05,970 --> 00:01:08,120
let's think of some examples.

14
00:01:08,190 --> 00:01:12,600
One example is optical character recognition or handwriting recognition.

15
00:01:12,690 --> 00:01:19,230
If we want to recognize alphanumeric characters then there are at least 36 possibilities 10 digits and

16
00:01:19,230 --> 00:01:21,010
26 letters.

17
00:01:21,030 --> 00:01:23,020
What about speech recognition.

18
00:01:23,190 --> 00:01:30,740
If we want to recognize words then each word counts as a separate category image classification is another

19
00:01:30,740 --> 00:01:36,710
one and for a long time this was the standard benchmark which proved that deep learning works better

20
00:01:36,710 --> 00:01:38,850
than other machine learning models.

21
00:01:38,990 --> 00:01:43,850
You may have heard of a famous dataset called the image in that dataset which is made up of millions

22
00:01:43,850 --> 00:01:49,440
of images in the thousand categories such as bagel vending machine and soccer ball.

23
00:01:49,700 --> 00:01:55,460
Every year there is a contest to see who can build the best image that classifier and of course these

24
00:01:55,460 --> 00:01:56,840
are all deep learning models.

25
00:02:01,980 --> 00:02:05,110
So how does multi class classification work.

26
00:02:05,130 --> 00:02:10,710
Well suppose we have just calculated the activation at the final layer of our neural network but have

27
00:02:10,710 --> 00:02:13,680
not yet applied the activation function.

28
00:02:13,680 --> 00:02:21,500
So we have a superscript l equal to w superscript l transpose dot Zi superscript l minus 1.

29
00:02:21,630 --> 00:02:29,040
So that's the Z at the previous layer plus b superscript l it's important to remember that if the layer

30
00:02:29,040 --> 00:02:37,530
size is k then a superscript l is a vector of size k what we would like to do is map these values to

31
00:02:37,530 --> 00:02:42,570
probabilities for each of the K possible output categories.

32
00:02:42,570 --> 00:02:44,160
Okay so that's your first lesson here.

33
00:02:44,760 --> 00:02:49,670
If we have K possible outcomes then we should have K output nodes.

34
00:02:49,680 --> 00:02:56,310
The question now is what function can actually map the vector a superscript l 2 probability values

35
00:03:01,450 --> 00:03:05,170
let's think about what the requirements are for these probabilities.

36
00:03:05,170 --> 00:03:09,610
We need a probability distribution over a K distinct values.

37
00:03:09,610 --> 00:03:16,660
As a side note we assign these values the integers 0 to K minus 1 first since they are probabilities.

38
00:03:16,660 --> 00:03:18,580
They must be non negative.

39
00:03:18,580 --> 00:03:21,370
In other words they must be greater than or equal to zero.

40
00:03:22,270 --> 00:03:24,600
But there is also an upper limit.

41
00:03:24,700 --> 00:03:27,330
The maximum value of a probability is 1.

42
00:03:27,460 --> 00:03:33,370
But more importantly for this to be a proper probability distribution all the possible different values

43
00:03:33,520 --> 00:03:34,630
must sum to 1.

44
00:03:35,200 --> 00:03:39,250
So the second requirement is that all the probabilities must sum to 1

45
00:03:44,420 --> 00:03:48,200
luckily there is a function that accomplishes exactly this.

46
00:03:48,200 --> 00:03:54,470
It is called the soft max function first let's drop the superscript l on the activation a for now so

47
00:03:54,470 --> 00:04:03,110
we can just say a is a vector of size k then the soft max of this vector A is the exponential of a divided

48
00:04:03,110 --> 00:04:06,400
by the sum of the exponential of a across each component.

49
00:04:07,790 --> 00:04:11,420
As you can see this satisfies both of our requirements.

50
00:04:11,540 --> 00:04:18,830
The exponential of any number is always positive so therefore each probability must be non negative.

51
00:04:18,830 --> 00:04:23,930
Second because the denominator is just the sum of all the possible values of the numerator.

52
00:04:24,080 --> 00:04:29,930
Then if we sum all the possible values of the numerator we just get the same sum and therefore we get

53
00:04:29,930 --> 00:04:33,710
the same sum on the top and the bottom and they cancel out to give us 1

54
00:04:38,820 --> 00:04:40,950
now that we know how to solve Max works.

55
00:04:40,950 --> 00:04:44,750
Let's answer the most important question how do we use it in PI talk.

56
00:04:45,390 --> 00:04:49,540
Well just like most other functions we've applied so far it's very simple.

57
00:04:49,590 --> 00:04:52,930
There is an object we instantiate called soft Max.

58
00:04:53,070 --> 00:04:56,710
In other words the only requirement is that you spell it correctly.

59
00:04:56,760 --> 00:05:01,130
Of course you can always implement the soft Max yourself but in PI torch there's no need.

60
00:05:02,340 --> 00:05:08,670
As a side note the south Max is considered an activation function but unlike the real you sigmoid and

61
00:05:08,670 --> 00:05:15,930
tan h it is not meant for a hidden layer activations if you want to try using the soft Max as a hit

62
00:05:15,930 --> 00:05:21,760
and their activation you are most welcome to but you'll generally find that it doesn't work that well.

63
00:05:21,760 --> 00:05:27,100
So the soft Max is technically an activation function but we normally only use it when we're trying

64
00:05:27,100 --> 00:05:32,920
to get an output probability from a vector of activation values such as at the end of a neural network

65
00:05:38,110 --> 00:05:40,780
now that you know how to use the soft Maxim pi torch.

66
00:05:40,900 --> 00:05:46,030
The next thing I'm going to tell you is to not use it just like we learned earlier with the sigmoid

67
00:05:46,030 --> 00:05:51,580
and the binary cross entropy pi two which combines those two functions into a single loss for numerical

68
00:05:51,580 --> 00:05:52,730
stability.

69
00:05:52,870 --> 00:05:59,320
So pi talk is going to combine the soft Max function with the cross entropy loss into this single loss

70
00:05:59,650 --> 00:06:06,370
which is simply called cross entropy loss pi talk does in fact have a standalone categorical loss which

71
00:06:06,370 --> 00:06:10,910
is called an L L loss but that is outside the scope of this course.

72
00:06:11,110 --> 00:06:17,260
In the example below we've implemented logistic regression but instead of being a binary logistic regression

73
00:06:17,560 --> 00:06:22,290
it's now a multi class logistic regression that predicts k different categories.

74
00:06:22,540 --> 00:06:26,540
As you can see the model is still just a basic linear layer.

75
00:06:26,890 --> 00:06:37,560
All the magic really happens inside the cross entropy loss object.

76
00:06:37,650 --> 00:06:41,940
The last thing I want to do in this lecture is summarize the three types of tasks we've encountered

77
00:06:41,940 --> 00:06:48,760
so far that none that waste can perform and their corresponding activation functions two of them you

78
00:06:48,760 --> 00:06:51,070
already saw in the previous section of this course.

79
00:06:51,130 --> 00:06:53,250
And now we have a third.

80
00:06:53,290 --> 00:06:57,690
So first we have a regression where we use it no activation function at all.

81
00:06:57,760 --> 00:07:03,040
Sometimes we also call this the identity function which is just a fancy way of saying a function that

82
00:07:03,040 --> 00:07:05,680
returns its input.

83
00:07:05,680 --> 00:07:11,900
Second we have binary classification where we use the sigmoid activation function using the sigmoid.

84
00:07:11,920 --> 00:07:17,890
We only need one output node because since we know the probability of the output being one the probability

85
00:07:17,890 --> 00:07:20,940
of the output being 0 is just 1 minus that.

86
00:07:21,370 --> 00:07:25,300
This is the same type of thing you see when you're looking at the Bernoulli distribution versus the

87
00:07:25,300 --> 00:07:32,130
categorical distribution the Bernoulli distribution is a categorical distribution where there can only

88
00:07:32,130 --> 00:07:33,480
be two outcomes.

89
00:07:33,540 --> 00:07:35,710
Thus it only has one parameter.

90
00:07:36,090 --> 00:07:42,990
The categorical distribution on the other hand if it has K outcomes it also has K parameters.

91
00:07:42,990 --> 00:07:47,780
And that brings us to our third scenario which is multi class classification.

92
00:07:47,880 --> 00:07:50,640
For this we use the soft Max activation function

93
00:07:55,730 --> 00:08:01,370
one important thing to mention that sometimes confuses people is that these three activation functions

94
00:08:01,370 --> 00:08:07,550
apply whether you are using a feed for neural network a linear model or even a more complicated network

95
00:08:07,550 --> 00:08:08,980
like a CNN or an aunt.

96
00:08:09,740 --> 00:08:15,170
It doesn't matter what kind of model you have these three activation functions always correspond to

97
00:08:15,170 --> 00:08:16,910
these three tasks.

98
00:08:16,910 --> 00:08:22,400
So for example if I don't want to use a neural network but just a simple linear classifier for multiple

99
00:08:22,400 --> 00:08:27,650
classes then I would have just a single dense layer with a soft Max activation.

100
00:08:27,650 --> 00:08:30,020
This is still called logistic regression by the way.

101
00:08:30,020 --> 00:08:32,090
It's just not binary logistic regression.

102
00:08:32,600 --> 00:08:39,230
Instead it's called multi class logistic regression it's important to remember that what came before

103
00:08:39,230 --> 00:08:42,790
the final layer in the model is not related to the task.

104
00:08:42,950 --> 00:08:48,090
You can have a CNN or an aunt in or any other kind of complicated neural network.

105
00:08:48,140 --> 00:08:53,750
It is still the case that the final layer of the network will be a dense layer with one of these three

106
00:08:53,750 --> 00:08:57,230
activation functions corresponding to the task being done.

107
00:09:02,200 --> 00:09:07,630
Another important thing to remember is that the soft max function is more general whereas the sigmoid

108
00:09:07,630 --> 00:09:10,260
can only handle binary classification.

109
00:09:10,390 --> 00:09:16,690
The soft max function can handle multi class classification which includes binary classification.

110
00:09:16,690 --> 00:09:22,070
So it's possible if you want to just use the soft Max for all types of classification.

111
00:09:22,240 --> 00:09:26,400
If you set k equals 2 then you're doing binary classification.

112
00:09:26,500 --> 00:09:31,450
It's just that in this scenario your outputs are a bit redundant because you already know that they

113
00:09:31,450 --> 00:09:34,830
both sum to one and hence only one of them is really needed.