1
00:00:11,550 --> 00:00:17,520
In this lecture, we are going to discuss another very important topic, multiclass classification.

2
00:00:19,080 --> 00:00:23,770
Previously, we discussed binary classification, which used a sigmoid at the output.

3
00:00:24,420 --> 00:00:30,120
We also had segments with a hidden layer activation functions, but we just got rid of that in favor

4
00:00:30,120 --> 00:00:33,780
of the you, which tends to work better for the output.

5
00:00:33,810 --> 00:00:38,480
However, if we are doing binary classification, the sigmoid is still the right choice.

6
00:00:39,120 --> 00:00:41,820
Binary classification has many applications.

7
00:00:42,150 --> 00:00:44,370
We can predict disease or notices.

8
00:00:44,700 --> 00:00:46,480
Is the user fraudulent or not?

9
00:00:47,040 --> 00:00:48,780
Will the user click my add or not?

10
00:00:49,380 --> 00:00:51,510
Will the user accept a friend request or not?

11
00:00:51,510 --> 00:00:53,010
And so on and so forth.

12
00:00:53,520 --> 00:00:59,400
But there are situations that binary classification cannot handle, such as when we might have multiple

13
00:00:59,400 --> 00:01:00,840
categorical outcomes.

14
00:01:05,910 --> 00:01:12,090
Let's think of some examples, one example is optical character recognition or handwriting recognition,

15
00:01:12,600 --> 00:01:19,020
if we want to recognize alphanumeric characters, then there are at least 36 possibilities, 10 digits

16
00:01:19,020 --> 00:01:20,320
and twenty six letters.

17
00:01:20,970 --> 00:01:22,470
What about speech recognition?

18
00:01:23,070 --> 00:01:27,770
If we want to recognize words, then if word counts as a separate category.

19
00:01:29,160 --> 00:01:31,000
Image classification is another one.

20
00:01:31,140 --> 00:01:36,720
And for a long time, this was the standard benchmark which proved that deep learning worked better

21
00:01:36,720 --> 00:01:38,330
than other machine learning models.

22
00:01:38,910 --> 00:01:43,830
You may have heard of a famous data set called The Image in that data set, which is made up of millions

23
00:01:43,830 --> 00:01:48,910
of images and a thousand categories, such as bagel vending machine and soccer ball.

24
00:01:49,560 --> 00:01:54,300
Every year there is a contest to see who can build the best image that classifier.

25
00:01:54,480 --> 00:01:56,820
And of course, these are all deep learning models.

26
00:02:01,890 --> 00:02:04,350
So how does multiclass classification work?

27
00:02:05,070 --> 00:02:10,740
Well, suppose we have just calculated the activation at the final layer of our neural network but have

28
00:02:10,740 --> 00:02:13,000
not yet applied the activation function.

29
00:02:13,620 --> 00:02:21,540
So we have a superscripts l equal to W superscript L transpose dot Z superscript L minus one.

30
00:02:21,570 --> 00:02:25,380
So that's the Z at the previous layer plus B, superscript L.

31
00:02:26,940 --> 00:02:33,780
It's important to remember that if the laicised is K, then a superscripts L is a vector of size K.

32
00:02:34,710 --> 00:02:41,460
What we would like to do is map these values to probabilities for each of the key possible output categories.

33
00:02:42,450 --> 00:02:44,180
OK, so that's your first lesson here.

34
00:02:44,670 --> 00:02:48,870
If we have K possible outcomes then we should have a K output nodes.

35
00:02:49,620 --> 00:02:56,310
The question now is what function can actually map the vector a superscript L to probability values.

36
00:03:01,390 --> 00:03:04,460
Let's think about what the requirements are for these probabilities.

37
00:03:05,080 --> 00:03:08,560
We need a probability distribution over kay distinct values.

38
00:03:09,520 --> 00:03:15,340
As a side note, we assign these values the integers zero to K minus one first.

39
00:03:15,340 --> 00:03:18,090
Since they are probabilities, they must be non-negative.

40
00:03:18,490 --> 00:03:21,360
In other words, they must be greater than or equal to zero.

41
00:03:22,270 --> 00:03:24,010
But there is also an upper limit.

42
00:03:24,640 --> 00:03:27,000
The maximum value of a probability is one.

43
00:03:27,400 --> 00:03:32,740
But more importantly, for this to be a proper probability distribution, all the possible different

44
00:03:32,740 --> 00:03:34,600
values must sum to one.

45
00:03:35,140 --> 00:03:39,250
So the second requirement is that all the probabilities must sum to one.

46
00:03:44,330 --> 00:03:49,780
Luckily, there is a function that accomplishes exactly this, it is called the softmax function.

47
00:03:50,540 --> 00:03:56,150
First, let's drop the superscript L on the activation A for now so we can just say A is a vector of

48
00:03:56,150 --> 00:03:56,450
size.

49
00:03:56,450 --> 00:04:04,610
K, then the softmax of this vector A is the exponential of a divided by the sum of the exponential

50
00:04:04,610 --> 00:04:06,440
of across each component.

51
00:04:07,750 --> 00:04:14,310
As you can see, this satisfies both of our requirements, the exponential of any number is always positive,

52
00:04:14,320 --> 00:04:17,680
so therefore each probability must be non-negative.

53
00:04:18,760 --> 00:04:23,660
Second, because the denominator is just the sum of all the possible values of the numerator.

54
00:04:24,040 --> 00:04:29,770
Then if we sum all the possible values of the numerator, we just get the same sum and therefore we

55
00:04:29,770 --> 00:04:33,700
get the same sum on the top in the bottom and they cancel out to give us one.

56
00:04:38,760 --> 00:04:44,640
Now that we know how the softmax works, let's answer the most important question how do we use it intensive

57
00:04:44,640 --> 00:04:44,940
flow?

58
00:04:45,780 --> 00:04:50,040
Well, just like most of the other functions we've applied so far, it's very simple.

59
00:04:50,520 --> 00:04:52,600
We just pass in the string softmax.

60
00:04:53,220 --> 00:04:56,320
In other words, the only requirement is that you spell it correctly.

61
00:04:57,030 --> 00:05:01,860
Of course, you can always implement the softmax yourself, but intenser flow, there's no need.

62
00:05:03,150 --> 00:05:06,930
As a side note, the softmax is considered an activation function.

63
00:05:07,000 --> 00:05:13,140
But unlike the real you sigmoid and Tanach, it is not meant for a hidden layer activations.

64
00:05:14,460 --> 00:05:19,080
If you want to try using the softmax as a hidden layer activation, you are most welcome to.

65
00:05:19,440 --> 00:05:21,870
But you'll generally find that it doesn't work that well.

66
00:05:22,650 --> 00:05:28,050
So the softmax is technically an activation function, but we normally only use it when we're trying

67
00:05:28,050 --> 00:05:33,840
to get an output probability from a vector of activation values such as at the end of a neural network.

68
00:05:38,980 --> 00:05:43,300
The last thing I want to do in this lecture is summarize the three types of tasks we've encountered

69
00:05:43,300 --> 00:05:48,460
so far that known that we can perform and their corresponding activation functions.

70
00:05:49,440 --> 00:05:53,740
Two of them you already saw in the previous section of this course, and now we have a third.

71
00:05:54,600 --> 00:05:58,410
So first we have regression where we use it, no activation function at all.

72
00:05:59,070 --> 00:06:04,410
Sometimes we also call this the identity function, which is just a fancy way of saying a function that

73
00:06:04,410 --> 00:06:05,520
returns its input.

74
00:06:06,930 --> 00:06:13,250
Second, we have binary classification where we use the sigmoid activation function using the sigmoid.

75
00:06:13,260 --> 00:06:19,260
We only need one output node because since we know the probability of the output being one, the probability

76
00:06:19,260 --> 00:06:21,900
of the output being zero is just one minus that.

77
00:06:22,710 --> 00:06:26,670
This is the same type of thing you see when you're looking at the Bernoulli distribution versus the

78
00:06:26,670 --> 00:06:28,080
categorical distribution.

79
00:06:29,690 --> 00:06:35,180
The Bernoulli distribution is a categorical distribution where there can only be two outcomes, thus

80
00:06:35,180 --> 00:06:38,840
it only has one parameter the categorical distribution.

81
00:06:38,840 --> 00:06:43,300
On the other hand, if it has key outcomes, it also has key parameters.

82
00:06:44,240 --> 00:06:48,560
And that brings us to our third scenario, which is multiclass classification.

83
00:06:49,190 --> 00:06:51,970
For this, we use the softmax activation function.

84
00:06:57,070 --> 00:07:02,740
One important thing to mention that sometimes confuses people is that these three activation functions

85
00:07:02,740 --> 00:07:08,560
apply, whether you are using a feed for neural network, a linear model or even a more complicated

86
00:07:08,560 --> 00:07:13,330
network like CNN or an aunt, it it doesn't matter what kind of model you have.

87
00:07:13,630 --> 00:07:17,590
These three activation functions always correspond to these three tasks.

88
00:07:18,190 --> 00:07:23,380
So, for example, if I don't want to use a neural network, but just a simple linear classifier for

89
00:07:23,380 --> 00:07:28,310
multiple classes, then I would have just a single dense layer with a softmax activation.

90
00:07:28,990 --> 00:07:30,710
This is still called logistic regression.

91
00:07:30,730 --> 00:07:33,410
By the way, it's just not binary logistic regression.

92
00:07:33,880 --> 00:07:36,760
Instead, it's called multiclass logistic regression.

93
00:07:38,290 --> 00:07:43,120
It's important to remember that what came before the final layer in the model is not related to the

94
00:07:43,120 --> 00:07:43,690
task.

95
00:07:44,260 --> 00:07:49,070
You can have a CNN or an on in or any other kind of complicated neural network.

96
00:07:49,420 --> 00:07:55,120
It is still the case that the final layer of the network will be a dense layer with one of these three

97
00:07:55,120 --> 00:07:58,450
activation functions corresponding to the task being done.

98
00:08:03,450 --> 00:08:09,000
Another important thing to remember is that the softmax function is more general, whereas the sigmoid

99
00:08:09,000 --> 00:08:11,260
can only handle binary classification.

100
00:08:11,730 --> 00:08:17,490
The softmax function can handle multiclass classification, which includes binary classification.

101
00:08:18,030 --> 00:08:24,360
So it's possible if you want to just use the softmax for all types of classification, if you set K

102
00:08:24,360 --> 00:08:27,060
equals two, then you're doing binary classification.

103
00:08:27,780 --> 00:08:32,820
It's just that in this scenario, your outputs are a bit redundant because you already know that they

104
00:08:32,820 --> 00:08:36,210
both come to one and hence only one of them is really needed.