1
00:00:11,580 --> 00:00:17,490
In this lecture, we are going to discuss another very important topic multiclass classification.

2
00:00:19,050 --> 00:00:23,760
Previously, we discussed the binary classification, which used a sigmoid at the output.

3
00:00:24,390 --> 00:00:27,600
We also had segments for the hidden layer activation functions.

4
00:00:27,660 --> 00:00:33,780
But we just got rid of that in favor of the réélu, which tends to work better for the output.

5
00:00:33,810 --> 00:00:38,440
However, if we are doing binary classification, the sigmoid is still the right choice.

6
00:00:39,120 --> 00:00:41,820
Binary classification has many applications.

7
00:00:42,120 --> 00:00:44,340
We can predict disease or notices.

8
00:00:44,700 --> 00:00:46,500
Is the user fraudulent or not?

9
00:00:47,010 --> 00:00:48,750
Well, the user click my ad or not.

10
00:00:49,380 --> 00:00:51,420
Will the user accept a friend request or not?

11
00:00:51,510 --> 00:00:52,950
And so on and so forth?

12
00:00:53,520 --> 00:00:59,400
But there are situations that binary classification cannot handle, such as when we might have multiple

13
00:00:59,400 --> 00:01:00,810
categorical outcomes.

14
00:01:05,880 --> 00:01:07,290
Let's think of some examples.

15
00:01:08,130 --> 00:01:14,670
One example is optical character recognition or handwriting recognition if we want to recognize alphanumeric

16
00:01:14,670 --> 00:01:15,390
characters.

17
00:01:15,630 --> 00:01:20,280
Then there are at least 36 possibilities, 10 digits and 26 letters.

18
00:01:20,940 --> 00:01:24,840
What about speech recognition if we want to recognize words?

19
00:01:25,110 --> 00:01:27,780
Then each word counts as a separate category.

20
00:01:29,190 --> 00:01:30,990
Image classification is another one.

21
00:01:31,140 --> 00:01:36,690
And for a long time, this was the standard benchmark which proved that deep learning worked better

22
00:01:36,690 --> 00:01:38,310
than other machine learning models.

23
00:01:38,910 --> 00:01:43,830
You may have heard of a famous dataset called the image in that dataset, which is made up of millions

24
00:01:43,830 --> 00:01:48,870
of images and a thousand categories such as bagel vending machine and soccer ball.

25
00:01:49,590 --> 00:01:54,300
Every year, there is a contest to see who can build the best image that classifier.

26
00:01:54,480 --> 00:01:56,760
And of course, these are all deep learning models.

27
00:02:01,920 --> 00:02:04,290
So how does multiclass classification work?

28
00:02:05,040 --> 00:02:10,590
Well, suppose we have just calculated the activation at the final layer of our neural network, but

29
00:02:10,590 --> 00:02:12,960
have not yet applied the activation function.

30
00:02:13,650 --> 00:02:21,520
So we have a superscript l equal to W Superscript L transpose Dot Z superscript l minus one.

31
00:02:21,570 --> 00:02:25,380
So that's the Z at the previous layer, plus b superscript l.

32
00:02:26,970 --> 00:02:33,780
It's important to remember that if the layer size is K, then a superscript l is a vector of size K,

33
00:02:34,710 --> 00:02:41,430
what we would like to do is map these values to probabilities for each of the K possible output categories.

34
00:02:42,510 --> 00:02:44,160
OK, so that's your first lesson here.

35
00:02:44,700 --> 00:02:48,840
If we have K possible outcomes, then we should have a K output nodes.

36
00:02:49,620 --> 00:02:56,280
The question now is what function can actually map the vector a superscript l to probability values?

37
00:03:01,360 --> 00:03:04,420
Let's think about what the requirements are for these probabilities.

38
00:03:05,080 --> 00:03:10,320
We need a probability distribution over distinct values as a signal.

39
00:03:10,330 --> 00:03:15,310
We assign these values the integers zero to K minus one first.

40
00:03:15,310 --> 00:03:18,040
Since they are probabilities, they must be non negative.

41
00:03:18,490 --> 00:03:21,340
In other words, they must be greater than or equal to zero.

42
00:03:22,240 --> 00:03:23,980
But there is also an upper limit.

43
00:03:24,640 --> 00:03:26,980
The maximum value of a probability is one.

44
00:03:27,430 --> 00:03:32,710
But more importantly, for this to be a proper probability distribution, all the possible different

45
00:03:32,710 --> 00:03:34,600
values must sum to one.

46
00:03:35,140 --> 00:03:39,250
So the second requirement is that all the probabilities must sum to one.

47
00:03:44,360 --> 00:03:47,690
Luckily, there is a function that accomplishes exactly this.

48
00:03:48,110 --> 00:03:49,790
It is called the soft max function.

49
00:03:50,540 --> 00:03:56,000
First, let's drop the superscript l on the activation A for now, so we can just say A is a vector

50
00:03:56,000 --> 00:03:56,870
of size k.

51
00:03:57,620 --> 00:04:05,090
Then the soft max of this vector A is the exponential of a divided by the sum of the exponential avai

52
00:04:05,150 --> 00:04:06,380
across each component.

53
00:04:07,720 --> 00:04:10,780
As you can see, this satisfies both of our requirements.

54
00:04:11,500 --> 00:04:14,260
The exponential of any number is always positive.

55
00:04:14,320 --> 00:04:17,620
So therefore, each probability must be not negative.

56
00:04:18,760 --> 00:04:23,650
Second, because the denominator is just the sum of all the possible values of the numerator.

57
00:04:24,010 --> 00:04:29,770
Then if we sum all the possible values of the numerator, we just get the same sum and therefore we

58
00:04:29,770 --> 00:04:33,670
get the same sum on the top and the bottom, and they cancel out to give us one.

59
00:04:38,760 --> 00:04:43,020
Now that we know how to soft max works, let's answer the most important question.

60
00:04:43,200 --> 00:04:45,000
How do we use it in TensorFlow?

61
00:04:45,780 --> 00:04:50,010
Well, just like most of the other functions we've applied so far, it's very simple.

62
00:04:50,490 --> 00:04:52,590
We just pass in the strings of Max.

63
00:04:53,250 --> 00:04:56,310
In other words, the only requirement is that you spell it correctly.

64
00:04:57,060 --> 00:04:59,700
Of course, you can always implement the soft max yourself.

65
00:04:59,970 --> 00:05:01,800
But in TensorFlow, there's no need.

66
00:05:03,180 --> 00:05:06,900
As a side note, the SoC Max is considered an activation function.

67
00:05:06,960 --> 00:05:13,110
But unlike the real you sigmoid and 10h, it is not meant for a hidden layer activations.

68
00:05:14,490 --> 00:05:19,920
If you want to try using the soft max as a hidden layer activation, you are most welcome to, but you'll

69
00:05:19,920 --> 00:05:21,840
generally find that it doesn't work that well.

70
00:05:22,650 --> 00:05:28,020
So the soft max is technically in activation function, but we normally only use it when we're trying

71
00:05:28,020 --> 00:05:33,810
to get an output probability from a vector of activation values, such as at the end of a neural network.

72
00:05:38,950 --> 00:05:43,300
The last thing I want to do in this lecture is summarize the three types of tasks we've encountered

73
00:05:43,300 --> 00:05:48,430
so far that none that was can perform and their corresponding activation functions.

74
00:05:49,380 --> 00:05:53,700
Two of them you already saw in the previous section of this course, and now we have a third.

75
00:05:54,600 --> 00:05:58,410
So first we have regression where we use no activation function at all.

76
00:05:59,070 --> 00:06:04,380
Sometimes we also call this the identity function, which is just a fancy way of saying a function that

77
00:06:04,380 --> 00:06:05,520
returns its input.

78
00:06:06,930 --> 00:06:11,490
Second, we have binary classification where we use the sigmoid activation function.

79
00:06:12,270 --> 00:06:13,230
Using the sigmoid.

80
00:06:13,240 --> 00:06:19,230
We only need one output node because since we know the probability of the output being one, the probability

81
00:06:19,230 --> 00:06:21,840
of the output being zero is just one minus that.

82
00:06:22,680 --> 00:06:27,300
This is the same type of thing you see when you're looking at the Brinley distribution versus the categorical

83
00:06:27,300 --> 00:06:28,050
distribution.

84
00:06:29,660 --> 00:06:34,580
The Brinley distribution is a categorical distribution where there can only be two outcomes.

85
00:06:34,850 --> 00:06:36,620
Thus, it only has one parameter.

86
00:06:37,400 --> 00:06:43,270
The categorical distribution, on the other hand, if it has key outcomes, it also has key parameters.

87
00:06:44,240 --> 00:06:48,530
And that brings us to our third scenario, which is multiclass classification.

88
00:06:49,220 --> 00:06:51,980
For this, we use the soft max activation function.

89
00:06:57,010 --> 00:07:02,710
One important thing to mention that sometimes confuses people is that these three activation functions

90
00:07:02,710 --> 00:07:08,530
apply whether you are using a feedforward neural network, a linear model or even a more complicated

91
00:07:08,530 --> 00:07:10,120
network like a CNN or an aunt.

92
00:07:11,050 --> 00:07:16,510
It doesn't matter what kind of model you have, these three activation functions always correspond to

93
00:07:16,510 --> 00:07:17,560
these three tasks.

94
00:07:18,220 --> 00:07:23,740
So, for example, if I don't want to use a neural network but just a simple linear classifier for multiple

95
00:07:23,740 --> 00:07:28,300
classes, then I would have just a single dense layer with a soft max activation.

96
00:07:28,990 --> 00:07:31,240
This is still called logistic regression, by the way.

97
00:07:31,300 --> 00:07:33,430
It's just not binary logistic regression.

98
00:07:33,910 --> 00:07:36,730
Instead, it's called multiclass logistic regression.

99
00:07:38,290 --> 00:07:43,090
It's important to remember that what came before the final layer in the model is not related to the

100
00:07:43,090 --> 00:07:43,690
task.

101
00:07:44,260 --> 00:07:49,030
You can have a CNN or an Aunt N or any other kind of complicated neural network.

102
00:07:49,450 --> 00:07:55,090
It is still the case that the final layer of the network will be a dense layer with one of these three

103
00:07:55,090 --> 00:07:58,390
activation functions corresponding to the task being done.

104
00:08:03,480 --> 00:08:08,970
Another important thing to remember is that the South Max function is more general, whereas the sigmoid

105
00:08:08,970 --> 00:08:11,250
can only handle binary classification.

106
00:08:11,700 --> 00:08:17,430
The South Max function can handle multiclass classification, which includes binary classification.

107
00:08:18,030 --> 00:08:23,010
So it's possible if you want to just use the South Max for all types of classification.

108
00:08:23,520 --> 00:08:27,060
If you set K equals two, then you're doing binary classification.

109
00:08:27,810 --> 00:08:32,820
It's just that in this scenario, your outputs are a bit redundant because you already know that they

110
00:08:32,820 --> 00:08:36,179
both sum to one and hence only one of them is really needed.

