1
00:00:11,670 --> 00:00:17,010
In this lecture we are going to derive the categorical cross entropy loss function which is what we

2
00:00:17,010 --> 00:00:20,410
use when we are doing multi class classification.

3
00:00:20,700 --> 00:00:26,430
At this point you probably already know that the categorical cross entropy is the negative log likelihood

4
00:00:26,430 --> 00:00:28,260
for some distribution.

5
00:00:28,320 --> 00:00:33,830
So let's jump right in and try to figure out what this distribution is.

6
00:00:33,940 --> 00:00:39,680
In the case where we only have binary outcomes the correct discrete distribution is the Bernoulli distribution.

7
00:00:40,330 --> 00:00:44,730
So what is the correct distribution for multiple categorical outcomes.

8
00:00:44,740 --> 00:00:50,350
Well unlike the Gaussian and Bernoulli distributions this distribution is not named after a person.

9
00:00:50,500 --> 00:00:54,430
It is simply called quite appropriately the categorical distribution

10
00:00:59,580 --> 00:01:04,030
the usual analogy for the categorical distribution is a die roll.

11
00:01:04,050 --> 00:01:05,360
So when you roll a die.

12
00:01:05,490 --> 00:01:06,950
There are six possible outcomes.

13
00:01:06,960 --> 00:01:10,780
Typically one two three all the way up to six.

14
00:01:10,800 --> 00:01:12,500
Of course there are different kinds of dice.

15
00:01:12,510 --> 00:01:15,230
Well you may have even more possible outcomes.

16
00:01:15,390 --> 00:01:24,220
In general we will refer to the number of possible outcomes as K.

17
00:01:24,400 --> 00:01:27,580
So what does the categorical distribution look like.

18
00:01:27,580 --> 00:01:35,050
Well it looks a little strange as you can see it is a product each of the mew KS is a parameter telling

19
00:01:35,050 --> 00:01:39,170
us the probability that the result of the dice roll was k.

20
00:01:39,190 --> 00:01:43,990
In other words Mew K is equal to the probability that x is equal to k.

21
00:01:44,140 --> 00:01:48,240
So what does this mean about the one function in the exponent.

22
00:01:48,280 --> 00:01:51,120
This is what is called an indicator function.

23
00:01:51,130 --> 00:01:57,820
It returns one and when its argument is true and 0 otherwise Knowing this it's easy to confirm that

24
00:01:58,210 --> 00:02:03,920
if we plug in x equals 1 We should get that the probability that x is equal to 1 equals mu 1.

25
00:02:04,030 --> 00:02:08,280
The probability that x is equal to 2 is Muthu and so forth.

26
00:02:08,290 --> 00:02:13,440
In other words we can confirm that this is indeed the p MF for the categorical distribution

27
00:02:18,590 --> 00:02:24,170
as usual we are going to pretend we've collected any data points and we are going to do maximum likelihood

28
00:02:24,170 --> 00:02:27,410
estimation to find all the muse.

29
00:02:27,410 --> 00:02:32,540
What is interesting about this likelihood is that because the likelihood is the product of P a maths

30
00:02:32,750 --> 00:02:40,280
and the p MF already has a product we end up with two products in the likelihood also notice that there

31
00:02:40,280 --> 00:02:42,890
are two indices in the exponent here.

32
00:02:42,890 --> 00:02:50,710
There is the index eye and also the index K so that should remind you of a matrix because a matrix also

33
00:02:50,710 --> 00:02:51,690
has two indices.

34
00:02:51,700 --> 00:02:52,750
The row in the column

35
00:02:57,900 --> 00:02:59,460
after taking the log.

36
00:02:59,460 --> 00:03:05,160
We know that we get the log likelihood and we already know that the negative log likelihood is our categorical

37
00:03:05,160 --> 00:03:06,660
cross entropy.

38
00:03:06,660 --> 00:03:12,000
So if we take the negative of this expression we will get back something that looks exactly like the

39
00:03:12,000 --> 00:03:13,500
categorical cross entropy

40
00:03:18,610 --> 00:03:23,410
and thus we can relate this back to our categorical cross entropy loss function.

41
00:03:23,410 --> 00:03:27,850
Notice here that it's a tiny bit different from our log likelihood and that we don't have an indicator

42
00:03:27,850 --> 00:03:29,820
function anymore.

43
00:03:29,830 --> 00:03:33,750
This is because an equivalent way of writing this is double indexing.

44
00:03:33,760 --> 00:03:42,490
Why I k y k is called a 1 hot and quoted value because it can only take on the values 1 or 0 which row

45
00:03:42,520 --> 00:03:44,940
can only have 1 1.

46
00:03:45,010 --> 00:03:49,640
Remember that I represents the sample index and K represents the class label.

47
00:03:49,960 --> 00:03:52,060
The class label can only be one value.

48
00:03:52,120 --> 00:03:57,650
For example so for instance if I roll a die I can only get one number.

49
00:03:57,670 --> 00:04:04,870
It must be six or five or four three it can't be six and three at the same time so as another example

50
00:04:04,960 --> 00:04:09,940
if I have a picture and I'm classifying what type of animal it is it can only be one type of animal

51
00:04:10,480 --> 00:04:11,960
if it's a picture of a cat.

52
00:04:12,040 --> 00:04:19,330
It cannot simultaneously be a picture of a dog therefore only one of the K values can have a one representing

53
00:04:19,330 --> 00:04:20,200
the target.

54
00:04:20,320 --> 00:04:22,130
The rest of the values must be 0

55
00:04:27,300 --> 00:04:32,040
you'll notice that one hot in cutting the targets like this can be quite inefficient.

56
00:04:32,070 --> 00:04:33,750
Let's do a simple example.

57
00:04:34,110 --> 00:04:39,890
Suppose we have four samples so and it's four and three classes cat dog or bird.

58
00:04:39,900 --> 00:04:42,750
Suppose the targets are cat dog bird and cat.

59
00:04:43,500 --> 00:04:47,520
So numerically we can represent this with the vector one two three one.

60
00:04:49,470 --> 00:04:55,230
But in order to one hot and cold this vector we must turn it into a four by three matrix or an end by

61
00:04:55,230 --> 00:04:58,650
K matrix containing only zeros and ones.

62
00:04:58,800 --> 00:05:11,060
So the equivalent one hot encoding is 1 0 0 0 1 0 0 0 1 1 0 0.

63
00:05:11,060 --> 00:05:16,160
Now suppose my prediction probabilities also an n by K matrix are as follows.

64
00:05:16,160 --> 00:05:22,280
So for each sample and each class I have an associated probability it's the probability that I think

65
00:05:22,280 --> 00:05:24,080
the input belongs to that class

66
00:05:29,310 --> 00:05:32,070
after applying the categorical cross entropy formula.

67
00:05:32,160 --> 00:05:34,620
We get this value for the cost.

68
00:05:34,620 --> 00:05:39,030
Now you might be thinking that is a whole lot of multiplying by zero.

69
00:05:39,060 --> 00:05:42,190
We already know that anything multiplied by 0 0.

70
00:05:42,300 --> 00:05:50,680
So those values don't actually contribute anything to the cost.

71
00:05:50,710 --> 00:05:54,760
The only values that do matter are the ones which match the target.

72
00:05:54,760 --> 00:06:03,640
Remember the original target is the vector 1 2 3 1 What I would like to do ideally is to only take the

73
00:06:03,640 --> 00:06:07,990
log of these highlighted values and then add them altogether.

74
00:06:07,990 --> 00:06:12,240
The question is can I do that without one heart and calling the target first.

75
00:06:17,460 --> 00:06:20,270
And indeed this is possible in number pi.

76
00:06:20,280 --> 00:06:26,700
We can use the double indexing trick number higher res allow you to index using it such as lists and

77
00:06:26,730 --> 00:06:30,300
other arrays instead of just single integers.

78
00:06:30,300 --> 00:06:35,920
So if I wanted to take my predictions and index only the relevant values here's what I would do.

79
00:06:36,210 --> 00:06:43,620
I would take y hat at equals 1 and K goes 1 I would take y hat at i equals 2 and K equals 2.

80
00:06:43,620 --> 00:06:49,860
I would take y hat at i equals 3 and K close 3 and I would take my hat at I equals 4 and k equals 1

81
00:06:51,510 --> 00:06:53,760
using the number pi double indexing method.

82
00:06:54,060 --> 00:07:00,990
I can actually index all of these values at once I simply pass in all the row indices and other corresponding

83
00:07:00,990 --> 00:07:03,580
column indices at the same time.

84
00:07:03,630 --> 00:07:09,600
Importantly note that by doing this I never have to 1 hat and coat the targets I can pass them directly

85
00:07:09,990 --> 00:07:14,910
into the square brackets as an exercise.

86
00:07:14,920 --> 00:07:21,390
You should try this in actual num pi code to verify that it really works just note that in code or we

87
00:07:21,390 --> 00:07:28,450
index starting at 0 instead of 1.

88
00:07:28,610 --> 00:07:33,110
And so this is the idea behind the sparse categorical cross entropy.

89
00:07:33,110 --> 00:07:39,710
So when you are in tensor flow 2.0 if you use the regular categorical cross entropy you are using the

90
00:07:39,710 --> 00:07:46,000
one hot encoded target which requires n times K multiplications and additions.

91
00:07:46,130 --> 00:07:52,040
If you use the sparse categorical cross entropy you were using the original target which requires only

92
00:07:52,130 --> 00:07:54,930
n multiplications and additions.

93
00:07:55,100 --> 00:08:01,160
Therefore it's usually better to use the original target and apply the spice categorical cross entropy

94
00:08:01,160 --> 00:08:01,790
loss.