1
00:00:11,650 --> 00:00:16,900
In this lecture we are going to discuss the binary cross entropy loss which is the correct lost function

2
00:00:16,900 --> 00:00:20,630
to use when we are doing binary classification.

3
00:00:20,650 --> 00:00:26,140
Let's recap the actual lost function and then we will use the rest of his lecture to determine how to

4
00:00:26,140 --> 00:00:27,690
arrive at this lost function.

5
00:00:32,860 --> 00:00:38,140
Since we know that this is going to be based in probability let's think about the associated probability

6
00:00:38,140 --> 00:00:39,270
problem.

7
00:00:39,430 --> 00:00:42,910
We know that the outcomes the Y's must be binary.

8
00:00:42,910 --> 00:00:49,660
That means the targets in a binary classification problem must be 0 or 1 spam or not spam fraud or not

9
00:00:49,660 --> 00:00:51,300
fraud etc..

10
00:00:51,310 --> 00:00:56,410
The next question is what distribution is used for binary events.

11
00:00:56,410 --> 00:00:58,840
The answer is the Bernoulli distribution

12
00:01:04,000 --> 00:01:09,120
probably the most typical example of the Bernoulli distribution is the coin toss.

13
00:01:09,130 --> 00:01:14,560
So imagine if you tossed a coin a bunch of times and you want to calculate the probability of heads.

14
00:01:14,650 --> 00:01:15,960
Call that Mew.

15
00:01:16,150 --> 00:01:22,310
Intuitively we know that MU is the number of heads divided by the total number of coin tosses.

16
00:01:22,360 --> 00:01:28,210
Again we are going to go through the process of maximum likelihood estimation to see how we can arrive

17
00:01:28,210 --> 00:01:33,890
at this answer.

18
00:01:34,080 --> 00:01:34,350
All right.

19
00:01:34,360 --> 00:01:36,720
So let's say we've tossed our coin a bunch of times.

20
00:01:36,730 --> 00:01:40,520
Let's call the results x1 x 2 all the way up to X N.

21
00:01:40,600 --> 00:01:45,370
In this case the IS can only take on it 2 values 0 or 1.

22
00:01:45,370 --> 00:01:51,850
Let's say 1 means heads and zero means tails in order to calculate the likelihood we need the Bernoulli

23
00:01:51,880 --> 00:01:58,780
P MF which is the equation you see here now there's one small difference between this example in the

24
00:01:58,780 --> 00:02:01,160
previous example on the squared error.

25
00:02:01,450 --> 00:02:07,690
The difference is that for the previous example we were working with heights which are continuous valued.

26
00:02:07,780 --> 00:02:13,660
Now we are working with coin tosses which are discrete a coin toss can only give you 0 or 1.

27
00:02:13,660 --> 00:02:16,910
It can't give you say zero point five zero point nine.

28
00:02:17,140 --> 00:02:23,590
So we know from my study of probability that four continuous distributions we use the probability density

29
00:02:23,590 --> 00:02:29,440
function but for discrete distributions we use the probability mass function or P MF.

30
00:02:29,530 --> 00:02:34,870
So this is a P MF and it returns a probability rather than a probability density

31
00:02:40,040 --> 00:02:41,420
since you already know the steps.

32
00:02:41,420 --> 00:02:42,440
Let's get right to it.

33
00:02:43,100 --> 00:02:53,930
We create our likelihood function which is the product of the P maps for each of the axes we've collected.

34
00:02:54,060 --> 00:02:59,610
We know that we want to maximize the likelihood but the step we have to do before taking the derivative

35
00:02:59,760 --> 00:03:06,700
is that we want to find the log likelihood at this point this equation should already look very familiar.

36
00:03:06,720 --> 00:03:10,980
It's in the exact same form as the binary cross entropy except it negated

37
00:03:16,200 --> 00:03:22,350
as usual if you were to take the derivative of the log likelihood set it to zero and solve for me you.

38
00:03:22,350 --> 00:03:23,580
This is what you would get.

39
00:03:24,390 --> 00:03:29,250
Curiously we arrive at the same answer as the guessing case the sample mean of x.

40
00:03:29,310 --> 00:03:30,780
How can this be.

41
00:03:30,840 --> 00:03:31,360
Why isn't it.

42
00:03:31,380 --> 00:03:33,910
Number of has divided by n.

43
00:03:33,930 --> 00:03:35,850
In actuality it is.

44
00:03:36,060 --> 00:03:39,870
Remember that X can only take on the values 1 and 0.

45
00:03:39,870 --> 00:03:41,800
So when x is 0 that's tails.

46
00:03:41,820 --> 00:03:48,630
So they don't contribute anything to the sum of X the sum of X then is just the sum of a bunch of ones

47
00:03:49,050 --> 00:03:51,960
which is the number of times we flipped heads.

48
00:03:51,960 --> 00:03:56,190
So in fact this is exactly the same as number of heads divided by n

49
00:04:01,410 --> 00:04:02,480
as before.

50
00:04:02,490 --> 00:04:08,220
If we take the negative log likelihood and we put it side by side with the binary cross entropy we can

51
00:04:08,220 --> 00:04:10,070
see the similarities.

52
00:04:10,290 --> 00:04:17,040
We can conclude that what the binary cross entropy is really saying is that y is the result of a coin

53
00:04:17,040 --> 00:04:22,220
toss with a probability of heads for that coin toss is given by y hat.

54
00:04:22,230 --> 00:04:30,460
I analogous Lee x y is the result of a coin toss where MU is the probability of heads for that coin

55
00:04:30,460 --> 00:04:31,000
toss

56
00:04:36,170 --> 00:04:37,040
as a side note.

57
00:04:37,130 --> 00:04:43,010
You want to keep in mind that we also usually divide the sum of the errors by N so that we get the average

58
00:04:43,010 --> 00:04:46,670
binary cross entropy per data point.

59
00:04:46,700 --> 00:04:52,100
This makes it just like the mean squared error in that the value is invariant to the number of samples

60
00:04:52,100 --> 00:04:53,210
we have.

61
00:04:53,390 --> 00:04:59,000
You can imagine that if we have lots of samples and we only use the sum the error would be very large

62
00:04:59,120 --> 00:05:04,220
only due to the fact that we have a large number of samples in order to make the value of the error

63
00:05:04,220 --> 00:05:05,300
more meaningful.

64
00:05:05,300 --> 00:05:09,050
We can take the average and that makes our error more interpretable

65
00:05:14,210 --> 00:05:17,600
So to conclude what have we done in this lecture.

66
00:05:17,600 --> 00:05:22,910
We have shown that just like the mean squared error the binary cross entropy loss function is based

67
00:05:22,910 --> 00:05:24,320
in probability.

68
00:05:24,620 --> 00:05:30,020
What it says is that for the regression case where we use the means squared error it arises from the

69
00:05:30,020 --> 00:05:36,350
Gaussian distribution for the binary classification case where we use the binary cross entropy it arises

70
00:05:36,350 --> 00:05:38,700
from the Bernoulli distribution.

71
00:05:38,720 --> 00:05:45,590
Importantly the common pattern between the two is that in both cases the error function is really just

72
00:05:45,590 --> 00:05:52,220
the negative log likelihood and therefore the solution we end up finding in the end is called the maximum

73
00:05:52,220 --> 00:05:53,540
likelihood solution.
