1
00:00:00,240 --> 00:00:05,790
So now let's talk about lost functions and why they are essential to the training process.

2
00:00:06,870 --> 00:00:09,240
So remember the slide from a previous section?

3
00:00:09,810 --> 00:00:15,960
This is this is where we put we input some images here into our randomly initialized neural network

4
00:00:16,320 --> 00:00:22,320
and we just got some values which supposedly would be random values that tell us, give us the scores

5
00:00:22,740 --> 00:00:25,080
for each class, each image class.

6
00:00:25,080 --> 00:00:27,480
This one was a six or five or six eight eight one.

7
00:00:27,960 --> 00:00:33,600
And you can see the random scores generated from the outputs of the randomly initialized neural network.

8
00:00:34,410 --> 00:00:35,880
So what we need to do?

9
00:00:36,570 --> 00:00:38,760
We know these probabilities are wrong.

10
00:00:39,210 --> 00:00:42,570
Your scores are bad, but we don't know how bad they are.

11
00:00:42,600 --> 00:00:48,540
We need to actually find a way to quantify how bad our predictions are or how good they are.

12
00:00:49,260 --> 00:00:54,230
So we'll take a look at the method that's used primarily used in convolutional neural networks.

13
00:00:54,240 --> 00:01:00,960
Although this this could cross entropy loss can be used in any classification problem.

14
00:01:01,530 --> 00:01:07,620
It's not only specific to convolutional neural networks, however, because we use CNN's for classification.

15
00:01:08,100 --> 00:01:12,090
This lends itself quite well to our use case at hand.

16
00:01:12,750 --> 00:01:17,200
So imagine these our classes here for an input for image.

17
00:01:17,200 --> 00:01:17,820
I was input.

18
00:01:18,330 --> 00:01:21,000
And these are the predicted probabilities that come out of it.

19
00:01:21,840 --> 00:01:27,450
So you have point one point to point one, some point zero five point three notice that this is the

20
00:01:27,450 --> 00:01:30,330
highest one point zero five and point zero five.

21
00:01:30,900 --> 00:01:36,270
And look, fortunately, the ground truth for that image was actually a seven.

22
00:01:36,780 --> 00:01:39,060
And the probability was highest for four seven.

23
00:01:39,300 --> 00:01:40,050
It was 0.3.

24
00:01:40,620 --> 00:01:42,240
So that's good, right?

25
00:01:42,810 --> 00:01:44,550
Well, technically it is good.

26
00:01:45,060 --> 00:01:51,660
However, we need to find a way to actually use these probabilities to work out more, a more mathematical

27
00:01:51,690 --> 00:01:55,770
way to measure how good or bad these results really are.

28
00:01:56,100 --> 00:02:01,200
Because even though this is like a perfect result in theory, because the actual outcome is the predicted

29
00:02:01,200 --> 00:02:03,950
class, because the highest probability was 0.7.

30
00:02:05,190 --> 00:02:13,860
If I find that the fact that it was a close 0.2 was quite close for a one, for four plus one, that's

31
00:02:13,860 --> 00:02:19,890
a bit unsettling, meaning that our CNN might have a problem distinguishing between seven and 1s.

32
00:02:20,700 --> 00:02:23,640
So let's take a look at a formula for cross entropy loss.

33
00:02:24,060 --> 00:02:29,240
Cross entropy loss basically uses two distributions of a ground truth distribution here.

34
00:02:29,970 --> 00:02:33,360
It's VFX and Q effects of predicted distribution.

35
00:02:33,810 --> 00:02:40,590
So in this case here, where Y is, the predicted is around it and Y hat is a predicted distribution

36
00:02:40,590 --> 00:02:44,340
and the Dot is in a product and this is the formula we used to do it.

37
00:02:44,340 --> 00:02:47,100
I will go through this formula in the next slide here.

38
00:02:47,820 --> 00:02:49,710
So let's take a look at a simple example.

39
00:02:50,130 --> 00:02:52,620
Let's assume we only are looking at three classes here.

40
00:02:52,980 --> 00:02:57,240
Class zero, Class one, Class two Let's pretend it's cats, dogs and penguins.

41
00:02:57,630 --> 00:03:01,860
So these are the probabilities for an image that was input into this.

42
00:03:02,400 --> 00:03:10,770
So if this was a cat dog penguin, we think it's it gives us that CNN gives the highest possible probability

43
00:03:10,770 --> 00:03:14,730
score of six for it being a dog.

44
00:03:15,480 --> 00:03:18,010
So and then it corresponds to the ground truth here.

45
00:03:18,030 --> 00:03:18,720
So it's good.

46
00:03:18,960 --> 00:03:20,510
This is a relatively good result.

47
00:03:20,610 --> 00:03:23,310
Yeah, but let's see what else actually is.

48
00:03:23,880 --> 00:03:27,360
So by plugging in those values here, this is how we apply to formula.

49
00:03:27,360 --> 00:03:29,070
So we have this is the distribution here.

50
00:03:29,520 --> 00:03:38,130
So we have zero times larger point three plus one times log point zero, five point six and then zero

51
00:03:38,310 --> 00:03:39,270
zero, by the way.

52
00:03:39,270 --> 00:03:40,460
Are the ground your classes?

53
00:03:40,470 --> 00:03:41,070
That's the other.

54
00:03:41,370 --> 00:03:46,080
That's the other ground to the submission, plus zero times point one here.

55
00:03:46,920 --> 00:03:52,260
This gives us this value here, and because it's a negative in front of it, here we the negative value

56
00:03:52,260 --> 00:03:54,390
we get in here is cancel and we get a positive.

57
00:03:54,690 --> 00:03:56,240
So we have point two to two.

58
00:03:56,850 --> 00:03:59,730
This means that all loss has been quantified.

59
00:04:00,330 --> 00:04:07,560
If this point six was a point nine and this was a point zero five and this was a point zero five, this

60
00:04:07,560 --> 00:04:15,720
score, this loss score would actually be a lot lower and think about loss is indicative of something,

61
00:04:15,720 --> 00:04:18,060
but sort of lower or losses.

62
00:04:18,060 --> 00:04:20,310
The better of a neural network is performing.

63
00:04:20,880 --> 00:04:28,440
So also know that multicast log loss rewards or penalizes to correct classes only as you can see that.

64
00:04:28,660 --> 00:04:32,850
That's because these were zero here, and this is just another way to represent that formula.

65
00:04:34,140 --> 00:04:36,540
So what about other loss functions?

66
00:04:36,540 --> 00:04:37,340
Do they exist?

67
00:04:37,350 --> 00:04:38,970
And yes, they do exist.

68
00:04:39,330 --> 00:04:45,300
And oftentimes lost functions are called cost functions, depending on I think it was in engineering.

69
00:04:45,300 --> 00:04:50,890
They used to cost functions a lot and in mathematics to use lost functions, terminology a lot.

70
00:04:51,300 --> 00:04:56,730
So but it tends to mean the same thing when we're doing a binary classification problem, we use something

71
00:04:56,730 --> 00:04:59,790
called the binary cross entropy loss, which is basically to.

72
00:04:59,870 --> 00:05:06,590
Same thing, however, the formula a small tweak because it's only one up, a good that case for regressions

73
00:05:06,590 --> 00:05:12,680
regressions, meaning that instead of predicting a class, we are predicting a value like let's suppose

74
00:05:12,680 --> 00:05:18,410
we were to take an image and we needed to get like a a car value estimate out of it.

75
00:05:18,950 --> 00:05:22,220
So that value that it predicts is a regression.

76
00:05:22,310 --> 00:05:23,840
It's a continuous value.

77
00:05:24,590 --> 00:05:30,770
So the regressions since we do not have classes to compare it, but we use something called the mean

78
00:05:30,770 --> 00:05:32,990
square error, and that's how it's written.

79
00:05:33,000 --> 00:05:38,260
It's similar a similar type formula where we just compare the output distribution to the ground, truth

80
00:05:38,270 --> 00:05:41,780
distribution and get a score here.

81
00:05:42,440 --> 00:05:43,640
They are the lowest functions.

82
00:05:43,640 --> 00:05:47,270
Do there's L1, L2 hinge lowest mean, absolute error.

83
00:05:47,570 --> 00:05:52,910
There's actually a whole host of lost functions, triplet loss and many others that probably escaped

84
00:05:52,910 --> 00:05:53,810
my mind right now.

85
00:05:54,260 --> 00:05:57,140
But contrastive loss is not a popular one I used.

86
00:05:57,590 --> 00:06:03,440
But for now, the biggest focus you're going to be going to be looking with is cross entropy loss.

87
00:06:04,880 --> 00:06:08,540
So now that we have a loss, what do we do with that value?

88
00:06:09,440 --> 00:06:11,330
Well, we need to have the DeWitt's.

89
00:06:11,330 --> 00:06:18,350
But updating the weights for when CNN isn't a trivial task because as you can see, there are thousands,

90
00:06:18,350 --> 00:06:19,700
sometimes millions of widths.

91
00:06:20,030 --> 00:06:23,120
How do we know what value to change those random weights do?

92
00:06:23,540 --> 00:06:27,970
So that in the future, in the next iteration, we minimized the loss.

93
00:06:28,100 --> 00:06:32,690
So we need to do an update that allows the loss in the future to be lower.

94
00:06:33,500 --> 00:06:34,490
So how do we do this?

95
00:06:34,970 --> 00:06:40,100
Well, we use a technique called impact propagation, which is quite brilliant technique, which we'll

96
00:06:40,100 --> 00:06:42,050
discuss in the next section.

97
00:06:43,070 --> 00:06:49,220
And we used a lost value for this to those values essential for making back propagation.

98
00:06:50,120 --> 00:06:55,400
So let's stop there, and in the next section, we'll continue onto back propagation.

99
00:06:56,030 --> 00:06:58,010
So thank you and I'll see you in the next section.
