1
00:00:00,120 --> 00:00:07,050
Hi and welcome to the section on gradient descent, gradient descent is basically the algorithm or technique

2
00:00:07,050 --> 00:00:11,100
that we use to find the optimal rates in the training process.

3
00:00:11,850 --> 00:00:13,740
So let's take a look at something here.

4
00:00:13,950 --> 00:00:16,080
Remember the lost function we discussed earlier?

5
00:00:16,680 --> 00:00:18,430
How do you find the lowest loss?

6
00:00:18,450 --> 00:00:20,670
How do you get the gradients to get the lowest loss?

7
00:00:21,150 --> 00:00:27,450
We've already seen a back propagation is a process by which we used to did the individual beats or gradients.

8
00:00:27,960 --> 00:00:33,870
And basically our goal here is to find the best set of weights where we can get the lowest loss possible.

9
00:00:34,560 --> 00:00:36,140
So no, this is important.

10
00:00:36,150 --> 00:00:39,840
It is to distinguish here the method by which we achieve this goal.

11
00:00:39,930 --> 00:00:46,170
That is a method in which we use bat propagation to lower the loss, to change the weights, to lower

12
00:00:46,170 --> 00:00:53,340
the loss is called gradient descent and gradient descent is basically all about us finding the optimal

13
00:00:53,610 --> 00:00:55,140
way it's where the loss is.

14
00:00:55,140 --> 00:00:55,560
Lewis.

15
00:00:56,010 --> 00:00:59,670
And this is a very basic to the example of what I'm talking about.

16
00:01:00,060 --> 00:01:02,220
Imagine this is the loss halo sequel.

17
00:01:02,220 --> 00:01:03,120
Bad, no loss.

18
00:01:03,120 --> 00:01:03,600
Equal good.

19
00:01:04,110 --> 00:01:08,310
And this is the point we want to achieve with the way it's due to the values of the weights and we want

20
00:01:08,310 --> 00:01:09,390
to end up here.

21
00:01:09,510 --> 00:01:12,390
We want to have a neural network with those weights.

22
00:01:13,170 --> 00:01:14,940
So let's take a look at gradients now.

23
00:01:15,840 --> 00:01:18,030
Gradients are the derivative of a function.

24
00:01:18,600 --> 00:01:23,550
So basically, it tells us the rate of change of one variable with respect to another.

25
00:01:23,790 --> 00:01:24,780
In this case here.

26
00:01:25,260 --> 00:01:31,020
Remember, this was error loss and this was a it's where you just just said it.

27
00:01:31,260 --> 00:01:36,180
So this tells us how much error changes with respect to changes in the weights.

28
00:01:36,870 --> 00:01:43,200
So let's take a look at what this means to a positive gradient means loss increases if weight increases

29
00:01:43,920 --> 00:01:47,790
and a negative gradient means loss decreases if weight decreases.

30
00:01:48,210 --> 00:01:49,470
That's in this case here.

31
00:01:50,190 --> 00:01:56,550
So at point, a single look at this guy point eight, if he's moving to the right, that increases.

32
00:01:56,550 --> 00:01:58,650
Oh, it's because you can see the weights go up.

33
00:01:58,890 --> 00:02:00,600
Let's move in the right direction.

34
00:02:01,170 --> 00:02:03,000
And it also decreases our loss.

35
00:02:03,510 --> 00:02:08,400
That means it's going to have a negative gradient here at Point B. Though let's take a look at this

36
00:02:08,760 --> 00:02:14,070
moving to the right increases or weights and increases over loss, and that's going to be a positive

37
00:02:14,070 --> 00:02:15,930
gradient in this side because it's going up.

38
00:02:16,710 --> 00:02:21,180
So therefore, the negative or gradient basically tells us the direction we want to be moving.

39
00:02:21,420 --> 00:02:22,380
In this example.

40
00:02:23,910 --> 00:02:25,800
So let's talk a bit more about gradients.

41
00:02:26,640 --> 00:02:33,120
So the point at which gradient to zero means that small changes to the left and the right doing change

42
00:02:33,120 --> 00:02:33,570
loss.

43
00:02:34,140 --> 00:02:36,540
This is a good thing, but it's also a bad thing.

44
00:02:36,990 --> 00:02:38,550
And let's take a look at why.

45
00:02:39,240 --> 00:02:42,420
Imagine we're at point C now, so you can see it.

46
00:02:42,420 --> 00:02:48,510
Point C the small change, the left or a small change to the right doesn't really change the loss all

47
00:02:48,510 --> 00:02:49,020
that much.

48
00:02:49,710 --> 00:02:52,860
This means that it's stuck in a local minima.

49
00:02:53,370 --> 00:02:57,540
So this is a bad thing, though, because we want to end up here.

50
00:02:57,810 --> 00:02:59,040
Remember, look at this point here.

51
00:02:59,370 --> 00:03:01,830
He has a much lower loss than a potency.

52
00:03:02,400 --> 00:03:08,850
However, inadvertently during the training process, if our living rate, which we'll discuss in the

53
00:03:08,850 --> 00:03:15,630
next upcoming slides at the learning rate is set to small our neural, our gradient can get stuck,

54
00:03:15,750 --> 00:03:19,650
our function can get stuck here and it's never going to change, is we?

55
00:03:19,650 --> 00:03:20,190
It's anymore.

56
00:03:20,340 --> 00:03:26,010
It's going to converge at this point, and it's going to see this is the best I can do when in reality,

57
00:03:26,010 --> 00:03:27,240
this is the best you could have done.

58
00:03:28,380 --> 00:03:31,800
So let's take a look at what local and global minimums are.

59
00:03:32,280 --> 00:03:37,560
So just to expand in the previous slide, these points in this graph here, these are the local minimums

60
00:03:37,560 --> 00:03:38,340
of this function.

61
00:03:38,850 --> 00:03:40,780
However, this is the global minimum right here.

62
00:03:40,830 --> 00:03:42,420
This is the point where we want to get to.

63
00:03:42,780 --> 00:03:44,760
This is a point where it has the lowest loss.

64
00:03:45,810 --> 00:03:47,640
So think about really innocent.

65
00:03:48,300 --> 00:03:49,310
Remember the gradient descent?

66
00:03:49,310 --> 00:03:53,910
There's a method by which we use that propagation to get the best tweets.

67
00:03:54,780 --> 00:04:02,130
It's basically a system in which we use to back propagation algorithm to just find a way to get to the

68
00:04:02,130 --> 00:04:06,210
lowest points, the lowest loss and get the weight to that point.

69
00:04:06,810 --> 00:04:08,550
So imagine this example here.

70
00:04:08,550 --> 00:04:13,860
Imagine being a really tiny person and you're traversing down the slope of this old rough border.

71
00:04:14,400 --> 00:04:21,060
There'd be peaks and valleys and troughs because if you're is quite small, it's an old rugged type

72
00:04:21,060 --> 00:04:24,750
will, then you're going to probably see a lot of like bumps and cracks in it.

73
00:04:25,350 --> 00:04:30,300
So how do you know when you're truly at the bottom because you're so small at this point, you can possibly

74
00:04:30,300 --> 00:04:32,670
take large steps so you didn't get stuck in a valley.

75
00:04:33,270 --> 00:04:36,450
However, you risk jumping over the bottom global minimum point.

76
00:04:37,110 --> 00:04:38,370
So let's take a look at that here.

77
00:04:38,640 --> 00:04:40,380
Step size is important.

78
00:04:40,890 --> 00:04:45,150
That's step size can tell us basically, if you move in this direction.

79
00:04:45,720 --> 00:04:46,410
That's a good thing.

80
00:04:46,560 --> 00:04:48,690
You'll end up there at this point.

81
00:04:49,200 --> 00:04:54,600
However, if you move a step size of this size again, you end up here.

82
00:04:54,720 --> 00:04:59,520
So, no, you're left thinking that this is this is the global minimum.

83
00:04:59,640 --> 00:04:59,880
When?

84
00:04:59,910 --> 00:05:00,870
In reality, it's not.

85
00:05:02,070 --> 00:05:08,250
So if you were to move backwards, oops, yeah, and if you were to move backwards, you end up in this

86
00:05:08,250 --> 00:05:09,060
same situation.

87
00:05:09,480 --> 00:05:13,870
So what we need here is a larger step size to find a global minimum.

88
00:05:14,250 --> 00:05:19,680
However, step size can be too large, and you can sometimes hop over the global minimum as well.

89
00:05:21,450 --> 00:05:23,100
Let's watch this animation again.

90
00:05:26,250 --> 00:05:33,110
So remember, we talked about land of land, there was a lending rate, just basically it's just this

91
00:05:33,120 --> 00:05:34,680
adjusts to step size.

92
00:05:35,520 --> 00:05:39,420
So we're having a landing factor here when we have detailed weeds.

93
00:05:39,780 --> 00:05:44,580
This allows us to control the magnitude of how much we jump when updating our wits.

94
00:05:44,940 --> 00:05:49,980
You want to take small steps, but not too small steps, because then it takes forever to converge and

95
00:05:49,980 --> 00:05:51,660
you risk getting stuck in local minimums.

96
00:05:51,960 --> 00:05:58,290
However, you don't want to take large steps so that you miss the global minima entirely so you can

97
00:05:58,290 --> 00:05:58,610
see it.

98
00:05:58,620 --> 00:06:00,990
Finding the right lender is a bit tricky.

99
00:06:02,160 --> 00:06:04,740
However, we'll get to that point afterward.

100
00:06:05,400 --> 00:06:07,620
Let's talk about some gradient descent methods.

101
00:06:08,220 --> 00:06:09,960
So a naive gradient descent.

102
00:06:10,200 --> 00:06:14,820
This just passes the entire dataset should that work and updates to its of the end.

103
00:06:15,570 --> 00:06:19,740
It's computationally expensive and slow, and it's not always a good thing.

104
00:06:19,920 --> 00:06:21,510
So that's always, not always a good thing.

105
00:06:22,050 --> 00:06:29,460
But stochastic, gritty descent updates to weeks after each sample is an input image is followed propagated

106
00:06:29,460 --> 00:06:30,090
to the network.

107
00:06:30,450 --> 00:06:32,850
So in a way, you can see this been better.

108
00:06:32,970 --> 00:06:37,230
It's not going to be as computationally expensive as passing through the entire thing.

109
00:06:37,950 --> 00:06:44,310
However, this leads to a lot of noisy, fluctuating lost values, and it's also very slow to train

110
00:06:44,310 --> 00:06:50,640
to so many batch gradient descent, which is what we do in practice in the real world.

111
00:06:50,670 --> 00:06:52,830
This is the method that works best.

112
00:06:53,310 --> 00:06:54,510
Combines board methods.

113
00:06:54,780 --> 00:06:58,720
It takes a batch of data, let's say, 16 images and info.

114
00:06:58,740 --> 00:07:02,010
It propagates it all and then updates to gradients at the end.

115
00:07:02,250 --> 00:07:06,900
This leads to faster training, convergence and convergence and global minima.

116
00:07:08,100 --> 00:07:09,360
So we'll stop there now.

117
00:07:10,020 --> 00:07:13,950
This is a system that you know about is that typically it to 256.

118
00:07:14,400 --> 00:07:19,350
However, they can be one one basically doing stochastic gradient descent at that point.

119
00:07:19,740 --> 00:07:27,180
So it's best to use the biggest bad size you could use that your RAM will support the GPU memory will

120
00:07:27,180 --> 00:07:27,630
support.

121
00:07:28,740 --> 00:07:33,840
So, no, we're going to take a look at optimizers as well as looting read schedulers.

122
00:07:34,200 --> 00:07:35,760
So stay tuned for that lesson.

123
00:07:35,910 --> 00:07:36,360
Thank you.
