1
00:00:00,690 --> 00:00:04,080
Now, let's talk about optimizes and learning rate schedules.

2
00:00:04,470 --> 00:00:10,440
These are methods that we actually use in addition to basically a more advanced gradient descent methods

3
00:00:10,800 --> 00:00:12,690
that we use to find the optimal weights.

4
00:00:12,870 --> 00:00:13,920
So let's get started.

5
00:00:14,670 --> 00:00:20,550
So remember optimizes of basically methods like the gradient descent method, which we discussed.

6
00:00:21,030 --> 00:00:27,000
It's an optimization algorithm, which is why we call it optimizes that allow us to find the lowest

7
00:00:27,570 --> 00:00:30,000
weights to all the bits that give us the lowest loss.

8
00:00:30,480 --> 00:00:33,450
So in the last section, we covered gradient descent.

9
00:00:33,990 --> 00:00:39,180
So now we'll take a look at some other optimization methods that are more advanced that can offer sometimes

10
00:00:39,180 --> 00:00:43,770
better performance or faster convergence when compared to stochastic gradient descent.

11
00:00:45,000 --> 00:00:50,670
So before we go on, let's take a look at some of the problems with the standard stochastic gradient

12
00:00:50,670 --> 00:00:51,990
descent method.

13
00:00:52,560 --> 00:00:59,160
So firstly, we have to manually set the living rate or if we're not using it manually set lending rate,

14
00:00:59,160 --> 00:01:02,500
we have to decide on learning which schedule which we'll talk about shortly.

15
00:01:03,120 --> 00:01:09,400
So that that basically gives that's a problem, because now you kind of have to know what, what values

16
00:01:09,400 --> 00:01:10,290
use in advance.

17
00:01:10,590 --> 00:01:14,560
We can try a value in a way some time and try and evaluate with some time.

18
00:01:14,910 --> 00:01:18,030
So it's not the best way to actually pick a learning rate.

19
00:01:18,690 --> 00:01:25,230
So using the same learning rate for the parameter, it's often times allows us to jump over or get stuck

20
00:01:25,230 --> 00:01:28,590
in, look of animals and jump over the global minima at times if it's too big.

21
00:01:29,490 --> 00:01:36,030
So to solve some of these problems, a lot of researchers have developed basically extensions to suggest

22
00:01:36,030 --> 00:01:36,840
a good idea to send.

23
00:01:37,500 --> 00:01:40,530
All of these basically operate on the same underlying principles.

24
00:01:40,860 --> 00:01:44,040
They're just different ways to make the update differently.

25
00:01:44,490 --> 00:01:52,110
OK, so let's take a look at the first advanced feature of SAGASTI gradient descent that we can use.

26
00:01:52,500 --> 00:01:53,970
This one is called momentum.

27
00:01:54,480 --> 00:02:00,510
So one of the issues with Sagasti gradient descent was that areas where we'll clean is much steeper

28
00:02:00,510 --> 00:02:06,960
in one direction, which basically doesn't work too well in that case because it sees it as a bad thing.

29
00:02:07,860 --> 00:02:12,960
So because it's got a big spike, basically, it's what I'm saying, so it's going to not want to move

30
00:02:12,960 --> 00:02:13,710
in that direction.

31
00:02:14,190 --> 00:02:20,190
So this results in this Segretti a descent oscillating around is for making very little progress to

32
00:02:20,190 --> 00:02:21,660
actually finding the minimum point.

33
00:02:22,320 --> 00:02:23,880
Now what momentum does?

34
00:02:24,120 --> 00:02:29,690
It increases the strength of the updates four dimensions whose gradients switched directions and bit

35
00:02:29,700 --> 00:02:30,630
with big changes.

36
00:02:31,140 --> 00:02:33,150
So this dampens oscillations.

37
00:02:33,150 --> 00:02:38,910
As you can see with momentum, here is a lot less oscillations and it gets fitter.

38
00:02:39,050 --> 00:02:39,910
It gets closer.

39
00:02:39,910 --> 00:02:42,120
Sorry to the minimum point here.

40
00:02:42,660 --> 00:02:47,310
Typically, momentum, we use a value of point nine that usually gives us some good results.

41
00:02:47,700 --> 00:02:50,940
Next, let's take a look at Nazarov acceleration.

42
00:02:51,570 --> 00:02:55,620
So one of the problems introduced by momentum was overshooting the local minimum.

43
00:02:56,010 --> 00:03:01,980
So what nature of the acceleration does is basically it provides like a corrective update just so that

44
00:03:01,980 --> 00:03:04,170
we don't overshoot the local minimum.

45
00:03:04,530 --> 00:03:06,750
This is an illustration of how it works.

46
00:03:06,780 --> 00:03:08,730
So these are the updates.

47
00:03:08,880 --> 00:03:11,010
This is the initial sorry.

48
00:03:11,040 --> 00:03:13,530
This is the initial stochastic gradient descent.

49
00:03:13,950 --> 00:03:18,900
This is Néstor of sort of with momentum, and this is an order of correction in red.

50
00:03:19,230 --> 00:03:22,260
So this is the final update in Vector to form here.

51
00:03:23,850 --> 00:03:26,910
So this is a look at the other optimized this that we can use.

52
00:03:27,360 --> 00:03:29,440
I'm not going to go through all of these in detail.

53
00:03:29,460 --> 00:03:34,860
However, I will say that Adam is quite effective in a lot of training scenarios.

54
00:03:35,280 --> 00:03:40,800
I use Adam quite often as well as I do use the custom gradient descent a lot with momentum.

55
00:03:41,250 --> 00:03:46,620
I don't often use Néstor of correction, but I do use momentum quite a bit and I get tend to get good

56
00:03:46,620 --> 00:03:48,990
results when that doesn't work.

57
00:03:49,020 --> 00:03:55,800
I use Adam, but it depends on your dataset and your model that you're, you know, doing, so you can

58
00:03:55,800 --> 00:03:57,750
pause it and read these if you would like.

59
00:03:58,140 --> 00:04:02,670
I'm not going to discuss them in too much detail because they get quite technical.

60
00:04:03,150 --> 00:04:05,370
So it's not that important that you know this anyway.

61
00:04:05,760 --> 00:04:07,920
However, here's a visual comparison of them.

62
00:04:08,130 --> 00:04:15,270
So you could play this animation here, and that shows you basically hold on different methods.

63
00:04:15,270 --> 00:04:17,640
Take to convince you can see stochastic gradient descent.

64
00:04:17,910 --> 00:04:23,010
The red one is taking quite a long time, whereas let's see who is this thing.

65
00:04:23,010 --> 00:04:28,160
The moment momentum one was first, perhaps or at adulthood.

66
00:04:28,170 --> 00:04:32,940
But it's a good animation to see this other one here shows you a similar thing as well.

67
00:04:34,600 --> 00:04:37,600
Consider all trying to find the global minimum, which is down here.

68
00:04:37,990 --> 00:04:44,440
However, you can notice CTD Stochastic or did descend got stuck into local minima here.

69
00:04:44,860 --> 00:04:47,560
It never found its way down slope like the others did.

70
00:04:48,310 --> 00:04:55,840
So you can see why having an optimizer that has that has the ability to change the learning rate of

71
00:04:55,840 --> 00:04:57,640
deaths is very important.

72
00:04:58,000 --> 00:05:03,910
So now let's move on to learning read schedules so you can see sometimes you may not want to use an

73
00:05:04,120 --> 00:05:06,940
algorithm that uses an advanced optimizer.

74
00:05:07,030 --> 00:05:12,460
So for whatever reason, maybe you're very specific with what learning rates you want to use for your

75
00:05:12,460 --> 00:05:13,090
experiment.

76
00:05:13,600 --> 00:05:18,490
So in that case, we can use stochastic gradient descent with something called learning, which schedule

77
00:05:18,880 --> 00:05:21,040
which sets the learning rate for each epoch.

78
00:05:21,350 --> 00:05:26,410
It's basically a simple lookup table to see it epoch one to use this value equal to that value.

79
00:05:27,070 --> 00:05:33,100
So usually you progressively start at larger learning rates and then progressively get smaller over

80
00:05:33,100 --> 00:05:33,490
time.

81
00:05:34,570 --> 00:05:37,210
We can use this because if ordinary, it is too high.

82
00:05:37,450 --> 00:05:40,460
You tend to overshoot minimum points.

83
00:05:40,990 --> 00:05:45,550
So it's a good way to basically adapt your neural network learning process if you wanted to have that

84
00:05:46,210 --> 00:05:47,140
precise control.

85
00:05:47,920 --> 00:05:54,550
So the with schedules of directive very simple to implement paid too much carrots and TensorFlow all

86
00:05:54,560 --> 00:05:59,410
incorporate different ways to do it so you can manually, such as you can have a key parameter.

87
00:05:59,860 --> 00:06:03,700
So it is quite configurable and we'll get to this shortly.

88
00:06:03,730 --> 00:06:05,350
And yet in future lessons.

89
00:06:06,130 --> 00:06:07,390
So we'll stop there.

90
00:06:07,510 --> 00:06:13,720
And basically, this concludes all of the background knowledge you need to understand for training you

91
00:06:13,720 --> 00:06:14,380
on CNN.

92
00:06:14,980 --> 00:06:17,920
However, I understand that this is a bit overwhelming.

93
00:06:17,920 --> 00:06:24,790
It's been roughly an hour of these lessons, and you may be quite confused still, which is understandable

94
00:06:24,790 --> 00:06:26,620
because it's a fairly deep topic.

95
00:06:26,800 --> 00:06:27,580
Pardon the pun.

96
00:06:28,270 --> 00:06:31,480
So what I'll do, I'll have a review session.

97
00:06:32,170 --> 00:06:37,840
This is going to be like a five to 10 minute video where in this next in the upcoming section where

98
00:06:37,840 --> 00:06:41,980
I go over briefly, all of the theory we have we had in the previous slides.

99
00:06:42,370 --> 00:06:47,800
So hopefully that can help you fit together all of the knowledge you just learned so that you can understand

100
00:06:47,810 --> 00:06:49,000
CNN so much better.

101
00:06:49,360 --> 00:06:49,810
Thank you.