1
00:00:02,350 --> 00:00:07,680
Everyone welcome back to this class modern Deep Learning in Python Deep Learning in Python.

2
00:00:07,680 --> 00:00:08,280
Part two

3
00:00:13,230 --> 00:00:14,040
in this lecture.

4
00:00:14,040 --> 00:00:19,590
We are going to look at what I consider to be one of the most effective improvements over plain gradient

5
00:00:19,590 --> 00:00:22,000
descent called momentum.

6
00:00:22,020 --> 00:00:31,890
Personally I found it to be the 80 percent factor that can improve your learning procedure.

7
00:00:31,930 --> 00:00:34,470
So how does momentum work.

8
00:00:34,480 --> 00:00:40,270
I like the analogy of a ball rolling down a hill which is nice because during gradient descent we imagine

9
00:00:40,270 --> 00:00:42,570
the errors surface as the bottom of a hill.

10
00:00:43,450 --> 00:00:49,940
But I think today an even better analogy is moving something on a frictionless surface like ice.

11
00:00:50,020 --> 00:00:55,450
So if you've ever gone skating before you know that you can easily glide along the ice very quickly

12
00:00:55,660 --> 00:01:02,350
and without much force because the momentum just carries you the way you were going before you can imagine

13
00:01:02,410 --> 00:01:08,800
pushing a box on ice and the box will move from one point to another quite easily because you can push

14
00:01:08,800 --> 00:01:11,290
it once and it's going to slide.

15
00:01:11,350 --> 00:01:13,780
That's exactly what happens in physics.

16
00:01:13,780 --> 00:01:19,240
If there is no friction then an object will just continue to go in the direction it was already going

17
00:01:21,210 --> 00:01:23,580
of course ice still does have some friction.

18
00:01:23,640 --> 00:01:30,060
So eventually the box was slowed down and eventually stop as you'll see momentum in gradient descent

19
00:01:30,240 --> 00:01:32,190
also slows down after a while to

20
00:01:37,260 --> 00:01:39,290
to continue with this analogy.

21
00:01:39,370 --> 00:01:42,730
Let's try to think of a situation with our momentum.

22
00:01:42,730 --> 00:01:48,490
Now keep in mind I'm talking about gradient descent momentum here not physics momentum since the definition

23
00:01:48,490 --> 00:01:51,360
of momentum and physics is a little different.

24
00:01:51,370 --> 00:01:56,590
So imagine now that instead of ice you're trying to push a box in sand.

25
00:01:56,590 --> 00:02:01,750
Now you can imagine that this is going to be very difficult because sand has lots of friction.

26
00:02:01,840 --> 00:02:07,010
So if you push a box in sand it's going to go as far as you push it and no more.

27
00:02:07,150 --> 00:02:10,880
If you want the box to move again you have to push it again.

28
00:02:11,110 --> 00:02:14,580
That's like gradient descent without momentum each time.

29
00:02:14,590 --> 00:02:24,910
If we want to move there has to be a gradient so that we can move in the direction of the gradient.

30
00:02:24,910 --> 00:02:27,660
Now let's try to put these ideas into math.

31
00:02:27,730 --> 00:02:31,120
Let's start with regular gradient descent without momentum.

32
00:02:31,120 --> 00:02:38,110
So we have theta at time t is equal to theta at time T minus one minus the learning rate times the gradient

33
00:02:38,110 --> 00:02:39,160
G.

34
00:02:39,370 --> 00:02:45,310
And from this we can see that if the gradient is zero nothing is going to happen to theta it just gets

35
00:02:45,400 --> 00:02:46,900
updated to its old value.

36
00:02:46,900 --> 00:02:47,890
It doesn't change

37
00:02:53,250 --> 00:02:56,450
now let's look at gradient descent with momentum.

38
00:02:56,520 --> 00:03:02,210
We use the term momentum very loosely here since it has nothing to do with actual physical momentum.

39
00:03:02,310 --> 00:03:09,750
So we create a new variable called V which stands for the velocity it's equal to new times its all velocity

40
00:03:10,110 --> 00:03:13,110
minus the learning rate times the gradient.

41
00:03:13,110 --> 00:03:18,140
And this is exactly what we talked about earlier when we imagine pushing a box on ice.

42
00:03:18,180 --> 00:03:24,870
The G term is the effect of our pushing the box but the mew times V of T minus one term is the effect

43
00:03:24,870 --> 00:03:29,810
of continuing to move in the same direction as we were going before.

44
00:03:29,850 --> 00:03:35,250
Now we talked about how when you push a box on ice even though there's less friction and the box can

45
00:03:35,250 --> 00:03:38,610
slide it's still going to stop eventually.

46
00:03:38,640 --> 00:03:45,570
So we want the new V to take on only a fraction of the old V and so a typical value for Mew which we

47
00:03:45,570 --> 00:03:51,570
also call the momentum term is zero point nine or zero point nine five or zero point nine nine.

48
00:03:51,570 --> 00:03:52,710
Somewhere in that area

49
00:03:58,170 --> 00:04:03,840
one thing you can immediately see is that if we plug in the expression for V of t we can get the update

50
00:04:03,840 --> 00:04:09,530
for theta of T in terms of only the old theta the old V and the gradient G.

51
00:04:09,780 --> 00:04:21,150
From here it's easy to see that if we set Mew to zero we just get back regular old gradient descent.

52
00:04:21,210 --> 00:04:24,270
So what is the effect of using momentum.

53
00:04:24,270 --> 00:04:28,960
Well as you know in all of our scripts we like to plot the cost per iteration.

54
00:04:29,220 --> 00:04:34,830
And so what we usually see is that the cost converges to its minimum value much faster than if we had

55
00:04:34,830 --> 00:04:36,360
not used momentum.

56
00:04:36,360 --> 00:04:45,510
So using momentum is pretty great because it significantly speeds up training.

57
00:04:45,520 --> 00:04:50,110
Now I want to discuss one more perspective on momentum before we move on.

58
00:04:50,170 --> 00:04:53,860
This is what is typically taught in deep learning courses these days.

59
00:04:53,860 --> 00:04:59,890
So in this perspective the problem that momentum is solving is if we have unequal gradients in different

60
00:04:59,890 --> 00:05:01,300
directions.

61
00:05:01,300 --> 00:05:07,420
So for visualization purposes let's assume we have two parameters to optimize the vertical parameter

62
00:05:07,420 --> 00:05:08,840
and the horizontal parameter.

63
00:05:09,370 --> 00:05:16,300
So the gradient in one direction is very steep and the gradient in the other direction is very shallow.

64
00:05:16,330 --> 00:05:22,210
The idea is if you don't have momentum then you rely purely on the gradient which points more in the

65
00:05:22,210 --> 00:05:24,960
steep direction than in the shallow direction.

66
00:05:25,090 --> 00:05:27,150
And this is just a property of the gradient.

67
00:05:27,160 --> 00:05:30,070
It's the direction of steepest descent.

68
00:05:30,070 --> 00:05:36,430
So since this gradient vector points more in the steep direction than the result is we're going to zigzag

69
00:05:36,460 --> 00:05:38,500
back and forth across this valley.

70
00:05:38,710 --> 00:05:41,930
And that's a really inefficient way of reaching the minimum.

71
00:05:42,010 --> 00:05:45,630
So what happens when we add momentum to this situation.

72
00:05:45,700 --> 00:05:51,370
Well since in the shallow direction we move in the same direction every time those velocities are going

73
00:05:51,370 --> 00:05:52,240
to accumulate.

74
00:05:52,690 --> 00:05:58,410
So we'll have a portion of our old velocity added to our new velocity to help us along in that direction.

75
00:05:59,510 --> 00:06:04,850
The result is that we get there faster by taking bigger steps in the direction of the shallow gradient.