1
00:00:11,700 --> 00:00:18,240
In this lecture we are going to discuss gradient descent gradient descent is extremely important because

2
00:00:18,240 --> 00:00:24,510
it's the algorithm that we use to train all of our models from simple neurons to neural networks to

3
00:00:24,510 --> 00:00:27,960
convolution on their own networks and recurrent neural networks.

4
00:00:27,960 --> 00:00:34,530
This one basic idea is the backbone of deep learning and in fact it can be used for other machine learning

5
00:00:34,530 --> 00:00:41,940
models as well such as K Means clustering Hidden Markov models and Matrix Factorization.

6
00:00:42,070 --> 00:00:46,170
Basically it is a general solution with wide applicability.

7
00:00:46,180 --> 00:00:52,600
This lecture will discuss the basic intuitions about gradient descent but if you are an expert in mathematics

8
00:00:52,870 --> 00:00:57,700
you might want to check out a rigorous proof that it works in a document of attached and extra reading

9
00:00:57,700 --> 00:00:59,180
that X T.

10
00:00:59,230 --> 00:01:07,710
It's called gradient descent convergence analysis.

11
00:01:07,720 --> 00:01:12,870
First let's ask ourselves why do we need a gradient descent in the first place.

12
00:01:12,880 --> 00:01:15,750
What is the problem we are trying to solve.

13
00:01:15,760 --> 00:01:21,160
Recall that for problems such as linear regression we are going to come up with an error function also

14
00:01:21,160 --> 00:01:24,100
known as a cause function or lost function.

15
00:01:24,100 --> 00:01:29,960
Our goal is to minimize this error with respect to the model parameters call that W..

16
00:01:29,980 --> 00:01:32,920
So how do we minimize L with respect to W

17
00:01:38,050 --> 00:01:42,730
if you think back to your high school calculus studies you'll realize that you've already encountered

18
00:01:42,730 --> 00:01:43,580
this.

19
00:01:43,840 --> 00:01:49,300
Whenever we have a function and we want to minimize it or maximize it with respect to its arguments

20
00:01:49,360 --> 00:01:50,740
what do we do.

21
00:01:50,890 --> 00:01:53,630
We take its derivative and set it to zero.

22
00:01:53,680 --> 00:01:56,270
Then we saw for the parameter in question.

23
00:01:56,530 --> 00:02:03,460
So for example if you wanted to minimize 5 X squared plus three x plus one with respect to x you would

24
00:02:03,460 --> 00:02:09,070
find the derivative with respect to x set it to zero and solve for x.

25
00:02:09,070 --> 00:02:15,760
The reason why this works is because the derivative or the slope at the minimum and maximum of a function

26
00:02:16,000 --> 00:02:17,500
is zero.

27
00:02:17,500 --> 00:02:20,170
In other words I can draw a line at this point.

28
00:02:20,380 --> 00:02:25,300
Tangent to the curve and it should be a horizontal line.

29
00:02:25,330 --> 00:02:29,080
Now you might ask how do we know if it's a maximum or minimum.

30
00:02:29,080 --> 00:02:34,390
There are ways to determine that but in machine learning we generally do not do that analysis.

31
00:02:34,390 --> 00:02:41,260
We pretty much already know that the lost functions we are working with have minimums and not maximums.

32
00:02:41,490 --> 00:02:47,040
Sometimes we may worry about saddle points or local minima but in general this is still not a problem

33
00:02:47,250 --> 00:02:53,140
for the latest deep learning methods the reason why is outside the scope of this course.

34
00:02:53,230 --> 00:02:55,780
But you are welcome to ask me about it on the Q and A

35
00:03:00,900 --> 00:03:01,240
Okay.

36
00:03:01,240 --> 00:03:05,200
This still doesn't answer the question why do we need a gradient descent.

37
00:03:05,200 --> 00:03:11,140
Well here's the problem sometimes we can take the derivative and try to set it to 0 and solve for the

38
00:03:11,140 --> 00:03:12,160
parameter.

39
00:03:12,370 --> 00:03:19,060
But sometimes this equation is not solvable by the way I've maybe once or twice had someone asked me

40
00:03:19,090 --> 00:03:23,140
why some equations are not solvable while my answer is.

41
00:03:23,140 --> 00:03:29,290
If you think you can solve it then do it and show me the answer you can use your intuition about other

42
00:03:29,290 --> 00:03:31,570
equations we do not know how to solve.

43
00:03:31,630 --> 00:03:33,520
For example we can't solve.

44
00:03:33,520 --> 00:03:38,680
Let's say the roots of a degree 100 polynomial it just simply is impossible

45
00:03:43,780 --> 00:03:44,100
OK.

46
00:03:44,130 --> 00:03:49,950
So what is the general strategy we use in mathematics when we cannot solve an equation.

47
00:03:49,980 --> 00:03:53,820
The answer is we use numerical approximations.

48
00:03:53,820 --> 00:03:56,500
One such example is finding integrals.

49
00:03:56,610 --> 00:03:59,890
Sometimes you can't solve for an integral by hand.

50
00:03:59,910 --> 00:04:05,040
I think anyone who is taking calculus knows that solving some integrals can be quite challenging and

51
00:04:05,040 --> 00:04:07,940
for others literally impossible.

52
00:04:07,950 --> 00:04:09,670
So what do we do.

53
00:04:09,750 --> 00:04:14,640
Well instead of trying to solve the equation by hand we just draw a little trapezoid underneath the

54
00:04:14,640 --> 00:04:17,780
curve and we find the area of those trapezoid.

55
00:04:19,460 --> 00:04:22,970
That will give us an approximation for the area under the curve.

56
00:04:23,720 --> 00:04:25,330
OK so that's the analogy.

57
00:04:25,610 --> 00:04:36,450
When we can't solve something analytically or by hand we resort to a numerical approximation methods.

58
00:04:36,480 --> 00:04:39,320
So what's the idea behind gradient descent.

59
00:04:39,360 --> 00:04:45,210
The idea is if I take the gradient of the function with respect to the parameter W and I take small

60
00:04:45,210 --> 00:04:49,450
steps in that direction the value of the function will decrease.

61
00:04:49,470 --> 00:04:56,630
Provided that my step size is small enough eventually as I keep taking these very small steps I'm going

62
00:04:56,630 --> 00:04:59,430
to get closer and closer to the bottom of the curve.

63
00:04:59,660 --> 00:05:06,910
And eventually I will converge to an answer by converge I mean that when you get near the bottom your

64
00:05:06,910 --> 00:05:10,960
steps will become so small that they will become insignificant.

65
00:05:10,960 --> 00:05:16,760
So w doesn't change as you take more steps at that point.

66
00:05:16,760 --> 00:05:20,340
You have found the value of W that minimizes your function.

67
00:05:25,290 --> 00:05:27,650
So what does this look like in code.

68
00:05:27,660 --> 00:05:34,800
Basically it's just a for loop before the loop we initialize W to some random value then inside the

69
00:05:34,800 --> 00:05:38,640
loop we continually update w using the gradient descent formula

70
00:05:43,810 --> 00:05:46,840
so here's an example of gradient descent at work.

71
00:05:47,080 --> 00:05:49,280
Let's say we want to minimize w squared.

72
00:05:49,300 --> 00:05:51,140
So that's our last function.

73
00:05:51,190 --> 00:05:52,930
The derivative is 2 W.

74
00:05:53,560 --> 00:06:05,680
So inside the loop we iteratively apply the equation W equals W minus eight times two times w.

75
00:06:05,680 --> 00:06:08,830
Now let's look at some code I've pasted that does this.

76
00:06:09,070 --> 00:06:12,130
As you can see I've initialized W to 5.

77
00:06:12,790 --> 00:06:18,550
I created an empty list called L where I'm going to store the loss at each iteration so that I can plot

78
00:06:18,550 --> 00:06:19,650
it later.

79
00:06:19,870 --> 00:06:26,440
Now in a loop I run my gradient descent update one hundred times and it each time I append w square

80
00:06:26,440 --> 00:06:35,090
to l finally I plot out we can see that eventually the value of the function goes down to zero.

81
00:06:35,110 --> 00:06:36,600
It's a minimum value.

82
00:06:36,890 --> 00:06:43,160
The value of W that we find at the end is also zero because this is the value of W that makes the original

83
00:06:43,160 --> 00:06:49,160
function zero.

84
00:06:49,200 --> 00:06:54,780
Now you will notice that there is some hyper parameters that can't quite be ignored in particular how

85
00:06:54,780 --> 00:06:57,580
many iterations should we run the loop for.

86
00:06:57,630 --> 00:07:03,390
This is known as the number of epochs of course we must have a number of epochs high enough so that

87
00:07:03,390 --> 00:07:07,120
the cost function converges but it must not be too high.

88
00:07:07,260 --> 00:07:13,950
Otherwise we'll waste a lot of time where w simply doesn't change how small should our step size be.

89
00:07:13,950 --> 00:07:19,200
I've used the Greek symbol Ada for this and we call it the learning rate the learning rate must be small

90
00:07:19,200 --> 00:07:23,940
enough so that the cost function doesn't explode but it also must be large enough so that the learning

91
00:07:23,940 --> 00:07:26,580
process doesn't take unnecessarily long.

92
00:07:27,960 --> 00:07:33,120
Both of these hyper remedies are values that we must choose using trial and error.

93
00:07:33,150 --> 00:07:37,860
There are a few more advanced methods such as Bayesian optimization that you could use also but they

94
00:07:37,860 --> 00:07:40,320
are outside the scope of this course.

95
00:07:40,320 --> 00:07:46,320
Normally for students of machine learning doing experiments manually and gaining intuition through practice

96
00:07:46,350 --> 00:07:47,510
is the best approach.