1
00:00:11,690 --> 00:00:16,350
In this lecture we are going to answer the question how does a model learn.

2
00:00:16,370 --> 00:00:21,950
Let's start with linear regression and its most basic form linear regression just means line of best

3
00:00:21,950 --> 00:00:22,370
fit.

4
00:00:23,120 --> 00:00:30,080
As you may recall from your high school math studies a line has the equation y equals M X plus b here

5
00:00:30,080 --> 00:00:35,420
X is the input variable M is the slope and B is the y intercept.

6
00:00:35,420 --> 00:00:41,690
When we put these together we get the equation for a line and our job of course is to find a line that

7
00:00:41,690 --> 00:00:43,010
best fits our input data

8
00:00:48,250 --> 00:00:53,470
so you can imagine that our input data is a bunch of data points that approximately create the shape

9
00:00:53,470 --> 00:00:54,670
of a line.

10
00:00:54,700 --> 00:00:56,310
This plot is called a scatter plot.

11
00:00:57,130 --> 00:01:03,010
So for each point x and y in our dataset we're going to draw a dot on this chart.

12
00:01:03,010 --> 00:01:07,360
Our goal is to find the line that best passes through all these data points.

13
00:01:07,420 --> 00:01:13,420
In practice this means finding the slope and y intercept one method.

14
00:01:13,420 --> 00:01:19,900
You may have used in high school is to take a ruler and to try and draw the best line on paper using

15
00:01:19,900 --> 00:01:21,620
visual inspection.

16
00:01:21,730 --> 00:01:25,840
Of course since this is data science we no longer use such methods.

17
00:01:25,840 --> 00:01:28,360
We have to be more systematic.

18
00:01:28,360 --> 00:01:32,000
The line that you draw might not be the same as the line that I draw.

19
00:01:32,050 --> 00:01:36,430
And furthermore if we are in multiple dimensions this won't be a line at all.

20
00:01:36,430 --> 00:01:44,430
So it's clear that we need something better and more quantitative.

21
00:01:44,460 --> 00:01:50,490
The idea is we're going to define an error function for each data point which is a pair made up of the

22
00:01:50,490 --> 00:01:53,110
input x eye and the target y.

23
00:01:53,200 --> 00:02:00,660
We're going to calculate a prediction y had I equals to m exile plus B then we are going to take the

24
00:02:00,660 --> 00:02:07,860
squared difference between Y and Y had I we're going to do this for each of our data points for I equals

25
00:02:07,860 --> 00:02:13,200
one up to n once we've done this we can add them all up and divide by n.

26
00:02:13,230 --> 00:02:16,520
This is called the mean squared error.

27
00:02:16,770 --> 00:02:21,470
It's the average squared deviation between our predictions and our targets.

28
00:02:21,480 --> 00:02:24,900
Basically this is going to tell us how accurate our model is.

29
00:02:25,800 --> 00:02:31,410
If our predictions are equal to the targets then this area will be zero because why I would be equal

30
00:02:31,410 --> 00:02:32,430
to I had.

31
00:02:33,090 --> 00:02:37,290
Then the more wrong we are the larger this error becomes.

32
00:02:37,290 --> 00:02:39,600
So that's generally how an error function should work

33
00:02:44,890 --> 00:02:45,770
as a side note.

34
00:02:45,790 --> 00:02:51,580
Since some students get confused by this at first we have multiple names for this error and we usually

35
00:02:51,580 --> 00:02:53,500
use them interchangeably.

36
00:02:53,650 --> 00:02:55,370
Sometimes we call it a loss.

37
00:02:55,390 --> 00:02:59,100
Sometimes we call it an objective or sometimes we call it a cost.

38
00:02:59,110 --> 00:03:01,360
However these all mean the same thing.

39
00:03:02,110 --> 00:03:07,840
I like the term cost because it makes intuitive sense for business people as any good businessmen would

40
00:03:07,840 --> 00:03:08,650
do.

41
00:03:08,650 --> 00:03:20,380
Your job is to minimize your cost and in fact this statement alone tells us how to find the ways W.

42
00:03:20,460 --> 00:03:25,290
So how do we minimize this cost or in other words make it as small as possible.

43
00:03:25,380 --> 00:03:31,410
Well it's time to turn to our old friend calculus if you recall we can find the minimum or a maximum

44
00:03:31,410 --> 00:03:36,190
of a function by finding where its derivative is equal to zero.

45
00:03:36,240 --> 00:03:42,120
If you're not convinced by this try to draw a curve which has a minimum or maximum and draw the slope

46
00:03:42,180 --> 00:03:44,360
at the minimum or maximum point.

47
00:03:44,640 --> 00:03:49,530
If you drew your line correctly it should be a horizontal line meaning that the slope at this point

48
00:03:49,530 --> 00:03:50,830
is zero.

49
00:03:51,000 --> 00:03:54,780
And remember the slope at each point is just the derivative.

50
00:03:54,840 --> 00:04:00,240
So that's why we want to find the derivative set it to zero and then solve for the parameter in question

51
00:04:05,450 --> 00:04:10,400
now usually we have more than one parameter even in one dimensional linear regression.

52
00:04:10,400 --> 00:04:14,990
This is the case because we have both the slope and the y intercept.

53
00:04:14,990 --> 00:04:19,780
So what happens when we have a function of more than one variable in this case.

54
00:04:19,790 --> 00:04:23,090
The derivative is actually called the gradient.

55
00:04:23,090 --> 00:04:29,540
If you recall from your calculus studies the gradient is just a vector or tensor of partial derivatives

56
00:04:29,870 --> 00:04:36,650
for each of the variables we're taking the gradient with respect to so for example if you take the square

57
00:04:36,650 --> 00:04:41,900
function in one dimension that's called a parabola but if you have two input dimensions it's called

58
00:04:41,900 --> 00:04:43,520
a parabola void.

59
00:04:43,520 --> 00:04:50,150
Beyond that we can't picture it but the basic idea is still the same find the gradient set it to 0 solve

60
00:04:50,150 --> 00:04:54,920
for the parameters by the way the term parameter is just another name for the weights.

61
00:04:54,980 --> 00:04:58,340
So that's the W. and b we've been referring to throughout this section

62
00:05:03,610 --> 00:05:10,060
luckily in this course we won't be calculating any derivatives manually since I have many other courses

63
00:05:10,060 --> 00:05:11,430
which do that a lot.

64
00:05:11,710 --> 00:05:13,870
And by a lot I really mean a lot.

65
00:05:14,590 --> 00:05:19,930
So in this course we're kind of excused from doing that because pi torch already does that for us using

66
00:05:19,930 --> 00:05:27,170
a process called automatic differentiation luckily you don't have to know how that works either because

67
00:05:27,230 --> 00:05:30,410
as its name suggests it's automatic.

68
00:05:30,410 --> 00:05:36,260
So the only thing you really do have to know is that a PI torch is going to automatically find the gradient

69
00:05:36,290 --> 00:05:41,030
of all your weights and be pi which uses these gradients to train your model

70
00:05:46,810 --> 00:05:52,290
at this point you might be wondering if all I have to do is find the gradient and set it to zero.

71
00:05:52,390 --> 00:05:57,980
Then why in our previous code did it involve an iterative training process.

72
00:05:58,120 --> 00:06:03,670
Recall that we had to specify the number of epochs to train for and this resulted in a plot of loss

73
00:06:03,670 --> 00:06:09,020
per iteration which we could check to confirm that the training algorithm converged nicely.

74
00:06:09,310 --> 00:06:14,800
Well in actuality it's not possible to actually solve for the equation that you get when you set the

75
00:06:14,800 --> 00:06:16,030
gradient to zero.

76
00:06:16,030 --> 00:06:22,330
Most of the time the one exception to this in this course at least is linear regression where we can

77
00:06:22,330 --> 00:06:28,200
solve it we call the solution an analytical solution or a closed form solution.

78
00:06:28,210 --> 00:06:35,140
Basically this means that we can express the optimal value of W and B using an equation for logistic

79
00:06:35,140 --> 00:06:38,050
regression and the rest of the models we'll be discussing.

80
00:06:38,050 --> 00:06:40,390
It's not possible to do this.

81
00:06:40,600 --> 00:06:43,100
Therefore we need another approach.

82
00:06:43,120 --> 00:06:45,450
That approach is called gradient descent.

83
00:06:50,490 --> 00:06:53,860
Now there's a lot more to gradient descent than what's in this lecture.

84
00:06:54,000 --> 00:06:58,290
And you can check out the in-depth section of this course for that if you're interested.

85
00:06:58,290 --> 00:07:01,150
For now let's just talk about the basics.

86
00:07:01,200 --> 00:07:07,090
Essentially what we do is we start at a randomly initialize point for both WMD.

87
00:07:07,530 --> 00:07:11,850
Since this is random these probably don't lead to a small cost.

88
00:07:12,000 --> 00:07:19,440
So then we find the gradient of our loss with respect to W and b we then take small steps in this direction

89
00:07:19,770 --> 00:07:23,310
to update W and B on each iteration.

90
00:07:23,310 --> 00:07:27,600
Remember these iterations are called epochs mathematically.

91
00:07:27,660 --> 00:07:33,210
You can prove that this leads to a decrease in cost although we won't discuss that in this lecture.

92
00:07:33,210 --> 00:07:38,580
You can imagine that this is exactly what goes on inside the optimizer that step function.

93
00:07:38,580 --> 00:07:45,520
This is really all that's happening for epoch in range epochs and then set w equal to W minus eight

94
00:07:45,810 --> 00:07:47,100
times the gradient of J.

95
00:07:47,100 --> 00:07:57,070
With respect to W and set B to be minus eight times the gradient of J with respect to B.

96
00:07:57,090 --> 00:08:03,060
So again all we're doing is taking small steps in the direction of the gradient with respect to W and

97
00:08:03,060 --> 00:08:11,290
B and one obvious question that arises from this is how small should these small steps actually be.

98
00:08:11,310 --> 00:08:17,640
Well the step size is specified by this Greek letter ADA which we call the learning rate that specifies

99
00:08:17,640 --> 00:08:20,950
how fast or slow we want to train our model.

100
00:08:20,970 --> 00:08:26,880
It's important to set this value right because if you don't then your model won't get good results even

101
00:08:26,880 --> 00:08:29,030
if your model is actually a good model

102
00:08:34,180 --> 00:08:38,750
unfortunately there is no direct method of choosing a good learning rate.

103
00:08:38,770 --> 00:08:42,720
Generally speaking the learning rate is something we call a hyper parameter.

104
00:08:42,790 --> 00:08:48,370
This is to differentiate it from W and B which are just regular old parameters.

105
00:08:48,370 --> 00:08:53,500
It's called a hyper parameter because it's still a parameter but it's not a parameter of the model itself.

106
00:08:53,560 --> 00:08:59,890
It's sort of like a meta parameter and in fact no hyper parameters are really chosen directly.

107
00:08:59,890 --> 00:09:06,010
It's more of a process of trial and error along with intuition that you gain from practicing a lot and

108
00:09:06,010 --> 00:09:07,780
seeing a lot of different examples

109
00:09:12,900 --> 00:09:14,430
as mentioned before.

110
00:09:14,430 --> 00:09:19,680
One way to know if your learning rate is too high or too low is to check the last per iteration after

111
00:09:19,680 --> 00:09:21,060
training.

112
00:09:21,060 --> 00:09:26,550
This is somewhat unfortunate because training can take a long time and you only know the result once

113
00:09:26,550 --> 00:09:27,930
it's done.

114
00:09:27,930 --> 00:09:33,540
If you use a bad learning rate and got bad results while you still had to wait for them I really like

115
00:09:33,540 --> 00:09:38,610
this visualization because it encapsulates how to choose the lending rate quite well.

116
00:09:38,760 --> 00:09:43,410
If you're learning rate is too high then your loss will just shoot off to infinity.

117
00:09:43,410 --> 00:09:48,450
Intuitively this is because you're overshooting the minimum and just ending up on the other side of

118
00:09:48,450 --> 00:09:54,660
the canyon that means your steps are too large and you need to make them smaller.

119
00:09:54,830 --> 00:10:00,290
Sometimes your lending rate might be a bit too high so you'll see some convergence but it'll converge

120
00:10:00,290 --> 00:10:07,710
to a sub optimal value if you're lending rate is too low then you'll see a really shallow curve.

121
00:10:07,720 --> 00:10:11,440
This is also not good because as I mentioned in training takes time.

122
00:10:11,470 --> 00:10:14,940
So if your lending rate is too low you'll have to wait longer.

123
00:10:15,190 --> 00:10:20,050
But not only that sometimes you can just get stuck at a suboptimal point.

124
00:10:20,050 --> 00:10:21,750
That's also not good.

125
00:10:22,030 --> 00:10:28,030
A good learning rate is in between these two extremes and so what you'll probably end up doing is finding

126
00:10:28,030 --> 00:10:33,520
the limits of your particular dataset what's too high what's too low and then trying different numbers

127
00:10:33,520 --> 00:10:36,890
in between until you find something good.

128
00:10:37,170 --> 00:10:43,350
Normally we try numbers that are powers of 10 for example zero point one zero point zero one zero point

129
00:10:43,350 --> 00:10:44,990
zero zero one and so forth.
