1
00:00:11,560 --> 00:00:15,650
In this lecture, we are going to answer the question, how does a model learn?

2
00:00:16,330 --> 00:00:19,870
Let's start with linear regression and its most basic form.

3
00:00:19,870 --> 00:00:22,360
Linear regression just means line of best fit.

4
00:00:23,020 --> 00:00:28,720
As you may recall from your high school math studies, a line has the equation, Y equals M, X plus

5
00:00:28,720 --> 00:00:30,100
B here.

6
00:00:30,100 --> 00:00:34,750
X is the input variable, M is the slope and B is the Y intercept.

7
00:00:35,350 --> 00:00:38,490
When we put these together, we get the equation for a line.

8
00:00:39,160 --> 00:00:43,050
And our job, of course, is to find a line that best fits our input data.

9
00:00:48,180 --> 00:00:53,460
So you can imagine that our input data is a bunch of data points that approximately create the shape

10
00:00:53,460 --> 00:00:54,060
of a line.

11
00:00:54,630 --> 00:00:56,300
This plot is called a scatterplot.

12
00:00:57,060 --> 00:01:02,070
So for each point X and Y in our data set, we're going to draw a dot on this chart.

13
00:01:02,910 --> 00:01:06,860
Our goal is to find a line that best passes through all these data points.

14
00:01:07,320 --> 00:01:11,200
In practice, this means finding the slope and y intercept.

15
00:01:12,840 --> 00:01:18,720
One method you may have used in high school is to take a ruler and to try and draw the best line on

16
00:01:18,720 --> 00:01:20,900
paper using visual inspection.

17
00:01:21,630 --> 00:01:25,830
Of course, since this is data science, we no longer use such methods.

18
00:01:25,830 --> 00:01:27,850
We have to be more systematic.

19
00:01:28,290 --> 00:01:31,670
The line that you draw might not be the same as the line that I draw.

20
00:01:31,950 --> 00:01:35,680
And furthermore, if we are in multiple dimensions, this won't be a line at all.

21
00:01:36,390 --> 00:01:39,360
So it's clear that we need something better and more quantitative.

22
00:01:44,400 --> 00:01:50,490
The idea is we're going to define an era function for each data point, which is a pair made up of the

23
00:01:50,490 --> 00:01:52,170
input EXI and the target.

24
00:01:52,170 --> 00:02:00,540
Why we're going to calculate a prediction why Hatti equals to Mexi plus B, then we are going to take

25
00:02:00,540 --> 00:02:04,020
the squared difference between why and why haddi.

26
00:02:04,620 --> 00:02:06,960
We're going to do this for each of our data points.

27
00:02:07,110 --> 00:02:08,660
For I equals one up to end.

28
00:02:09,600 --> 00:02:12,860
Once we've done this, we can add them all up and divide by N.

29
00:02:13,150 --> 00:02:15,420
This is called the mean squared error.

30
00:02:16,690 --> 00:02:20,750
It's the average squared deviation between our predictions and our targets.

31
00:02:21,450 --> 00:02:24,610
Basically, this is going to tell us how accurate our model is.

32
00:02:25,680 --> 00:02:31,440
If our predictions are equal to the targets, then this error will be zero because why I would be equal

33
00:02:31,440 --> 00:02:36,540
to I had I then the more wrong we are, the larger this error becomes.

34
00:02:37,240 --> 00:02:39,600
So that's generally how an error function should work.

35
00:02:44,810 --> 00:02:50,210
As a side note, since some students get confused by this, at first we have multiple names for this

36
00:02:50,210 --> 00:02:52,970
error and we usually use them interchangeably.

37
00:02:53,570 --> 00:02:58,670
Sometimes we call it a loss, sometimes we call it an objective, or sometimes we call it a cost.

38
00:02:59,060 --> 00:03:01,320
However, these all mean the same thing.

39
00:03:02,030 --> 00:03:07,850
I like the term cost because it makes intuitive sense for business people as any good businessmen would

40
00:03:07,850 --> 00:03:08,240
do.

41
00:03:08,570 --> 00:03:10,640
Your job is to minimize your cost.

42
00:03:11,000 --> 00:03:15,320
And in fact, this statement alone tells us how to find the ways W.

43
00:03:20,390 --> 00:03:22,280
So how do we minimize this cost?

44
00:03:22,340 --> 00:03:24,650
In other words, make it as small as possible?

45
00:03:25,280 --> 00:03:28,040
Well, it's time to turn it to our old friend calculus.

46
00:03:28,550 --> 00:03:34,460
If you recall, we can find the minimum or a maximum of a function by finding where its derivative is

47
00:03:34,460 --> 00:03:35,410
equal to zero.

48
00:03:36,140 --> 00:03:41,660
If you're not convinced by this, try to draw a curve which has a minimum or a maximum and draw the

49
00:03:41,660 --> 00:03:43,870
slope at the minimum or maximum point.

50
00:03:44,540 --> 00:03:49,550
If you drew your line correctly, it should be a horizontal line, meaning that the slope at this point

51
00:03:49,550 --> 00:03:50,130
is zero.

52
00:03:50,870 --> 00:03:54,580
And remember, the slope at each point is just the derivative.

53
00:03:54,770 --> 00:04:00,230
So that's why we want to find the derivative set at zero and then solve for the parameter in question.

54
00:04:05,370 --> 00:04:10,390
Now, usually we have more than one parameter, even in one dimensional linear regression.

55
00:04:10,410 --> 00:04:14,250
This is the case because we have both the slope and the Y intercept.

56
00:04:14,940 --> 00:04:17,950
So what happens when we have a function of more than one variable?

57
00:04:19,020 --> 00:04:22,420
In this case, the derivative is actually called the gradient.

58
00:04:22,980 --> 00:04:29,520
If you recall from your calculus studies, the gradient is just a vector or tensor of partial derivatives

59
00:04:29,850 --> 00:04:33,840
for each of the variables we're taking the gradient with respect to.

60
00:04:34,800 --> 00:04:39,230
So, for example, if you take the square function in one dimension, that's called a parabola.

61
00:04:39,870 --> 00:04:42,930
But if you have two input dimensions, it's called a parabola, Lloyd.

62
00:04:43,470 --> 00:04:45,030
Beyond that, we can't picture it.

63
00:04:45,030 --> 00:04:47,010
But the basic idea is still the same.

64
00:04:47,730 --> 00:04:51,190
Find the gradient set at zero, solve for the parameters.

65
00:04:51,990 --> 00:04:54,920
By the way, the term parameter is just another name for the weights.

66
00:04:54,930 --> 00:04:58,350
So that's the WNBA we've been referring to throughout this section.

67
00:05:03,540 --> 00:05:09,540
Luckily, in this course, we won't be calculating any derivatives manually, since I have many other

68
00:05:09,540 --> 00:05:13,870
courses which do that a lot and by a lot I really mean a lot.

69
00:05:14,520 --> 00:05:20,820
So in this course, we're kind of excused from doing that because to flow already does that for us using

70
00:05:20,820 --> 00:05:23,340
a process called automatic differentiation.

71
00:05:24,270 --> 00:05:29,750
Luckily, you don't have to know how that works either, because as its name suggests, it's automatic.

72
00:05:30,270 --> 00:05:36,210
So the only thing you really do have to know is that a tensor flow is going to automatically find the

73
00:05:36,210 --> 00:05:39,420
gradient of all your weights and be essential.

74
00:05:39,420 --> 00:05:41,640
Flow uses these gradients to train your model.

75
00:05:46,800 --> 00:05:51,950
At this point, you might be wondering if all I have to do is find the gradient and set it to zero,

76
00:05:52,320 --> 00:05:57,410
then why in our previous code did it involve an iterative training process?

77
00:05:58,080 --> 00:06:03,690
Recall that we had to specify the number of epochs to train for, and this resulted in a plot of lost

78
00:06:03,690 --> 00:06:08,430
per iteration, which we could check to confirm that the training algorithm converges nicely.

79
00:06:09,270 --> 00:06:14,790
Well, in actuality, it's not possible to actually solve the equation that you get when you set the

80
00:06:14,790 --> 00:06:17,030
gradient to zero most of the time.

81
00:06:17,940 --> 00:06:22,770
The one exception to this, in this course at least, is linear regression where we can solve it.

82
00:06:23,700 --> 00:06:27,640
We call the solution an analytical solution or a closed form solution.

83
00:06:28,170 --> 00:06:35,160
Basically, this means that we can express the optimal value of W and be using an equation for logistic

84
00:06:35,160 --> 00:06:39,680
regression and the rest of the models will be discussing it's not possible to do this.

85
00:06:40,560 --> 00:06:42,660
Therefore, we need another approach.

86
00:06:43,080 --> 00:06:45,270
That approach is called gradient descent.

87
00:06:50,400 --> 00:06:54,870
Now, there's a lot more to gradient descent than what's in this lecture, and you can check out the

88
00:06:54,870 --> 00:06:57,730
in-depth section of this course for that if you're interested.

89
00:06:58,230 --> 00:07:00,410
For now, let's just talk about the basics.

90
00:07:01,080 --> 00:07:06,630
Essentially, what we do is we start at a randomly initialized point for both WMP.

91
00:07:07,470 --> 00:07:11,130
Since this is random, these probably don't lead to a small cost.

92
00:07:11,910 --> 00:07:18,720
So then we find the gradient of our loss with respect to W and B, we then take small steps in this

93
00:07:18,720 --> 00:07:22,590
direction to update W and B on each iteration.

94
00:07:23,220 --> 00:07:25,650
Remember, these iterations are called epochs.

95
00:07:26,550 --> 00:07:31,560
Mathematically, you can prove that this leads to a decrease in cost, although we won't discuss that

96
00:07:31,560 --> 00:07:32,250
in this lecture.

97
00:07:33,090 --> 00:07:37,620
You can imagine that this is exactly what goes inside the CARUS fit function.

98
00:07:38,010 --> 00:07:44,910
This is really all that's happening for epoch in range epochs and then set W equal to W minus eight

99
00:07:45,180 --> 00:07:53,040
times the gradient of G with respect to W and set B to B minus eight times the gradient of G with respect

100
00:07:53,040 --> 00:07:53,520
to be.

101
00:07:56,420 --> 00:08:02,480
So, again, all we're doing is taking small steps in the direction of the gradient with respect W and

102
00:08:02,480 --> 00:08:09,860
B, and one obvious question that arises from this is how small should these small steps actually be?

103
00:08:10,610 --> 00:08:16,400
Well, the step size is specified by this Greek letter, ADA, which we call the learning rate that

104
00:08:16,400 --> 00:08:19,560
specifies how fast or slow we want to train our model.

105
00:08:20,270 --> 00:08:25,610
It's important to set this value right, because if you don't, then your model won't get good results,

106
00:08:25,880 --> 00:08:28,430
even if your model is actually a good model.

107
00:08:33,460 --> 00:08:37,390
Unfortunately, there's no direct method of choosing a good learning rate.

108
00:08:38,110 --> 00:08:41,630
Generally speaking, the learning rate is something we call a hyper parameter.

109
00:08:42,160 --> 00:08:47,100
This is to differentiate it from WSB, which are just regular old parameters.

110
00:08:47,650 --> 00:08:52,000
It's called a hyper parameter because it's still a parameter, but it's not a perimeter of the model

111
00:08:52,000 --> 00:08:52,620
itself.

112
00:08:52,870 --> 00:08:54,510
It's sort of like a meta parameter.

113
00:08:55,180 --> 00:08:58,640
And in fact, no hyper parameters are really chosen directly.

114
00:08:59,170 --> 00:09:05,110
It's more of a process of trial and error, along with intuition that you gain from practicing a lot

115
00:09:05,200 --> 00:09:07,180
and seeing a lot of different examples.

116
00:09:12,190 --> 00:09:17,800
As mentioned before, one way to know if your learning rate is too high or too low is the check the

117
00:09:17,810 --> 00:09:19,550
loss per iteration after training.

118
00:09:20,410 --> 00:09:25,960
This is somewhat unfortunate because training can take a long time and you only know the result once

119
00:09:25,960 --> 00:09:26,440
it's done.

120
00:09:27,250 --> 00:09:31,510
If you use a bad learning rate and got bad results while you still had to wait for them.

121
00:09:32,290 --> 00:09:37,420
I really like this visualization because it encapsulates how to choose the learning rate quite well.

122
00:09:38,080 --> 00:09:41,960
If your lending rate is too high, then your loss will just shoot off to infinity.

123
00:09:42,700 --> 00:09:47,860
Intuitively, this is because you're overshooting the minimum and just ending up on the other side of

124
00:09:47,860 --> 00:09:48,550
the canyon.

125
00:09:49,840 --> 00:09:55,420
That means your steps are too large and you need to make them smaller, sometimes your lending rate

126
00:09:55,420 --> 00:09:56,730
might be a bit too high.

127
00:09:56,740 --> 00:10:01,180
So you'll see some convergence, but it'll converge to a suboptimal value.

128
00:10:02,960 --> 00:10:06,660
If your lending rate is too low, then you'll see a really shallow curve.

129
00:10:07,130 --> 00:10:10,470
This is also not good because as I mentioned, training takes time.

130
00:10:10,880 --> 00:10:14,060
So if you're lending rate is too low, you'll have to wait longer.

131
00:10:14,630 --> 00:10:18,520
But not only that, sometimes you can just get stuck at a suboptimal point.

132
00:10:19,460 --> 00:10:20,480
That's also not good.

133
00:10:21,380 --> 00:10:24,280
A good learning rate is in between these two extremes.

134
00:10:24,290 --> 00:10:30,140
And so what you'll probably end up doing is finding the limits of your particular data set, what's

135
00:10:30,140 --> 00:10:35,270
too high, what's too low, and then trying different numbers in between until you find something good.

136
00:10:36,590 --> 00:10:42,620
Normally we try numbers that have powers of 10, for example, zero point one zero point zero one zero

137
00:10:42,620 --> 00:10:44,480
point zero zero one and so forth.