1
00:00:11,590 --> 00:00:15,640
In this lecture, we are going to answer the question, how does a model learn?

2
00:00:16,300 --> 00:00:21,940
Let's start with linear regression and its most basic form linear regression just means line of best

3
00:00:21,940 --> 00:00:22,330
fit.

4
00:00:23,020 --> 00:00:30,070
As you may recall from your high school math studies, a line has the equation Y equals m x plus b here.

5
00:00:30,070 --> 00:00:34,720
X is the input variable, M is the slope and B is the Y intercept.

6
00:00:35,350 --> 00:00:38,500
When we put these together, we get the equation for a line.

7
00:00:39,190 --> 00:00:43,000
And our job, of course, is to find a line that best fits our input data.

8
00:00:48,210 --> 00:00:53,430
So you can imagine that our input data is a bunch of data points that approximately create the shape

9
00:00:53,430 --> 00:00:54,090
of a line.

10
00:00:54,630 --> 00:00:56,290
This plot is called a scatterplot.

11
00:00:57,090 --> 00:01:02,070
So for each point X and Y in our data set, we're going to draw a dot on this chart.

12
00:01:02,940 --> 00:01:06,810
Our goal is to find the line that best passes through all these data points.

13
00:01:07,320 --> 00:01:11,190
In practice, this means finding the slope in y intercept.

14
00:01:12,810 --> 00:01:18,690
One method you may have used in high school is to take a ruler and to try and draw the best line on

15
00:01:18,690 --> 00:01:20,910
paper using visual inspection.

16
00:01:21,660 --> 00:01:25,800
Of course, since this is data science, we no longer use such methods.

17
00:01:25,800 --> 00:01:27,810
We have to be more systematic.

18
00:01:28,290 --> 00:01:31,650
The line that you draw might not be the same as the line that I draw.

19
00:01:31,950 --> 00:01:35,670
And furthermore, if we are in multiple dimensions, this won't be a line at all.

20
00:01:36,390 --> 00:01:39,300
So it's clear that we need something better and more quantitative.

21
00:01:44,400 --> 00:01:50,460
The idea is we're going to define an era function for each data point, which is a pair made up of the

22
00:01:50,460 --> 00:01:58,320
input exi and the target why we're going to calculate a prediction, why had-I equals to exi plus b,

23
00:01:59,280 --> 00:02:04,020
then we are going to take the squared difference between why and why had-I.

24
00:02:04,620 --> 00:02:08,639
We're going to do this for each of our data points for I equals one up to end.

25
00:02:09,570 --> 00:02:12,880
Once we've done this, we can add them all up and divide by end.

26
00:02:13,170 --> 00:02:15,390
This is called the mean squared error.

27
00:02:16,690 --> 00:02:20,710
It's the average square deviation between our predictions and our targets.

28
00:02:21,430 --> 00:02:24,570
Basically, this is going to tell us how accurate our model is.

29
00:02:25,740 --> 00:02:31,410
If our predictions are equal to the targets, then this era will be zero, because why I would be equal

30
00:02:31,410 --> 00:02:36,510
to why had I then the more wrong we are, the larger this era becomes.

31
00:02:37,260 --> 00:02:39,570
So that's generally how an error function should work.

32
00:02:44,810 --> 00:02:50,180
As a side note, since some students get confused by this, at first, we have multiple names for this

33
00:02:50,180 --> 00:02:52,940
error and we usually use them interchangeably.

34
00:02:53,600 --> 00:02:55,070
Sometimes we call it a loss.

35
00:02:55,340 --> 00:02:58,670
Sometimes we call it an objective or sometimes we call it a cost.

36
00:02:59,090 --> 00:03:01,310
However, these all mean the same thing.

37
00:03:02,030 --> 00:03:07,820
I like the term cost because it makes intuitive sense for business people as any good businessman would

38
00:03:07,820 --> 00:03:08,220
do.

39
00:03:08,570 --> 00:03:10,610
Your job is to minimize your cost.

40
00:03:11,030 --> 00:03:15,290
And in fact, this statement alone tells us how to find the ways W.

41
00:03:20,390 --> 00:03:24,650
So how do we minimize this cost or, in other words, make it as small as possible?

42
00:03:25,280 --> 00:03:28,010
Well, it's time to turn it to our old friend calculus.

43
00:03:28,580 --> 00:03:34,220
If you recall, we can find the minimum or a maximum of a function by finding where it's derivative

44
00:03:34,220 --> 00:03:35,390
is equal to zero.

45
00:03:36,170 --> 00:03:41,630
If you're not convinced by this, try to draw a curve which has a minimum or a maximum and draw the

46
00:03:41,630 --> 00:03:43,850
slope at the minimum or maximum point.

47
00:03:44,570 --> 00:03:49,520
If you drew your line correctly, it should be a horizontal line, meaning that the slope at this point

48
00:03:49,520 --> 00:03:50,120
is zero.

49
00:03:50,900 --> 00:03:54,540
And remember, the slope at each point is just the derivative.

50
00:03:54,800 --> 00:03:56,720
So that's why we want to find the derivative.

51
00:03:56,990 --> 00:04:00,170
Set it to zero and then solve for the parameter in question.

52
00:04:05,370 --> 00:04:10,350
Now, usually we have more than one parameter, even in one dimensional linear regression.

53
00:04:10,380 --> 00:04:14,220
This is the case because we have both the slope and the Y intercept.

54
00:04:14,970 --> 00:04:17,940
So what happens when we have a function of more than one variable?

55
00:04:19,050 --> 00:04:22,410
In this case, the derivative is actually called the gradient.

56
00:04:23,010 --> 00:04:29,490
If you recall from your calculus studies, the gradient is just a vector or tensor of partial derivatives

57
00:04:29,850 --> 00:04:33,810
for each of the variables, we're taking the gradient with respect to.

58
00:04:34,830 --> 00:04:39,210
So, for example, if you take the square function in one dimension, that's called a parabola.

59
00:04:39,870 --> 00:04:42,560
But if you have two input dimensions, it's called a parabola.

60
00:04:43,470 --> 00:04:46,950
Beyond that, we can't picture it, but the basic idea is still the same.

61
00:04:47,730 --> 00:04:49,590
Find the gradient, set it to zero.

62
00:04:49,770 --> 00:04:54,910
So for the parameters, by the way, the term parameter is just another name for the weights.

63
00:04:54,930 --> 00:04:58,290
So that's the W and B we've been referring to throughout this section.

64
00:05:03,540 --> 00:05:09,510
Luckily, in this course, we won't be calculating any derivatives manually since I have many other

65
00:05:09,510 --> 00:05:13,860
courses which do that a lot and by a lot, I really mean a lot.

66
00:05:14,550 --> 00:05:20,220
So in this course, we're kind of excused from doing that because TensorFlow already does that for us

67
00:05:20,460 --> 00:05:23,310
using a process called automatic differentiation.

68
00:05:24,270 --> 00:05:29,700
Luckily, you don't have to know how that works either, because as its name suggests, it's automatic.

69
00:05:30,270 --> 00:05:36,210
So the only thing you really do have to know is that a TensorFlow is going to automatically find the

70
00:05:36,210 --> 00:05:41,610
gradient of all your weights and be TensorFlow uses these gradients to train your model.

71
00:05:46,740 --> 00:05:51,930
At this point, you might be wondering if all I have to do is find the gradient and set it to zero.

72
00:05:52,320 --> 00:05:57,360
Then why in our previous code did it involve an iterative training process?

73
00:05:58,050 --> 00:06:01,440
Recall that we had to specify the number of epochs to train for.

74
00:06:01,710 --> 00:06:06,960
And this resulted in a plot of loss per iteration, which we could check to confirm that the training

75
00:06:06,960 --> 00:06:08,400
algorithm converges nicely.

76
00:06:09,240 --> 00:06:14,700
Well, in actuality, it's not possible to actually solve for the equation that you get when you set

77
00:06:14,700 --> 00:06:15,690
the gradient to zero.

78
00:06:15,960 --> 00:06:21,900
Most of the time, the one exception to this in this course, at least, is linear regression where

79
00:06:21,900 --> 00:06:22,740
we can solve it.

80
00:06:23,700 --> 00:06:27,630
We call the solution an analytical solution or a closed form solution.

81
00:06:28,140 --> 00:06:35,130
Basically, this means that we can express the optimal value of WNDB using an equation for logistic

82
00:06:35,130 --> 00:06:37,710
regression and the rest of the models will be discussing.

83
00:06:37,980 --> 00:06:39,630
It's not possible to do this.

84
00:06:40,560 --> 00:06:42,630
Therefore, we need another approach.

85
00:06:43,080 --> 00:06:45,240
That approach is called gradient descent.

86
00:06:50,400 --> 00:06:53,760
Now, there's a lot more to gradient descent than what's in this lecture.

87
00:06:53,910 --> 00:06:57,720
And you can check out the in-depth section of this course for that if you're interested.

88
00:06:58,230 --> 00:07:00,360
For now, let's just talk about the basics.

89
00:07:01,140 --> 00:07:08,770
Essentially, what we do is we start at a randomly initialized point for both WMD since this is random.

90
00:07:08,790 --> 00:07:11,070
These probably don't lead to a small cost.

91
00:07:11,940 --> 00:07:16,080
So then we find the gradient of our loss with respect to WMD.

92
00:07:16,770 --> 00:07:22,530
We then take small steps in this direction to update W and B on each iteration.

93
00:07:23,250 --> 00:07:25,620
Remember, these iterations are called epochs.

94
00:07:26,550 --> 00:07:31,530
Mathematically, you can prove that this leads to a decrease in costs, although we won't discuss that

95
00:07:31,530 --> 00:07:32,220
in this lecture.

96
00:07:33,120 --> 00:07:37,590
You can imagine that this is exactly what goes inside the Carrera's Fit function.

97
00:07:38,010 --> 00:07:44,880
This is really all that's happening for epoch and range epochs and then set to equal to W minus eight

98
00:07:45,150 --> 00:07:46,480
times the gradient of J.

99
00:07:46,500 --> 00:07:53,520
With respect to W and set B to B minus eight times the gradient of J with respect to B.

100
00:07:56,420 --> 00:08:02,360
So again, all we're doing is taking small steps in the direction of the gradient with respect to W

101
00:08:02,360 --> 00:08:09,530
and B, and one obvious question that arises from this is how small should these small steps actually

102
00:08:09,530 --> 00:08:09,860
be?

103
00:08:10,610 --> 00:08:15,590
Well, the step size is specified by this Greek letter ADA, which we call the learning rate.

104
00:08:16,250 --> 00:08:19,610
This specifies how fast or slow we want to train our model.

105
00:08:20,270 --> 00:08:25,580
It's important to set this value right, because if you don't, then your model won't get good results

106
00:08:25,880 --> 00:08:28,400
even if your model is actually a good model.

107
00:08:33,520 --> 00:08:37,390
Unfortunately, there's no direct method of choosing a good learning rate.

108
00:08:38,110 --> 00:08:41,590
Generally speaking, the learning rate is something we call a hyper parameter.

109
00:08:42,159 --> 00:08:47,080
This is to differentiate it from WMP, which are just regular old parameters.

110
00:08:47,680 --> 00:08:51,970
It's called a hyper parameter because it's still a parameter, but it's not a parameter of the model

111
00:08:51,970 --> 00:08:52,570
itself.

112
00:08:52,900 --> 00:08:54,490
It's sort of like a meta parameter.

113
00:08:55,210 --> 00:08:58,630
And in fact, no hyper parameters are really chosen directly.

114
00:08:59,230 --> 00:09:05,200
It's more of a process of trial and error, along with intuition that you gain from practicing a lot

115
00:09:05,200 --> 00:09:07,140
and seeing a lot of different examples.

116
00:09:12,220 --> 00:09:13,420
As mentioned before.

117
00:09:13,750 --> 00:09:19,060
One way to know if your lending rate is too high or too low is to check the loss per iteration after

118
00:09:19,060 --> 00:09:19,570
training.

119
00:09:20,410 --> 00:09:25,930
This is somewhat unfortunate because training can take a long time and you only know the result once

120
00:09:25,930 --> 00:09:26,440
it's done.

121
00:09:27,250 --> 00:09:32,920
If you use a bad learning rate and got bad results while you still had to wait for them, I really like

122
00:09:32,920 --> 00:09:37,420
this visualization because it encapsulates how to choose the lending rate quite well.

123
00:09:38,080 --> 00:09:41,950
If you're lending rate is too high, then your loss will just shoot off to infinity.

124
00:09:42,700 --> 00:09:47,830
Intuitively, this is because you're overshooting the minimum and just ending up on the other side of

125
00:09:47,830 --> 00:09:48,550
the canyon.

126
00:09:49,810 --> 00:09:53,080
That means your steps are too large and you need to make them smaller.

127
00:09:54,160 --> 00:09:59,650
Sometimes your lending rate might be a bit too high, so you'll see some convergence, but it'll converge

128
00:09:59,650 --> 00:10:01,150
to a sub optimal value.

129
00:10:01,600 --> 00:10:05,560
That's kind of what happened in the more example without learning rates scheduling.

130
00:10:07,390 --> 00:10:11,020
If you're lending rate is too low, then you'll see a really shallow curve.

131
00:10:11,530 --> 00:10:14,890
This is also not good because as I mentioned, training takes time.

132
00:10:15,310 --> 00:10:18,430
So if your lending rate is too low, you'll have to wait longer.

133
00:10:19,030 --> 00:10:22,900
But not only that, sometimes you can just get stuck at a suboptimal point.

134
00:10:23,860 --> 00:10:24,850
That's also not good.

135
00:10:25,810 --> 00:10:31,510
A good learning rate is in between these two extremes, and so what you'll probably end up doing is

136
00:10:31,510 --> 00:10:33,820
finding the limits of your particular data set.

137
00:10:34,210 --> 00:10:38,080
What's too high, what's too low and then trying different numbers in between.

138
00:10:38,320 --> 00:10:39,700
Until you find something good.

139
00:10:40,990 --> 00:10:43,210
Normally, we try numbers that are powers of 10.

140
00:10:43,540 --> 00:10:48,820
For example, zero point one zero point zero one zero point zero zero one and so forth.