1
00:00:11,560 --> 00:00:16,120
In this lecture we are going to do an example of simple linear regression in code.

2
00:00:16,120 --> 00:00:21,660
Using pi talk in other words we are going to learn how to find the line of best fit.

3
00:00:21,730 --> 00:00:25,240
This lecture is going to walk you through a prepared to call lab notebook.

4
00:00:25,240 --> 00:00:30,940
Although a very good exercise which I always recommend is once you know how this is done to try and

5
00:00:30,940 --> 00:00:37,540
recreate it yourself with as few references as possible as usual you can look at the title of the notebook

6
00:00:37,810 --> 00:00:40,360
to determine what notebook we are currently looking at.

7
00:00:42,610 --> 00:00:47,500
To start we're going to do a few imports you'll see throughout the course that these imports are what

8
00:00:47,500 --> 00:00:49,530
will need pretty much every time.

9
00:00:49,570 --> 00:00:56,050
So we have torch and torch type and then which is a module that contains a lot of useful stuff in pottery.

10
00:00:56,050 --> 00:01:01,840
Next we have num pi and map plot lit num pies for doing linear algebra and mapping out lib is for doing

11
00:01:01,840 --> 00:01:07,200
plotting.

12
00:01:07,240 --> 00:01:10,940
Next we're going to generate some data to apply our model to.

13
00:01:11,230 --> 00:01:13,690
Now you might ask why aren't we using real data.

14
00:01:13,990 --> 00:01:15,220
Well don't get too excited.

15
00:01:15,250 --> 00:01:17,170
That's what we're going to do next.

16
00:01:17,350 --> 00:01:22,750
Using synthetic data is actually very important in machine learning but more on that later.

17
00:01:22,780 --> 00:01:27,600
So to generate this data I'm going to set an equal to 20 data points.

18
00:01:27,640 --> 00:01:32,980
Remember that our convention is to use the capital letter N to represent the total number of samples

19
00:01:33,280 --> 00:01:39,460
and the capital letter D to represent the total number of features as we move on to more complex datasets

20
00:01:39,580 --> 00:01:40,770
later in this course.

21
00:01:40,840 --> 00:01:44,460
There will be some additional letters in our case a d equals 1.

22
00:01:44,470 --> 00:01:47,960
So we don't really need to specify anything.

23
00:01:47,990 --> 00:01:51,660
Next we're going to generate some random points on the x axis.

24
00:01:51,740 --> 00:01:56,380
They're going to be uniformly distributed between minus five and plus five.

25
00:01:56,420 --> 00:01:58,630
There are of course several ways you could do this.

26
00:01:58,670 --> 00:02:02,790
So if you don't like the way I've done it you can feel free to implement your own.

27
00:02:02,990 --> 00:02:05,930
And this applies to anything in this course.

28
00:02:05,930 --> 00:02:11,570
So what we've done here is generated 20 random data points using the random function which outputs random

29
00:02:11,570 --> 00:02:13,830
numbers between 0 and 1.

30
00:02:13,940 --> 00:02:18,290
Next we multiply those numbers by 10 so that the range between 0 and 10.

31
00:02:19,070 --> 00:02:20,880
Finally we subtract 5.

32
00:02:20,900 --> 00:02:26,050
So the range becomes minus five to plus five.

33
00:02:26,100 --> 00:02:31,710
Next we're going to generate the data points on the y axis because we're doing linear regression.

34
00:02:31,740 --> 00:02:36,910
We would like these y data points to be linearly correlated with the X data points.

35
00:02:36,990 --> 00:02:43,670
So to do that I'm going to say y equals zero point five times x minus one plus some Gaussian noise.

36
00:02:43,950 --> 00:02:46,270
Pay close attention to these numbers.

37
00:02:46,290 --> 00:02:52,120
This means that our true slope is zero point five and are true y intercept is minus one.

38
00:02:52,170 --> 00:02:55,630
These are the values that our model is going to try and find.

39
00:02:55,660 --> 00:03:01,350
Now you might ask why have I added Gaussian noise with mean zero and variance 1 rather than some other

40
00:03:01,350 --> 00:03:02,760
kind of noise.

41
00:03:02,970 --> 00:03:08,700
In fact when we have Gaussian noise the mean squared error becomes the correct lost function to use

42
00:03:09,030 --> 00:03:14,310
and thus since we're using the means squared error as our last function we would also like to use Gaussian

43
00:03:14,310 --> 00:03:17,420
noise centered at zero.

44
00:03:17,650 --> 00:03:22,420
You're not expected to understand this during this course but when I mentioned these things people inevitably

45
00:03:22,420 --> 00:03:24,490
end up asking me why this is the case.

46
00:03:24,820 --> 00:03:29,770
So if you are interested in learning the answer I would recommend you look at the in-depth series of

47
00:03:29,790 --> 00:03:36,960
courses where I've already covered all this material.

48
00:03:37,230 --> 00:03:39,470
Next we're going to do a plot of the data.

49
00:03:39,780 --> 00:03:44,910
As you can see it's sort of roughly forms a line it's a little bit hard to tell since that noise variance

50
00:03:44,910 --> 00:03:48,200
is quite high compared to the variance of the data itself.

51
00:03:48,330 --> 00:03:56,000
But I think it's pretty clear that the data is trending upward so that's everything for our dataset.

52
00:03:56,000 --> 00:04:04,170
Next we have the PI towards stuff first we're going to define our model as promised is just one line

53
00:04:04,500 --> 00:04:06,750
and a linear one one.

54
00:04:06,750 --> 00:04:12,520
This means that we're creating a linear model with one input and one output next we're going to begin

55
00:04:12,580 --> 00:04:19,930
the training process as you know it starts with defining a lost function object and an optimizer object.

56
00:04:19,930 --> 00:04:21,440
Luckily you've seen this before.

57
00:04:21,460 --> 00:04:23,380
So there are no surprises.

58
00:04:23,380 --> 00:04:33,850
We're using the MSE loss and SGI the optimizer with a learning rate of zero point one.

59
00:04:33,860 --> 00:04:39,140
Next we're going to transform X and Y into data types which are appropriate for Pi to work.

60
00:04:39,590 --> 00:04:47,960
So we start by reshaping X and Y into n by 1 matrices next week has both X and Y into flow thirty twos

61
00:04:48,230 --> 00:04:52,060
and convert them into torch sensors is in the next block.

62
00:04:52,070 --> 00:05:03,450
I printed out the type of the inputs variable so you can see that it is indeed a torch tensor.

63
00:05:03,470 --> 00:05:09,950
Next we have our main training loop to start we set an epochs the number of iterations of the loop to

64
00:05:09,940 --> 00:05:11,190
30.

65
00:05:11,210 --> 00:05:16,460
Now you might ask why did I choose 30 and not some other number like 10 or five or 1000.

66
00:05:17,030 --> 00:05:21,600
Well the trick is I've ran the script before and tried different values already.

67
00:05:21,890 --> 00:05:26,060
As a student you don't really see all the work that goes into choosing hyper parameters.

68
00:05:26,060 --> 00:05:31,340
You only see the final result so you might assume that you can just pick any number and everything will

69
00:05:31,340 --> 00:05:32,090
work out great.

70
00:05:32,690 --> 00:05:34,990
This is often surprising for beginners.

71
00:05:35,120 --> 00:05:40,400
In any case I discuss a lot about hyper parameters elsewhere so if you're interested in that please

72
00:05:40,400 --> 00:05:47,920
ask me about it on the Q and A something else you didn't see in the code preparation lecture is that

73
00:05:47,920 --> 00:05:53,580
I have a list of losses that I'm going to plot after we are done in each iteration of the loop.

74
00:05:53,660 --> 00:05:59,530
We were going to store the loss in this list and at the end we're going to plot the loss per iteration.

75
00:05:59,530 --> 00:06:04,330
This will let us understand whether or not we chose good hyper parameters and whether or not the training

76
00:06:04,330 --> 00:06:06,460
process converge and so forth.

77
00:06:06,460 --> 00:06:07,540
So it's quite useful.

78
00:06:08,290 --> 00:06:10,930
Oh and just a hint for future examples.

79
00:06:10,930 --> 00:06:15,760
We are pretty much always going to plot this after training so I don't want anyone asking me what is

80
00:06:15,760 --> 00:06:19,810
this plot mean because it's going to be the exact same thing every time.

81
00:06:19,840 --> 00:06:20,670
So don't forget it.

82
00:06:23,310 --> 00:06:23,660
All right.

83
00:06:23,670 --> 00:06:27,420
So next we have our training loop which we've seen before.

84
00:06:27,420 --> 00:06:32,160
As I mentioned before the first step is this weird step we 0 the gradients.

85
00:06:32,340 --> 00:06:37,890
And this is because behind the scenes pi torque is actually accumulating the gradient each time you

86
00:06:37,890 --> 00:06:38,870
call backward.

87
00:06:39,330 --> 00:06:44,790
So this will zero the gradient to prevent them from accumulating and give us the correct answer.

88
00:06:51,160 --> 00:06:54,810
Next we do a forward pass to get the model predictions.

89
00:06:54,850 --> 00:06:59,260
Since you've now seen how to make predictions and get the output as a num pi array you know this is

90
00:06:59,260 --> 00:07:04,630
similar to that except we don't need to take the additional step of converting this into a null spire.

91
00:07:04,930 --> 00:07:11,500
This is all in PI torch land so we get out a PI torch tensor that we can apply in the next step the

92
00:07:11,500 --> 00:07:16,750
next step is to calculate the loss using the criterion we defined earlier.

93
00:07:16,750 --> 00:07:19,690
Next we save the loss to our list of losses.

94
00:07:19,690 --> 00:07:24,350
In this case we do want to take the loss from pi torch land into Python land.

95
00:07:24,430 --> 00:07:26,950
So we call the function item.

96
00:07:26,950 --> 00:07:32,050
You might ask why do we call the function item and not the function number pi like we do when we make

97
00:07:32,050 --> 00:07:33,450
predictions.

98
00:07:33,520 --> 00:07:37,300
And that's because the loss is a single Python number and not a number higher.

99
00:07:37,840 --> 00:07:43,090
So when your tensor is a single number and you want to bring that number back to Python land then you

100
00:07:43,090 --> 00:07:45,460
should use the item function.

101
00:07:45,520 --> 00:07:51,640
Next we calculate the gradients Pi which encapsulates this in a function called backward to be called

102
00:07:51,640 --> 00:07:52,720
from the lost tensor.

103
00:07:53,590 --> 00:07:58,420
Finally we call optimize it step to do one step of gradient descent.

104
00:07:58,480 --> 00:08:02,570
As always I like to print out useful information while the loop runs.

105
00:08:02,590 --> 00:08:07,750
So here you can see I'm printing the iteration number along with the corresponding loss at each iteration

106
00:08:07,750 --> 00:08:08,460
of the loop.

107
00:08:14,030 --> 00:08:17,990
So if you look at these numbers they give you some sense of how training is progressing

108
00:08:24,350 --> 00:08:29,690
but the real value comes from plotting the last per iteration which is what we have next.

109
00:08:29,690 --> 00:08:35,330
This is the kind of curve we like to see in deep learning a steady decrease downward fast the beginning

110
00:08:35,450 --> 00:08:37,070
and slow at the end.

111
00:08:37,100 --> 00:08:41,030
This plot gives us more confidence that training has completed successfully.

112
00:08:41,060 --> 00:08:45,230
It's not a 100 percent guarantee but this is at least one positive sign.

113
00:08:52,500 --> 00:08:57,620
Next we're going to use our trained model to make predictions and plot the result.

114
00:08:57,870 --> 00:09:02,170
As you recall a task for this model is to find a line of best fit.

115
00:09:02,310 --> 00:09:05,880
It would be quite silly if we didn't naturally plot that line.

116
00:09:05,880 --> 00:09:11,520
The next step is therefore to pass our inputs into our model which gives us the predictions in terms

117
00:09:11,520 --> 00:09:15,730
of a torch tensor since we would prefer to work with an umpire raise.

118
00:09:15,810 --> 00:09:20,460
We first detach the tensor and then call the NUM by function.

119
00:09:20,460 --> 00:09:24,420
Now you might be wondering what happens if I do not call the attach.

120
00:09:24,420 --> 00:09:26,100
We'll look at that in a moment.

121
00:09:26,280 --> 00:09:31,320
The next step is to plot the data points using a scatter plot and then to plot the predicted line using

122
00:09:31,320 --> 00:09:32,620
a line plot.

123
00:09:32,760 --> 00:09:47,310
As you can see from the result our line is in fact the line of best fit.

124
00:09:47,340 --> 00:09:52,550
Next we look at what will happen if you do not call the detach function as you can see.

125
00:09:52,560 --> 00:09:59,190
We get an error it says can't call them PI on variable that requires grad use var dot detach that num

126
00:09:59,190 --> 00:10:00,570
pi instead.

127
00:10:00,570 --> 00:10:01,940
Which is exactly what we did

128
00:10:05,510 --> 00:10:10,660
however this error message gives us a hint for an alternative way to make predictions.

129
00:10:10,730 --> 00:10:12,980
We can see that it has something to do with gradients

130
00:10:20,850 --> 00:10:21,990
in the next block of code.

131
00:10:21,990 --> 00:10:25,620
We can instruct pi to not to compute gradients.

132
00:10:25,620 --> 00:10:27,830
We use the context manager with torch dot.

133
00:10:27,840 --> 00:10:31,360
No grad inside the block we make our prediction.

134
00:10:31,440 --> 00:10:37,410
But if you notice there is no call to the detach function as you can see the output is printed correctly

135
00:10:37,500 --> 00:10:46,800
without any error.

136
00:10:46,820 --> 00:10:51,710
The last thing we would like to do in this script is to inspect to the parameters of the model to check

137
00:10:51,710 --> 00:10:54,170
if they're close to the true values.

138
00:10:54,170 --> 00:10:58,580
As you recall these are the slope and intercept of the line in machine learning.

139
00:10:58,580 --> 00:11:04,370
We call this the weight and bias in order to get these values we call model dot weight dot data dot

140
00:11:04,370 --> 00:11:12,080
num pi and model dot bias dot data dot num pi intuitively you can expect that model that way and model

141
00:11:12,080 --> 00:11:17,380
that bias store the weight and bias as some special pi torch tensor variable.

142
00:11:17,480 --> 00:11:22,960
However we would like to bring these back to num piloted so in order to do that we call data dot num

143
00:11:22,970 --> 00:11:23,930
pi.

144
00:11:24,200 --> 00:11:29,200
You can imagine that the reason we treat these differently from the model's predictions is because they

145
00:11:29,200 --> 00:11:33,020
are model parameters rather than data inputs or outputs.

146
00:11:33,050 --> 00:11:37,280
If we look at the values we can glean a few key pieces of information.

147
00:11:37,340 --> 00:11:43,970
First the obvious thing we want to check are they close to the true values we get zero point four six

148
00:11:43,970 --> 00:11:45,530
in minus one point two.

149
00:11:45,560 --> 00:11:52,320
So it appears that they are one interesting thing about these is that w seems to be a two dimensional

150
00:11:52,320 --> 00:11:56,110
array whereas B seems to be a one dimensional array.

151
00:11:56,250 --> 00:12:01,710
You might ask why do we need a 2D array and a 1 b array to store single numbers.

152
00:12:01,710 --> 00:12:07,040
In fact we don't but we need our model to be flexible enough to handle different scenarios.

153
00:12:07,260 --> 00:12:13,800
These scenarios are what you'll learn about in the later lectures the final thing I want to talk about

154
00:12:13,800 --> 00:12:19,310
in this lecture is what is the point of using synthetic data instead of real data.

155
00:12:19,320 --> 00:12:21,620
This is going to come up once or twice in this course.

156
00:12:21,630 --> 00:12:24,680
So it's good to discuss this early and often.

157
00:12:24,690 --> 00:12:30,690
One of the most important questions we can ask is Does our model actually do what we think it does.

158
00:12:30,780 --> 00:12:37,050
As a beginner taking a course your automatic reflex is to make a guess or to think of the problem philosophically

159
00:12:37,620 --> 00:12:39,480
but this is not a philosophy course.

160
00:12:39,570 --> 00:12:41,600
It is a computer science course.

161
00:12:41,700 --> 00:12:47,040
We don't discuss maybes in this course it's binary yes or no it works or doesn't.

162
00:12:47,040 --> 00:12:50,250
So synthetic data is very useful in that regard.

163
00:12:50,370 --> 00:12:55,980
I can create synthetic data such that if my model does what I say it does then it will find the pattern

164
00:12:55,980 --> 00:12:59,930
in the synthetic data that I set it up to find if it can.

165
00:12:59,940 --> 00:13:05,520
Then my assertion is wrong doing things this way allows us to test the strengths and weaknesses of our

166
00:13:05,520 --> 00:13:09,440
models and to help us understand what they can and cannot do.