1
00:00:00,180 --> 00:00:06,090
Earlier in the course in the joint distribution section, we talked about covariance and Pearson's correlation

2
00:00:06,090 --> 00:00:06,870
coefficient.

3
00:00:06,900 --> 00:00:12,780
Now we want to revisit that idea of the correlation coefficient to talk about that and the residual

4
00:00:12,780 --> 00:00:15,280
which are both key parts of regression.

5
00:00:15,300 --> 00:00:17,000
The focus of this section.

6
00:00:17,010 --> 00:00:22,680
So if you remember Pearson's correlation coefficient, which we often indicate with this variable R

7
00:00:22,710 --> 00:00:25,260
is equal to one over n minus one.

8
00:00:25,260 --> 00:00:31,920
Where n is that sample size multiplied by the sum of these two products here x sub minus the mean of

9
00:00:31,920 --> 00:00:38,880
x divided by the sample standard deviation of x and then y sub by minus the mean of y divided by sample

10
00:00:38,880 --> 00:00:39,660
standard deviation.

11
00:00:39,660 --> 00:00:46,920
With respect to y, realize here that we could also write this formula this way because both of these

12
00:00:46,920 --> 00:00:50,480
values here are basically the Z scores for X and Y.

13
00:00:50,490 --> 00:00:56,700
If we look at these here, we recognize them as the formula for a Z score X minus the mean of X divided

14
00:00:56,700 --> 00:01:00,180
by its standard deviation is the Z score for X.

15
00:01:00,180 --> 00:01:05,730
So we can replace each of these fractions with the z score of x, the z score of Y.

16
00:01:05,760 --> 00:01:09,360
So this is another way to write the correlation coefficient.

17
00:01:09,360 --> 00:01:15,450
And then this formula here is the one that we used earlier when we talked about the correlation coefficient

18
00:01:15,450 --> 00:01:16,680
and covariance.

19
00:01:16,710 --> 00:01:22,230
All three of these get us to the same place so we can use whichever formula is most convenient.

20
00:01:22,260 --> 00:01:30,720
Remember that the value of the correlation coefficient always falls on the interval -1 to 1, so we'll

21
00:01:30,720 --> 00:01:36,330
never find a value for the correlation coefficient that is less than negative one or greater than positive

22
00:01:36,330 --> 00:01:36,780
one.

23
00:01:36,780 --> 00:01:42,210
The correlation coefficient will always take on a value somewhere in the interval, negative one to

24
00:01:42,240 --> 00:01:43,170
positive one.

25
00:01:43,170 --> 00:01:49,620
And remember that the value of this correlation coefficient just tells us how strong the relationship

26
00:01:49,620 --> 00:01:52,470
is between these two values x and y.

27
00:01:52,470 --> 00:01:59,790
So if we think about a scale for the correlation coefficient, so we could almost sketch a number line

28
00:01:59,790 --> 00:02:05,970
like this where the correlation coefficient is negative one over here, positive one over here and zero.

29
00:02:05,970 --> 00:02:13,620
At this point the closer we are to negative one for the correlation coefficient, the more we say that

30
00:02:13,620 --> 00:02:17,940
a regression line with a negative slope perfectly describes the data.

31
00:02:17,940 --> 00:02:25,230
So if the correlation coefficient is exactly negative one, it means that all of the points in the scatterplot,

32
00:02:25,230 --> 00:02:28,500
all the points in the data set lie along the regression line.

33
00:02:28,500 --> 00:02:29,100
Exactly.

34
00:02:29,100 --> 00:02:31,380
They all lie directly on the line.

35
00:02:31,380 --> 00:02:33,480
None of the points are away from the line.

36
00:02:33,480 --> 00:02:40,200
So for instance, if we say that this is the regression line, maybe it looks something like this,

37
00:02:40,200 --> 00:02:47,220
then all of the points in the scatterplot lie directly on this line.

38
00:02:47,220 --> 00:02:50,100
There are no points away from the line at all.

39
00:02:50,100 --> 00:02:58,050
Whereas if the correlation coefficient is equal to positive one, that means that a regression line

40
00:02:58,050 --> 00:03:06,900
with a positive slope perfectly describes the relationship in the data, perfectly describes the position

41
00:03:06,900 --> 00:03:08,820
of all the points in the scatterplot.

42
00:03:08,820 --> 00:03:16,320
So that means that all of the points in our scatterplot lie exactly on this regression line.

43
00:03:16,320 --> 00:03:25,320
If the correlation coefficient is exactly zero, that means that a line doesn't do a good job at all

44
00:03:25,320 --> 00:03:32,730
of describing the relationship in the data, which means our scatterplot might look something like this,

45
00:03:32,730 --> 00:03:34,410
kind of like a blob.

46
00:03:34,410 --> 00:03:37,830
There's no obvious linear relationship here at all.

47
00:03:37,830 --> 00:03:45,600
We can't really identify any line that starts to describe some kind of trend in this scatterplot and

48
00:03:45,600 --> 00:03:48,210
if we lie somewhere in between.

49
00:03:48,210 --> 00:03:57,180
So if R is between zero and negative one, that means that our regression line has a negative slope.

50
00:03:57,180 --> 00:04:01,800
So it might look something like this, but that it doesn't perfectly describe the data.

51
00:04:01,800 --> 00:04:08,880
So we might have some points along the line, but we'll also have some points away from the line like

52
00:04:08,880 --> 00:04:09,420
this.

53
00:04:09,420 --> 00:04:12,210
But we see this general negative trend.

54
00:04:12,210 --> 00:04:18,089
And then of course that means that a correlation coefficient between zero and positive one.

55
00:04:18,089 --> 00:04:23,910
So somewhere over here means that the regression line is going to have a positive slope.

56
00:04:23,910 --> 00:04:30,390
So maybe something like that, but that the data will not lie all along the line.

57
00:04:30,390 --> 00:04:37,020
So we might have some points on the line, but we'll also have some points away from the line like this.

58
00:04:37,110 --> 00:04:43,470
Don't confuse the value of the correlation coefficient r with the slope of the regression line.

59
00:04:43,470 --> 00:04:50,100
So imagine here all of these regression lines, this regression line, this one we already sketched

60
00:04:50,100 --> 00:04:52,860
and this one right here.

61
00:04:53,010 --> 00:04:59,550
If we had scatter plots for each of these regression lines where all the points in the scatterplot.

62
00:04:59,940 --> 00:05:05,910
Perfectly along the regression line, then each of these scatter plots would have a correlation coefficient

63
00:05:05,910 --> 00:05:06,900
of one.

64
00:05:06,900 --> 00:05:11,060
But you can see that the slope of the regression line is different for each plot.

65
00:05:11,070 --> 00:05:17,460
In other words, a correlation coefficient between zero and one just means that the regression line

66
00:05:17,490 --> 00:05:23,430
has a positive slope and a correlation coefficient between zero and negative.

67
00:05:23,430 --> 00:05:30,930
One means that the regression line has a negative slope, but the value of the correlation coefficient

68
00:05:30,930 --> 00:05:33,910
doesn't actually tell us the slope of the regression line.

69
00:05:33,930 --> 00:05:40,020
All it tells us is that the closer we are to positive one, the more tightly clustered the points are

70
00:05:40,020 --> 00:05:45,240
around a regression line with a positive slope, and the closer the correlation coefficient is to negative

71
00:05:45,240 --> 00:05:51,240
one, the more tightly clustered the points in the scatterplot are along a regression line with a negative

72
00:05:51,240 --> 00:05:51,810
slope.

73
00:05:51,840 --> 00:05:57,660
Now remember that to calculate the correlation coefficient, we can use any of these formulas here.

74
00:05:57,660 --> 00:06:04,260
If we use this third one, we can create a small chart to quickly make this calculation for us.

75
00:06:04,260 --> 00:06:06,690
And here's what that chart should look like.

76
00:06:06,690 --> 00:06:14,490
So if we have raw data given by these columns X and Y, so these are the columns with our raw data,

77
00:06:14,490 --> 00:06:19,890
the values of X are zero two, four, six, eight, ten and 12, and the values of Y that correspond

78
00:06:19,890 --> 00:06:22,880
to each of those X values are given in this column here.

79
00:06:22,890 --> 00:06:26,790
In other words, we have coordinate points here x, y.

80
00:06:26,790 --> 00:06:30,540
So one coordinate point in the scatterplot is 00.8.

81
00:06:30,540 --> 00:06:36,390
Another coordinate point is to one, another coordinate point is four, 0.2, etc. So we have all these

82
00:06:36,390 --> 00:06:37,350
coordinate points.

83
00:06:37,350 --> 00:06:39,840
You can see that there are seven data points.

84
00:06:39,840 --> 00:06:45,540
And so if we're using this formula to find the correlation coefficient, you can see here that we're

85
00:06:45,540 --> 00:06:51,180
going to need the mean of X and the mean of Y, which means that we sum all the values of X, we get

86
00:06:51,180 --> 00:06:56,010
42 and then we divide by the number of data points we have, which is seven.

87
00:06:56,010 --> 00:06:58,230
We have seven data points here.

88
00:06:58,230 --> 00:07:01,680
So we divide 42 by seven and we get a mean of six.

89
00:07:01,770 --> 00:07:06,420
We sum all the values for y divide by seven and we get a mean of 0.8.

90
00:07:06,630 --> 00:07:12,840
So that means that the mean of x here is six, the mean of Y is 0.8.

91
00:07:12,840 --> 00:07:19,680
So we have those two values and then for each value of x, we need to find x sub B minus the mean and

92
00:07:19,680 --> 00:07:23,340
for each value of why we need to find y sub B minus the mean.

93
00:07:23,340 --> 00:07:29,010
So these next two columns here to get this negative six, we take this particular value of x zero and

94
00:07:29,010 --> 00:07:34,260
we subtract the mean so zero minus six to get negative four, we take two minus six.

95
00:07:34,260 --> 00:07:37,230
To get negative two, we take four minus six, etc..

96
00:07:37,230 --> 00:07:45,870
So to find each of these values here, we subtract the corresponding mean from each data point and we

97
00:07:45,870 --> 00:07:47,430
get these two columns.

98
00:07:47,430 --> 00:07:52,800
Now our formula tells us that we have to multiply each of those values together, so each x value multiplied

99
00:07:52,800 --> 00:07:53,940
by each y value.

100
00:07:53,940 --> 00:07:59,310
So negative six times zero gives us six, negative four times 0.2 gives us -0.8.

101
00:07:59,310 --> 00:08:00,960
So we find that product.

102
00:08:00,960 --> 00:08:08,100
In other words, when we multiply these two together, we get this product here and then summing all

103
00:08:08,100 --> 00:08:09,300
of those products.

104
00:08:09,300 --> 00:08:17,310
So to find this sum or the entire numerator here, we take the sum of all of these products and we see

105
00:08:17,310 --> 00:08:18,990
that sum right there.

106
00:08:18,990 --> 00:08:22,050
So 1.6 is the value of our numerator.

107
00:08:22,050 --> 00:08:28,710
And then in the denominator we again have these x sub minus the mean of x values, but we need to square

108
00:08:28,710 --> 00:08:29,070
them.

109
00:08:29,070 --> 00:08:34,200
So we use this column here to square all the values in this column.

110
00:08:34,200 --> 00:08:41,460
In other words, we are getting this squared value here in this column and we're getting this squared

111
00:08:41,460 --> 00:08:45,780
value here in this column, and then we need those sums.

112
00:08:45,780 --> 00:08:53,880
So to get this sum here and then separately this sum here, we need to add all the values in each column.

113
00:08:53,880 --> 00:08:57,480
So we get that sum and that sum.

114
00:08:57,480 --> 00:09:03,000
And then taking the square root of the product of those two is the same as taking the square root of

115
00:09:03,000 --> 00:09:07,770
each of them individually and then multiplying them together so we can take the square root of both

116
00:09:07,770 --> 00:09:09,030
of these values.

117
00:09:09,030 --> 00:09:12,600
So we get those square roots here and here.

118
00:09:12,600 --> 00:09:16,890
And if we multiply these together, we get the denominator.

119
00:09:16,890 --> 00:09:22,860
So this entire denominator here is the product of these two values here.

120
00:09:22,860 --> 00:09:29,250
So if we multiply these two together and then we take 1.6 divided by the product of these two, we get

121
00:09:29,250 --> 00:09:30,510
the correlation coefficient.

122
00:09:30,510 --> 00:09:39,570
R And so the correlation coefficient for this data set here of x and Y values this raw data, our correlation

123
00:09:39,570 --> 00:09:42,630
coefficient is 0.101.

124
00:09:42,630 --> 00:09:49,530
So the fact that this value is positive tells us that the regression line through this scatterplot is

125
00:09:49,530 --> 00:09:51,540
going to have a positive slope.

126
00:09:51,690 --> 00:09:56,760
But the fact that this value is very close to zero instead of positive one.

127
00:09:56,760 --> 00:09:59,580
So the value is maybe right about here means.

128
00:09:59,810 --> 00:10:02,210
The correlation is very weak.

129
00:10:02,210 --> 00:10:06,980
We don't have a strong correlation like we do over here where all the points in the scatterplot lie

130
00:10:06,980 --> 00:10:08,630
exactly on the regression line.

131
00:10:08,630 --> 00:10:14,120
Instead, the points are very spread out away from the regression line, but the regression line does

132
00:10:14,120 --> 00:10:15,680
have a positive slope.

133
00:10:15,680 --> 00:10:22,940
So if we replace this table with a scatterplot of the raw data, then we can see the relationship we

134
00:10:22,940 --> 00:10:23,870
were talking about.

135
00:10:23,870 --> 00:10:28,610
Here's the regression line through the scatter plot of that raw data that we were just looking at.

136
00:10:28,610 --> 00:10:33,200
We can see that the regression line does have a positive slope because as we move to the right, the

137
00:10:33,200 --> 00:10:34,220
line moves up.

138
00:10:34,220 --> 00:10:41,120
So we've got a positive slope, but the data is very spread out away from the regression line.

139
00:10:41,120 --> 00:10:45,470
It's not tightly packed or tightly clustered close to the regression line.

140
00:10:45,470 --> 00:10:51,230
So it does make sense that we would find a correlation coefficient that is very close to zero, but

141
00:10:51,230 --> 00:10:55,370
still positive to indicate the positive slope of the regression line.

142
00:10:55,730 --> 00:11:03,140
That being said, as a general rule of thumb, we do classify the strength of the correlation based

143
00:11:03,140 --> 00:11:05,630
on the value of this correlation coefficient.

144
00:11:05,630 --> 00:11:11,960
So in particular, if the correlation coefficient is specifically between negative one and -0.07, we

145
00:11:11,960 --> 00:11:14,480
tend to call that a strong negative correlation.

146
00:11:14,480 --> 00:11:19,760
Whereas if we're kind of in that mid negative range between negative point seven and -0.3, we would

147
00:11:19,760 --> 00:11:22,130
call that a moderate negative correlation.

148
00:11:22,130 --> 00:11:28,130
And then if the correlation coefficient is between -0.3 and zero, we would call that a weak negative

149
00:11:28,130 --> 00:11:28,880
correlation.

150
00:11:28,880 --> 00:11:33,050
And then we have the mirror image of those relationships on the positive side.

151
00:11:33,050 --> 00:11:38,780
So between zero and 0.3, we have a weak positive correlation between 0.3 and point seven, a moderate

152
00:11:38,780 --> 00:11:42,800
positive correlation and between 0.7 and one a strong positive correlation.

153
00:11:42,800 --> 00:11:50,210
So our correlation coefficient of 0.101 would fall within this range right here.

154
00:11:50,210 --> 00:11:54,620
And so we would say that this data has a weak positive correlation.

155
00:11:54,620 --> 00:11:57,470
Now related to the correlation coefficient.

156
00:11:57,470 --> 00:12:00,590
At this point we need to talk about the residual as well.

157
00:12:00,590 --> 00:12:05,330
So the residual for any data point, In other words, every data point has a residual.

158
00:12:05,330 --> 00:12:11,240
The residual for any data point in a scatterplot is the difference between the actual value.

159
00:12:11,240 --> 00:12:17,990
So the data point itself and what the regression line predicts at that same value of the independent

160
00:12:17,990 --> 00:12:18,560
variable.

161
00:12:18,560 --> 00:12:26,600
So if we say here in this scatterplot that this is the horizontal x axis, this is the vertical y axis

162
00:12:26,720 --> 00:12:34,310
here we could look at as an example this data point here, its actual value is two, but it's predicted

163
00:12:34,310 --> 00:12:37,310
value is what we find along the regression line.

164
00:12:37,310 --> 00:12:43,580
If we follow this data point down to its corresponding point along the regression line.

165
00:12:43,580 --> 00:12:46,760
So it looks like the predicted value.

166
00:12:46,760 --> 00:12:48,620
So this is the.

167
00:12:49,440 --> 00:12:50,610
Actual value.

168
00:12:50,610 --> 00:12:53,280
And then this value right here is the.

169
00:12:55,200 --> 00:12:56,390
Predicted value.

170
00:12:56,400 --> 00:13:01,520
It looks like this predicted value is maybe about 0.85, something like that, roughly.

171
00:13:01,530 --> 00:13:06,000
So the residual is the difference between that actual value and the predicted value.

172
00:13:06,030 --> 00:13:06,930
We can write it this way.

173
00:13:06,930 --> 00:13:13,500
The residual is actual minus predicted or mathematically we would write that this way because we often

174
00:13:13,500 --> 00:13:16,140
also think about the residual as error.

175
00:13:16,140 --> 00:13:21,810
And so we represent the residual with the variable E And then the actual value here is the actual y

176
00:13:21,810 --> 00:13:22,890
value of the data point.

177
00:13:22,890 --> 00:13:30,390
Remember, this data point here is a coordinate point x, y, and so it has its actual y value, but

178
00:13:30,390 --> 00:13:33,930
then the corresponding y value along the regression line.

179
00:13:33,930 --> 00:13:38,310
Remember that we write the regression line or the equation of the regression line.

180
00:13:38,310 --> 00:13:46,020
As for y hat, the equation of the regression line is y hat equals m x plus B, where m is the slope

181
00:13:46,020 --> 00:13:48,840
of the regression line and B is the y intercept.

182
00:13:48,840 --> 00:13:53,160
So the predicted value along the regression line is always given by y hat.

183
00:13:53,160 --> 00:14:00,900
So if we look at the residuals for all the points in our scatterplot, we would show them like this.

184
00:14:00,900 --> 00:14:04,230
These are all the residuals.

185
00:14:04,230 --> 00:14:05,040
It's this difference.

186
00:14:05,040 --> 00:14:10,290
We could also think about it as distance, the distance between each point and the regression line.

187
00:14:10,290 --> 00:14:15,000
So the points that are closer to the regression line will have a smaller residual.

188
00:14:15,000 --> 00:14:19,110
The points that are further from the regression line will have a larger residual.

189
00:14:19,110 --> 00:14:24,540
If the point in the scatterplot is below the regression line, its residual will be negative.

190
00:14:24,540 --> 00:14:29,220
So everything down here, all four of these points here will have a.

191
00:14:30,000 --> 00:14:38,520
Negative residual, whereas these three points up here will have a positive residual because of their

192
00:14:38,520 --> 00:14:41,040
position in relation to the regression line.

193
00:14:41,040 --> 00:14:48,210
And now that we understand this idea of the residual, we can define the regression line in a new way.

194
00:14:48,240 --> 00:14:54,870
What we can say is that for any regression line, any y hat equation representing a regression line,

195
00:14:54,870 --> 00:15:00,420
we know that the sum of all the residuals is equal to zero.

196
00:15:00,450 --> 00:15:04,550
That is another way to define or identify the regression line.

197
00:15:04,560 --> 00:15:11,400
The regression line will always be the line that makes the sum of all of these distances equal to zero.

198
00:15:11,430 --> 00:15:14,310
That also means that the mean.

199
00:15:15,260 --> 00:15:17,020
Of the residuals is equal to zero.

200
00:15:17,030 --> 00:15:22,790
And in general, that should make sense because the idea of the regression line very, very broadly

201
00:15:22,790 --> 00:15:27,710
is that it's trying to balance out all of the data points in the scatterplot.

202
00:15:27,710 --> 00:15:33,440
And if we're creating balance in the scatterplot, we can think about that as trying to balance all

203
00:15:33,440 --> 00:15:37,670
of these residual values, the distance of each point to the regression line.

204
00:15:37,670 --> 00:15:40,040
And so the mean of the residuals is going to be zero.

205
00:15:40,070 --> 00:15:41,970
The sum of the residuals is going to be zero.

206
00:15:41,990 --> 00:15:47,120
Now, we've already looked at in the past how to calculate the equation of the regression line.

207
00:15:47,120 --> 00:16:00,530
The equation of this regression line in particular is y hat equal to 0.0 143x plus 0.7143.

208
00:16:00,530 --> 00:16:06,680
Both of these values for M and B are rounded to four decimal places, but this is the regression line

209
00:16:06,680 --> 00:16:07,430
equation.

210
00:16:07,430 --> 00:16:14,150
And so what we can do once we have the regression line equation is create a table of actual and predicted

211
00:16:14,150 --> 00:16:21,440
values and then use those to calculate the residual for each data point so we can create a chart.

212
00:16:21,440 --> 00:16:25,310
These first two columns are all of the raw data that we looked at earlier.

213
00:16:25,310 --> 00:16:31,940
These are all of the X values and then these are all of the Y values or the actual values for our equation

214
00:16:31,940 --> 00:16:32,570
here.

215
00:16:32,570 --> 00:16:39,590
The predicted values are the values we get from our y hat equation, our regression line equation.

216
00:16:39,590 --> 00:16:44,300
What we do is we take each value of x and we plug it into the y hat equation.

217
00:16:44,300 --> 00:16:49,520
So you can see here, if we plug in x equals zero, then this term goes away and we're left with y hat

218
00:16:49,520 --> 00:16:51,650
is equal to 0.7143.

219
00:16:51,650 --> 00:16:53,870
And we see that here 0.7143.

220
00:16:54,080 --> 00:16:59,990
So we're getting these predicted values out of the y hat equation and then we know that the residual

221
00:16:59,990 --> 00:17:04,400
we can calculate as Y minus.

222
00:17:05,180 --> 00:17:06,020
Y hat.

223
00:17:06,020 --> 00:17:08,270
And so we find the value of the residual.

224
00:17:08,270 --> 00:17:10,190
We find the value of the area.

225
00:17:10,190 --> 00:17:16,220
And we see here, in fact, these three positive values that represent the three points above the regression

226
00:17:16,220 --> 00:17:21,000
line and the four negative values that represent the four points below the regression line.

227
00:17:21,020 --> 00:17:26,839
Now, the reason that we're looking at this chart is because we can use it to illustrate these two facts

228
00:17:26,839 --> 00:17:32,150
here that we just talked about, that the sum of the residuals is zero and that the mean of the residuals

229
00:17:32,150 --> 00:17:32,750
is zero.

230
00:17:32,780 --> 00:17:40,610
Because if we create a new scatterplot where we have our X values along the horizontal axis, just like

231
00:17:40,610 --> 00:17:47,410
up here, but now along the vertical axis, we plot the value of the residual.

232
00:17:47,420 --> 00:17:52,610
So notice here that in our vertical axis, the value of zero is here in the middle.

233
00:17:52,610 --> 00:17:57,400
And so we have some negative values below zero and some positive values above zero.

234
00:17:57,410 --> 00:18:05,420
If we plot each of these values of the residual in this new set of coordinate axes, we get this scatterplot

235
00:18:05,420 --> 00:18:05,900
here.

236
00:18:05,900 --> 00:18:08,090
Ignore the yellow line for a second.

237
00:18:08,090 --> 00:18:10,880
We get this scatterplot with all these new points.

238
00:18:10,880 --> 00:18:16,940
What we see is that they all correspond in terms of their position to the scatterplot up here.

239
00:18:16,940 --> 00:18:19,370
We've just changed the vertical axis.

240
00:18:19,370 --> 00:18:25,760
So now this point, instead of being plotted with a vertical value of two, is being plotted with a

241
00:18:25,760 --> 00:18:28,820
vertical value of about 1.17.

242
00:18:28,820 --> 00:18:30,500
We see that point here.

243
00:18:30,500 --> 00:18:36,800
These two points down here are the -0.57 and the -0.60.

244
00:18:36,830 --> 00:18:44,600
In other words, up here in this scatterplot, all of these points are plotted as x, y coordinate pairs

245
00:18:44,600 --> 00:18:52,250
down here in this scatterplot, all of these points are plotted as x, y hat coordinate pairs.

246
00:18:52,990 --> 00:18:57,940
And when we use this new set of coordinate points, when our coordinate points are given by X and Y

247
00:18:57,940 --> 00:19:05,350
hat and we use that set of coordinate points to find a new regression line equation, that new regression

248
00:19:05,350 --> 00:19:09,910
line equation is this line here, we'll call it y hat of the residuals.

249
00:19:09,910 --> 00:19:14,270
That new regression line equation is the equation y hat equals zero.

250
00:19:14,290 --> 00:19:20,440
In other words, another way to put this is that if in this plot, the points in the scatterplot are

251
00:19:20,440 --> 00:19:25,870
pretty well spaced out with a bunch above, this y equals zero line and a bunch below this y equals

252
00:19:25,870 --> 00:19:26,620
zero line.

253
00:19:26,620 --> 00:19:34,240
That means that a linear model is going to do an okay job describing the trend in the original scatterplot

254
00:19:34,240 --> 00:19:36,070
with the original set of data.

255
00:19:36,070 --> 00:19:41,590
So if you remember, we said before that this regression line, this line of best fit, this line of

256
00:19:41,590 --> 00:19:46,000
least squares, we have a bunch of different names for this regression line, this trend line.

257
00:19:46,000 --> 00:19:53,470
We said before that we were always trying to minimize the sum of squares so we would look at each data

258
00:19:53,470 --> 00:20:01,090
point in the scatterplot and we would think about creating a square between that point and the regression

259
00:20:01,090 --> 00:20:01,450
line.

260
00:20:01,450 --> 00:20:10,330
So we would have a square like this and then we would have a square like this, and then this one up

261
00:20:10,330 --> 00:20:11,800
here would be really, really big.

262
00:20:11,800 --> 00:20:14,890
This square here would look something like that.

263
00:20:14,890 --> 00:20:18,130
Let's just go ahead and make this big one.

264
00:20:18,130 --> 00:20:19,450
So we have this big square.

265
00:20:19,450 --> 00:20:25,330
We said before that we were trying to minimize the sum of all of these square areas.

266
00:20:25,330 --> 00:20:32,290
Well, now that we know that the side of each square can be described as the residual E, if we just

267
00:20:32,290 --> 00:20:41,830
say that e sub n represents all of the residuals and we square that value, this is all of the squared

268
00:20:41,830 --> 00:20:45,970
residuals and this is the sum of all of the squared residuals.

269
00:20:45,970 --> 00:20:52,720
This is the value we're trying to minimize to find the very best fitting line through the data, through

270
00:20:52,720 --> 00:20:57,880
the scatterplot, to find the equation of the regression line, to find the equation of the trend line,

271
00:20:57,880 --> 00:21:00,100
the line of best fit, the line of least squares.

272
00:21:00,100 --> 00:21:06,520
This is the value we're trying to minimize when we describe that value in terms of this new idea we've

273
00:21:06,520 --> 00:21:12,850
introduced, which is the idea of the residual Thinking about this in the context of the residual also

274
00:21:12,850 --> 00:21:19,060
gives us a better idea of why we look at the squares instead of just the pure distances.

275
00:21:19,060 --> 00:21:24,190
In other words, instead of just pure actual minus predicted the distance from the data point to the

276
00:21:24,190 --> 00:21:25,030
regression line.

277
00:21:25,030 --> 00:21:31,300
We instead look at the squared area, and the reason is because in the context of the residual here

278
00:21:31,300 --> 00:21:36,310
we get a positive residual for any points that are above the least squares line.

279
00:21:36,310 --> 00:21:41,110
We get a negative residual for any points that are below the least squares line.

280
00:21:41,110 --> 00:21:46,630
And if we have a bunch of points above and below the line, those positive and negative distances would

281
00:21:46,630 --> 00:21:51,100
cancel each other out if we didn't square the values to turn them into areas.

282
00:21:51,100 --> 00:21:58,180
So if for this point we got a residual of positive one, and if for this point we got a residual of

283
00:21:58,180 --> 00:22:03,130
negative one, those two things would cancel each other out completely and we wouldn't necessarily be

284
00:22:03,130 --> 00:22:05,290
optimizing for either of those data points.

285
00:22:05,290 --> 00:22:09,550
When we square the residuals, we get two pieces of squared area.

286
00:22:09,550 --> 00:22:12,280
They both count as positive pieces of area.

287
00:22:12,280 --> 00:22:18,040
And so we can make sure to account for minimizing both of those areas when we're building the equation

288
00:22:18,040 --> 00:22:19,390
of our regression line.

289
00:22:19,390 --> 00:22:25,900
So we're always trying to minimize the sum of the squared residuals or the sum of these square areas.

290
00:22:25,900 --> 00:22:32,350
So that process of trying to minimize the residuals by minimizing the squares of the residuals, that's

291
00:22:32,350 --> 00:22:38,500
where we get that name least squares line or line of least squares, or we often call this process least

292
00:22:38,500 --> 00:22:39,520
squares regression.

293
00:22:39,520 --> 00:22:44,650
We're just trying to minimize the total area of all of these squares.

