1
00:00:00,090 --> 00:00:04,920
Throughout this course, we've sort of been following this trend of covariance and then correlation

2
00:00:04,920 --> 00:00:06,780
or the Pearson correlation coefficient.

3
00:00:06,780 --> 00:00:13,110
And now we're going to continue that thread by looking at the coefficient of determination.

4
00:00:13,110 --> 00:00:20,550
So if you remember, covariance tells us about the direction of the relationship between two variables

5
00:00:20,550 --> 00:00:27,930
X and Y, but we don't really use covariance to analyze that relationship because covariance is sensitive

6
00:00:27,930 --> 00:00:30,270
to the scale of the two variables.

7
00:00:30,270 --> 00:00:34,170
So it can take on any value and it's hard to interpret.

8
00:00:34,170 --> 00:00:41,490
What we do instead is use covariance to calculate correlation, which we always indicate with r Pearson's

9
00:00:41,490 --> 00:00:42,540
correlation coefficient.

10
00:00:42,540 --> 00:00:44,400
R or just correlation.

11
00:00:44,400 --> 00:00:50,220
And this value tells us about the direction of the relationship between the variables, but also about

12
00:00:50,220 --> 00:00:51,780
the strength of the relationship.

13
00:00:51,780 --> 00:00:57,000
In other words, it kind of answers the question how close are the data points to the regression line

14
00:00:57,000 --> 00:00:59,370
or the line of least squares of the line of best fit.

15
00:00:59,370 --> 00:01:04,319
And if you remember, we talked about in the last lecture that correlation can take on values between

16
00:01:04,319 --> 00:01:05,670
negative one and one.

17
00:01:05,670 --> 00:01:11,460
So a value of exactly negative one tells us that all of the points in the data set, all the points

18
00:01:11,460 --> 00:01:17,310
in the scatterplot lie exactly on the regression line, and that the regression line has a negative

19
00:01:17,310 --> 00:01:17,820
slope.

20
00:01:17,850 --> 00:01:23,400
So it would tell us that the relationship between X and Y is as strong as it can possibly be in the

21
00:01:23,400 --> 00:01:29,850
negative direction, whereas a correlation of one would tell us that all of the points in the scatterplot

22
00:01:29,850 --> 00:01:35,700
lie exactly on the regression line and that the regression line has some positive slope.

23
00:01:35,700 --> 00:01:40,110
So again, the correlation is very strong and the direction is positive.

24
00:01:40,110 --> 00:01:45,840
So it gives us an idea of how close the data points are to the line of best fit in addition to some

25
00:01:45,840 --> 00:01:46,980
sense of direction.

26
00:01:46,980 --> 00:01:53,940
Now, if we continue this thread one step further, we can think about the coefficient of determination

27
00:01:53,940 --> 00:01:55,920
which really answers the question for us.

28
00:01:55,920 --> 00:01:59,610
How much error is eliminated by using the regression line?

29
00:01:59,610 --> 00:02:04,380
Or we could continue the sentence instead of using the mean of the dependent variable.

30
00:02:04,590 --> 00:02:11,340
So we indicate the coefficient of determination with this lowercase r squared because it is exactly

31
00:02:11,340 --> 00:02:17,130
the square of correlation lowercase r, which means that assuming we have correlation in order to find

32
00:02:17,130 --> 00:02:22,380
the coefficient of determination, all we have to do is square the value of the correlation.

33
00:02:22,380 --> 00:02:26,640
Realize, though, that that only works for simple linear regression.

34
00:02:26,640 --> 00:02:32,460
So we have to be doing linear regression instead of curve fitting the data with a curve of some other

35
00:02:32,460 --> 00:02:33,120
shape.

36
00:02:33,120 --> 00:02:38,220
And we have to have a single independent variable that's affecting a single dependent variable.

37
00:02:38,220 --> 00:02:43,620
In other words, if we're doing multiple regression and or we're doing non linear regression, then

38
00:02:43,620 --> 00:02:46,920
the coefficient of determination is given by capital R squared.

39
00:02:46,920 --> 00:02:53,400
And we need to say that capital R squared is not simply the square of lowercase r, the correlation

40
00:02:53,400 --> 00:02:55,080
or Pearson's correlation coefficient.

41
00:02:55,080 --> 00:03:00,360
So we find this in a different way, but for now we're sticking with simple linear regression, in which

42
00:03:00,360 --> 00:03:06,690
case we can find the coefficient of determination by squaring correlation r Which means that the value

43
00:03:06,690 --> 00:03:12,570
of the coefficient of determination is going to fall between zero and one, and we usually give our

44
00:03:12,570 --> 00:03:14,550
squared in terms of a percentage.

45
00:03:14,550 --> 00:03:19,890
This makes sense because if we think about correlation as taking on values between negative one and

46
00:03:19,890 --> 00:03:26,370
positive one, when we square any value in this interval, we'll end up with values only in the interval

47
00:03:26,370 --> 00:03:27,390
0 to 1.

48
00:03:27,390 --> 00:03:33,780
So that's why this interval of values becomes this interval of values when we square our to get r squared,

49
00:03:33,780 --> 00:03:36,780
which means when we say how much error is eliminated.

50
00:03:36,780 --> 00:03:41,700
What we're really saying here is what percentage of the error is eliminated when we use the regression

51
00:03:41,700 --> 00:03:45,390
line instead of just the mean of the dependent variable.

52
00:03:45,390 --> 00:03:50,820
One last thing we want to say about capital R squared here the coefficient of determination for multiple

53
00:03:50,820 --> 00:03:52,860
regression or for non linear regression.

54
00:03:52,860 --> 00:04:01,350
This value capital r squared can take on values in the interval negative infinity to one, not just

55
00:04:01,350 --> 00:04:02,790
the interval 0 to 1.

56
00:04:02,790 --> 00:04:08,010
But in either case, whether we're looking at lowercase r squared or capital R squared, the closer

57
00:04:08,010 --> 00:04:14,790
the coefficient of determination is to a value of one or 100%, the more we can say that the regression

58
00:04:14,790 --> 00:04:19,260
line is a better estimate of the data than just using the mean instead.

59
00:04:19,260 --> 00:04:25,170
In other words, when we have a value for R squared that's closer to one, it means that the independent

60
00:04:25,170 --> 00:04:30,690
variable has more explanatory power about the dependent variable, or that there's a higher percentage

61
00:04:30,690 --> 00:04:34,110
of the variation in Y that's explained by x.

62
00:04:34,110 --> 00:04:41,610
So R squared really just tells us the goodness of fit of our regression line or the strength of our

63
00:04:41,610 --> 00:04:42,810
linear regression model.

64
00:04:42,810 --> 00:04:48,270
So with some of that background out of the way, let's actually look at a scatterplot so that we can

65
00:04:48,270 --> 00:04:51,030
get a visual understanding of what we're trying to say here.

66
00:04:51,240 --> 00:04:57,690
So over the last couple of lectures we've been working with this data set here, we've looked at it

67
00:04:57,690 --> 00:04:58,350
a few times.

68
00:04:58,350 --> 00:04:59,700
We have the values of the independent.

69
00:04:59,790 --> 00:05:03,110
Variable X and the values of the dependent variable Y.

70
00:05:03,120 --> 00:05:07,890
And so each of these pairs gives us a coordinate point in our scatterplot.

71
00:05:07,890 --> 00:05:14,310
So if we look at the point here, 00.8, we find it right here in our scatterplot and these are identical

72
00:05:14,310 --> 00:05:15,000
scatter plots.

73
00:05:15,000 --> 00:05:18,240
They just have a different line through them, which we'll talk about in a second.

74
00:05:18,240 --> 00:05:20,250
But the points are all the same.

75
00:05:20,340 --> 00:05:21,720
So we have this data set.

76
00:05:21,720 --> 00:05:27,120
Each of these points is plotted in the scatterplot and in this left hand scatterplot.

77
00:05:27,120 --> 00:05:31,290
The yellow line represents the mean of the dependent variable y.

78
00:05:31,290 --> 00:05:36,840
So we calculated this before, but the mean of all of these Y values is 0.8.

79
00:05:36,870 --> 00:05:41,310
So if we just sketch in the line Y equals 0.8.

80
00:05:41,310 --> 00:05:43,170
We have this line here.

81
00:05:43,170 --> 00:05:49,530
If you think about it, this is the most crude basic way of coming up with some kind of a regression

82
00:05:49,530 --> 00:05:52,020
line for the data in a scatterplot.

83
00:05:52,020 --> 00:05:57,300
All we're doing here is adding up all of the Y values in our scatterplot and dividing by the number

84
00:05:57,300 --> 00:05:58,380
of points that we have.

85
00:05:58,380 --> 00:06:03,750
And then we get that average or that mean and we just sketch the line in at that mean.

86
00:06:03,750 --> 00:06:10,740
And in theory we could call that a line that plots some kind of a path through the scatterplot.

87
00:06:10,770 --> 00:06:17,670
It doesn't do a very elegant job, but it does try somewhat to balance the error that we find in our

88
00:06:17,670 --> 00:06:18,450
data set.

89
00:06:18,450 --> 00:06:26,280
In fact, we can see here that if we look at the error between each data point and the line representing

90
00:06:26,280 --> 00:06:32,580
the mean, so we look at each of those distances, the distance here, the distance here, the distance

91
00:06:32,580 --> 00:06:35,370
from each point to the line.

92
00:06:35,370 --> 00:06:36,480
So all these.

93
00:06:37,170 --> 00:06:37,950
Different.

94
00:06:38,530 --> 00:06:41,030
Distances and we add them all up.

95
00:06:41,050 --> 00:06:43,690
The sum of those distances is zero.

96
00:06:43,720 --> 00:06:51,040
We see that in this column right here, the Y minus Y bar column, and that's because we treat the error

97
00:06:51,040 --> 00:06:56,950
for data points below the mean as negative and the error for data points above the mean as positive.

98
00:06:56,950 --> 00:07:03,640
And so the line representing the mean here just balances out those positive and negative values by finding

99
00:07:03,640 --> 00:07:07,980
the value at which those positive and negative values would sum to zero.

100
00:07:07,990 --> 00:07:10,570
In fact, that's the definition of the mean.

101
00:07:10,570 --> 00:07:15,280
But remember that when we're talking about error, we always talk about squared error.

102
00:07:15,280 --> 00:07:23,020
And so if we square each of these distances, if we create an actual square from each of these, so

103
00:07:23,020 --> 00:07:26,680
something that looks like this for each.

104
00:07:27,260 --> 00:07:31,520
One of these, and then we have this big square up top.

105
00:07:32,240 --> 00:07:40,520
Here, etc. We have a certain amount of squared error when we use the mean as our theoretical line of

106
00:07:40,520 --> 00:07:41,170
best fit.

107
00:07:41,180 --> 00:07:47,150
In this last column of this chart, we're finding the area of each of these squares and if we add up

108
00:07:47,150 --> 00:07:52,250
all of that squared area, we get a total sum of 2.24.

109
00:07:52,280 --> 00:07:54,980
That's the total area of all of these squares.

110
00:07:54,980 --> 00:07:57,700
When we use the mean for this line here.

111
00:07:57,710 --> 00:08:03,560
Now, if instead we use the regression line, the line of best fit, and we found the equation of this

112
00:08:03,560 --> 00:08:10,130
line earlier, what we expect is that this regression line should do a better job of fitting the data

113
00:08:10,130 --> 00:08:12,080
than just using the mean.

114
00:08:12,080 --> 00:08:13,760
The value of r squared.

115
00:08:13,760 --> 00:08:18,860
The coefficient of determination tells us exactly how much of a better job this line will do.

116
00:08:19,040 --> 00:08:24,020
So what we can do is a really similar calculation to the one that we did in this table.

117
00:08:24,020 --> 00:08:30,650
Again, we have the same data set with these raw values of X and Y, and we can use the value of X to

118
00:08:30,650 --> 00:08:36,980
find all the values of Y hat all of the predicted values, the values predicted by the regression line,

119
00:08:36,980 --> 00:08:41,270
by plugging each of these values of X into this equation for the regression line.

120
00:08:41,270 --> 00:08:48,320
So we get the value y hat that corresponds to each of these values of X, and then we take y minus y

121
00:08:48,350 --> 00:08:51,170
hat, which remember is the residual.

122
00:08:51,200 --> 00:08:58,730
This is the residual E, the error that is the distance from the actual value y in the coordinate point

123
00:08:58,730 --> 00:09:02,030
to Y's predicted value along the regression line.

124
00:09:02,030 --> 00:09:09,590
So if y minus y bar is the distance of each point to the line representing the mean, then y minus y

125
00:09:09,590 --> 00:09:14,360
hat is the distance from each point to the regression line, the line of best fit.

126
00:09:14,360 --> 00:09:22,280
So these values from this column are again all of these distances, which because this line is a little

127
00:09:22,280 --> 00:09:27,380
different than the mean line, these distances will be slightly different.

128
00:09:28,020 --> 00:09:29,700
Then the ones that we see over here.

129
00:09:29,700 --> 00:09:38,280
And then when we square those values, we create a square from each of those distances.

130
00:09:38,580 --> 00:09:43,380
So we create a square for each of these and when we add up.

131
00:09:44,230 --> 00:09:52,700
The area of all of these squares, we get the sum of the squared area.

132
00:09:52,720 --> 00:10:00,820
And for us, that total sum here, when we're using the regression line is this value here 2.217 approximately.

133
00:10:00,820 --> 00:10:07,510
So what we can see is that in this particular example, this value, the amount of area we have over

134
00:10:07,510 --> 00:10:14,710
here is just barely smaller than the amount of area we had over here, which means that the regression

135
00:10:14,710 --> 00:10:21,970
line in this case for this particular dataset does a very, very, very slightly better job at fitting

136
00:10:21,970 --> 00:10:25,240
the data than the mean line itself.

137
00:10:25,240 --> 00:10:31,180
And we can use this formula for the coefficient of determination to figure out exactly what that percentage

138
00:10:31,180 --> 00:10:31,630
is.

139
00:10:31,630 --> 00:10:35,050
So if we do our calculation here, we get one minus.

140
00:10:35,080 --> 00:10:38,230
This is the sum of the squared residuals.

141
00:10:38,230 --> 00:10:40,690
It's the 2.217 value we found.

142
00:10:40,690 --> 00:10:43,840
So this is approximately 2.217.

143
00:10:43,840 --> 00:10:49,420
And then this sum in the denominator is the 2.24 that we calculated.

144
00:10:50,230 --> 00:10:57,310
And so if we take this decimal number divided by this decimal number and then we subtract that result

145
00:10:57,310 --> 00:11:02,610
from one, we get approximately 0.0102.

146
00:11:02,620 --> 00:11:11,260
Or if we change that to a percentage, we get approximately 1.02%, which means that using this regression

147
00:11:11,260 --> 00:11:19,360
line compared to using the mean line eliminates about 1.02% of the error, which we could confidently

148
00:11:19,360 --> 00:11:21,550
say is really not very good.

149
00:11:21,550 --> 00:11:26,920
Remember, ideally, if the regression line is a great fit for the data, we would have an r squared

150
00:11:26,920 --> 00:11:29,050
value much closer to.

151
00:11:29,800 --> 00:11:32,670
100% then to 1%.

152
00:11:32,680 --> 00:11:37,420
And we can actually see this 1% value reflected in these scatter plots.

153
00:11:37,420 --> 00:11:42,370
With the mean line in the regression line here, you can tell just by looking at these that the regression

154
00:11:42,370 --> 00:11:49,750
line is barely any different at all then the mean line, the square areas look almost identical and

155
00:11:49,750 --> 00:11:55,690
we can see that the data points in the scatterplot are spread out far away from both the mean line and

156
00:11:55,690 --> 00:11:56,620
the regression line.

157
00:11:56,620 --> 00:12:00,670
So neither of these lines does a great job fitting the data.

158
00:12:00,670 --> 00:12:06,280
The regression line does about a 1% better job, but still overall a pretty poor job.

159
00:12:06,280 --> 00:12:11,380
But in general here, just remember that we're saying that the better that the linear regression line

160
00:12:11,380 --> 00:12:17,650
fits the data in comparison to just the mean line, the closer the value of the coefficient of determination

161
00:12:17,650 --> 00:12:22,030
will be to 100% or to one.

162
00:12:22,030 --> 00:12:25,060
If we're using a decimal number here, we ended up with 0.01.

163
00:12:25,060 --> 00:12:29,890
We want that closer to one if our linear regression line is eliminating a lot of error.

164
00:12:30,130 --> 00:12:34,630
So here we calculated the coefficient of determination from scratch.

165
00:12:34,630 --> 00:12:42,280
But in the last video we did look at correlation and we found that correlation for this same dataset

166
00:12:42,280 --> 00:12:45,520
was approximately 0.101.

167
00:12:45,520 --> 00:12:51,910
And if we take the square of this value, we do get close to this 0.010 to value.

168
00:12:51,910 --> 00:12:56,980
This original correlation value is rounded, this value is a little bit approximate, but we could have

169
00:12:56,980 --> 00:13:03,430
just taken our correlation and squared it and come up with the coefficient of determination.

170
00:13:03,430 --> 00:13:07,810
Or we can use this formula to calculate it and work with the data this way.

171
00:13:07,930 --> 00:13:14,320
Now the last thing that we want to talk about here is what's called root mean square error or or you'll

172
00:13:14,320 --> 00:13:17,200
also see it referred to as root mean square deviation.

173
00:13:17,470 --> 00:13:18,220
Msdh.

174
00:13:18,250 --> 00:13:20,340
They mean the same thing here.

175
00:13:20,350 --> 00:13:23,230
The formulas we use to calculate root mean square error.

176
00:13:23,260 --> 00:13:28,630
This is just two different ways of writing the same formula, but we can think about this value as the

177
00:13:28,630 --> 00:13:31,090
standard deviation of the residuals.

178
00:13:31,120 --> 00:13:35,560
Essentially what we're talking about here is really similar to the standard deviation we've looked at

179
00:13:35,560 --> 00:13:36,220
before.

180
00:13:36,250 --> 00:13:41,020
Remember that standard deviation is all about the spread of the data around the mean.

181
00:13:41,020 --> 00:13:44,590
And up to now we've been thinking about it in terms of a data distribution.

182
00:13:44,590 --> 00:13:47,650
Well, a scatterplot is like a data distribution.

183
00:13:47,650 --> 00:13:50,890
It shows us visually how our data is distributed.

184
00:13:50,890 --> 00:13:57,340
And in a plot like this one, we can see how it's distributed around the regression line.

185
00:13:57,340 --> 00:14:04,450
Why hat and so root mean square error or root mean squared deviation is going to give us standard deviations

186
00:14:04,480 --> 00:14:06,340
around this regression line.

187
00:14:06,340 --> 00:14:13,150
And if you remember that 68% of the data lies within one standard deviation, that means that 60% of

188
00:14:13,150 --> 00:14:19,480
our data points are going to lie within this same kind of standard deviation, or that 95% of our data

189
00:14:19,480 --> 00:14:24,100
points are going to fall within two standard deviations of the regression line.

190
00:14:24,100 --> 00:14:29,740
So notice here that to calculate root mean squared error or root mean square deviation, we just take

191
00:14:29,740 --> 00:14:32,800
the y minus y hat, quantity squared values.

192
00:14:32,800 --> 00:14:36,940
We already have those in this last column here.

193
00:14:36,940 --> 00:14:38,230
We add them all up.

194
00:14:38,230 --> 00:14:42,070
Notice this formula tells us to sum those so we add them all up.

195
00:14:42,100 --> 00:14:49,570
The sum is about 2.217, and then we divide that sum by the number of data points that we have in this

196
00:14:49,570 --> 00:14:55,240
case for this data set and equals seven because we have one, two, three, four, five, six, seven

197
00:14:55,240 --> 00:14:56,110
data points.

198
00:14:56,110 --> 00:15:02,740
So we divide this 2.217 by seven data points, and then we take the square root of that value.

199
00:15:02,740 --> 00:15:09,850
And in this case we get an approximate RC of 0.5628.

200
00:15:09,850 --> 00:15:15,100
So what that tells us is that the distance between each of these.

201
00:15:15,700 --> 00:15:17,650
Why Intercept's here?

202
00:15:18,620 --> 00:15:19,730
Each of these.

203
00:15:20,620 --> 00:15:23,650
Distances is 0.56.

204
00:15:24,730 --> 00:15:25,600
Two eight.

205
00:15:25,630 --> 00:15:31,000
We can see here that the intercept of the regression line is 0.7143.

206
00:15:31,150 --> 00:15:38,200
If we add 0.71432.5628, we get about 1.28 or so.

207
00:15:38,200 --> 00:15:43,750
And we see that the y intercept of this blue line here is about at that value.

208
00:15:43,780 --> 00:15:48,470
This is the upper edge of one standard deviation around this regression line.

209
00:15:48,490 --> 00:15:52,390
This is the lower edge of one standard deviation around the regression line.

210
00:15:52,390 --> 00:16:00,160
So we know that in this interval here we should expect to find 68% of our data points.

211
00:16:00,160 --> 00:16:04,900
And then we haven't sketched the lower edge of two standard deviations around the mean.

212
00:16:04,900 --> 00:16:07,120
But we would have another red line down here.

213
00:16:07,120 --> 00:16:12,760
And between this lower red line and this upper red line, that would be two standard deviations around

214
00:16:12,760 --> 00:16:13,780
the regression line.

215
00:16:13,780 --> 00:16:19,330
And so we would expect to find 95% of all of our data points within that interval.

216
00:16:19,360 --> 00:16:25,750
Remember, those values are coming from the empirical rule, which tells us that we find 68% of our

217
00:16:25,750 --> 00:16:34,120
data within one standard deviation, 95% of our data within two standard deviations, and about 99.7%

218
00:16:34,120 --> 00:16:36,210
of our data within three standard deviations.

219
00:16:36,220 --> 00:16:41,950
So just think about this as the standard deviation of the residuals, which means that the larger the

220
00:16:41,950 --> 00:16:43,900
value is of our MSE.

221
00:16:43,930 --> 00:16:49,900
The further apart these lines will be, which means the more scattered our data points are and the weaker

222
00:16:49,900 --> 00:16:51,630
the correlation is in the data.

223
00:16:51,640 --> 00:16:57,700
If we find a smaller rmafc value, a smaller standard deviation of the residuals, that means these

224
00:16:57,700 --> 00:17:02,200
lines are going to be closer together, which means the data points are more tightly clustered around

225
00:17:02,200 --> 00:17:03,040
the regression line.

226
00:17:03,040 --> 00:17:07,960
And that means that we have a stronger correlation within our data set.