1
00:00:00,090 --> 00:00:07,830
Covariance and correlation both inform us about the relationship between two data series, and while

2
00:00:07,830 --> 00:00:13,830
correlation is the value we end up being more interested in because it's a more helpful metric, we

3
00:00:13,830 --> 00:00:19,410
have to calculate covariance on our way to correlation, which is why we want to start with an understanding

4
00:00:19,410 --> 00:00:20,550
of covariance.

5
00:00:20,760 --> 00:00:25,470
So the covariance formula will start here and then unpack it piece by piece.

6
00:00:25,650 --> 00:00:32,790
But the covariance formula, this says that the covariance between two series given by X and Y is equal

7
00:00:32,790 --> 00:00:39,840
to and this right hand side basically just says the mean of the square areas that are given between

8
00:00:39,840 --> 00:00:46,140
each particular data point x sub by sub I and the mean data point x bar y bar.

9
00:00:46,170 --> 00:00:49,710
Equivalently, we can write this formula this way.

10
00:00:49,710 --> 00:00:52,440
So these two formulas represent exactly the same thing.

11
00:00:52,440 --> 00:00:54,120
They're just written in different ways.

12
00:00:54,120 --> 00:01:00,150
This second formula here tells us that covariance can also be calculated as the mean product.

13
00:01:00,150 --> 00:01:07,110
So if we multiply each x sub by by its corresponding y sub by and then we take the mean of those products,

14
00:01:07,110 --> 00:01:12,030
we get this value here and then we subtract from that the product of the means.

15
00:01:12,030 --> 00:01:18,090
So the mean of all of the x sub values, the mean of all the Y sub values, and we multiply those two

16
00:01:18,090 --> 00:01:20,040
products together to get the product of the means.

17
00:01:20,040 --> 00:01:22,530
We subtract that from the mean of the products.

18
00:01:22,530 --> 00:01:25,200
This is another way to calculate covariance.

19
00:01:25,200 --> 00:01:28,470
So when you see these two different formulas, they mean the same thing.

20
00:01:28,470 --> 00:01:35,730
Now, to understand covariance, what we really want to realize is that this first formula looks a lot

21
00:01:35,730 --> 00:01:42,540
like the variance formula that we learned about earlier To get back from this formula to the variance

22
00:01:42,540 --> 00:01:43,200
formula.

23
00:01:43,320 --> 00:01:46,680
What we want to do is replace Y with X.

24
00:01:46,680 --> 00:01:59,070
So we would say the covariance of x with itself one over n times the sum of i equals one to n of x sub

25
00:01:59,070 --> 00:02:04,740
i minus the mean times x sub i minus the mean.

26
00:02:04,740 --> 00:02:10,530
Because again we're just replacing y with x here and then we can see that we can rewrite.

27
00:02:11,440 --> 00:02:18,220
This right hand side as the square of this single binomial because we have two identical binomial is

28
00:02:18,220 --> 00:02:19,410
multiplied together here.

29
00:02:19,420 --> 00:02:25,540
So instead of saying x abi minus x bar times the same thing, x sub minus x bar, we can rewrite that

30
00:02:25,540 --> 00:02:29,500
as x abi minus x bar quantity squared.

31
00:02:29,500 --> 00:02:36,280
And we should now recognize this as the variance formula for the variance of x.

32
00:02:36,280 --> 00:02:43,330
And so because these right hand sides are equal, what we're essentially saying here is that the variance

33
00:02:43,330 --> 00:02:50,860
of a single variable X is equivalent to the covariance of that variable with itself.

34
00:02:50,860 --> 00:02:56,470
Given that relationship, if we sort of work backwards from this idea of variance.

35
00:02:56,470 --> 00:03:03,670
So thinking about variance, if we almost just think about a one dimensional number line and we have

36
00:03:03,670 --> 00:03:11,650
some data set, some data series given by this X variable and we say that the mean of that series is

37
00:03:11,650 --> 00:03:20,290
here at X Bar when we want to calculate the variance of this variable X, what we do is we look at all

38
00:03:20,290 --> 00:03:23,320
of the other values of X in the data set.

39
00:03:23,320 --> 00:03:31,570
So let's say that there's another value of X in the data set right here and right here and right here.

40
00:03:32,940 --> 00:03:39,300
And what the variance formula calculates is the mean of the square areas that are generated.

41
00:03:39,300 --> 00:03:44,270
When we make a square out of the distance between each of these data points and the mean.

42
00:03:44,280 --> 00:03:50,730
So in other words, looking at this closest data point here, if we create a square between this data

43
00:03:50,730 --> 00:04:00,240
point and the mean and then we create another square between this data point and the mean and finally

44
00:04:00,240 --> 00:04:05,640
another square between this data point and the mean.

45
00:04:05,640 --> 00:04:08,700
Now we have three pieces of area.

46
00:04:08,730 --> 00:04:11,460
This Green Square is the largest amount of area.

47
00:04:11,460 --> 00:04:15,210
Then the white square and then the blue square is a smaller amount of area.

48
00:04:15,210 --> 00:04:21,360
What this variance formula gives us is the mean amount of area of these three squares or the average

49
00:04:21,360 --> 00:04:21,990
area.

50
00:04:21,990 --> 00:04:24,990
So let's say that this little blue square, let's just make up values.

51
00:04:24,990 --> 00:04:29,190
Let's say that this little blue square has a square area of one square unit.

52
00:04:29,190 --> 00:04:33,000
Let's say that this white square is a square area of four square units.

53
00:04:33,000 --> 00:04:36,510
And the Green Square is a square area of five square units.

54
00:04:36,510 --> 00:04:39,810
What we would do is add up all of these areas.

55
00:04:39,810 --> 00:04:42,270
So five plus four is nine, plus one is ten.

56
00:04:42,300 --> 00:04:46,800
The total area of all three squares together is ten square units.

57
00:04:46,800 --> 00:04:48,360
And then we have three data points.

58
00:04:48,360 --> 00:04:50,190
So we would divide that by three.

59
00:04:50,400 --> 00:04:53,340
We get about 3.33.

60
00:04:53,340 --> 00:04:59,550
And so the variance of the x variable here, if this is the data set, is about 3.33.

61
00:04:59,550 --> 00:05:06,210
This is what we calculated in the past, the variance of just one variable x, which we now know is

62
00:05:06,210 --> 00:05:10,020
also equivalent to the covariance of X with itself.

63
00:05:10,020 --> 00:05:16,500
But instead of calculating the covariance of X with itself, if we instead want to calculate the covariance

64
00:05:16,500 --> 00:05:23,220
of X with a different variable y, now our little two dimensional picture has to transition to a three

65
00:05:23,220 --> 00:05:24,210
dimensional picture.

66
00:05:24,210 --> 00:05:30,390
So in that three dimensional picture, instead of plotting values of just X along this one dimensional

67
00:05:30,390 --> 00:05:35,280
line, we would plot x, y, coordinate points in a two dimensional plane.

68
00:05:35,280 --> 00:05:42,150
So instead of just now along a line where we have this mean value of X here and then we have this point

69
00:05:42,150 --> 00:05:50,340
x sub one, x sub two and x sub three, now we have coordinate points where maybe this first point here

70
00:05:50,640 --> 00:05:57,840
is the point x one, y one and this point up here is the point x seven.

71
00:05:58,800 --> 00:06:05,760
Why seven But each of these is just one value from the X Data series paired together with its associated

72
00:06:05,760 --> 00:06:10,890
y value from the Y data series and then plotted in the two dimensional plane.

73
00:06:11,040 --> 00:06:17,970
Now what we already know for this series, instead of having just a mean X bar for the one series in

74
00:06:17,970 --> 00:06:24,390
X, when we find a mean for X and Y together, what we're doing is we're averaging all the X values

75
00:06:24,390 --> 00:06:30,180
and we're averaging all the Y values and then we are plotting that mean point.

76
00:06:30,180 --> 00:06:32,670
And this point is X bar.

77
00:06:33,500 --> 00:06:37,430
Why bar It's the mean of all the X values and the mean of all the Y values.

78
00:06:37,430 --> 00:06:44,030
And we plot that point and we can think about this as the balancing point of the entire data set.

79
00:06:44,030 --> 00:06:50,690
Then just like when we had variance in one variable to show covariance with two variables, we just

80
00:06:50,690 --> 00:06:53,720
look at the sum of all those square areas.

81
00:06:53,720 --> 00:07:00,740
So if we plot some of those square areas, we connect each individual point to that mean point.

82
00:07:00,740 --> 00:07:02,240
So we did that here with this point.

83
00:07:02,240 --> 00:07:04,790
Here with this point, we found this smaller square area.

84
00:07:04,790 --> 00:07:08,030
With this point up here, we found a smaller square area.

85
00:07:08,030 --> 00:07:11,570
And then with this last point up here, one of the bigger square areas.

86
00:07:11,570 --> 00:07:13,820
So we plot all those squares.

87
00:07:13,820 --> 00:07:20,630
If we add up all of the areas of all the squares connecting each point to the mean, then we get this

88
00:07:20,630 --> 00:07:22,760
whole sum right here.

89
00:07:22,760 --> 00:07:29,180
When we divide by n the number of data points that gives us covariance, which is just the average of

90
00:07:29,180 --> 00:07:30,500
all the square areas.

91
00:07:30,500 --> 00:07:37,310
So this covariance concept is just like the variance value we learned about earlier, except that in

92
00:07:37,310 --> 00:07:40,850
this way we get to see how X and Y varied together.

93
00:07:40,850 --> 00:07:44,660
We're not just looking at one variable on its own.

94
00:07:44,660 --> 00:07:49,460
Instead, here we're looking at the two variables in comparison to each other.

95
00:07:49,460 --> 00:07:55,520
And we can sort of imagine here how both X and Y have an effect on the variance.

96
00:07:55,520 --> 00:08:03,050
This is our mean point, but if we picked a point, let's say way up here, way above the mean, but

97
00:08:03,050 --> 00:08:05,570
not too far to the right of the mean.

98
00:08:05,570 --> 00:08:11,150
And then we created this rectangular area between that point and the mean point.

99
00:08:11,150 --> 00:08:12,980
Maybe that looks something like this.

100
00:08:12,980 --> 00:08:20,630
We can see how this area might actually be quite large because of the amount of variance in Y between

101
00:08:20,630 --> 00:08:22,610
this particular point and the mean.

102
00:08:22,610 --> 00:08:28,520
Even though the variance in X for this particular point is small, the fact that the variance in Y is

103
00:08:28,520 --> 00:08:33,740
larger lets us see that this point actually varies quite a bit from the mean, even though the variance

104
00:08:33,740 --> 00:08:35,059
in x is small.

105
00:08:35,059 --> 00:08:40,490
If we were just looking at one variable alone, just the variance in x this point wouldn't look like

106
00:08:40,490 --> 00:08:46,250
it varied that much from the mean at all because it's only this tiny little width right here that constitutes

107
00:08:46,250 --> 00:08:47,720
the variance in x.

108
00:08:47,720 --> 00:08:51,590
But this huge height gives us the variance in Y.

109
00:08:51,590 --> 00:08:55,310
And so we can see that this point actually varies quite a bit from the mean.

110
00:08:55,310 --> 00:09:00,560
And so it's that covariance being able to see what's happening to X and Y at the same time.

111
00:09:00,560 --> 00:09:05,090
That allows us to see more than just this one variable picture over here.

112
00:09:05,090 --> 00:09:11,690
Now, when we calculate a value for covariance, it will help us to see the relationship between the

113
00:09:11,690 --> 00:09:13,520
two variables X and Y.

114
00:09:13,550 --> 00:09:19,100
For instance, with this particular data series here, we can actually sketch a line through the data

115
00:09:19,100 --> 00:09:23,810
and we'll talk more about this later, but sketch a line through the data that shows us that there's

116
00:09:23,810 --> 00:09:26,480
a positive relationship between X and Y.

117
00:09:26,510 --> 00:09:29,150
It's positive because as X increases.

118
00:09:29,150 --> 00:09:35,030
So as we move to the right along the horizontal axis, as X increases Y also increases, we move up

119
00:09:35,030 --> 00:09:36,410
along the vertical axis.

120
00:09:36,410 --> 00:09:39,380
Or we could work backwards and say that as X decreases.

121
00:09:39,380 --> 00:09:43,880
So as we move from right to left this way, Y also decreases.

122
00:09:43,880 --> 00:09:45,320
We move from top to bottom.

123
00:09:45,320 --> 00:09:51,560
So either Y is decreasing as X is decreasing or Y is increasing as X is increasing.

124
00:09:51,560 --> 00:09:56,150
In either case, both variables are moving in the same direction at the same time.

125
00:09:56,150 --> 00:10:00,140
That's what we call a direct relationship or a positive relationship.

126
00:10:00,140 --> 00:10:05,000
If we have instead a relationship moving in the other direction.

127
00:10:05,000 --> 00:10:12,590
So if our line looked something like this instead, maybe like this, that would indicate an inverse

128
00:10:12,590 --> 00:10:17,270
relationship or a negative relationship, because as X is increasing.

129
00:10:17,270 --> 00:10:23,450
So as we move from left to right and the value of X increases, the value of Y decreases, we're moving

130
00:10:23,450 --> 00:10:29,570
down along this line as X increases or working in the other direction as X decreases.

131
00:10:29,570 --> 00:10:35,390
So as we move from the right to the left, as X decreases, Y is increasing.

132
00:10:35,390 --> 00:10:40,490
And so the variables are moving in opposite directions and we call that an inverse or a negative relationship.

133
00:10:40,490 --> 00:10:47,360
The value of covariance will give us that general direction of the relationship between the two variables.

134
00:10:47,360 --> 00:10:53,570
If covariance is positive, that means there's a positive direct relationship between X and Y.

135
00:10:53,600 --> 00:10:59,270
If covariance is negative, that means there's a negative or inverse relationship between X and Y.

136
00:10:59,300 --> 00:11:03,470
So if covariance is positive, the variables generally are moving in the same direction.

137
00:11:03,470 --> 00:11:08,120
If covariance is negative, the variables in general are moving in opposite directions.

138
00:11:08,120 --> 00:11:11,690
For that reason, covariance is a helpful metric.

139
00:11:11,690 --> 00:11:17,060
The problem is that it doesn't tell us the strength of the relationship, only the direction of the

140
00:11:17,060 --> 00:11:20,630
relationship, and it's not a standardized value.

141
00:11:20,630 --> 00:11:25,370
So changing the scale of our data will change the value of covariance.

142
00:11:25,370 --> 00:11:31,880
One of the ways that we can see that is by simply adding one data point to this existing data set.

143
00:11:31,880 --> 00:11:32,810
So we had all of.

144
00:11:32,880 --> 00:11:39,780
These little blue points in our data set, and they're all on or very close to this line running through

145
00:11:39,780 --> 00:11:40,590
the data set.

146
00:11:40,590 --> 00:11:47,160
And we would calculate a covariance by finding the mean of all of these different square or rectangular

147
00:11:47,160 --> 00:11:47,850
areas.

148
00:11:47,850 --> 00:11:54,840
But if we add another point to our data set and let's say that it is directly along this line right

149
00:11:54,840 --> 00:12:01,860
here, even though this point is right in line with each of these other data points in the set, this

150
00:12:01,860 --> 00:12:07,560
point, this point, this point here, here and here, we're all right along this line.

151
00:12:07,560 --> 00:12:15,090
And so is this new point such that this new point isn't adding really any new information than all of

152
00:12:15,090 --> 00:12:15,960
these other points here?

153
00:12:15,960 --> 00:12:19,380
It's right in line with all of the other values along this line.

154
00:12:19,380 --> 00:12:28,680
But if we sketch in the area created between this point and the mean right here, we can see that this

155
00:12:28,680 --> 00:12:35,670
massive amount of area is definitely going to increase our covariance calculation, even though this

156
00:12:35,670 --> 00:12:42,030
particular point doesn't really vary off of this trend line in either the X or Y direction.

157
00:12:42,030 --> 00:12:47,070
And we could continue adding points directly along the line way out here in space.

158
00:12:47,070 --> 00:12:53,550
And that huge area is going to contribute to increasing our covariance value, even though we're not

159
00:12:53,550 --> 00:12:56,040
really seeing any variation in the set.

160
00:12:56,070 --> 00:13:02,250
Essentially all we're doing is scaling up this sum right here, which means that we calculate a larger

161
00:13:02,250 --> 00:13:09,210
total area and that tells us that our data points are creating a large amount of area away from the

162
00:13:09,210 --> 00:13:10,140
mean point.

163
00:13:10,140 --> 00:13:17,610
But it doesn't give us any indication about how close or far our data points are from this trend line.

164
00:13:17,610 --> 00:13:19,470
And that's what we're really interested in.

165
00:13:19,500 --> 00:13:24,900
We want to know if all these data points that we're adding to the set are right along this trend line

166
00:13:24,900 --> 00:13:26,190
or far from it.

167
00:13:26,190 --> 00:13:30,300
So at this point we added up here was right along the trend line.

168
00:13:30,300 --> 00:13:38,040
We want to be able to distinguish between that kind of a point and maybe a point down here that is far

169
00:13:38,040 --> 00:13:39,660
away from the trend line.

170
00:13:39,660 --> 00:13:43,290
And that's what correlation is going to allow us to do later.

171
00:13:43,290 --> 00:13:48,870
But for now, despite all of its problems, let's just look at an example so that we know how to use

172
00:13:48,870 --> 00:13:51,120
this formula to calculate covariance.

173
00:13:51,120 --> 00:13:56,700
Now that we know what it's doing, we can obviously do this by hand or with a calculator.

174
00:13:56,700 --> 00:14:01,770
But these kind of calculations, especially as our data sets get larger and larger, are much, much

175
00:14:01,770 --> 00:14:03,870
easier to do with software.

176
00:14:03,870 --> 00:14:10,980
So for instance, here's a table, and these are the values in our X and Y series.

177
00:14:10,980 --> 00:14:16,740
So we have our X values here one, two, three, four, five, three and four, and their corresponding

178
00:14:16,740 --> 00:14:20,670
Y values two, four, six, eight, ten, five and nine.

179
00:14:20,670 --> 00:14:25,200
So essentially these are all of our x, sub I and Y sub AI values.

180
00:14:25,200 --> 00:14:28,290
And then we need to find the mean x bar and y bar.

181
00:14:28,290 --> 00:14:36,180
So we sum the entire X series and the entire Y series and we get sums of 22 and 44 respectively.

182
00:14:36,270 --> 00:14:43,860
And then because there are seven data points, because PN is seven, in this case we divide 22 and 44

183
00:14:43,860 --> 00:14:46,410
by seven to get the mean.

184
00:14:46,410 --> 00:14:51,300
So this here is X bar and this is why bar.

185
00:14:51,300 --> 00:14:58,110
And in fact these series are what was represented here in this original data set of blue points in the

186
00:14:58,110 --> 00:14:58,680
plane.

187
00:14:58,800 --> 00:15:04,170
So we calculate the mean of both the X series and the Y series.

188
00:15:04,170 --> 00:15:10,410
Then we need to calculate these x sub minus x bar values and the Y sub minus y bar values.

189
00:15:10,590 --> 00:15:12,120
So that's where we go next.

190
00:15:12,120 --> 00:15:14,790
We have x sub, B minus x bar.

191
00:15:14,790 --> 00:15:23,100
So this first value here is one minus the mean of x 3.14, etc. This second value here is two minus

192
00:15:23,100 --> 00:15:26,220
the mean, this is three, minus the mean etc..

193
00:15:26,220 --> 00:15:35,250
And then same thing here for y we take y of two minus the mean of 6.28 to get this -4.28 value.

194
00:15:35,250 --> 00:15:40,080
In this second row we take four minus the mean, we get a -2.28.

195
00:15:40,080 --> 00:15:43,320
So we calculate all those y minus y bar values.

196
00:15:43,320 --> 00:15:46,920
So we calculate all those y sub by minus y bar values.

197
00:15:46,920 --> 00:15:53,790
And then of course we take their product because here we have to multiply these x sub minus x bar values

198
00:15:53,790 --> 00:15:56,310
by the Y sub by minus y bar values.

199
00:15:56,430 --> 00:16:01,080
So once we have these products, we then add them all together.

200
00:16:01,080 --> 00:16:04,680
This sum here says to add all those products together.

201
00:16:04,680 --> 00:16:07,140
That's this sum right here.

202
00:16:07,140 --> 00:16:13,590
So when we add everything together, we get 22.7 approximately, and then we just divide this value

203
00:16:13,590 --> 00:16:20,460
by seven to get the covariance of X and Y as approximately 3.2449.

204
00:16:20,460 --> 00:16:25,290
This is a rounded value here, but this is the covariance of x with Y.

205
00:16:25,290 --> 00:16:30,210
And like we said at the beginning, this second formula here will do the same thing.

206
00:16:30,240 --> 00:16:32,230
All we do once we have our x and y value.

207
00:16:32,350 --> 00:16:32,770
I use.

208
00:16:32,770 --> 00:16:35,880
We take the product of each X and Y.

209
00:16:35,890 --> 00:16:40,080
So here we have our values for X, our values for Y.

210
00:16:40,090 --> 00:16:43,510
So if we multiply one times two, we get two.

211
00:16:43,750 --> 00:16:51,910
Two times four is eight, three times six is 18, four times eight is 32, etc. All the way down to

212
00:16:51,910 --> 00:16:54,250
four times nine is 36.

213
00:16:54,370 --> 00:16:59,050
So if we multiply all the X and Y values, we find all these products.

214
00:16:59,050 --> 00:17:04,450
So this here is the x y column and then we take the mean of these.

215
00:17:04,450 --> 00:17:10,599
So we add all of these together and then we divide by an equal seven the number of data points we have.

216
00:17:10,599 --> 00:17:16,690
So if we add all these together and then we divide by seven, the result we get is this mean of the

217
00:17:16,690 --> 00:17:24,160
Z products, which turns out to be 23, and then we just subtract from that the product of the two means,

218
00:17:24,160 --> 00:17:27,280
which we already have, we already have X bar and Y bar.

219
00:17:27,280 --> 00:17:31,000
You can see here we have about 3.14 and 6.28.

220
00:17:31,000 --> 00:17:36,310
If we multiply these values together, we're going to get about 19 and three quarters.

221
00:17:36,310 --> 00:17:39,550
And so when we then take 23 here.

222
00:17:40,380 --> 00:17:45,210
Minus the product of X bar and Y bar, which is about 19.75.

223
00:17:45,210 --> 00:17:45,900
Roughly.

224
00:17:45,900 --> 00:17:55,220
The value we end up with is about 3.25 or more specifically, this exact value here for covariance 3.2449

225
00:17:55,230 --> 00:17:57,780
if we rounded the first four decimal places.

226
00:17:57,780 --> 00:18:03,690
So these two covariance formulas do calculate exactly the same value just in two different ways.

227
00:18:03,690 --> 00:18:09,480
And so just to tie this covariance back to this data set, we started with the fact that we get a value

228
00:18:09,510 --> 00:18:11,340
here of 3.2.

229
00:18:11,370 --> 00:18:18,150
The fact that it's positive tells us that there is a positive or direct relationship between X and Y,

230
00:18:18,180 --> 00:18:23,730
meaning that as X increases, Y increases or as X decreases, Y decreases.

231
00:18:23,730 --> 00:18:27,350
And we see that with this rough trend line that we sketch through the data.

232
00:18:27,360 --> 00:18:33,360
If we had found a negative value for covariance, it would indicate the opposite relationship between

233
00:18:33,360 --> 00:18:34,980
the X series and the Y series.

234
00:18:34,980 --> 00:18:40,020
It would tell us that there was more area over here in this direction.

235
00:18:41,000 --> 00:18:42,470
Down here or.

236
00:18:43,170 --> 00:18:46,280
Here along this negative trend line.

237
00:18:46,290 --> 00:18:50,420
Then there was area in this direction along the positive trend line.

238
00:18:50,430 --> 00:18:56,160
So really the biggest takeaway here is that when we find a positive covariance value, it means we have

239
00:18:56,160 --> 00:19:02,730
a positive or a direct relationship that the variables change in the same direction instead of in opposite

240
00:19:02,730 --> 00:19:03,530
directions.

241
00:19:03,540 --> 00:19:09,810
And next, we're going to look at how to use what we've learned about covariance and standardize it

242
00:19:09,810 --> 00:19:14,730
into correlation, which will be an even more useful measure for us.

