1
00:00:00,090 --> 00:00:05,070
Now that we understand the idea of covariance, we want to transition into talking about correlation

2
00:00:05,070 --> 00:00:11,040
and specifically the Pearson correlation coefficient, which we often indicate with this letter R or

3
00:00:11,040 --> 00:00:13,100
sometimes the Greek letter rho.

4
00:00:13,110 --> 00:00:19,320
But either way, whether we indicate it with R or rho, that Pearson correlation coefficient is just

5
00:00:19,320 --> 00:00:25,200
referring to the correlation between the variables X and Y, or between these two data series X and

6
00:00:25,200 --> 00:00:26,910
Y, and we calculate it.

7
00:00:26,910 --> 00:00:33,060
And this is why we had to learn covariance first as the covariance divided by the product of the standard

8
00:00:33,060 --> 00:00:39,300
deviations of X, And why, given this formula, we can see that there are several ways we could go

9
00:00:39,300 --> 00:00:41,700
about actually calculating correlation.

10
00:00:41,730 --> 00:00:47,580
Of course, we could first calculate a complete covariance value and then we could calculate both standard

11
00:00:47,580 --> 00:00:51,870
deviations and we could divide covariance by the product of the standard deviations.

12
00:00:51,870 --> 00:00:57,390
Or we could break these values down using the formulas for each of them individually.

13
00:00:57,390 --> 00:00:59,310
And it would look something like this.

14
00:00:59,310 --> 00:01:06,300
So here in the numerator we have the formula for covariance between x and Y, and then in the denominator

15
00:01:06,300 --> 00:01:09,120
we have each of the standard deviation values.

16
00:01:09,120 --> 00:01:12,120
So we could go off of this formula as well.

17
00:01:12,240 --> 00:01:18,000
Sometimes we'll see the correlation formula as a simplified version of this one.

18
00:01:18,180 --> 00:01:19,650
It looks like this.

19
00:01:19,650 --> 00:01:25,620
And basically what we've recognized is that in the denominator here, when we have the square root of

20
00:01:25,620 --> 00:01:32,160
the product of one over N and this sum, we can take the square root of each of those individually.

21
00:01:32,160 --> 00:01:37,740
So this here, this four square root is equivalent to the square root of one over n multiplied by the

22
00:01:37,740 --> 00:01:39,960
square root of this whole sum.

23
00:01:39,960 --> 00:01:44,790
Essentially what that means is that we can kind of pull out in front this square root of one over N

24
00:01:44,790 --> 00:01:50,430
and this square root of one over N, And if we multiply those together, then they simplify to just

25
00:01:50,430 --> 00:01:51,090
one over.

26
00:01:51,090 --> 00:01:51,540
N.

27
00:01:51,570 --> 00:01:57,690
If we pull out this and this from underneath the square root, we can write them separately as the square

28
00:01:57,690 --> 00:02:02,130
root of one over n times the square root of one over n.

29
00:02:02,430 --> 00:02:07,980
Of course, when we multiply those square roots together, the roots cancel and we're just left with

30
00:02:07,980 --> 00:02:08,699
one over.

31
00:02:08,699 --> 00:02:09,060
N.

32
00:02:09,060 --> 00:02:15,990
So this whole thing here is equivalent to one over N And now we can see that we get this one over N

33
00:02:15,990 --> 00:02:22,680
to cancel with this one over N giving us this formula right here, which means we can calculate correlation

34
00:02:22,680 --> 00:02:23,400
either way.

35
00:02:23,400 --> 00:02:26,580
We can calculate it using this formula or using this formula.

36
00:02:26,580 --> 00:02:32,160
We just have to be careful that we're not dividing by n here in the numerator and then leaving out the

37
00:02:32,160 --> 00:02:36,060
one over n as part of both roots in the denominator or vice versa.

38
00:02:36,090 --> 00:02:43,350
Because realize that this value we have here in the numerator is not the covariance of x and Y because

39
00:02:43,350 --> 00:02:45,960
it doesn't include this one over n value.

40
00:02:45,960 --> 00:02:52,110
So if we calculate covariance and then use this square root here in the denominator, we'll get an incorrect

41
00:02:52,110 --> 00:02:53,850
value for correlation.

42
00:02:53,850 --> 00:03:00,570
The only reason this formula works is because we cancel this one over n from every part of our formula,

43
00:03:00,570 --> 00:03:02,820
including the covariance formula.

44
00:03:02,820 --> 00:03:05,850
And so this is not the full covariance formula.

45
00:03:05,850 --> 00:03:11,070
So we just have to be really careful with which pieces we're using to calculate correlation.

46
00:03:11,070 --> 00:03:16,980
But other than that, actually calculating this value is pretty straightforward because we already know

47
00:03:16,980 --> 00:03:18,420
how to calculate covariance.

48
00:03:18,420 --> 00:03:21,240
We already know how to calculate standard deviations.

49
00:03:21,240 --> 00:03:26,910
Again, just like with covariance, especially as our data sets get larger and larger, we're definitely

50
00:03:26,910 --> 00:03:34,380
going to want to use a calculator or software to find all of these values, to sum up all of these values.

51
00:03:34,380 --> 00:03:42,090
And so here we've extended the table that we used to look at covariance to include values for correlation.

52
00:03:42,090 --> 00:03:49,470
So we've already seen this table when we talked about covariance up to this part right here and now

53
00:03:49,470 --> 00:03:52,350
we've just added these two extra columns.

54
00:03:52,350 --> 00:03:58,920
So we already have our covariance value and we're essentially going to be using this formula right here.

55
00:03:58,920 --> 00:04:02,220
So let's start by getting rid of all this cancellation we did.

56
00:04:02,580 --> 00:04:08,280
So working from this formula, we essentially just need to calculate the standard deviation with respect

57
00:04:08,280 --> 00:04:09,060
to X and Y.

58
00:04:09,060 --> 00:04:14,940
So if we look at our standard deviation formula here, we have x ABI minus x bar.

59
00:04:14,940 --> 00:04:20,279
What we already calculated that value when we found covariance, we have it here in our table.

60
00:04:20,279 --> 00:04:22,470
Now all we have to do is square it.

61
00:04:22,470 --> 00:04:27,120
So in this column here, we're just squaring the value from this column.

62
00:04:27,120 --> 00:04:28,590
So we square that value.

63
00:04:28,620 --> 00:04:34,800
We're also going to need the squared Y sub B minus Y bar value, which we have here.

64
00:04:34,800 --> 00:04:40,680
And then this summation notation here just tells us to sum up all of those squared values.

65
00:04:40,680 --> 00:04:45,360
So we find all those values and then here we take their sum.

66
00:04:45,360 --> 00:04:52,650
So we have the sum of all the x sub by minus x bar quantity squared values, and that's about 10.8.

67
00:04:52,650 --> 00:04:59,970
And then this 49.4 is this sum here, the sum of all the Y sub by minus y bar, quantity squared values.

68
00:05:00,540 --> 00:05:05,790
And once we have those two sums, we just have to divide by n the number of data points, which in this

69
00:05:05,790 --> 00:05:10,590
case is seven we have here and equals seven.

70
00:05:10,590 --> 00:05:12,240
So we have to divide by seven.

71
00:05:12,240 --> 00:05:18,390
That's these values here, which means that these two values are the variance, not the covariance,

72
00:05:18,390 --> 00:05:21,750
but the variance of X and Y individually.

73
00:05:21,750 --> 00:05:27,660
So the variance of X, the variance of Y, and then we take their square roots and we get the standard

74
00:05:27,660 --> 00:05:28,290
deviations.

75
00:05:28,290 --> 00:05:35,430
So these here, this is the standard deviation for X, and this is the standard deviation for Y, which

76
00:05:35,430 --> 00:05:42,000
means that if we round these values, what we're getting here is approximately the covariance, which

77
00:05:42,000 --> 00:05:51,720
we said was about 3.2449 divided by and I'm just rounding here, but divided by the standard deviation

78
00:05:51,720 --> 00:05:58,440
with respect to X, So that's about 1.245, so 1.2, four, five.

79
00:05:58,440 --> 00:06:06,330
And then the standard deviation with respect to Y, which is about 2.657, so 2.657.

80
00:06:06,330 --> 00:06:11,610
So this is the calculation we're doing to find our correlation value right here.

81
00:06:11,610 --> 00:06:17,280
We're just taking this covariance, dividing it by the product of these two standard deviations.

82
00:06:17,280 --> 00:06:23,550
And the correlation value we get is about 0.9805 rounded to four decimal places.

83
00:06:23,550 --> 00:06:28,560
So approximately 0.9805.

84
00:06:28,560 --> 00:06:35,760
Now, the reason that correlation is so helpful to us is because it's a standardized version of covariance.

85
00:06:35,760 --> 00:06:42,240
If we think back to when we learned about variance and standard deviation for a single variable, we

86
00:06:42,240 --> 00:06:43,590
can kind of think about it this way.

87
00:06:43,590 --> 00:06:50,460
So when we were talking about variance in standard deviation, we sort of had this business need to

88
00:06:50,460 --> 00:06:54,810
measure the dispersion or the spread of a single variable.

89
00:06:54,810 --> 00:07:01,800
And so we would calculate variance, and variance is a statistical measure of spread, but it's not

90
00:07:01,800 --> 00:07:08,250
as intuitive or helpful because it's not in the same units as the original data.

91
00:07:08,250 --> 00:07:13,500
If the original data was given in meters, then variance would always be in meters squared.

92
00:07:13,500 --> 00:07:19,020
So if we take the square root of variance, we get here standard deviation and the units of standard

93
00:07:19,020 --> 00:07:22,980
deviation match the units of the original data.

94
00:07:22,980 --> 00:07:30,420
And so standard deviation is a standardized, more intuitive measure of spread or dispersion when we're

95
00:07:30,420 --> 00:07:32,790
talking about just one variable.

96
00:07:33,030 --> 00:07:37,410
Well, covariance and correlation sort of act in the same way.

97
00:07:37,440 --> 00:07:43,230
Our goal here is to get a measure of the relationship between two variables instead of the measure of

98
00:07:43,230 --> 00:07:45,660
dispersion or measure of spread of one variable.

99
00:07:45,660 --> 00:07:49,830
That's sort of our business requirement or our real world goal.

100
00:07:49,920 --> 00:07:56,190
And we can calculate covariance as a statistical measure of the relationship between two variables.

101
00:07:56,190 --> 00:08:02,700
And that's helpful to a degree, but for the same reason that variance was only marginally helpful.

102
00:08:02,910 --> 00:08:10,710
The units of covariance don't match the original units, so if the original units of X are in.

103
00:08:11,480 --> 00:08:19,250
Meters and the original units of Y are in meters, then the units of covariance are going to be in meters

104
00:08:19,250 --> 00:08:25,160
squared, which gives us some idea of the measure of the relationship but isn't super intuitive.

105
00:08:25,190 --> 00:08:33,200
The correlation coefficient by dividing by the two standard deviations removes these units entirely,

106
00:08:33,200 --> 00:08:38,659
and the correlation coefficient are just correlation actually has no units at all.

107
00:08:38,659 --> 00:08:44,270
Because if you think about it, if we said here that covariance was in meters squared, well, the units

108
00:08:44,270 --> 00:08:50,450
of standard deviation for these two measures are going to be in meters and in meters.

109
00:08:50,450 --> 00:08:56,030
And so when we divide meter squared by meters and meters, all the units will cancel and we're left

110
00:08:56,030 --> 00:08:59,270
with a unit list measurement of correlation.

111
00:08:59,270 --> 00:09:08,300
And what that means in this case is that correlation is this standardized value that always falls between

112
00:09:08,450 --> 00:09:11,450
negative one and positive one.

113
00:09:11,450 --> 00:09:19,100
So in the same way that maybe probability exists on a scale of 0 to 1, because we can only have a 0%

114
00:09:19,100 --> 00:09:24,380
chance that something happens all the way up to a 100% chance that something happens, but nothing less

115
00:09:24,380 --> 00:09:26,300
than zero and nothing greater than 100.

116
00:09:26,300 --> 00:09:29,120
So probability is locked on this 0 to 1 scale.

117
00:09:29,120 --> 00:09:33,800
In that same way correlation is locked on this -1 to 1 scale.

118
00:09:33,800 --> 00:09:40,850
And so it sort of gives us a measure of power or this particular scale on which to understand the strength

119
00:09:40,850 --> 00:09:42,470
of a relationship.

120
00:09:42,470 --> 00:09:48,500
So if covariance only gave us the direction of the relationship, correlation not only gives us the

121
00:09:48,500 --> 00:09:52,580
direction of the relationship, but also the strength of the relationship.

122
00:09:52,580 --> 00:10:02,750
So we want to think about covariance as just direction, but correlation as direction and.

123
00:10:03,920 --> 00:10:07,970
Strength, and that's where the standardized scale comes in.

124
00:10:07,970 --> 00:10:14,000
So we said with covariance, if we got a positive value, it told us that our two variables are two

125
00:10:14,000 --> 00:10:14,450
measures.

126
00:10:14,450 --> 00:10:16,790
X and Y moved in the same direction.

127
00:10:16,790 --> 00:10:22,160
They would always increase together or they would decrease together, whereas a negative value for covariance

128
00:10:22,160 --> 00:10:25,250
told us that our variables moved in the opposite direction.

129
00:10:25,250 --> 00:10:28,130
One would decrease while the other would increase or vice versa.

130
00:10:28,130 --> 00:10:33,050
But it didn't give us any indication of the strength of the relationship.

131
00:10:33,050 --> 00:10:41,150
Whereas with correlation here on this -1 to 1 scale, if we think about here, the line or this is negative

132
00:10:41,150 --> 00:10:44,870
one, this is one, and then we have zero in the middle here.

133
00:10:44,870 --> 00:10:50,960
If we find a value for correlation between zero and one, it means that the direction of the relationship

134
00:10:50,960 --> 00:10:54,890
is positive and so X and Y move in the same direction.

135
00:10:54,890 --> 00:10:56,990
They both increase or they both decrease.

136
00:10:56,990 --> 00:11:02,630
Whereas if we get a value for correlation between zero and negative one, it tells us that the variables

137
00:11:02,630 --> 00:11:05,060
X and Y move in the opposite direction.

138
00:11:05,060 --> 00:11:08,150
One increases, while the other decreases or vice versa.

139
00:11:08,150 --> 00:11:14,480
But the value we find for correlation also gives us the strength of the relationship in the sense that

140
00:11:14,480 --> 00:11:21,650
if we are much closer to positive one over here or to negative one all the way over here on the left,

141
00:11:21,680 --> 00:11:28,100
we know that we have a strong positive relationship on the right or a strong negative relationship on

142
00:11:28,100 --> 00:11:33,900
the left, meaning that the two variables are highly correlated or closely correlated.

143
00:11:33,920 --> 00:11:39,590
Contrast that with finding a value for correlation that's close to zero, whether that's positive or

144
00:11:39,590 --> 00:11:40,040
negative.

145
00:11:40,040 --> 00:11:46,430
So something like 0.1 or -0.1, some value that's close to zero either on the positive side or the negative

146
00:11:46,430 --> 00:11:51,230
side, that would tell us that there is a very weak relationship in the data.

147
00:11:51,230 --> 00:11:57,830
So a weak correlation or a low correlation and visually we can think about that as just how tightly

148
00:11:57,830 --> 00:12:00,410
clustered the data is around a trend.

149
00:12:00,410 --> 00:12:08,780
So in really super simple terms, if we have a data set and that data set.

150
00:12:09,480 --> 00:12:10,320
Looks.

151
00:12:11,120 --> 00:12:11,930
Like this.

152
00:12:11,930 --> 00:12:14,990
We already know that that's a positive correlation.

153
00:12:14,990 --> 00:12:20,420
We know that we would calculate a positive value for covariance, for correlation.

154
00:12:20,420 --> 00:12:25,820
Because of the positive covariance, we know that correlation will also be positive, but because all

155
00:12:25,820 --> 00:12:33,110
of the data is perfectly in alignment along this one line, we know that that's a very strong correlation

156
00:12:33,110 --> 00:12:36,470
in this particular case, a very strong positive correlation.

157
00:12:36,470 --> 00:12:40,340
And so we know that we're going to get a correlation very close to one.

158
00:12:40,340 --> 00:12:43,340
And in fact, that's what we got here with this data set.

159
00:12:43,340 --> 00:12:50,060
We were looking at an extremely strong positive correlation, an extremely strong positive relationship.

160
00:12:50,060 --> 00:12:53,030
And so our data probably looks something much like this.

161
00:12:53,030 --> 00:12:58,040
If we had a very strong negative correlation that would look like this.

162
00:12:58,040 --> 00:13:04,490
So again, a very strong relationship where the data is all tightly clustered close to this line, but

163
00:13:04,490 --> 00:13:06,860
just in the negative direction with a negative slope.

164
00:13:06,860 --> 00:13:08,330
So here this was blue.

165
00:13:08,330 --> 00:13:10,040
Let's say that this is red.

166
00:13:10,160 --> 00:13:16,130
And then if we call a positive but weak correlation or weak relationship, something that looks like

167
00:13:16,130 --> 00:13:18,860
this, so we might have.

168
00:13:20,060 --> 00:13:20,930
Data.

169
00:13:21,970 --> 00:13:27,130
That looks something like this so we can see the positive direction, but it's not tightly pulled to

170
00:13:27,130 --> 00:13:34,210
this middle trend line that's a weaker positive relationship or a weak positive correlation.

171
00:13:34,360 --> 00:13:39,430
And then a weak negative correlation will do in purple and that.

172
00:13:40,630 --> 00:13:41,950
Would look something.

173
00:13:43,130 --> 00:13:49,240
Like this where we see that negative direction of the data, but the correlation is weaker.

174
00:13:49,250 --> 00:13:57,380
Covariance doesn't give us any intuition, any insight whatsoever about how tightly the data is packed

175
00:13:57,380 --> 00:13:58,910
around this trend line.

176
00:13:58,910 --> 00:14:00,830
It just gives us direction.

177
00:14:00,830 --> 00:14:06,860
But obviously when we're answering real world questions, we really want to know, is the trend super

178
00:14:06,860 --> 00:14:07,400
strong?

179
00:14:07,400 --> 00:14:09,110
Is it super highly correlated?

180
00:14:09,110 --> 00:14:12,800
Do the two variables match up in lockstep with each other?

181
00:14:12,800 --> 00:14:18,740
Whereas one variable changes the other, one changes predictably along this line, and we can say that

182
00:14:18,740 --> 00:14:21,080
the two variables are very highly correlated.

183
00:14:21,080 --> 00:14:22,580
That's something we want to know.

184
00:14:22,580 --> 00:14:29,780
And standardising covariance by dividing by the product of the standard deviations gives us correlation,

185
00:14:29,780 --> 00:14:33,950
which does give us insight into how strong or weak the relationship is.

186
00:14:33,950 --> 00:14:40,760
If this blue data set here is the data that we get about sales compared to how much money we spend on

187
00:14:40,760 --> 00:14:41,540
marketing.

188
00:14:41,660 --> 00:14:50,330
Then this data here is telling us very clearly that as we spend more money on marketing, as our marketing

189
00:14:50,330 --> 00:14:55,340
advertising budget increases, sales also increase in lockstep.

190
00:14:55,340 --> 00:15:01,040
Then we absolutely want to pour more money into marketing and advertising so that we can see sales go

191
00:15:01,040 --> 00:15:01,430
up.

192
00:15:01,550 --> 00:15:09,530
But if this is the data set we're getting instead this green data set, we can see that the data does

193
00:15:09,530 --> 00:15:17,570
have a positive correlation, but it's not necessarily always the case that as we increase our marketing

194
00:15:17,570 --> 00:15:21,170
and advertising budget that sales will go up.

195
00:15:21,170 --> 00:15:26,660
The relationship between those two things is not nearly as strong as the relationship we see up here.

196
00:15:26,660 --> 00:15:31,580
And so as a business, we might choose to make slightly different decisions based on whether we get

197
00:15:31,580 --> 00:15:33,800
this blue data set or this green data set.

198
00:15:33,800 --> 00:15:41,090
But we only have insight into how strong this relationship is if we standardize our covariance statistic

199
00:15:41,090 --> 00:15:48,140
by dividing by the product of the standard deviations in order to turn it into this correlation measure.

