1
00:00:00,120 --> 00:00:05,700
Earlier, we briefly mentioned the chi square test statistic and now we want to look more at chi squared

2
00:00:05,700 --> 00:00:06,370
tests.

3
00:00:06,390 --> 00:00:10,080
There are many different kinds of chi square tests we can perform.

4
00:00:10,080 --> 00:00:14,160
We'll look at three of them and then in particular we'll focus on one.

5
00:00:14,160 --> 00:00:20,370
So one kind of chi squared test that we can do is a chi squared test for homogeneity, where we take

6
00:00:20,370 --> 00:00:25,350
a sample from two groups and essentially compare their probability distributions.

7
00:00:25,350 --> 00:00:31,800
So an example of a question we could answer here is whether or not gender has an effect on pet preference.

8
00:00:31,800 --> 00:00:39,570
So if we were to ask men and women whether they prefer cats, dogs or some other kind of pet and then

9
00:00:39,570 --> 00:00:45,210
create a table of our data, it might look something like this, We can see that what we're doing here

10
00:00:45,210 --> 00:00:46,890
is sampling from two groups.

11
00:00:46,890 --> 00:00:52,440
One group is men, one group is women, and we are comparing their probability distributions.

12
00:00:52,440 --> 00:00:58,110
So we have a probability distribution for men across cats, dogs and other pets, and then the same

13
00:00:58,110 --> 00:01:00,240
probability distribution for women.

14
00:01:00,240 --> 00:01:06,720
Now, if gender has no effect at all on pet preference, then we would expect those probability distributions

15
00:01:06,720 --> 00:01:07,890
to be similar.

16
00:01:07,890 --> 00:01:15,810
In other words, if men prefer cats at roughly a 25% rate, dogs at 50% and other pets at 25%, then

17
00:01:15,810 --> 00:01:22,500
we would expect roughly the same distribution for women, 25% for cats, 50% for dogs and 25% for other

18
00:01:22,500 --> 00:01:23,100
pets.

19
00:01:23,130 --> 00:01:29,370
Now, of course, we know that these two distributions are very unlikely to be identical, but if gender

20
00:01:29,370 --> 00:01:34,140
doesn't affect pet preference, then we could expect them to be roughly similar.

21
00:01:34,140 --> 00:01:39,450
But if gender does have an effect on pet preference, then we might see these probability distributions

22
00:01:39,450 --> 00:01:41,370
being extremely different.

23
00:01:41,370 --> 00:01:47,670
And the chi squared test statistic lets us look at how different the distributions are and then determine

24
00:01:47,670 --> 00:01:53,220
whether the difference is big enough to be called statistically significant, such that we could reject

25
00:01:53,220 --> 00:01:58,410
the null hypothesis that gender doesn't affect pet preference and therefore lend support to the alternative

26
00:01:58,410 --> 00:02:01,800
hypothesis that pet preference is affected by gender.

27
00:02:01,800 --> 00:02:04,770
So that's one kind of test We can run.

28
00:02:04,800 --> 00:02:11,220
Another kind of test we can run with chi squared is a chi squared test for association or independence.

29
00:02:11,220 --> 00:02:18,540
In this kind of test, we sample from one group and we compare characteristics for that same group.

30
00:02:18,540 --> 00:02:22,770
For instance, we could compare eye color and handedness.

31
00:02:22,770 --> 00:02:29,220
So whether someone is left handed or right handed for the same set of individuals, we might make a

32
00:02:29,220 --> 00:02:30,990
data table that looks like this.

33
00:02:30,990 --> 00:02:38,130
So we pull a sample of people all from the same population and we record whether they are left handed

34
00:02:38,130 --> 00:02:40,890
or right handed in contrast with their eye color.

35
00:02:40,890 --> 00:02:47,640
So out of our one group, we can see that we have 72 people who are left handed with brown eyes, 36

36
00:02:47,640 --> 00:02:52,650
people who are left handed with blue eyes, 130 people who are right handed with green eyes, etc..

37
00:02:52,650 --> 00:02:58,500
So in this kind of test, our null hypothesis would be that eye color and handedness are independent,

38
00:02:58,500 --> 00:03:01,080
they're not associated, they have no association.

39
00:03:01,080 --> 00:03:06,510
And therefore the alternative hypothesis would state that eye color and handedness are associated,

40
00:03:06,510 --> 00:03:12,570
they are dependent and we would again, similar to the homogeneity test, look at the distributions

41
00:03:12,570 --> 00:03:15,270
of the left handed people and the right handed people.

42
00:03:15,270 --> 00:03:21,210
If handedness and eye color are not associated, then the distributions for left handed and right handed

43
00:03:21,210 --> 00:03:22,320
should be similar.

44
00:03:22,320 --> 00:03:27,300
But if handedness and eye color are associated, if they're dependent on one another, then maybe we'll

45
00:03:27,300 --> 00:03:31,320
see different distributions for left handed and right handed people.

46
00:03:31,320 --> 00:03:37,230
And if that difference is significant enough, then we may be able to reject the null and lend support

47
00:03:37,230 --> 00:03:43,290
to our alternative hypothesis that handedness and eye color are associated variables.

48
00:03:43,290 --> 00:03:47,070
So you can see how these two kinds of tests are similar.

49
00:03:47,070 --> 00:03:51,570
The third kind of test will look at is a chi squared goodness of fit test.

50
00:03:51,570 --> 00:03:56,910
And here we're comparing actual versus expected values in a contingency table.

51
00:03:56,910 --> 00:04:03,060
For example, we might try to answer the question, are sales number of sales of a product affected

52
00:04:03,060 --> 00:04:04,230
by month?

53
00:04:04,230 --> 00:04:09,870
And maybe we would use this table here where we have the months of the year, January, February, March,

54
00:04:09,870 --> 00:04:14,370
etc. and we have the number of sales that we make in each month.

55
00:04:14,370 --> 00:04:21,510
So we call this a contingency table because the number of sales are contingent on the month of the year.

56
00:04:21,540 --> 00:04:27,870
To answer this kind of a question, we would think about the expected number of sales in each month

57
00:04:27,870 --> 00:04:30,480
based on the total number of sales for the year.

58
00:04:30,480 --> 00:04:37,080
In other words, if sales aren't really impacted by month, then we would expect a steady level or a

59
00:04:37,080 --> 00:04:39,660
consistent level of sales across the year.

60
00:04:39,660 --> 00:04:44,100
Month wouldn't have an effect on sales, and sales should remain fairly steady.

61
00:04:44,100 --> 00:04:50,340
And so in that sense, just like with these two tests, we would compare the distribution of actual

62
00:04:50,340 --> 00:04:53,190
sales to the distribution of expected sales.

63
00:04:53,190 --> 00:04:59,130
And if those two distributions are significantly different enough, then we might be able to conclude

64
00:04:59,130 --> 00:04:59,850
that sales.

65
00:04:59,910 --> 00:05:01,630
Are affected by month.

66
00:05:01,650 --> 00:05:06,230
So that gives you a little bit of an idea of what we're doing with chi squared tests.

67
00:05:06,240 --> 00:05:11,970
Let's dig into this goodness of fit test in a little more detail so that we can see an example and work

68
00:05:11,970 --> 00:05:14,310
through a full chi squared test.

69
00:05:14,670 --> 00:05:20,670
So we're starting here with an expanded version of the table that we had for the goodness of FIT test.

70
00:05:20,700 --> 00:05:23,850
We're trying to figure out if sales are affected by month.

71
00:05:23,850 --> 00:05:25,920
So we had the months of the year.

72
00:05:26,040 --> 00:05:29,880
These were the observed number of sales in each month.

73
00:05:29,880 --> 00:05:35,970
So in January we sold 60 products, in February we sold 80 products, etc. And when we total up all

74
00:05:35,970 --> 00:05:39,570
those sales, we have 1020 sales for the year.

75
00:05:39,600 --> 00:05:47,700
Now, if we take 1020 and we divide it by 12 months in the year, we get 85.

76
00:05:47,700 --> 00:05:55,170
And so our expected value for each month of the year is 85, because if a month has no effect on number

77
00:05:55,170 --> 00:06:02,430
of sales, then sales should remain steady and our expected value would be a steady, consistent 85

78
00:06:02,430 --> 00:06:07,470
sales per month for a total at the end of the year of 1020 sales.

79
00:06:07,470 --> 00:06:12,060
What we want to do now is use a chi squared goodness of fit test.

80
00:06:12,060 --> 00:06:21,060
We're going to pick a confidence level of 95% or an alpha value of 0.05 and output value of 5%.

81
00:06:21,060 --> 00:06:26,640
And we're going to use this chi squared test to make a determination about whether sales are affected

82
00:06:26,640 --> 00:06:27,870
by month of the year.

83
00:06:27,870 --> 00:06:35,460
So with all of these chi squared tests are null, hypothesis is that month has no effect or that the

84
00:06:35,460 --> 00:06:37,410
distributions aren't different.

85
00:06:37,410 --> 00:06:40,650
Basically that the two variables don't affect each other.

86
00:06:40,650 --> 00:06:44,130
In this case, that observed and expected values aren't different.

87
00:06:44,130 --> 00:06:47,280
So we would say here that sales.

88
00:06:48,240 --> 00:06:51,060
Aren't affected by month.

89
00:06:51,090 --> 00:06:57,570
That means that our null hypothesis then is that sales are affected by month or that the month of the

90
00:06:57,570 --> 00:06:59,900
year does have an effect on sales.

91
00:06:59,910 --> 00:07:06,090
So again, if sales aren't affected by month, then it would make sense that they would be evenly distributed

92
00:07:06,090 --> 00:07:11,550
over each month and we would see something fairly consistent like this expected value of 85.

93
00:07:11,580 --> 00:07:17,370
Now, keep in mind here that we can't just lend support to the alternative hypothesis that sales are

94
00:07:17,370 --> 00:07:23,610
affected by month just by having different observed values then the value that we expect.

95
00:07:23,610 --> 00:07:31,140
For instance, if our observed values were 84 for January, 86, for February 84, for March, 86,

96
00:07:31,140 --> 00:07:37,380
for April, 84, for May, 86 for June, etc. Yes, those values are technically different than the

97
00:07:37,380 --> 00:07:44,370
expected values of 85 for every single month, but the values are so similar, they're consistent enough

98
00:07:44,370 --> 00:07:50,820
that the difference will not be statistically significant and we will not have enough evidence to reject

99
00:07:50,820 --> 00:07:54,120
the null hypothesis and therefore lend support to the alternative.

100
00:07:54,150 --> 00:07:59,460
These observed values here not only have to be different, they have to be different enough to meet

101
00:07:59,460 --> 00:08:05,610
our level of statistical significance, and that's what will determine with our chi squared test statistic.

102
00:08:05,610 --> 00:08:08,970
So to calculate chi squared, that's our next step.

103
00:08:08,970 --> 00:08:14,640
For each of our data points, we find the difference between the observed and expected values.

104
00:08:14,640 --> 00:08:17,760
So we take observed value minus expected value.

105
00:08:17,760 --> 00:08:22,230
That should kind of remind us of the idea of the residual that we've been looking at as we've been working

106
00:08:22,230 --> 00:08:23,520
through regression here.

107
00:08:23,550 --> 00:08:27,780
So we take the observed minus expected and we get that difference.

108
00:08:27,780 --> 00:08:32,990
And so we can see here that the third row in our table is observed minus expected.

109
00:08:33,150 --> 00:08:36,450
Then we square that value to make all the values positive.

110
00:08:36,450 --> 00:08:39,480
That's the last row of our table, all those squared values.

111
00:08:39,480 --> 00:08:45,750
Once we have all those squared values, we add them all up and then divide by the expected value.

112
00:08:45,750 --> 00:08:52,050
We divide each one of them by the expected value, and then we add all of those results together in

113
00:08:52,050 --> 00:08:56,670
this particular chi squared test, because the expected values are all the same.

114
00:08:56,670 --> 00:09:03,390
That means we can add all these squared values in the fourth row and then just divide by 85 all at once.

115
00:09:03,390 --> 00:09:11,130
So the sum here of this fourth row is 6850, which means that the chi squared test statistic for this

116
00:09:11,130 --> 00:09:22,470
particular test will be 6850 divided by our expected value, 85, which is approximately 80 point.

117
00:09:23,150 --> 00:09:24,010
Five nine.

118
00:09:24,020 --> 00:09:31,790
So this is our chi squared test statistic, just like we would calculate a PT test statistic or a Z

119
00:09:31,790 --> 00:09:37,400
score, we calculate this chi squared test statistic and then, as you might have suspected, just like

120
00:09:37,400 --> 00:09:42,050
the T score or the Z score, we need to look this up in a chi squared table.

121
00:09:42,080 --> 00:09:48,500
Chi squared has its own distribution and so it has its own table and the chi squared table looks like

122
00:09:48,500 --> 00:09:49,040
this.

123
00:09:49,040 --> 00:09:56,900
It's very similar to the t table in the sense that we have a degrees of freedom value down this left

124
00:09:56,900 --> 00:09:57,560
hand side.

125
00:09:57,590 --> 00:10:03,650
Now when we think about degrees of freedom for a chi squared test, we always need to go back to the

126
00:10:03,650 --> 00:10:09,290
table that we're starting with and we want to look at the original data that we had, the observed data,

127
00:10:09,290 --> 00:10:10,730
including the total.

128
00:10:10,730 --> 00:10:19,100
So in this case we would be looking at this part right here of our table, all the original observed

129
00:10:19,100 --> 00:10:22,400
data plus the total.

130
00:10:22,400 --> 00:10:27,860
Now the degrees of freedom is the number of values that we would have to include in the body of the

131
00:10:27,860 --> 00:10:34,490
table, not including the total, in order to be able to fill in any other missing values in the table.

132
00:10:34,490 --> 00:10:36,020
So here's what we mean.

133
00:10:36,020 --> 00:10:41,810
If we look at the body of this table, we have all these values in the body 60, 80, 65, all the way

134
00:10:41,810 --> 00:10:44,120
up to 65 here in December.

135
00:10:44,120 --> 00:10:45,710
And then we have the total.

136
00:10:45,740 --> 00:10:51,950
The question is, how many values can we remove from the body of the table before we would no longer

137
00:10:51,950 --> 00:10:54,500
know what to put in each cell?

138
00:10:54,500 --> 00:11:02,720
Well, if we remove the value here in December and we pretend now that we don't have the value for December,

139
00:11:02,720 --> 00:11:06,230
the only values we have are January through November.

140
00:11:06,230 --> 00:11:12,620
And the total the question is could we figure out what value to put back in for December?

141
00:11:12,620 --> 00:11:19,220
And of course, the answer is yes, because we could take the total 1020 and subtract from it all of

142
00:11:19,220 --> 00:11:22,160
these other values for January through November.

143
00:11:22,160 --> 00:11:27,290
And what we would be left with is 65, which has to be December's value.

144
00:11:27,290 --> 00:11:33,830
So we can remove the value for December and we can still know what all of the values in the table will

145
00:11:33,830 --> 00:11:34,270
be.

146
00:11:34,280 --> 00:11:40,490
Now the question becomes, can we remove another value from the body of the table and still know how

147
00:11:40,490 --> 00:11:42,080
to fill in the rest of the table?

148
00:11:42,080 --> 00:11:44,810
So let's say we remove November's value.

149
00:11:44,810 --> 00:11:48,830
Now the question is, can we accurately fill in November and December?

150
00:11:48,830 --> 00:11:53,000
And the answer is no, because we have the total 1020.

151
00:11:53,000 --> 00:12:00,530
But if we subtract out all these values January through October, we're left with a missing 125.

152
00:12:00,530 --> 00:12:05,030
But we don't know how to split 125 between November and December.

153
00:12:05,030 --> 00:12:11,000
We have no way of telling how much of that 125 goes in November and how much goes in December, which

154
00:12:11,000 --> 00:12:17,990
means that we can only remove one value from the table before we would no longer be able to finish filling

155
00:12:17,990 --> 00:12:18,410
it in.

156
00:12:18,410 --> 00:12:24,770
So if we can only remove one value, that means we have 11 values remaining in the table, which means

157
00:12:24,770 --> 00:12:28,190
in this case the degrees of freedom is 11.

158
00:12:28,190 --> 00:12:33,440
Degrees of freedom is the remaining number of values that we have in the table.

159
00:12:33,440 --> 00:12:39,170
And since that's a little confusing, let's look at one of the tables we had from earlier in the video.

160
00:12:39,200 --> 00:12:42,110
This handedness versus I color table.

161
00:12:42,110 --> 00:12:47,780
Really what this comes down to is that when we look at the body of the table, we're always able to

162
00:12:47,780 --> 00:12:50,720
remove the last row and or column of the table.

163
00:12:50,720 --> 00:12:55,580
So in this table we could remove all of these values.

164
00:12:56,230 --> 00:13:01,330
Because you can see that even if we didn't have any of these values here, we could find left handed

165
00:13:01,330 --> 00:13:08,770
Hazel by subtracting 72, 36 and 20 from this 140 total on the right hand side that would allow us to

166
00:13:08,770 --> 00:13:11,230
fill in this left handed Hazel here.

167
00:13:11,230 --> 00:13:13,720
So then we would have this value.

168
00:13:13,750 --> 00:13:23,460
And then once we have that value, we can fill in right handed and brown by subtracting 72 from 532.

169
00:13:23,470 --> 00:13:24,580
So we'd have that value.

170
00:13:24,580 --> 00:13:30,640
We would find right handed and blue by subtracting 36 from 251 and so on.

171
00:13:30,640 --> 00:13:39,550
We'd take 150 -20 to get this cell and then 67 -12 to find this cell, and suddenly we'd have the whole

172
00:13:39,550 --> 00:13:40,420
table again.

173
00:13:40,420 --> 00:13:48,670
So if we're able to remove all of these values here, that means degrees of freedom for this table,

174
00:13:48,670 --> 00:13:54,430
for this scenario is three, because the number of remaining values in the body of our table is three.

175
00:13:54,430 --> 00:13:56,590
We have 72, 36 and 20.

176
00:13:56,590 --> 00:14:01,000
And so degrees of freedom here would be three in the same way that degrees of freedom up here would

177
00:14:01,000 --> 00:14:01,890
be 11.

178
00:14:01,900 --> 00:14:06,130
So back to our original example, our sales affected by month.

179
00:14:06,130 --> 00:14:11,650
We have 11 degrees of freedom and we're interested in this 5% significant level.

180
00:14:11,650 --> 00:14:19,240
So we come here to 11 degrees of freedom and we work our way across to an upper tail probability of

181
00:14:19,240 --> 00:14:21,190
0.05.

182
00:14:21,190 --> 00:14:26,980
Whatever our level of significance is, we find that in the upper tail probability hetero, and then

183
00:14:26,980 --> 00:14:33,490
we come here to their intersection and we see this value 19.68 from here.

184
00:14:33,490 --> 00:14:38,980
The only thing left to say is whether our chi squared test statistic that we calculated is less than

185
00:14:38,980 --> 00:14:42,700
or greater than the value that we find in our chi squared table.

186
00:14:42,760 --> 00:14:52,660
In this case, we can see obviously that 80.59 is greater than 19.68 and whenever we find a chi squared

187
00:14:52,660 --> 00:14:58,120
test statistic that is greater than this critical value that we find in the chi squared table, that

188
00:14:58,120 --> 00:15:04,780
means that the test statistic we found is significant enough to allow us to reject the null hypothesis.

189
00:15:04,780 --> 00:15:11,860
In our case, there's a massive difference between 80.59 and 19.68, which tells us that we have very,

190
00:15:11,860 --> 00:15:17,590
very strong evidence that sales are indeed affected by month of the year.

191
00:15:17,590 --> 00:15:25,090
So we can reject this null hypothesis that sales aren't affected by month and we give a lot of support

192
00:15:25,090 --> 00:15:26,980
because of the massive difference here.

193
00:15:26,980 --> 00:15:34,090
We give a lot of support to this alternative hypothesis, the hypothesis that sales are indeed affected

194
00:15:34,090 --> 00:15:40,210
by month of the year, that the distribution of sales doesn't just hold steady and constant unaffected

195
00:15:40,210 --> 00:15:46,450
as we go month to month, and that instead they fluctuate with some level of significance and the number

196
00:15:46,450 --> 00:15:49,870
of sales we make is dictated by the month of the year.

197
00:15:49,870 --> 00:15:51,700
It's affected by month of the year.

198
00:15:51,700 --> 00:15:57,280
So in a way we could say that month of the year and sales are dependent variables.

199
00:15:57,280 --> 00:16:01,420
They are associated, they are not independent of one another.

200
00:16:01,420 --> 00:16:08,560
So even though we looked at different kinds of Chi square tests, they all follow the same kind of general

201
00:16:08,560 --> 00:16:09,160
pattern.

202
00:16:09,160 --> 00:16:16,480
We're always comparing two different distributions and we sort of start with this idea that the distributions

203
00:16:16,480 --> 00:16:20,860
are the same and so we compute this expected value.

204
00:16:20,890 --> 00:16:27,040
Then we look at the difference between actual value or observed value and expected value, and we use

205
00:16:27,040 --> 00:16:30,220
this formula to calculate the chi squared test statistic.

206
00:16:30,220 --> 00:16:36,190
Once we have the chi squared test statistic and the number of degrees of freedom, we can look for our

207
00:16:36,190 --> 00:16:42,040
level of significance in the chi squared table and compare our test statistic to the value that we find

208
00:16:42,040 --> 00:16:47,620
in the table and use the relationship between the test statistic and the value from the table to make

209
00:16:47,620 --> 00:16:52,780
a conclusion about whether or not we can reject the null hypothesis.