1
00:00:00,120 --> 00:00:06,810
Everything will cover in this section and the next really depends on our ability to take a sample from

2
00:00:06,810 --> 00:00:12,810
a population, which means that we want to start this section with a good understanding of sampling

3
00:00:12,810 --> 00:00:17,870
sampling techniques and the bias that we can potentially introduce when we take a sample.

4
00:00:17,880 --> 00:00:25,710
So we'll start by saying that a population is the entire group of subjects or information or people

5
00:00:25,710 --> 00:00:30,600
or data that we're interested in, and we indicate population with a capital N, We've already talked

6
00:00:30,600 --> 00:00:31,410
about population.

7
00:00:31,410 --> 00:00:38,310
We've already seen instances where we refer to the population with a capital N this capital n value,

8
00:00:38,340 --> 00:00:44,070
the value of N represents the number of people, subjects, members, data, etc. that are included

9
00:00:44,070 --> 00:00:45,390
in the population.

10
00:00:45,390 --> 00:00:50,370
And by definition a sample is a subset of the population.

11
00:00:50,370 --> 00:00:56,880
So if this is our entire population, the entire universe of our population, then we want to understand

12
00:00:56,880 --> 00:01:00,660
that a sample is just a subset of the population.

13
00:01:00,660 --> 00:01:04,650
So maybe our population includes 5000 people.

14
00:01:04,650 --> 00:01:11,070
We might be able to take a sample of, let's say, 700 people of those 5000.

15
00:01:11,070 --> 00:01:17,820
And so the population size would be capital N equals 5000, The sample size would be lowercase, N equals

16
00:01:17,820 --> 00:01:23,280
700, and those 700 people would just be taken from this larger population.

17
00:01:23,280 --> 00:01:30,240
Now, of course, this whole idea of sampling exists because we're interested in studying the population,

18
00:01:30,240 --> 00:01:38,190
but oftentimes it's very difficult or even impossible to actually collect data from every member of

19
00:01:38,190 --> 00:01:39,180
the population.

20
00:01:39,180 --> 00:01:44,760
We gave an example earlier in the course about studying the population of the state of California,

21
00:01:44,760 --> 00:01:46,680
which is many millions of people.

22
00:01:46,680 --> 00:01:53,310
If that's our population, it's going to be virtually impossible to collect data from the entire population

23
00:01:53,310 --> 00:01:58,980
because as we're conducting our study or our survey, some people will move out of the state, some

24
00:01:58,980 --> 00:02:00,480
people will move into the state.

25
00:02:00,480 --> 00:02:03,240
New babies will be born, some people might pass away.

26
00:02:03,240 --> 00:02:10,289
It would be essentially impossible to truly capture the entire population as a snapshot in one particular

27
00:02:10,289 --> 00:02:12,660
instant and define that entire population.

28
00:02:12,660 --> 00:02:18,780
And furthermore, the logistics of actually collecting responses from all of those people would be virtually

29
00:02:18,780 --> 00:02:21,150
impossible, even if we could define who they were.

30
00:02:21,150 --> 00:02:27,750
So for so many reasons, taking a smaller sample of the population of the state is going to make a lot

31
00:02:27,750 --> 00:02:28,650
more sense.

32
00:02:28,650 --> 00:02:36,330
Now, in a perfect world, we want this sample that we take to be a representative or unbiased sample.

33
00:02:36,330 --> 00:02:42,240
And what that means is that, of course we want the sample to do a good job representing the population

34
00:02:42,240 --> 00:02:45,360
and ideally we want it to be representative in every way.

35
00:02:45,360 --> 00:02:51,480
So let's take a basic characteristic of the population, like gender, males versus females in the state.

36
00:02:51,510 --> 00:02:54,750
Maybe the state is 54% female.

37
00:02:54,750 --> 00:02:59,490
Well, then ideally we would want our sample to also be 54% female.

38
00:02:59,490 --> 00:03:04,650
Maybe 20% of our population is between ages 35 and 50.

39
00:03:04,650 --> 00:03:10,020
Well, we might want 20% of our sample to fall between ages 35 and 50.

40
00:03:10,020 --> 00:03:16,230
We want the sample to represent the population because the idea here is that we would want to be able

41
00:03:16,230 --> 00:03:19,530
to use the sample to make inferences about the population.

42
00:03:19,530 --> 00:03:25,830
We want to be able to use what we find about the sample and scale it up so that we can make a proportional

43
00:03:25,830 --> 00:03:28,260
conclusion about the population itself.

44
00:03:28,260 --> 00:03:35,550
Let's take a really basic example and say that we have a big jar of marbles and the jar actually holds

45
00:03:35,790 --> 00:03:38,250
20 measuring cups of marbles.

46
00:03:38,250 --> 00:03:42,630
And we want to know the number of red marbles that exist in the jar.

47
00:03:42,630 --> 00:03:48,330
And we want to do that without going through every single marble in the jar and counting all of the

48
00:03:48,330 --> 00:03:49,110
red marbles.

49
00:03:49,110 --> 00:03:55,830
So what we might do instead is take a one cup sample of the 20 cups in the jar.

50
00:03:55,830 --> 00:04:02,250
And let's say that in that one cup sample we find that there are two red marbles.

51
00:04:02,250 --> 00:04:07,020
If our sample is representative, what we're able to do is set up an equation like this.

52
00:04:07,840 --> 00:04:14,620
Where we say for every one cup of marbles in the jar, we find two red marbles in that one cup.

53
00:04:14,620 --> 00:04:18,610
And so in 20 cups, we should find how many red marbles.

54
00:04:18,640 --> 00:04:20,320
Well, this is an equation.

55
00:04:20,320 --> 00:04:26,320
We can solve it by recognizing that in order to change one into 20, we have to multiply it by 20,

56
00:04:26,320 --> 00:04:31,270
Which means that in order to change, to add to this value, we also have to multiply it by 20.

57
00:04:31,270 --> 00:04:36,650
So two times 20 is 40 and we say x equals 40.

58
00:04:36,670 --> 00:04:40,930
In other words, 40 over 20 is the same thing as two over one.

59
00:04:40,930 --> 00:04:46,330
And so we would make a conclusion that there might be 40 red marbles in the entire jar.

60
00:04:46,360 --> 00:04:52,690
This is the kind of conclusion we might be able to make, assuming that this one cup sample of marbles

61
00:04:52,690 --> 00:04:59,460
that we took from the 20 total cups in the jar is actually representative of those 20 total cups.

62
00:04:59,470 --> 00:05:05,440
If this one cup sample we took is not representative, maybe there are no red marbles or there are many

63
00:05:05,440 --> 00:05:07,840
more red marbles in this particular cup.

64
00:05:07,840 --> 00:05:09,850
Then in another cup we might have chosen.

65
00:05:09,850 --> 00:05:14,740
Then the sample isn't going to be representative and we're not going to get a good estimate of the total

66
00:05:14,740 --> 00:05:16,670
red marbles in the entire jar.

67
00:05:16,690 --> 00:05:23,110
So from here on, everything we talk about is going to be how to pick a representative sample and talking

68
00:05:23,110 --> 00:05:26,890
about the bias that we might introduce when we choose a sample.

69
00:05:26,890 --> 00:05:31,090
So let's start by talking about some different sampling techniques.

70
00:05:31,090 --> 00:05:37,390
In our marble example, we had 20 cups of marbles in a jar and we just scooped out a single one cup

71
00:05:37,390 --> 00:05:39,490
measurement and used that as a sample.

72
00:05:39,490 --> 00:05:44,860
But in the real world, with real world kinds of problems, what are the different sampling techniques

73
00:05:44,860 --> 00:05:45,780
that we can use?

74
00:05:45,790 --> 00:05:51,640
Well, one of the most common, because it's the easiest, is called a simple random sample.

75
00:05:51,670 --> 00:05:58,000
The advantage of using a simple random sample is that it can be an easy and therefore quick and cost

76
00:05:58,000 --> 00:05:59,740
effective method of sampling.

77
00:05:59,740 --> 00:06:06,580
And the idea is that we assign a random number to every member of our population.

78
00:06:06,580 --> 00:06:12,940
And then ideally we use something like a random number generator to choose members from the population.

79
00:06:12,940 --> 00:06:17,860
So for instance, if we're trying to study all of the employees in our company, let's say there are

80
00:06:17,860 --> 00:06:23,980
1000 employees in our company, then we would assign all of the employees a number, starting with one

81
00:06:23,980 --> 00:06:25,480
all the way through 1000.

82
00:06:25,480 --> 00:06:29,740
So every employee has a number, and then we would use a random number generator.

83
00:06:29,740 --> 00:06:31,720
We can find one online.

84
00:06:31,720 --> 00:06:37,360
Excel also has a function that will randomly generate a number in Excel.

85
00:06:37,360 --> 00:06:41,380
The function is this one here Rand.

86
00:06:42,230 --> 00:06:48,560
Between as in a random number, between some certain interval and the inputs.

87
00:06:48,980 --> 00:06:53,230
A and B are the smallest and largest numbers in our list.

88
00:06:53,240 --> 00:06:58,640
So if our employees have been assigned a number one through 1000, then we would use the function equals

89
00:06:58,640 --> 00:06:59,030
rand.

90
00:06:59,030 --> 00:07:06,350
Between one comma, 1000 and Excel will randomly generate for us a number between one and 1000.

91
00:07:06,350 --> 00:07:12,470
Using a random number generator like this one will pick for us random numbers, and that's how we will

92
00:07:12,470 --> 00:07:14,690
build our simple random sample.

93
00:07:14,690 --> 00:07:23,300
So the result might be employee number 17, 162, 414, etc. And we could keep going until we have whatever

94
00:07:23,300 --> 00:07:24,380
sample size we need.

95
00:07:24,380 --> 00:07:27,620
Maybe we want a sample size of 150 employees.

96
00:07:27,620 --> 00:07:32,900
Of the 1000 total employees that we have, we would keep generating random numbers until we've picked

97
00:07:32,900 --> 00:07:35,210
randomly 150 employees.

98
00:07:35,240 --> 00:07:41,840
As we mentioned before, this sampling technique can be easy to implement sometimes and therefore cost

99
00:07:41,840 --> 00:07:43,370
effective and efficient.

100
00:07:43,370 --> 00:07:48,020
And it does a good job at truly picking random members from the population.

101
00:07:48,020 --> 00:07:50,030
But there are a couple problems.

102
00:07:50,030 --> 00:07:55,640
For instance, we do need to be able to assign a random number to every member of the population.

103
00:07:55,640 --> 00:08:02,690
And if the population is so large or hard to quantify that we can't assign a random number to every

104
00:08:02,690 --> 00:08:09,470
single population member, then it may be difficult to apply this method, and simple random sampling

105
00:08:09,470 --> 00:08:13,760
might do a bad job ultimately creating a representative sample.

106
00:08:13,760 --> 00:08:20,900
And that's because maybe our population of 1000 employees in our company is made up of exactly 500 men

107
00:08:20,900 --> 00:08:22,220
and 500 women.

108
00:08:22,220 --> 00:08:29,180
We would expect then that our sample might have about 75 men and 75 women, but it's completely possible

109
00:08:29,180 --> 00:08:36,080
that using a totally random sample, we end up with a ratio like 130 men and only 20 women, which would

110
00:08:36,080 --> 00:08:41,659
obviously not be a representative sample of the population, at least in terms of gender.

111
00:08:41,659 --> 00:08:47,030
Now, a sampling technique we can use to potentially address some of the drawbacks of a sample.

112
00:08:47,030 --> 00:08:50,960
Random sample is what's called stratified random sampling.

113
00:08:50,960 --> 00:08:56,210
When we use this sampling technique, we break our population into strata.

114
00:08:56,210 --> 00:09:02,930
So we set here for the sample random sample example that if we have 1000 employees in our company,

115
00:09:03,470 --> 00:09:11,600
1000 total employees and 500 are men and 500 are women, instead of using a random number generator

116
00:09:11,600 --> 00:09:14,570
to pick a sample random sample from all 1000.

117
00:09:14,570 --> 00:09:20,450
In other words, instead of pulling our sample directly from this group, we could break the population

118
00:09:20,450 --> 00:09:26,630
in half and have a population of men and a population of women, and then take a sample random sample

119
00:09:26,630 --> 00:09:33,440
inside of the male population and a sample random sample inside of the female population to at least

120
00:09:33,440 --> 00:09:34,850
control for gender.

121
00:09:34,850 --> 00:09:42,290
So if our sample size is 150 people, we could take a random sample of 75 men and a random sample of

122
00:09:42,290 --> 00:09:48,590
75 women, and that would give us a 150 person sample that is more representative of the company as

123
00:09:48,590 --> 00:09:50,450
a whole in terms of gender.

124
00:09:50,450 --> 00:09:56,030
If we're going to use this technique, we just need to make sure that our strata don't overlap.

125
00:09:56,030 --> 00:09:59,090
So the strata need to be mutually exclusive.

126
00:09:59,090 --> 00:10:03,680
For example, here we divided the company into 500 men and 500 women.

127
00:10:03,680 --> 00:10:09,800
But let's say instead that we had divided the company by education level, classifying them by the college

128
00:10:09,800 --> 00:10:10,850
degrees they held.

129
00:10:10,850 --> 00:10:15,230
And we called one group bachelor's degrees and the other group master's degrees.

130
00:10:15,230 --> 00:10:19,850
Well, if we have an employee who has both a bachelor's degree and a master's degree and we include

131
00:10:19,850 --> 00:10:25,070
them in both strata, then we've tainted the exclusivity of these two groups.

132
00:10:25,070 --> 00:10:26,810
They need to be mutually exclusive.

133
00:10:26,810 --> 00:10:33,290
We just need to divide them along non overlapping lines and then we need to take a proportional sample.

134
00:10:33,290 --> 00:10:41,540
So here we had two equally sized groups, but if our company instead had, let's say 750 employees whose

135
00:10:41,540 --> 00:10:48,410
first language is English, and then 250 employees whose first language is something other than English,

136
00:10:48,410 --> 00:10:55,760
and let's say we want to take a 100 employee sample, we would want to take 75 employees from the English

137
00:10:55,760 --> 00:11:02,450
group and then 25 employees from the not English group to keep the samples proportional to the size

138
00:11:02,450 --> 00:11:03,560
of each strata.

139
00:11:03,560 --> 00:11:10,430
Then we would combine the 75 from this group and the 25 from this group to create our 100 employee sample

140
00:11:10,430 --> 00:11:14,270
where we've controlled somewhat for primary language spoken.

141
00:11:14,270 --> 00:11:20,960
Now, while stratified, random sampling can help to eliminate some of the sample selection bias that

142
00:11:20,960 --> 00:11:26,690
we might create with simple random sampling by controlling for certain characteristics of the population,

143
00:11:26,690 --> 00:11:33,230
this kind of sampling takes more work, and we might not have the resources or time or money or manpower

144
00:11:33,230 --> 00:11:35,510
to perform this type of sampling.

145
00:11:35,510 --> 00:11:41,450
And of course, just because we've controlled for one characteristic doesn't still mean that we're getting

146
00:11:41,450 --> 00:11:41,570
a.

147
00:11:41,640 --> 00:11:43,150
Perfectly representative sample.

148
00:11:43,170 --> 00:11:49,410
For instance, we might control for gender, but maybe we're not controlling for household income or

149
00:11:49,410 --> 00:11:53,130
language spoken or department of our company.

150
00:11:53,160 --> 00:11:58,140
All of which might be important variables to consider when we're building our sample.

151
00:11:58,140 --> 00:12:00,420
But these aren't our only options.

152
00:12:00,420 --> 00:12:03,960
We can also take what's called a clustered random sample.

153
00:12:03,960 --> 00:12:07,650
This sampling technique can be really useful depending on the situation.

154
00:12:07,650 --> 00:12:13,500
For instance, maybe we're interested in learning something about the employees of Fortune 500 companies.

155
00:12:13,500 --> 00:12:19,080
Well, it might be prohibitively difficult for us to get a complete list of every single employee at

156
00:12:19,080 --> 00:12:22,050
all 500 of these extremely large companies.

157
00:12:22,050 --> 00:12:25,710
So conducting a simple random sample might not make sense.

158
00:12:25,710 --> 00:12:32,850
What we could do instead is say that we have 500 Fortune 500 companies, and we could think about each

159
00:12:32,850 --> 00:12:35,430
of those companies as its own cluster.

160
00:12:35,430 --> 00:12:36,990
So we have 500 clusters.

161
00:12:36,990 --> 00:12:44,850
We could take a random sample of those 500 companies so we could assign each of the Fortune 500 companies

162
00:12:44,880 --> 00:12:47,040
a number one through 500.

163
00:12:47,040 --> 00:12:49,860
Then we could use a random number generator.

164
00:12:49,860 --> 00:12:56,880
Let's say we want to take a sample of 25 of these companies so we would use our random number generator

165
00:12:56,880 --> 00:13:01,740
to pick randomly 25 of the Fortune 500 companies.

166
00:13:01,740 --> 00:13:05,700
So now we have 25 companies instead of 500 companies.

167
00:13:05,700 --> 00:13:14,100
If we then survey every employee at those 25 companies and use them as a sample of all of the employees

168
00:13:14,100 --> 00:13:17,400
at all 500 companies, we would call that a.

169
00:13:18,390 --> 00:13:19,710
Single stage.

170
00:13:20,710 --> 00:13:24,680
Cluster sample, but maybe this sample is still too large.

171
00:13:24,700 --> 00:13:28,750
After all, these companies are enormous global companies.

172
00:13:28,750 --> 00:13:33,360
So adding up all the employees at all 25 companies might still give us a huge list.

173
00:13:33,370 --> 00:13:40,480
Instead, what we could do is take a random sample within each of these companies.

174
00:13:40,480 --> 00:13:48,040
For instance, maybe we sample using a sample random sample, maybe we sample 100 employees from each

175
00:13:48,040 --> 00:13:49,770
of these 25 companies.

176
00:13:49,780 --> 00:13:54,520
If we do that, then we call this a double stage.

177
00:13:55,490 --> 00:13:58,610
Cluster sample because we're sampling in two stages.

178
00:13:58,610 --> 00:14:00,200
We started with 500 companies.

179
00:14:00,200 --> 00:14:07,370
We consider each company its own cluster of employees, so we sample to choose randomly 25 clusters.

180
00:14:07,370 --> 00:14:13,820
And then within each of those clusters, we pick out 100 employees in the second stage of sampling.

181
00:14:13,820 --> 00:14:22,910
And if we have 100 employees at 25 companies, then the result is a 2500 employee sample, 100 times

182
00:14:22,910 --> 00:14:23,690
25.

183
00:14:23,690 --> 00:14:31,550
We get 2500 total employees and we might use this set of employees as a sample of all of the employees

184
00:14:31,550 --> 00:14:34,640
at all 500 Fortune 500 companies.

185
00:14:34,640 --> 00:14:39,380
But again, just like with the sampling techniques we talked about before, there are always pros and

186
00:14:39,380 --> 00:14:40,340
cons here, right?

187
00:14:40,340 --> 00:14:44,150
This can sometimes be a more labor intensive sampling technique.

188
00:14:44,150 --> 00:14:49,610
And there's nothing really to say that these 25 companies are representative of these 500.

189
00:14:49,610 --> 00:14:56,150
Maybe we would want to use a stratified random sample to pull these 25 from these 500 to help us control

190
00:14:56,150 --> 00:14:57,920
for some characteristics.

191
00:14:57,920 --> 00:15:03,530
There's also nothing to say that these 100 employees are representative of all of the employees at each

192
00:15:03,530 --> 00:15:05,000
of these 25 companies.

193
00:15:05,000 --> 00:15:07,130
There are always pros and cons.

194
00:15:07,130 --> 00:15:11,270
The last sampling technique that we want to talk about is systematic random sampling.

195
00:15:11,270 --> 00:15:16,130
This one's really similar to simple random sampling, and it's pretty easy to do.

196
00:15:16,160 --> 00:15:23,330
We assign a number to every member of the population and then we just choose randomly.

197
00:15:23,330 --> 00:15:29,510
We could use a random number generator for this, but choose randomly some starting number and then

198
00:15:29,510 --> 00:15:33,560
pick some interval at which to pull members from the population.

199
00:15:33,560 --> 00:15:38,720
So again, let's say we have our 1000 employee population at our company.

200
00:15:38,720 --> 00:15:45,860
We assign each of them a number one through 1000, and then we use a random number generator to choose,

201
00:15:45,860 --> 00:15:51,080
let's say the number 167, and then also the number 11.

202
00:15:51,080 --> 00:15:59,390
Then to build our sample, we would start with employee number 167, and then we would just add 11 every

203
00:15:59,390 --> 00:16:02,870
time to this previous number to get the next member of the sample.

204
00:16:02,870 --> 00:16:12,350
So we would also then use employee number 178 and then employee number 189 and then employee number

205
00:16:12,530 --> 00:16:21,800
200 211, etc. until we had the total number of employees that we wanted to include in our sample.

206
00:16:21,800 --> 00:16:25,460
So we've introduced these different sampling techniques.

207
00:16:25,460 --> 00:16:32,660
But remember here that practically speaking, our goal is never really to pick a sample that's absolutely

208
00:16:32,660 --> 00:16:36,470
perfectly representative in every way of our population.

209
00:16:36,470 --> 00:16:39,410
That's almost certainly going to be impossible.

210
00:16:39,410 --> 00:16:43,460
We can't control perfectly for every characteristic of the population.

211
00:16:43,460 --> 00:16:49,340
Instead, our goal is always to think through what we're trying to study, what characteristics we have

212
00:16:49,340 --> 00:16:54,440
in the population, what's maybe most important to try to control for, And we're trying to build the

213
00:16:54,440 --> 00:16:58,760
most unbiased representative sample that we possibly can.

214
00:16:58,760 --> 00:17:05,480
And then and this is critical pair that with being completely transparent about our sampling technique.

215
00:17:05,480 --> 00:17:12,349
So we do our best to accomplish this goal of finding a representative, unbiased sample that will let

216
00:17:12,349 --> 00:17:15,440
us make good inferences about the population.

217
00:17:15,440 --> 00:17:21,980
But whatever sampling technique or set of sampling techniques we use to get to that sample, we just

218
00:17:21,980 --> 00:17:28,790
overly communicate about exactly what we did, how we built our sample, what methods we use, etc.

219
00:17:28,790 --> 00:17:36,620
so that anyone who is trying to interpret the findings of our study can get the clearest possible view

220
00:17:36,620 --> 00:17:39,890
of the limitations that are attached to our results.

221
00:17:39,890 --> 00:17:45,470
In fact, we ourselves might even want to offer up anything that we can spot about the limitations of

222
00:17:45,470 --> 00:17:46,370
our sampling.

223
00:17:46,370 --> 00:17:51,800
For instance, if we took a clustered random sample like this with the Fortune 500 companies, we might

224
00:17:51,800 --> 00:17:57,830
want to point out right up front that we used a random number generator to pick 25 of the Fortune 500

225
00:17:57,830 --> 00:18:02,470
companies, but that these 25 might not be representative of the total 500.

226
00:18:02,480 --> 00:18:02,900
Maybe.

227
00:18:02,900 --> 00:18:09,470
We look at the 25 we picked and they are disproportionately concentrated in one or two geographic regions.

228
00:18:09,470 --> 00:18:15,410
And so we might want to point out, along with our result, our conclusions, that we still have questions

229
00:18:15,410 --> 00:18:22,820
about the geographic representation of our sample and that the lack of accurate geographic representation

230
00:18:22,820 --> 00:18:25,070
may be skewing our results.

231
00:18:25,070 --> 00:18:31,040
So we're trying to be as representative as we can be paired with as much transparency as possible about

232
00:18:31,040 --> 00:18:32,570
our sampling techniques.

233
00:18:32,900 --> 00:18:38,540
Now, all that being said about sampling techniques, let's talk about some of the bias we might introduce

234
00:18:38,540 --> 00:18:44,780
depending on the sampling techniques that we use and how we engage with our sample participants.

235
00:18:44,780 --> 00:18:51,200
So when it comes to actually building our sample, we want to be careful about voluntary response or

236
00:18:51,200 --> 00:18:52,910
self selection bias.

237
00:18:52,910 --> 00:18:55,040
And there are so many different types.

238
00:18:55,390 --> 00:18:56,890
Biases we can introduce.

239
00:18:56,890 --> 00:19:02,110
We want to cover just some of the most obvious ones here, but just know that this is certainly not

240
00:19:02,110 --> 00:19:03,550
a complete list.

241
00:19:03,550 --> 00:19:10,120
So the idea with voluntary response bias or self-selection bias is that if people are asked to voluntarily

242
00:19:10,120 --> 00:19:16,360
participate in a study or a survey or they self-select for participating, we ask people to participate

243
00:19:16,360 --> 00:19:19,100
and some people voluntarily participate and others don't.

244
00:19:19,120 --> 00:19:24,850
By definition, we're getting people who are more willing to participate, and that could skew our results.

245
00:19:24,850 --> 00:19:31,240
For instance, maybe we want to ask students about their study habits and we allow students to voluntarily

246
00:19:31,240 --> 00:19:32,080
participate.

247
00:19:32,080 --> 00:19:37,720
Well, maybe students who already have better study habits, who are better students, are more eager

248
00:19:37,720 --> 00:19:43,120
to participate in a study like this one than students with worse study habits or students who maybe

249
00:19:43,120 --> 00:19:45,730
aren't doing as well in their classes.

250
00:19:45,730 --> 00:19:51,310
Which means we're going to end up with a sample of students who maybe study more, are more successful

251
00:19:51,310 --> 00:19:52,300
or getting better grades.

252
00:19:52,300 --> 00:19:58,810
That's going to affect the results of our sample and maybe make our sample not representative of the

253
00:19:58,810 --> 00:20:00,100
entire population.

254
00:20:00,250 --> 00:20:06,850
We can also think about non-response bias, which is all about people who choose not to respond to the

255
00:20:06,850 --> 00:20:08,350
survey or to a study.

256
00:20:08,350 --> 00:20:15,730
So if a political representative asks his constituents to fill out a survey and mail it back, he might

257
00:20:15,730 --> 00:20:21,460
get fewer responses from busier people, people who are working parents, caretakers, people who travel

258
00:20:21,460 --> 00:20:26,710
for work and aren't home as much and don't get their mail, don't have time to fill out the survey and

259
00:20:26,710 --> 00:20:27,730
send it back.

260
00:20:27,730 --> 00:20:33,820
Those groups may fail to respond to the survey and our sample is going to be biased.

261
00:20:33,820 --> 00:20:40,480
Similarly to non-response bias would be this idea of under coverage where we just don't collect data

262
00:20:40,480 --> 00:20:45,520
from entire groups of the population that actually should be included in our study.

263
00:20:45,520 --> 00:20:51,460
Maybe we run a daycare center and we're surveying parents as they come to pick up their children, but

264
00:20:51,460 --> 00:20:57,010
maybe whoever is conducting the study only stays at the daycare center until 5:00 PM.

265
00:20:57,010 --> 00:21:02,650
Well, any parents who pick up their children after 5 p.m. won't be represented in the study.

266
00:21:02,650 --> 00:21:05,080
They were just completely missed as a group.

267
00:21:05,080 --> 00:21:10,420
And maybe it's the people who pick up their children later who make more money or less money.

268
00:21:10,420 --> 00:21:13,990
Maybe they're more likely to work full time or more likely to work part time.

269
00:21:13,990 --> 00:21:18,070
Maybe they're more likely to be single income households or dual income households.

270
00:21:18,070 --> 00:21:21,880
We could be missing a whole group of our population.

271
00:21:21,880 --> 00:21:26,320
And so we have this under coverage risk that's going to bias our sample.

272
00:21:26,320 --> 00:21:30,580
We also have to worry about prescreening or advertising bias.

273
00:21:30,580 --> 00:21:34,690
How do we get people to participate in our sample in the first place?

274
00:21:34,690 --> 00:21:42,280
Going back to the idea of surveying employees at our company, if we post a flyer in the cafeteria of

275
00:21:42,280 --> 00:21:47,440
our company offices, and that's the only place that we advertise the survey, then the only people

276
00:21:47,440 --> 00:21:51,970
that we're going to get included in our sample are people who saw that flyer, which means people who

277
00:21:51,970 --> 00:21:57,760
use the cafeteria, who work when the cafeteria is open and choose to visit the cafeteria, which means

278
00:21:57,760 --> 00:22:01,540
there are people who maybe didn't pack their own lunch and bring it to work.

279
00:22:01,540 --> 00:22:04,930
We're setting ourselves up for a certain kind of sample.

280
00:22:04,930 --> 00:22:11,800
We're introducing bias just in the pre screening process or in the way that we advertised for this study.

281
00:22:11,800 --> 00:22:13,480
And then the last one that we'll talk about.

282
00:22:13,480 --> 00:22:19,810
But remember, these are certainly not all of the different types of bias we can have is survivorship

283
00:22:19,810 --> 00:22:20,680
bias.

284
00:22:20,680 --> 00:22:26,710
Maybe we're trying to understand more about what the business environment looks like today in present

285
00:22:26,710 --> 00:22:32,590
time, but maybe what we miss is that the business environment, the business climate today is very

286
00:22:32,590 --> 00:22:33,250
difficult.

287
00:22:33,250 --> 00:22:39,970
It's a tough economy and maybe many businesses have just gone out of business and we don't survey or

288
00:22:39,970 --> 00:22:45,880
we don't include any representation of those businesses who just closed, who just shut down.

289
00:22:45,880 --> 00:22:50,770
They were fully operational last month, but in this difficult environment they've closed.

290
00:22:50,770 --> 00:22:53,590
And so we're not including them in our sample.

291
00:22:53,590 --> 00:22:56,500
And we're introducing this idea of survivorship bias.

292
00:22:56,500 --> 00:23:01,780
This could also kind of be an example of under coverage where we're not covering that group of businesses

293
00:23:01,780 --> 00:23:02,890
that just closed.

294
00:23:02,890 --> 00:23:09,100
Also potentially similar to non-response bias, and many of them are related to this idea of convenience

295
00:23:09,100 --> 00:23:16,060
sampling where we sample in a way that is convenient and easy for us but actually doesn't build a very

296
00:23:16,060 --> 00:23:17,680
good representative sample.

297
00:23:17,680 --> 00:23:25,240
We want to avoid choosing a sample based on convenience and instead start with an idea about what a

298
00:23:25,240 --> 00:23:27,700
good representative sample might be.

299
00:23:27,700 --> 00:23:34,270
We want to think about the best way to collect a good sample, not how to sample in a way that is most

300
00:23:34,270 --> 00:23:38,380
convenient, most easy, fastest, cheapest for us.

301
00:23:38,380 --> 00:23:44,080
Because if we just sample in a way that is convenient, we may not get a representative sample at all.

302
00:23:44,080 --> 00:23:48,370
So these are all things that we need to be worried about when we are building our sample.

303
00:23:48,370 --> 00:23:54,340
But once we have our sample and now we're engaging with the participants in our sample, we can introduce

304
00:23:54,340 --> 00:23:54,670
all.

305
00:23:54,840 --> 00:24:00,710
Kinds of problems into the data, just by the way that we engage with the members of our sample.

306
00:24:00,720 --> 00:24:05,940
For instance, we can introduce what's called measurement bias, where maybe there's something wrong

307
00:24:05,940 --> 00:24:11,370
with the tool that we're using to measure data or collect data from our sample.

308
00:24:11,370 --> 00:24:16,440
For instance, maybe the data that we're collecting is the amount of time that our employees work,

309
00:24:16,440 --> 00:24:21,420
and we don't realize that when they're clocking in for work and clocking out of work, the clock that's

310
00:24:21,420 --> 00:24:24,120
keeping track of that time is actually slow.

311
00:24:24,120 --> 00:24:27,540
And so it appears as though they're working for longer than they actually are.

312
00:24:27,570 --> 00:24:31,980
The tool that we're using to measure the data that we're collecting is wrong.

313
00:24:31,980 --> 00:24:32,790
It's inaccurate.

314
00:24:32,790 --> 00:24:34,380
It's not well calibrated.

315
00:24:34,380 --> 00:24:36,150
That could throw off all of our data.

316
00:24:36,150 --> 00:24:39,000
Even if we have a great representative sample.

317
00:24:39,000 --> 00:24:42,750
We can also introduce something called social desirability bias.

318
00:24:42,750 --> 00:24:49,470
If we are asking questions of the people in our sample, we have to realize that people have this social

319
00:24:49,470 --> 00:24:50,610
desirability bias.

320
00:24:50,610 --> 00:24:52,830
They want to be seen in a favorable light.

321
00:24:52,830 --> 00:24:57,360
So if we ask them difficult questions, for example, have they ever stolen something?

322
00:24:57,360 --> 00:24:58,950
Or when was the last time they lied?

323
00:24:58,950 --> 00:25:00,570
They may not be truthful.

324
00:25:00,570 --> 00:25:05,610
They may not give us an accurate answer because they're tempted to answer the question in a way that

325
00:25:05,610 --> 00:25:06,870
makes them look better.

326
00:25:06,870 --> 00:25:12,480
So we have to be aware that the data that we're collecting around that question could be skewed toward

327
00:25:12,480 --> 00:25:14,550
a more favorable response.

328
00:25:14,670 --> 00:25:21,810
Somewhat related, we can have this idea of leading questions where the way that we are asking the question

329
00:25:21,810 --> 00:25:25,800
of our participants leads them to answer in a particular way.

330
00:25:25,830 --> 00:25:31,110
Maybe we're asking people about how they'll vote in the next election, and instead of asking them who

331
00:25:31,140 --> 00:25:36,060
they plan to vote for, we ask them whether they'll vote for the more experienced candidate.

332
00:25:36,060 --> 00:25:40,410
Of course, we think about the more experienced candidate being the better candidate.

333
00:25:40,410 --> 00:25:45,930
And so if we ask the question that way, people might tend to sort of agree with us that they're going

334
00:25:45,930 --> 00:25:50,790
to vote for the more experienced candidate when they actually don't know who that is or they're not

335
00:25:50,790 --> 00:25:54,390
planning to vote for that candidate because it doesn't match their political party.

336
00:25:54,390 --> 00:25:59,820
We might be better off just asking them who they're going to vote for rather than leading them or prompting

337
00:25:59,820 --> 00:26:04,110
them with this idea of will you vote for the more experienced candidate?

338
00:26:04,290 --> 00:26:10,230
And then we can have something like recall bias where if we're asking people to remember something or

339
00:26:10,230 --> 00:26:15,780
give us an account of something that happened in the past, we know that human memory is actually notoriously

340
00:26:15,780 --> 00:26:16,350
bad.

341
00:26:16,350 --> 00:26:22,470
People will recall events in one way when the past actually unfolded differently than their memory.

342
00:26:22,470 --> 00:26:25,770
So asking people about the past is predictably risky.

343
00:26:25,770 --> 00:26:31,500
We introduce this idea of recall bias now, just like these biases we talked about for building the

344
00:26:31,500 --> 00:26:31,920
sample.

345
00:26:31,920 --> 00:26:37,260
This is certainly not an exhaustive list of biases that we can introduce when we're actually surveying

346
00:26:37,260 --> 00:26:38,970
the members of our sample.

347
00:26:38,970 --> 00:26:42,690
But it gives you an idea of some of the things that we want to be aware of.

348
00:26:42,690 --> 00:26:49,740
And to summarize, like the idea that we talked about before, the goal here is not to perfectly eliminate

349
00:26:49,740 --> 00:26:53,610
every kind of possible bias that really is impossible.

350
00:26:53,610 --> 00:26:58,950
Instead, we want to do everything we can to introduce as little bias as possible and then at the same

351
00:26:58,950 --> 00:27:05,220
time be as transparent as we can about how we built our sample and how we collected data from that sample

352
00:27:05,220 --> 00:27:10,590
so that everybody looking at and interpreting the data can have as much information as possible about

353
00:27:10,590 --> 00:27:16,020
how much we can rely on the data from the sample actually representing the population.

