1
00:00:05,500 --> 00:00:09,100
Welcome back, everyone, to this next to the course on sampling.

2
00:00:09,940 --> 00:00:15,340
When working with data, we often are limited in our ability to collect all the data in existence.

3
00:00:15,520 --> 00:00:20,950
More likely and more realistically, we're really just collecting a sample from a larger population,

4
00:00:20,950 --> 00:00:25,540
and we hope to use that sample to estimate characteristics of the whole population.

5
00:00:26,470 --> 00:00:31,870
Understanding sampling well is critical for testing out potential business decisions or strategies.

6
00:00:31,960 --> 00:00:37,360
As I mentioned, it's unlikely that we're ever going to have access to an entire population of data.

7
00:00:37,360 --> 00:00:43,480
So in setting up ideas to test, we need to understand how sampling plays a role in our experimentation

8
00:00:43,480 --> 00:00:44,320
process.

9
00:00:45,480 --> 00:00:50,340
Now, a quick important note on this idea of sampling versus the entire population.

10
00:00:50,370 --> 00:00:56,400
It should be noted that certain modern technology companies with a billion plus user bases like Gmail

11
00:00:56,400 --> 00:01:02,430
or Facebook or Instagram or these other services, they actually contain a strategic advantage due to

12
00:01:02,430 --> 00:01:08,460
their scale of being able to conduct tests on enormous samples while only affecting a small percentage

13
00:01:08,460 --> 00:01:09,270
of users.

14
00:01:09,270 --> 00:01:15,420
So you can imagine that if you have a billion users or 2 billion users on a service such as Facebook,

15
00:01:15,420 --> 00:01:21,330
that if you want to quickly conduct a sample test, so to speak, on a million users, that is much

16
00:01:21,330 --> 00:01:24,420
larger of a sample than most other companies have.

17
00:01:24,450 --> 00:01:29,370
However, with the scale of something like Facebook, it's actually a very small percentage of their

18
00:01:29,370 --> 00:01:30,060
users.

19
00:01:30,060 --> 00:01:35,400
And this combine of statistics is actually a strategic advantage, and I want you to keep that in the

20
00:01:35,400 --> 00:01:38,130
back of your mind as we continue to discuss sampling.

21
00:01:39,480 --> 00:01:43,920
So in this section of the course, we're going to be discussing sampling and bias, the central limit

22
00:01:43,920 --> 00:01:47,880
theorem, the student's T distribution and confidence intervals.

23
00:01:49,210 --> 00:01:55,330
I want to begin by exploring some ideas of sampling tests and how sampling relates to the central limit

24
00:01:55,330 --> 00:01:57,430
theorem, which is a crucial concept.

25
00:01:58,580 --> 00:02:03,560
We can think of sampling as grabbing data instances from a larger data distribution.

26
00:02:03,890 --> 00:02:07,190
And let's actually frame this in a more realistic case.

27
00:02:07,220 --> 00:02:13,240
Imagine we are testing delivery via air drones and we're tracking how off target the packages are.

28
00:02:13,250 --> 00:02:18,410
So we take these drones and they happen to, while they're flying in the air, drop a package onto the

29
00:02:18,410 --> 00:02:23,720
ground and you can actually look this up on YouTube and see videos of startups doing this exact task.

30
00:02:24,680 --> 00:02:28,060
So our company, however, is simply a logistics company.

31
00:02:28,070 --> 00:02:30,540
We don't actually want to manufacture the drones.

32
00:02:30,560 --> 00:02:33,260
Instead, what we're going to be doing is running some experiments.

33
00:02:33,260 --> 00:02:39,080
We want to test different brands of drones to understand their performance, more specifically, understand

34
00:02:39,080 --> 00:02:43,970
their accuracy in their ability to drop a product or box on target.

35
00:02:45,050 --> 00:02:50,390
So the purpose of testing the drones is so that when we go into production, we can have some idea of

36
00:02:50,390 --> 00:02:54,770
how accurate different brands of drones are in their delivery targets.

37
00:02:56,130 --> 00:03:02,190
So what we could do is we could begin to think of our testing of the drones we actually have delivered

38
00:03:02,190 --> 00:03:07,950
on hand as a sample of larger future population of all drone deliveries.

39
00:03:07,950 --> 00:03:12,510
Because remember, realistically, we're not going to be able to test every single drone that we're

40
00:03:12,510 --> 00:03:14,780
ever going to buy off the assembly line.

41
00:03:14,790 --> 00:03:21,120
Instead, we are just working off a sample of drones that we hope is going to be representative of all

42
00:03:21,120 --> 00:03:23,460
the drones that come off the assembly line.

43
00:03:24,950 --> 00:03:29,840
So let's imagine that we're tracking the measurements of how far off the packages are and we end up

44
00:03:29,840 --> 00:03:31,430
creating a sample distribution.

45
00:03:31,430 --> 00:03:36,800
So what we end up doing here is we run a bunch of tests, take some drones, they have a product, and

46
00:03:36,800 --> 00:03:41,020
then they drop it onto the ground and we have a particular target that they need to meet.

47
00:03:41,030 --> 00:03:45,350
And what we're doing is we're going to measure the distance of how off they are from the target.

48
00:03:45,350 --> 00:03:47,510
This is made up data, but you get the idea.

49
00:03:47,510 --> 00:03:52,610
We're just measuring, hey, this particular drop was off by 100 centimeters, this next one was off

50
00:03:52,610 --> 00:03:55,100
by 110 centimeters, etc., etc..

51
00:03:55,100 --> 00:04:00,560
And for each of those, I'm going to draw a little point and I could then further organize that data

52
00:04:00,560 --> 00:04:01,730
into a histogram.

53
00:04:01,730 --> 00:04:06,440
And that essentially creates a data distribution from real life experimentation.

54
00:04:06,440 --> 00:04:12,950
So I'm essentially now just counting how many drops were 100 centimetres to 150 centimeters off target

55
00:04:12,950 --> 00:04:13,940
or something like that.

56
00:04:13,940 --> 00:04:19,459
Depending on how you define your bins, you can see it looks like I'm starting to build out a data distribution

57
00:04:19,459 --> 00:04:21,529
from my data of the drones.

58
00:04:23,160 --> 00:04:29,820
Now, we can also see that in this particular example, if I were to randomly choose a single data point,

59
00:04:29,820 --> 00:04:32,160
it would likely be close to 100.

60
00:04:33,410 --> 00:04:38,960
So far, all these concepts that we've discussed, we've already learned about when talking about data

61
00:04:38,960 --> 00:04:39,980
distributions.

62
00:04:41,160 --> 00:04:46,920
But now I want to present two different scenarios and how the knowledge learned in this section can

63
00:04:46,920 --> 00:04:47,550
help us.

64
00:04:47,550 --> 00:04:51,990
Basically, giving you some motivation and framing of why sampling is so important.

65
00:04:53,500 --> 00:04:56,680
So let's imagine this first situation or scenario one.

66
00:04:57,010 --> 00:05:02,230
Imagine we had some partners across the globe conducting their own tests and they're creating their

67
00:05:02,230 --> 00:05:04,750
own data distribution for another drone.

68
00:05:04,750 --> 00:05:06,420
But now we have a problem.

69
00:05:06,430 --> 00:05:09,430
They forgot to write down the actual brand name.

70
00:05:09,430 --> 00:05:14,500
So they have all this data, but they're actually not sure what the brand of drone was.

71
00:05:15,430 --> 00:05:21,310
So using the idea of sampling from data distributions, would there actually be a way, statistically

72
00:05:21,310 --> 00:05:26,890
speaking, to see if the data sets were likely to come from the same population that is the same brand

73
00:05:26,890 --> 00:05:27,670
of drone?

74
00:05:28,830 --> 00:05:34,290
Well, later on, we're going to learn that we can actually begin to treat this situation as a test,

75
00:05:34,290 --> 00:05:40,050
which is going to allow us to test to see if the sampling from both drones were likely to come from

76
00:05:40,050 --> 00:05:41,340
the same population.

77
00:05:41,340 --> 00:05:47,700
And in this particular case, that's going to be based off the results of the data distribution of dropping

78
00:05:47,700 --> 00:05:50,430
off from the target location.

79
00:05:50,460 --> 00:05:56,790
Do these drones actually show similar behavior that is going to indicate they come from the same brand

80
00:05:56,790 --> 00:05:57,540
of drones?

81
00:05:57,540 --> 00:05:59,080
And to be really specific.

82
00:05:59,100 --> 00:06:03,030
This is going to be a two sample t test, but we'll learn more about that later.

83
00:06:04,450 --> 00:06:10,060
So the way this would really work is we would compare samples from the data distributions created from

84
00:06:10,060 --> 00:06:15,820
both drones and use what's known as a P value to see the likelihood of being from the same brand.

85
00:06:15,820 --> 00:06:20,800
So we'd take the data distributions from the real life experiments and we can actually begin to compare

86
00:06:20,800 --> 00:06:21,220
them.

87
00:06:22,410 --> 00:06:26,310
And as I mentioned later on in this section, we're going to dive into mathematics of how to conduct

88
00:06:26,310 --> 00:06:27,390
such a test.

89
00:06:27,390 --> 00:06:33,600
But I want you to also consider another situation or scenario that sampling and tests can help out with.

90
00:06:33,630 --> 00:06:40,740
This particular situation was a situation where you can use sampling and experimentation to see, Hey,

91
00:06:40,770 --> 00:06:45,510
do these two data sets come from the same population or what's the likelihood they come from the same

92
00:06:45,510 --> 00:06:46,380
population?

93
00:06:47,110 --> 00:06:53,260
Now in scenario two, imagine we're in a situation where we've tested the original drones, but now

94
00:06:53,260 --> 00:06:57,820
I actually want to see if I can make modifications to them to improve their performance.

95
00:06:57,820 --> 00:07:02,170
So it's technically still the same drone, but now I've affected it in some way.

96
00:07:02,200 --> 00:07:09,370
For example, I add a really large propeller to the top of the drone, so technically the same drone,

97
00:07:09,370 --> 00:07:14,440
but now a modification or something's been changed to it to have some sort of effect.

98
00:07:15,660 --> 00:07:20,940
So we are now wanting to know if there are differences within the same population that is the same drone

99
00:07:20,940 --> 00:07:23,190
brand after a change.

100
00:07:24,460 --> 00:07:31,780
So this case is known as a paired t test, and this is essentially to explore the effects of the change

101
00:07:31,780 --> 00:07:33,400
on the same population.

102
00:07:33,400 --> 00:07:39,190
So we would run the experiments before and after the change and then perform what's known as a paired

103
00:07:39,190 --> 00:07:44,440
t test, which is going to statistically tell us if we actually had an effect.

104
00:07:46,210 --> 00:07:51,520
Understanding how to use sampling in conjunction with T tests, give us powerful tools to let us know

105
00:07:51,520 --> 00:07:56,500
whether or not experiments worked and the ability to compare samples from data distributions, to compare

106
00:07:56,500 --> 00:08:02,050
them against our assumptions like belonging to the same population, which in our examples was belonging

107
00:08:02,050 --> 00:08:04,180
to the same actual brand of drone.

108
00:08:04,300 --> 00:08:10,120
And remember, a lot of times with statistics and probability and experiments, a lot of these tests

109
00:08:10,120 --> 00:08:18,040
only give you answers in likelihood, such as it's 99% likely that the two drones came from the same

110
00:08:18,040 --> 00:08:18,650
brand.

111
00:08:18,670 --> 00:08:22,120
You're never really going to get a 100% sure answer.

112
00:08:24,050 --> 00:08:29,240
Now in the section, we're also going to be discussing what's known as the Central Limit Theorem, which

113
00:08:29,240 --> 00:08:33,530
will reveal a fundamental property of data distribution and sampling from them.

114
00:08:34,070 --> 00:08:38,659
Let's get a quick high level overview of the central limit theorem and why it's so important.

115
00:08:40,340 --> 00:08:41,690
Formally defined.

116
00:08:41,690 --> 00:08:48,770
The central limit theorem establishes that in many situations, when independent random variables are

117
00:08:48,770 --> 00:08:55,670
summed up, their properly normalized sum tends toward a normal distribution, even if the original

118
00:08:55,670 --> 00:08:58,560
variables themselves are not normally distributed.

119
00:08:58,580 --> 00:09:03,050
So that's quite a lot to take in and it's a lot of jargon in there.

120
00:09:03,050 --> 00:09:08,660
So I want you to get a clear understanding by actually showing the central limit theorem in practice.

121
00:09:09,810 --> 00:09:14,740
So let's imagine that we have a data distribution that displays a bimodal behavior.

122
00:09:14,760 --> 00:09:21,210
Bimodal essentially saying that by meaning to and modal for modalities is kind of a fancy way of saying

123
00:09:21,210 --> 00:09:24,000
there's two humps on this particular data distribution.

124
00:09:24,030 --> 00:09:28,770
You'll notice that one of them is centered around zero and one of them is centered around one.

125
00:09:29,950 --> 00:09:36,670
Now, let's imagine the I were to take a random sample from this bimodal data distribution, and I take

126
00:09:36,670 --> 00:09:37,680
that random sample.

127
00:09:37,690 --> 00:09:40,660
So in this particular case, I grab a couple of points.

128
00:09:40,690 --> 00:09:47,290
Obviously, probabilistically speaking, I'm more likely to grab points towards zero or one than those

129
00:09:47,290 --> 00:09:50,170
in the middle where there's a low probability of picking them.

130
00:09:51,520 --> 00:09:57,970
So after I picked those random sampling of points, I take the mean or average of that particular random

131
00:09:57,970 --> 00:09:58,620
sample.

132
00:09:58,630 --> 00:10:01,660
So in this case, let's imagine it's 0.43.

133
00:10:02,630 --> 00:10:09,830
Then what I'm going to do is I take that mean value and I plot it onto its own plot and you can kind

134
00:10:09,830 --> 00:10:11,840
of ignore the height of the line here.

135
00:10:12,320 --> 00:10:15,690
What is really important is the x axis on that right hand plot.

136
00:10:15,710 --> 00:10:19,710
So again, what we're doing here is I start off with this bimodal data distribution.

137
00:10:19,730 --> 00:10:22,030
I take a random sample of points from it.

138
00:10:22,040 --> 00:10:27,860
I take the average value of that sample and then I take that average value and I plot it onto its own

139
00:10:27,860 --> 00:10:28,370
plot.

140
00:10:28,370 --> 00:10:32,690
So all the ends up on the right hand plot is the value 0.43.

141
00:10:33,880 --> 00:10:38,830
Then what I'm going to do is I'm going to keep repeating this process of taking that random sample,

142
00:10:38,860 --> 00:10:42,610
calculating the mean of that sample, and then plotting it as a point.

143
00:10:43,630 --> 00:10:48,090
So if I do this again, let's say this time I get 0.51.

144
00:10:48,100 --> 00:10:52,720
And so I actually ran a simulation of this using the Python programming language.

145
00:10:53,450 --> 00:10:59,030
And if we do this many more times, you will notice a specific behavior is going to arise.

146
00:10:59,030 --> 00:11:01,010
And this is actually the central limit theorem.

147
00:11:01,900 --> 00:11:08,740
So the central limit theorem states that the means of these samples is going to be normally distributed,

148
00:11:08,800 --> 00:11:14,710
and it may be a little unclear using what is known as a rug plot, which is essentially just lines indicating

149
00:11:14,710 --> 00:11:16,300
a specific point value.

150
00:11:16,300 --> 00:11:18,150
That is the mean value of the samples.

151
00:11:18,160 --> 00:11:24,580
So what I could do is create a histogram from the means of the samples rather than just doing this dashed

152
00:11:24,580 --> 00:11:26,080
rug plot line plot.

153
00:11:26,910 --> 00:11:31,500
So here I now have a histogram of the mean of those samples.

154
00:11:31,500 --> 00:11:36,540
So again, to repeat myself, I took a sample from the data distribution on the left.

155
00:11:36,540 --> 00:11:45,150
Then I calculated the mean value like 0.43 or 0.51, etc. Then I ended up creating a histogram of doing

156
00:11:45,150 --> 00:11:46,800
that many, many times over.

157
00:11:46,800 --> 00:11:52,320
And what the central limit theorem says is that histogram is going to be normally distributed.

158
00:11:53,150 --> 00:11:54,530
Now, here's the cool part.

159
00:11:54,650 --> 00:12:01,460
The mean of the samples will be normally distributed around the mean of the data distribution, and

160
00:12:01,460 --> 00:12:05,410
that's actually why I chose this very specific bimodal data distribution.

161
00:12:05,420 --> 00:12:11,420
You'll notice that if it's centered one modality around zero and the other modality around the one,

162
00:12:11,420 --> 00:12:15,650
then the average of those values is going to be 0.5 right in the middle.

163
00:12:15,650 --> 00:12:18,350
So that's zero plus one divided by two.

164
00:12:18,440 --> 00:12:25,130
And if you keep doing these samplings and calculating of the means, you're going to get a normal distribution

165
00:12:25,130 --> 00:12:32,090
of the mean of those samples centered around the mean of the data distribution, which is why upon simulating

166
00:12:32,090 --> 00:12:37,340
this, you actually get a normal distribution that appears to be centered around 0.5.

167
00:12:38,850 --> 00:12:44,670
So what is absolutely crucial about the central limit theorem or CLT, as you'll often see it shortened

168
00:12:44,670 --> 00:12:49,680
to, is that this behavior is true for almost any data distribution.

169
00:12:49,710 --> 00:12:55,200
One example exception is the couch distribution, but that's a very unique distribution that you probably

170
00:12:55,200 --> 00:12:57,540
won't find yourself using in real life.

171
00:12:59,100 --> 00:13:05,220
So while this is a very interesting property of data distributions and sampling them, the next obvious

172
00:13:05,220 --> 00:13:11,340
question is besides being a neat mathematical trick, why would this property actually be useful in

173
00:13:11,340 --> 00:13:12,270
the real world?

174
00:13:13,350 --> 00:13:20,220
Well, in the real world, you often don't know the true distribution that your data comes from.

175
00:13:20,220 --> 00:13:23,850
So think about this in terms of the entire population.

176
00:13:23,850 --> 00:13:27,890
You often don't really know the true distribution of the entire population.

177
00:13:27,900 --> 00:13:33,630
Maybe it's bimodal, maybe it's exponential, maybe it's normal, maybe it's has various modalities

178
00:13:33,630 --> 00:13:36,000
like any of these examples that I'm showing here.

179
00:13:36,120 --> 00:13:41,520
Well, with the central limit theorem, you can just say, well, I don't really care about that true

180
00:13:41,520 --> 00:13:48,150
population distribution, because I know if I keep sampling from this data set, calculating the mean

181
00:13:48,150 --> 00:13:54,870
and then plotting that, then I end up getting that normal distribution around the mean of the original

182
00:13:54,870 --> 00:13:56,070
data distribution.

183
00:13:56,070 --> 00:14:01,590
So the central limit theorem allows us to disregard the original data distribution since we know the

184
00:14:01,590 --> 00:14:07,530
sample means will be normally distributed and we already know the normal distribution has a lot of unique

185
00:14:07,530 --> 00:14:10,530
and useful properties for calculating statistics.

186
00:14:11,880 --> 00:14:17,760
So if the normally distributed means, I can then actually perform the PT tests on the samples, which

187
00:14:17,760 --> 00:14:19,140
is super useful.

188
00:14:20,490 --> 00:14:27,120
So let's continue to learn more about the power of sampling and how we can use it to conduct tests.

189
00:14:27,150 --> 00:14:28,680
We'll see you in the next lecture.

