1
00:00:05,820 --> 00:00:10,260
Welcome back, everyone, to this section of the course on hypothesis testing.

2
00:00:11,320 --> 00:00:17,920
One of the most crucial capabilities an organization needs is the ability to test a theory or hypothesis.

3
00:00:18,070 --> 00:00:24,100
Companies often want to test potential effects before rolling out new features or services to all their

4
00:00:24,100 --> 00:00:24,880
users.

5
00:00:25,900 --> 00:00:31,480
In this section, we're going to be talking about topics related to hypothesis testing like significance

6
00:00:31,480 --> 00:00:37,560
level and type one versus type two errors, one tailed test and two tailed tests, the P value and a

7
00:00:37,600 --> 00:00:38,500
B testing.

8
00:00:39,780 --> 00:00:45,900
Understanding hypothesis testing provides a wealth study, statistical process and foundation for testing

9
00:00:45,900 --> 00:00:49,860
new features or analyzing results as a company or organization.

10
00:00:50,490 --> 00:00:55,500
Let me give you a few examples of the types of things companies can test with hypothesis testing.

11
00:00:56,490 --> 00:01:00,970
So you can test things like the effectiveness of vaccines or medications.

12
00:01:00,990 --> 00:01:07,080
You can also do a B testing for changes on a website such as testing whether or not a new color button

13
00:01:07,080 --> 00:01:09,090
actually improves conversions.

14
00:01:09,120 --> 00:01:15,120
You can also do things like testing, changes to schedules for efficiency, such as our workers just

15
00:01:15,120 --> 00:01:19,290
working four days a week, more productive and workers working five days a week.

16
00:01:19,320 --> 00:01:24,420
You can also do things like test the effects of different agricultural procedures on crops, like whether

17
00:01:24,420 --> 00:01:25,890
or not to use fertilizer.

18
00:01:27,210 --> 00:01:30,570
In general, a hypothesis can be stated as something in this term.

19
00:01:30,660 --> 00:01:33,360
If we do blink, then blink will happen.

20
00:01:33,510 --> 00:01:39,270
And let's get actually a little more specific and frame this in statistical terms when we're thinking

21
00:01:39,270 --> 00:01:42,310
of hypothesis testing and what a hypothesis is.

22
00:01:42,330 --> 00:01:48,660
We're typically thinking of something in the terms of if we do blink to an independent variable, then

23
00:01:48,660 --> 00:01:51,570
blink will happen to a dependent variable.

24
00:01:52,850 --> 00:01:58,520
So let's actually match up this sort of general statistical definition to some of the example cases

25
00:01:58,520 --> 00:01:59,390
we've mentioned.

26
00:02:00,190 --> 00:02:05,800
So, for example, I could say if we give patients a medication, then their white blood cell count

27
00:02:05,800 --> 00:02:06,670
will increase.

28
00:02:06,670 --> 00:02:10,030
And that could be an example of a hypothesis that I can test.

29
00:02:11,180 --> 00:02:17,060
Also note that I could technically model this as a before versus after medication test so I could take

30
00:02:17,060 --> 00:02:22,490
the same group of people, count their white blood cells, then given the medication and count the white

31
00:02:22,490 --> 00:02:24,440
blood cells, post medication.

32
00:02:24,440 --> 00:02:30,170
Or I could split people into two groups, a control group that doesn't take the medication versus a

33
00:02:30,170 --> 00:02:32,030
group that does take the medication.

34
00:02:33,830 --> 00:02:39,290
So let's do another example of, again, blank to an independent variable than blank will happen to

35
00:02:39,290 --> 00:02:40,460
a dependent variable.

36
00:02:41,430 --> 00:02:47,130
So we can say something like, If I add a fertilizer to the soil, then crop yields will increase.

37
00:02:47,960 --> 00:02:54,020
So again, this test would require a fair comparison between a non fertilized crop against the same

38
00:02:54,020 --> 00:02:54,980
fertilized crop.

39
00:02:54,980 --> 00:03:01,400
So I should try to make sure that all the other independent variables are as similar as possible between

40
00:03:01,400 --> 00:03:02,090
the groups.

41
00:03:02,090 --> 00:03:04,790
So an obvious one would be the crop itself.

42
00:03:04,790 --> 00:03:10,370
I probably want to compare non fertilized corn versus the same corn just fertilized.

43
00:03:10,400 --> 00:03:15,950
It's not really helpful to do something like non fertilized oranges versus fertilized cucumbers.

44
00:03:15,950 --> 00:03:22,010
We want to try to keep all the other independent variables the same and we'll discuss in a second about

45
00:03:22,010 --> 00:03:24,050
testing multiple independent variables.

46
00:03:24,170 --> 00:03:30,290
So for example, if we think of another hypothesis, I could say something like if we change the purchase

47
00:03:30,290 --> 00:03:36,530
button to a larger font and brighter color, then more customers will complete a purchase.

48
00:03:36,530 --> 00:03:40,760
Now you should notice this hypothesis is a little different than the ones we previously saw.

49
00:03:40,970 --> 00:03:47,000
This is basically an AB test of a website with a being the original button and B being the new button

50
00:03:47,000 --> 00:03:49,100
with a larger front and brighter color.

51
00:03:49,930 --> 00:03:53,920
However, this particular example has an important factor.

52
00:03:54,070 --> 00:03:59,230
You may have noticed that we're actually changing two independent variables at the same time.

53
00:03:59,230 --> 00:04:03,040
I'm changing both larger font and brighter color.

54
00:04:04,150 --> 00:04:09,700
So while it's certainly possible to conduct a hypothesis test with multiple independent variable changes

55
00:04:09,700 --> 00:04:15,160
against a dependent variable, we can see that this could make it unclear what independent variable

56
00:04:15,160 --> 00:04:18,130
is actually causing the effect or change in behaviour.

57
00:04:18,130 --> 00:04:23,860
And statistically speaking, while there are techniques to actually conduct changes across multiple

58
00:04:23,860 --> 00:04:29,320
independent variables, the cost, so to speak, is usually have to gather a lot more data.

59
00:04:29,320 --> 00:04:35,230
And we've previously mentioned that companies with a lot of users like Facebook or Google typically

60
00:04:35,230 --> 00:04:40,150
have enough users that testing multiple independent variable changes is actually not a big deal for

61
00:04:40,150 --> 00:04:45,580
them because they can easily test small percentages of their users but still have really large data

62
00:04:45,580 --> 00:04:49,120
sets like a million users out of 2 billion customers.

63
00:04:51,030 --> 00:04:56,010
So again, to frame that last hypothesis, you're going to be asking yourself questions like was it

64
00:04:56,010 --> 00:05:01,200
the color of the button that caused the effect or the font size or both changes put together.

65
00:05:01,500 --> 00:05:06,270
And as I mentioned, there's many established statistical methods to test changes across multiple independent

66
00:05:06,270 --> 00:05:06,950
variables.

67
00:05:06,960 --> 00:05:12,750
But for right now, I want to focus this section on just testing singular changes to a single independent

68
00:05:12,750 --> 00:05:13,410
variable.

69
00:05:15,090 --> 00:05:20,940
Like many of the other topics we've covered, hypothesis testing can seem intimidating due to terminology,

70
00:05:20,940 --> 00:05:26,910
and throughout this entire course, we've been trying to simplify a lot of the terminology to give you

71
00:05:26,910 --> 00:05:30,450
an intuition of what's actually going on behind the scenes.

72
00:05:30,450 --> 00:05:36,930
So you may hear terminology like null hypothesis and some esoteric phrasing like failed to reject the

73
00:05:36,930 --> 00:05:38,100
null hypothesis.

74
00:05:38,250 --> 00:05:43,320
Let's guide you through a simple example so you get more comfortable with terms like fail to reject

75
00:05:43,320 --> 00:05:44,670
and null hypothesis.

76
00:05:46,830 --> 00:05:49,920
So we're going to start off with a test situation or scenario.

77
00:05:49,950 --> 00:05:54,780
Imagine that we're in charge of a large e-commerce company and we run a website.

78
00:05:54,960 --> 00:06:00,690
And what we're going to do is we're going to change the size of the font to a larger size across the

79
00:06:00,690 --> 00:06:01,470
entire website.

80
00:06:01,470 --> 00:06:04,440
So all the font you see on the website is going to be slightly larger.

81
00:06:04,440 --> 00:06:08,520
And we have a hypothesis that if we're going to change the font size, then the customers are going

82
00:06:08,520 --> 00:06:09,480
to spend more money.

83
00:06:10,840 --> 00:06:16,450
Since our website is online, we can show some customers the original font size and measure their spend

84
00:06:16,450 --> 00:06:22,090
and simultaneously show another segment of customers the larger font size and measure their spend.

85
00:06:23,650 --> 00:06:25,750
When conducting tests like this one.

86
00:06:25,750 --> 00:06:29,740
We always want to try to make sure that the test audiences are similar to each other.

87
00:06:29,740 --> 00:06:35,200
So I want to make sure my control audience that's going to see the original font size is roughly similar

88
00:06:35,200 --> 00:06:40,960
to the test audience that is seeing the larger font size in order to prevent possible outside factors

89
00:06:40,960 --> 00:06:46,540
from contributing to any change in the dependent variable, which in this case is spend.

90
00:06:47,940 --> 00:06:52,920
So now that we understand the scenario, let's frame this so that we slowly build up to the definition

91
00:06:52,920 --> 00:06:54,410
of a null hypothesis.

92
00:06:54,420 --> 00:07:00,420
So I'm going to describe a situation where kind of a silly but very specific hypothesis is positive,

93
00:07:00,420 --> 00:07:05,310
and then we're going to use that to build our understanding of rejecting and failing to reject a null

94
00:07:05,310 --> 00:07:06,270
hypothesis.

95
00:07:07,410 --> 00:07:13,190
So let's imagine that I'm going to build up to a kind of hyper specific hypothesis.

96
00:07:13,200 --> 00:07:18,090
I'm going to say customers viewing larger font size will spend 100 or more dollars on the website.

97
00:07:18,090 --> 00:07:20,360
And how did I come up with this hypothesis?

98
00:07:20,370 --> 00:07:25,680
Maybe I ran a really small preliminary test comparing a test group to a control group.

99
00:07:25,800 --> 00:07:31,440
I remember the test group is going to see that larger font size, so I run this small preliminary test.

100
00:07:32,220 --> 00:07:33,330
And I have a test group.

101
00:07:33,330 --> 00:07:37,200
There's three customers in it, and then a control group, another three customers.

102
00:07:37,200 --> 00:07:38,850
And you probably need more than these customers.

103
00:07:38,850 --> 00:07:40,630
But we're keeping things simple for now.

104
00:07:40,650 --> 00:07:46,230
I take the average values of their spend and I see that on average, the test group viewing the larger

105
00:07:46,230 --> 00:07:50,220
font size is going to spend 100 more dollars versus the control group.

106
00:07:50,220 --> 00:07:55,680
So I come up from this preliminary testing with this very specific hypothesis that customers are viewing.

107
00:07:55,680 --> 00:08:00,810
The larger font size will spend 100 more dollars on the website, and we'll see that this causes issues

108
00:08:00,810 --> 00:08:01,860
further down the line.

109
00:08:01,860 --> 00:08:07,860
But let's just get this idea of reject versus fail to reject, and then we'll build up to a null hypothesis,

110
00:08:07,860 --> 00:08:10,590
which is a lot broader and a lot easier to use.

111
00:08:11,680 --> 00:08:17,710
So again, there's 100 more dollars on average spent versus the test versus control.

112
00:08:19,070 --> 00:08:24,200
So now that I've done the preliminary test, I start rolling out this experimentation to the real world.

113
00:08:24,200 --> 00:08:29,300
So I test this on new customers on the website after my little preliminary test that helped me define

114
00:08:29,300 --> 00:08:30,380
my hypothesis.

115
00:08:31,320 --> 00:08:32,970
But then I have issues.

116
00:08:32,970 --> 00:08:39,210
I noticed that when I actually roll this out to the real world, my test group is spending less than

117
00:08:39,210 --> 00:08:40,230
the control group.

118
00:08:40,230 --> 00:08:46,230
And now I see in pretty much every situation that the test group that is viewing the larger font size

119
00:08:46,230 --> 00:08:49,470
ends up spending less than the control group.

120
00:08:50,520 --> 00:08:55,560
So what happens if the rest of the experimental tests end up showing the opposite effect?

121
00:08:55,950 --> 00:09:03,810
Well, in that case, I'm going to decide to reject that hypothesis, since none of my other experiments

122
00:09:03,810 --> 00:09:05,940
actually ended up supporting this.

123
00:09:05,940 --> 00:09:11,790
So I'm going to reject the hypothesis that customers viewing larger font size will spend 100 or more

124
00:09:11,790 --> 00:09:13,230
dollars on the website.

125
00:09:15,180 --> 00:09:18,140
Now let's imagine that we rewind the time.

126
00:09:18,150 --> 00:09:22,740
So now again, I'm going to test the new font size on the web page for a variety of groups.

127
00:09:22,740 --> 00:09:28,890
And this time, if we rewind time and pretend that we were to redo this experiment, what happens if

128
00:09:28,890 --> 00:09:30,450
I get the same trend?

129
00:09:30,450 --> 00:09:33,120
But it's not exactly $100 difference?

130
00:09:33,690 --> 00:09:40,200
So I test this out in a variety of experiments, and the test group does in fact spend more than the

131
00:09:40,200 --> 00:09:41,310
control group.

132
00:09:41,310 --> 00:09:46,800
But you'll notice if you pay close attention to these actual experimental examples, it's not exactly

133
00:09:46,800 --> 00:09:48,090
a $100 difference.

134
00:09:48,090 --> 00:09:50,460
So in one experiment it was 103.

135
00:09:50,460 --> 00:09:54,570
Another experiment is 160, 75 and 72 difference and so on.

136
00:09:54,750 --> 00:09:58,050
So here we have a bit of a conundrum.

137
00:09:58,380 --> 00:10:07,020
None of the experiments are exactly reflecting my hypothesis, so I cannot say that my hypothesis is

138
00:10:07,020 --> 00:10:11,130
exactly 100% correct because remember, my hypothesis is actually really specific.

139
00:10:11,130 --> 00:10:15,060
It says that the customers are going to spend 100 more dollars on the website.

140
00:10:15,060 --> 00:10:20,550
And technically speaking, while the general trend was the same through the rest of my experimentation,

141
00:10:20,550 --> 00:10:25,170
none of the actual experiments showed the test group spending exactly $100 more.

142
00:10:26,470 --> 00:10:32,200
So the best that I conclude here is I'm going to fail to reject the hypothesis.

143
00:10:32,230 --> 00:10:39,760
Notice how that wording is different than saying that the hypothesis is 100% true, or that I accept

144
00:10:39,760 --> 00:10:41,680
the hypothesis as a truth.

145
00:10:41,710 --> 00:10:46,000
Instead, I'm just failing to reject the hypothesis.

146
00:10:47,640 --> 00:10:52,770
Since the amounts weren't exactly $100, I can't say the hypothesis is accepted.

147
00:10:52,770 --> 00:10:58,620
So instead I use that very specific terminology of failing to reject the hypothesis.

148
00:10:59,680 --> 00:11:06,460
So again, the difference is you reject the hypothesis or you fail to reject the hypothesis.

149
00:11:08,410 --> 00:11:14,830
Now, I've already mentioned a couple of times that this idea of exactly $104 is kind of troublesome.

150
00:11:14,830 --> 00:11:17,800
So is there a better way to actually define the hypothesis?

151
00:11:17,800 --> 00:11:18,490
Perhaps.

152
00:11:19,560 --> 00:11:23,550
Well, so far we've realized I can create the hypothesis and I can reject it.

153
00:11:23,550 --> 00:11:27,170
If the testing data and the experiments don't support the hypothesis.

154
00:11:27,180 --> 00:11:31,800
This was the case when testing showed that the font size change that it caused more spend than the control

155
00:11:31,800 --> 00:11:32,430
groups.

156
00:11:33,290 --> 00:11:38,300
We've also seen the cases that the experiments have the same trend, but not the same exact amount of

157
00:11:38,300 --> 00:11:40,020
100 more dollars in spend.

158
00:11:40,040 --> 00:11:44,020
I technically can't say that the hypothesis is absolutely correct.

159
00:11:44,030 --> 00:11:49,880
So instead I frame my phrasing as I fail to reject the hypothesis.

160
00:11:51,490 --> 00:11:53,620
So recall the hypothesis itself.

161
00:11:53,830 --> 00:11:57,700
Customers viewing the larger font size will spend 100 or more dollars on the website.

162
00:11:59,470 --> 00:12:04,720
Technically speaking, our goal is not really to show that the change is exactly $100.

163
00:12:04,720 --> 00:12:10,480
But if I think about this from a broader viewpoint of the company or organization, what I'm really

164
00:12:10,480 --> 00:12:15,610
trying to prove is that changing the font size has some effect on spend.

165
00:12:17,600 --> 00:12:22,970
Now, since we've established that we operate in a framework where we either reject or fail to reject

166
00:12:22,970 --> 00:12:29,420
a hypothesis, in this framework, it probably makes more sense to use what's known as a null hypothesis,

167
00:12:29,570 --> 00:12:33,740
where we posit simply that there is no difference between the two groups.

168
00:12:33,740 --> 00:12:35,450
That is the test and the control.

169
00:12:35,570 --> 00:12:40,210
So that is there to say there is no effect or no effect.

170
00:12:40,220 --> 00:12:45,560
That's why it's called the null hypothesis and you'll realize it becomes a lot easier, statistically

171
00:12:45,560 --> 00:12:50,450
speaking, to just frame this null hypothesis regardless of what you're actually testing.

172
00:12:52,060 --> 00:12:58,850
A null hypothesis allows us to not need to worry about an exact quantitative value of change or effect.

173
00:12:58,870 --> 00:13:04,660
So don't really need to worry about saying it's going to be exactly $100 or exactly $101.

174
00:13:04,690 --> 00:13:10,780
Instead, I reframed the entire question, and now I'm really just asking, did this particular change

175
00:13:10,780 --> 00:13:12,090
have an effect?

176
00:13:12,100 --> 00:13:16,870
And notice how that matches a lot more to our generalized statement of hypothesis.

177
00:13:17,410 --> 00:13:22,600
So, for example, I could say larger font size has no effect on customer spend.

178
00:13:22,660 --> 00:13:27,880
That's the null hypothesis version of the hypothesis that we saw previously.

179
00:13:29,710 --> 00:13:35,620
Notice again how a null hypothesis no longer even requires us to do any preliminary tests because I

180
00:13:35,620 --> 00:13:38,140
don't need to acquire some range like $100.

181
00:13:38,140 --> 00:13:44,050
So this whole idea of trying to do a little preliminary test to figure out the expected range, I throw

182
00:13:44,050 --> 00:13:49,900
that out the window if the null hypothesis, because all I'm doing is I'm just framing it as if this

183
00:13:49,900 --> 00:13:51,780
change is going to have no effect.

184
00:13:51,790 --> 00:13:53,260
That is my hypothesis.

185
00:13:53,260 --> 00:13:59,320
And that null hypothesis works a lot better in the framework of reject or fail to reject.

186
00:13:59,900 --> 00:14:04,670
So now let's go through a simple example of using the null hypothesis framework.

187
00:14:04,790 --> 00:14:10,710
Remember, now my null hypothesis is just saying whatever this change is doesn't have an effect.

188
00:14:10,730 --> 00:14:16,790
It has null effect, which means I no longer need to worry about running a preliminary experiment to

189
00:14:16,790 --> 00:14:18,990
figure out something like a $100 value.

190
00:14:19,010 --> 00:14:23,810
Instead, now that I've established a null hypothesis, I just go and run the experiment.

191
00:14:24,290 --> 00:14:28,190
So let's imagine that we do see a difference between test and control.

192
00:14:28,220 --> 00:14:33,000
Then what I can do here is I can reject the null hypothesis.

193
00:14:33,020 --> 00:14:37,610
Now, in the case where test control, we're actually really quite similar and it looked like there

194
00:14:37,610 --> 00:14:41,600
wasn't an effect, then I could say that I fail to reject the null hypothesis.

195
00:14:42,250 --> 00:14:48,820
Now, we haven't really discussed how much of an effect is needed to determine whether we reject or

196
00:14:48,820 --> 00:14:50,040
fail to reject.

197
00:14:50,050 --> 00:14:56,260
That's actually going to come later on in this section with a discussion of P values and statistical

198
00:14:56,260 --> 00:14:57,190
significance.

199
00:14:57,190 --> 00:15:04,000
So there is going to be some metric to understand whether you decide to reject or fail to reject, but

200
00:15:04,000 --> 00:15:05,640
we'll talk about that later on.

201
00:15:05,650 --> 00:15:10,780
Really, the purpose of this lecture was to get you familiar with the idea of a null hypothesis and

202
00:15:10,780 --> 00:15:13,960
the framework of rejecting or failing to reject.

203
00:15:15,600 --> 00:15:20,500
So along the null hypothesis like larger font size has no effect on customer spend.

204
00:15:20,520 --> 00:15:24,980
We also have an alternative hypothesis, which is essentially the opposite.

205
00:15:24,990 --> 00:15:29,610
It basically says that a larger font size does have an effect on customer spend.

206
00:15:29,880 --> 00:15:32,560
So you have the null hypothesis where there is no effect.

207
00:15:32,580 --> 00:15:34,920
The alternative is that there is an effect.

208
00:15:34,950 --> 00:15:41,130
Notice how neither the null hypothesis or the alternative hypothesis is actually specific to the strength

209
00:15:41,130 --> 00:15:41,980
of the effect.

210
00:15:42,000 --> 00:15:45,000
It's really just there is no effect or there is an effect.

211
00:15:46,390 --> 00:15:48,670
Now to fully understand hypothesis testing.

212
00:15:48,670 --> 00:15:54,220
Beyond this simple example, we first need to understand to learn how to evaluate the differences between

213
00:15:54,220 --> 00:15:59,920
results of groups to determine whether or not the effect is what's known as statistically significant.

214
00:16:00,610 --> 00:16:06,010
So let's move on to learn about how to use things like one tailed and two tailed test to set up hypothesis

215
00:16:06,010 --> 00:16:11,950
testing for different situations and discover how to use p values to quantitatively state the likelihoods

216
00:16:11,950 --> 00:16:14,470
of an effect being statistically significant.

217
00:16:14,500 --> 00:16:16,330
We'll see you at the next lecture.