1
00:00:00,090 --> 00:00:00,840
In this video.

2
00:00:00,840 --> 00:00:06,660
All we want to do is relate the hypothesis testing process we've just talked through to the AB testing

3
00:00:06,660 --> 00:00:11,960
process because AB testing is something we very often do in the real world.

4
00:00:11,970 --> 00:00:19,350
For instance, one of the most classic examples of AB testing is the example where we test two web page

5
00:00:19,350 --> 00:00:20,760
versions against each other.

6
00:00:20,760 --> 00:00:26,640
In other words, we're trying to determine whether a new version of the web page converts better than

7
00:00:26,640 --> 00:00:27,780
the original version.

8
00:00:27,780 --> 00:00:31,830
Version A So version A is what we're currently displaying on the site.

9
00:00:31,830 --> 00:00:34,590
If you visit the Web page, you'll see that version now.

10
00:00:34,590 --> 00:00:40,050
And we want to know whether version B, this new version, this new variation that we've created will

11
00:00:40,050 --> 00:00:45,720
result in a more favorable outcome than the outcome that's currently being generated by version A.

12
00:00:45,750 --> 00:00:54,420
So to be more specific, maybe variation A of this web page has a buy now button that is blue, whereas

13
00:00:54,420 --> 00:01:00,840
in variation B we change the color of that buy now button from blue to green, and we want to know if

14
00:01:00,840 --> 00:01:04,069
changing the button color makes version B perform better.

15
00:01:04,080 --> 00:01:09,630
Of course, this is a really simple example, but we could also run tests where we change the color

16
00:01:09,630 --> 00:01:13,590
of any part of the page or we change the font or the font size.

17
00:01:13,590 --> 00:01:16,740
We could change what the text actually says on the website.

18
00:01:16,740 --> 00:01:22,080
Maybe in version A this button says Buy now and in version B, this button says, Get yours.

19
00:01:22,080 --> 00:01:25,740
We could change the information that's being displayed on any part of the page.

20
00:01:25,740 --> 00:01:28,470
We could add or remove whole sections of the web page.

21
00:01:28,470 --> 00:01:30,510
Really, anything that we could imagine.

22
00:01:30,510 --> 00:01:35,310
And of course, we don't have to limit ourselves to just a single web page.

23
00:01:35,310 --> 00:01:40,110
We can apply this idea of AB testing to an infinite number of real world applications.

24
00:01:40,110 --> 00:01:45,540
So to take something totally different, let's say maybe that we run a transcription company in which

25
00:01:45,540 --> 00:01:52,410
our employees transcribe videos that are sent in by our customers and maybe our company employs thousands

26
00:01:52,410 --> 00:01:58,290
and thousands of people and we want to know whether lowering the temperature in the office by one degree

27
00:01:58,290 --> 00:02:03,060
will increase productivity as measured by the number of transcriptions that our employees finish each

28
00:02:03,060 --> 00:02:03,510
day.

29
00:02:03,510 --> 00:02:07,200
But no matter what we're testing, it's common to only make.

30
00:02:07,850 --> 00:02:13,320
One change at a time, because if we make multiple changes at once.

31
00:02:13,340 --> 00:02:18,040
Of course we don't know what effect that will have on the results of our AB test.

32
00:02:18,050 --> 00:02:22,700
For instance, in this web page example, if we change the colour of the button and the text of the

33
00:02:22,700 --> 00:02:29,030
button at the same time and we find a statistically significant result, we don't know if that statistical

34
00:02:29,030 --> 00:02:34,040
significance is the result of changing the color or changing the font or both.

35
00:02:34,040 --> 00:02:35,240
The combination of both.

36
00:02:35,240 --> 00:02:40,550
When we make just one change at a time, then we know that the change we made was the thing that actually

37
00:02:40,550 --> 00:02:41,210
had an effect.

38
00:02:41,210 --> 00:02:45,680
Or we know that that change didn't result in a statistically significant difference.

39
00:02:45,710 --> 00:02:51,290
Now we said that this AB testing process follows the same kind of hypothesis testing process that we've

40
00:02:51,290 --> 00:02:56,840
been talking about, which means that when we run an AB test, just like with hypothesis testing, we

41
00:02:56,840 --> 00:02:59,450
always start with a pair of hypothesis statements.

42
00:02:59,450 --> 00:03:06,740
So that step one and when we're running an AB test, the alternative hypothesis is that the new variation,

43
00:03:06,740 --> 00:03:13,550
this new variation B always leads to better results, whereas the null hypothesis is that the variation

44
00:03:13,550 --> 00:03:14,930
doesn't beat the status quo.

45
00:03:14,960 --> 00:03:21,050
Basically, whatever version we have now, version A when we create this new version B and we test it,

46
00:03:21,080 --> 00:03:25,640
it doesn't do any better than version A, which means that they have comparable results.

47
00:03:25,640 --> 00:03:26,840
The results don't change.

48
00:03:26,840 --> 00:03:30,290
So this no hypothesis is that the results don't change.

49
00:03:30,290 --> 00:03:36,020
Variation B doesn't perform any better than variation A whereas our alternative hypothesis says that

50
00:03:36,020 --> 00:03:41,270
yes, variation B does perform better than variation A, which means that they are not equal.

51
00:03:41,270 --> 00:03:47,660
Now realize here that we have used means, so the mean of variation a not equal to the mean of variation.

52
00:03:47,660 --> 00:03:53,660
B for example, we'll use the mean when the result that we're trying to improve is something like maybe

53
00:03:53,660 --> 00:03:55,340
average revenue per share, right?

54
00:03:55,340 --> 00:04:01,460
If people are looking at this web page and then clicking on this button to make a purchase, maybe we're

55
00:04:01,460 --> 00:04:07,790
trying to measure the amount of money that they spend, in which case we could use a mean because we

56
00:04:07,790 --> 00:04:13,700
could look at the mean spend when our customers see variation A versus the mean spend when our customers

57
00:04:13,730 --> 00:04:19,790
see variation B, And so we would use mean because our alternative hypothesis might say that the average

58
00:04:19,790 --> 00:04:26,150
revenue per user for variation A is not equal to the average revenue per user for variation B, those

59
00:04:26,150 --> 00:04:27,650
two means are different.

60
00:04:27,650 --> 00:04:30,350
That would be an example of continuous data.

61
00:04:30,350 --> 00:04:33,800
Remember previously that we talked about continuous versus discrete data.

62
00:04:33,800 --> 00:04:35,390
Well, mean revenue.

63
00:04:35,390 --> 00:04:41,720
If revenue can be really any amount more or less than revenue would be an example of continuous data.

64
00:04:41,720 --> 00:04:44,750
But we can also have discrete data like the simple.

65
00:04:44,750 --> 00:04:48,020
Yes, no question Did they click this button or not?

66
00:04:48,020 --> 00:04:52,670
In which case we would talk about the proportion of customers who clicked the button when they were

67
00:04:52,670 --> 00:04:57,830
looking at variation A versus the proportion of customers who clicked the button when they were looking

68
00:04:57,830 --> 00:05:03,980
at variation B And in that case we would write our null and alternative hypothesis statements this way,

69
00:05:03,980 --> 00:05:09,200
where the alternative hypothesis says that the two proportions are not equal, whereas the null hypothesis

70
00:05:09,200 --> 00:05:11,270
says that the two proportions are equal.

71
00:05:11,270 --> 00:05:17,150
Changing to variation B doesn't affect the proportion of customers who click the button.

72
00:05:17,150 --> 00:05:24,470
The proportion remains the same in variation B as it was in variation A The two proportions are equal

73
00:05:24,470 --> 00:05:26,210
in that null hypothesis.

74
00:05:26,240 --> 00:05:31,160
Realize here also that regardless of which kind of hypothesis statements we're looking at, in both

75
00:05:31,160 --> 00:05:34,010
cases, we're running a two tailed test.

76
00:05:34,040 --> 00:05:39,860
We can see that from the fact that our alternative hypothesis is not equal to and our null hypothesis

77
00:05:39,860 --> 00:05:42,920
is equal to in both sets of hypothesis statements.

78
00:05:42,920 --> 00:05:49,940
We have that non directional test compared to a one tailed test or a directional test where we say in

79
00:05:49,940 --> 00:05:54,650
the alternative hypothesis that the mean of a is greater than the mean of B or less than the mean of

80
00:05:54,650 --> 00:05:59,210
B, or that the proportion for A is greater than the proportion for B or less than the proportion for

81
00:05:59,210 --> 00:05:59,600
B.

82
00:05:59,600 --> 00:06:04,640
And if you think about it, it makes sense that we're running a two tailed test when we're a B testing

83
00:06:04,640 --> 00:06:10,220
because not only is a two tailed test more conservative, as we've talked about in the past, but also

84
00:06:10,220 --> 00:06:16,130
intuitively by creating variation B here we're saying that we don't know which variation is going to

85
00:06:16,130 --> 00:06:16,760
perform better.

86
00:06:16,760 --> 00:06:21,020
We don't know if A's going to be better or B's going to be better or they're going to be exactly equal.

87
00:06:21,020 --> 00:06:28,580
So we don't have an idea about the directionality of A versus B, And so because we don't suspect that

88
00:06:28,580 --> 00:06:32,690
direction, we have no idea whether variation B will perform better or worse.

89
00:06:32,690 --> 00:06:35,900
We need to run a non directional two tailed test.

90
00:06:35,900 --> 00:06:41,960
And so the alternative hypothesis is always a not equal to B, and the null hypothesis is always a equals

91
00:06:41,960 --> 00:06:42,320
B.

92
00:06:42,320 --> 00:06:48,680
So of course that means that for the purposes of an AB test, the null hypothesis automatically assumes

93
00:06:48,680 --> 00:06:55,130
that variation B is not a meaningful improvement over variation A and running the A B test is either

94
00:06:55,130 --> 00:07:02,300
going to disprove that null by showing that variation B is better and better in a statistically significant

95
00:07:02,300 --> 00:07:02,840
way.

96
00:07:02,840 --> 00:07:06,860
Or we're going to fail to show that B is better and in that.

97
00:07:06,940 --> 00:07:10,060
Case, we would fail to disprove the null hypothesis.

98
00:07:10,090 --> 00:07:16,330
Now, just like with hypothesis testing, our next step after we pick hypothesis statements is to choose

99
00:07:16,330 --> 00:07:21,340
a confidence level and therefore the alpha value or the level of significance.

100
00:07:21,340 --> 00:07:26,430
And a 95% confidence level is typically an industry standard.

101
00:07:26,440 --> 00:07:33,580
Again, we can pick technically any confidence level we want, but it's very common to pick a 95% confidence

102
00:07:33,580 --> 00:07:41,490
level and therefore an alpha value of 5% or a level of significance of 5%.

103
00:07:41,500 --> 00:07:45,820
So just like with hypothesis testing, these are values that we set ahead of time.

104
00:07:45,970 --> 00:07:52,720
And it's also really important when we're running an AB test to choose ahead of time, sample size and

105
00:07:52,720 --> 00:07:53,980
a time interval.

106
00:07:53,980 --> 00:07:58,060
So the sample size idea really isn't new in hypothesis testing.

107
00:07:58,060 --> 00:08:01,270
We would pick a sample size ahead of time and we do the same thing here.

108
00:08:01,270 --> 00:08:03,640
Pick a sample size, but for a B testing.

109
00:08:03,640 --> 00:08:10,360
It's also extremely important that we pick a time interval ahead of time, an interval of time over

110
00:08:10,360 --> 00:08:13,990
which we will run the AB test or conduct the AB test.

111
00:08:13,990 --> 00:08:20,500
And it's really important that we stick to that time interval and not end the test prematurely.

112
00:08:20,500 --> 00:08:26,860
The reason that setting a time interval ahead of time is so important is because AB testing tools often

113
00:08:26,860 --> 00:08:31,720
won't wait for a specific amount of time before returning a result.

114
00:08:31,720 --> 00:08:37,630
Instead, they'll start indicating right away as data is being collected, whether the result is showing

115
00:08:37,630 --> 00:08:38,440
significance.

116
00:08:38,440 --> 00:08:45,850
So imagine going back to our web page example that buying behavior on our website changes drastically

117
00:08:45,850 --> 00:08:49,600
throughout a single week, or maybe even a single month.

118
00:08:49,600 --> 00:08:55,720
For instance, maybe purchase volume is much higher during the week and lower on the weekends, for

119
00:08:55,720 --> 00:08:56,260
example.

120
00:08:56,290 --> 00:09:01,450
Maybe it's also the case for whatever reason, based on the kind of website we're running here, that

121
00:09:01,450 --> 00:09:06,520
most purchases on the site are made closer to the end of a month, as opposed to maybe in the first

122
00:09:06,520 --> 00:09:08,350
half of the month or something like that.

123
00:09:08,350 --> 00:09:14,710
If we pick a time interval that's too short or we don't allow the test to run through the full time

124
00:09:14,710 --> 00:09:20,170
interval, then we'll miss capturing data across the entire week or across the entire month, and we

125
00:09:20,170 --> 00:09:26,740
might start to see that the test is looking significant for maybe just weekday buying behavior.

126
00:09:26,740 --> 00:09:31,930
But we haven't let the test run through the weekend and maybe letting it run through the weekend would

127
00:09:31,930 --> 00:09:34,990
actually bring the results out of statistical significance.

128
00:09:34,990 --> 00:09:40,000
Or maybe if we only ran the test through the first half of the month instead of through the entire month,

129
00:09:40,000 --> 00:09:45,400
when most buying is done near the end of the month, maybe the test wouldn't look statistically significant

130
00:09:45,400 --> 00:09:46,840
for the first half of the month.

131
00:09:46,840 --> 00:09:51,670
But if we wait an entire month now, all of a sudden when we capture all that data at the end of the

132
00:09:51,670 --> 00:09:55,420
month, the test suddenly becomes statistically significant.

133
00:09:55,420 --> 00:10:02,230
So when we're a B testing, we need to pick a time interval that makes sense, and then we need to stick

134
00:10:02,230 --> 00:10:05,050
to that time interval without ending the test early.

135
00:10:05,200 --> 00:10:09,580
Another reason for doing this is this idea here of novelty effect.

136
00:10:09,580 --> 00:10:14,290
This idea of novelty effect relates to our existing customers.

137
00:10:14,290 --> 00:10:19,690
If we have regular customers who visit our site over and over and over again, and we've been showing

138
00:10:19,690 --> 00:10:26,320
them variation A for a long period of time, they're very used to seeing this blue button here on the

139
00:10:26,320 --> 00:10:26,860
site.

140
00:10:26,860 --> 00:10:32,830
If we suddenly change the color of the button from blue to green and we show them the green button,

141
00:10:32,830 --> 00:10:38,020
there's this novelty effect where they just notice that difference or they notice that something's different.

142
00:10:38,020 --> 00:10:43,360
And that may be creates a change in their behavior simply because we changed something.

143
00:10:43,360 --> 00:10:49,570
But maybe it's not actually the case that the green button does better with new customers than the blue

144
00:10:49,570 --> 00:10:50,050
button.

145
00:10:50,050 --> 00:10:54,610
It's only doing better with existing customers because of this novelty effect.

146
00:10:54,610 --> 00:11:01,030
And maybe once our existing customers get used to this green button, given enough time, then it won't

147
00:11:01,030 --> 00:11:05,260
actually be the case that this green button performs any better than the blue button.

148
00:11:05,260 --> 00:11:12,430
So we need to set this time interval so that we make sure to capture a sufficient amount of data, especially

149
00:11:12,430 --> 00:11:18,130
if our data varies over time, like we talked about days of the week or throughout the month, maybe

150
00:11:18,130 --> 00:11:20,050
even an entire buying season.

151
00:11:20,050 --> 00:11:22,870
And we need to be able to get past this novelty effect.

152
00:11:22,870 --> 00:11:27,370
So with all that in mind, we pick a competence level and therefore an alpha value.

153
00:11:27,400 --> 00:11:33,040
We set a certain sample size and a time interval and we've committed to sticking to that time interval.

154
00:11:33,040 --> 00:11:36,010
Our next step is going to be to calculate a test statistic.

155
00:11:36,010 --> 00:11:41,170
But before we go there, let's talk about why we're even doing this a B testing in the first place.

156
00:11:41,170 --> 00:11:48,160
Well, if we go back to our Web page example, maybe we get a million visitors per month to our website.

157
00:11:48,160 --> 00:11:54,760
If we just decide today to change this button from blue to green, even if sales revenue or click through

158
00:11:54,760 --> 00:12:01,390
rate improves in variation B here, we can't necessarily conclude that the green button is better because

159
00:12:01,390 --> 00:12:06,190
maybe up until this point we've been running this web page with the blue button and we've been seeing

160
00:12:06,190 --> 00:12:06,820
a certain.

161
00:12:06,880 --> 00:12:12,340
Level of revenue or a certain click through rate, and then all of a sudden we change the button to

162
00:12:12,340 --> 00:12:16,720
green and let's say that sales revenue increases with that change.

163
00:12:16,720 --> 00:12:22,180
Well, we don't know if it increased because we change the color of the button or if it increased maybe

164
00:12:22,180 --> 00:12:28,540
because the month in which we changed to this version of the page is maybe a higher volume month for

165
00:12:28,540 --> 00:12:29,590
revenue in general.

166
00:12:29,590 --> 00:12:34,840
Maybe we started showing this green button closer to holiday season and so revenue would have gone up

167
00:12:34,840 --> 00:12:35,500
anyway.

168
00:12:35,500 --> 00:12:40,780
And in fact, maybe keeping the blue button would have made revenue even higher than the revenue we

169
00:12:40,780 --> 00:12:42,550
got when we changed to the green button.

170
00:12:42,550 --> 00:12:48,580
So if we just make the change from variation A to variation B, we just decide overnight to make the

171
00:12:48,580 --> 00:12:48,910
change.

172
00:12:48,910 --> 00:12:50,260
Instead of doing any testing.

173
00:12:50,260 --> 00:12:55,330
We don't really have any insight into whether the change we made is actually a good change or not.

174
00:12:55,330 --> 00:12:59,830
We don't know if it's making a positive impact, a negative impact or no impact.

175
00:12:59,830 --> 00:13:03,700
So ideally we'd prefer to test instead of just make the change.

176
00:13:03,700 --> 00:13:09,910
And then in addition, there are often risks or even costs involved in making a change.

177
00:13:09,910 --> 00:13:17,170
And so we might want to run a test before we incur all that risk or incur all of that cost to change

178
00:13:17,170 --> 00:13:18,490
from A to B.

179
00:13:18,490 --> 00:13:24,340
So if we have millions of customers coming to our website every month and maybe our revenue is in the

180
00:13:24,340 --> 00:13:30,280
many millions of dollars, making a change without doing any testing potentially puts that revenue at

181
00:13:30,280 --> 00:13:30,970
risk.

182
00:13:30,970 --> 00:13:34,690
And so it's important that we're cautious about any changes that we make.

183
00:13:34,690 --> 00:13:40,450
And doing some A B testing helps us make sure that we're only making changes that the data tell us will

184
00:13:40,450 --> 00:13:42,520
be positive to our bottom line.

185
00:13:42,520 --> 00:13:48,070
Or we talked about the example where we want to see if lowering the temperature of our office space

186
00:13:48,070 --> 00:13:49,690
increases worker productivity.

187
00:13:49,690 --> 00:13:55,600
Well, lowering the temperature for thousands and thousands of employees might cause our utility bills

188
00:13:55,600 --> 00:13:57,610
to increase significantly.

189
00:13:57,610 --> 00:14:03,580
And so maybe we want to run an AB test to look at productivity before we spend tens of thousands of

190
00:14:03,580 --> 00:14:07,420
dollars or even hundreds of thousands of dollars on extra air conditioning.

191
00:14:07,420 --> 00:14:12,610
So those are just some examples of why we do a B testing in the first place.

192
00:14:12,610 --> 00:14:17,500
And because we're doing a B testing, that means that we're doing inferential statistics because we're

193
00:14:17,500 --> 00:14:23,050
taking a sample and then using statistics about the sample to make inferences about the population.

194
00:14:23,050 --> 00:14:28,990
So we are running web page variation B against a sample of our customers.

195
00:14:28,990 --> 00:14:34,870
We're running web page variation A against another sample of customers, and we're comparing those samples

196
00:14:34,870 --> 00:14:40,360
or we're lowering the temperature in our office space for employees in some departments and keeping

197
00:14:40,360 --> 00:14:42,580
the temperature the same in other departments.

198
00:14:42,580 --> 00:14:45,610
So inherently we have this idea of sampling.

199
00:14:45,610 --> 00:14:50,320
So these are the parallels between hypothesis testing and a B testing.

200
00:14:50,320 --> 00:14:56,290
And at this point, once we get to calculating the test statistic, what we really need to say is that

201
00:14:56,290 --> 00:15:02,710
this whole A, B testing process these days is done in A, B testing software or with a B testing tools.

202
00:15:02,710 --> 00:15:09,340
And that software, those tools are going to help us with every step of this process by determining

203
00:15:09,340 --> 00:15:11,260
which kind of test we're going to run.

204
00:15:11,260 --> 00:15:15,940
And picking a confidence level and a sample size will be able to input all of this information.

205
00:15:15,940 --> 00:15:21,520
And when it comes to calculating the test statistic, there's a different test statistic for every different

206
00:15:21,520 --> 00:15:24,130
kind of data and every different kind of test.

207
00:15:24,130 --> 00:15:30,580
And so our goal here is not to understand all of the fine detail behind every type of test, but rather

208
00:15:30,580 --> 00:15:35,530
just to get a solid enough foundation so that we understand generally what our AB testing software is

209
00:15:35,530 --> 00:15:41,050
doing and what it's telling us so that we can intelligently interpret the results that we're seeing.

210
00:15:41,230 --> 00:15:45,640
So when it comes to calculating the test statistic, we want to keep in mind that we're going to use

211
00:15:45,640 --> 00:15:49,720
a different test statistic depending on whether we have discrete or continuous data.

212
00:15:49,720 --> 00:15:52,900
We talked about this when we talked about hypothesis statements.

213
00:15:52,900 --> 00:15:56,830
If we have discrete data like do our customers click the button?

214
00:15:56,830 --> 00:16:04,060
Yes or no, we can think about examples of that, like click through rate, like the number of products

215
00:16:04,060 --> 00:16:04,690
that are purchased.

216
00:16:04,690 --> 00:16:11,320
If we just have a count of products being purchased and we can contrast that with continuous data where

217
00:16:11,350 --> 00:16:16,630
of course we might have something like what we talked about earlier, the mean revenue per customer.

218
00:16:16,630 --> 00:16:22,420
So you certainly want to have an idea whether your data is discrete or continuous based on what it is

219
00:16:22,420 --> 00:16:24,670
you're trying to measure and improve.

220
00:16:24,670 --> 00:16:31,000
And then depending on the kind of data or the kind of test will use different test statistics.

221
00:16:31,000 --> 00:16:37,840
So for example, if we're testing click through rate will probably use what's called a Fisher's exact

222
00:16:37,840 --> 00:16:38,410
test.

223
00:16:38,440 --> 00:16:44,050
If we're looking at the number of products purchased will very likely use a chi square test and we'll

224
00:16:44,050 --> 00:16:49,810
talk about Chi square later when we look at regression, if we're looking at revenue per customer,

225
00:16:49,810 --> 00:16:54,310
we might use a Welch's T test or maybe even just a student's T test.

226
00:16:54,310 --> 00:17:00,250
And of course we'll use a different test statistic formula for each of these different kinds of tests.

227
00:17:00,250 --> 00:17:04,030
But that's what a B testing software will help us with.

228
00:17:04,030 --> 00:17:06,609
If we feed it the correct information, it'll.

229
00:17:06,670 --> 00:17:09,390
Help us choose the correct test statistic.

230
00:17:09,400 --> 00:17:14,380
And then, of course, our hypothesis testing software helps us with the very last step, which is to

231
00:17:14,380 --> 00:17:19,660
draw a conclusion by answering the question, Is the test statistic that we calculated or that the software

232
00:17:19,660 --> 00:17:25,720
helped us calculate significant enough to reject the null hypothesis, in other words, to reject the

233
00:17:25,720 --> 00:17:33,220
idea that variation be whatever it was, had no effect on the outcome we were looking for, had no effect

234
00:17:33,220 --> 00:17:37,300
on mean revenue, had no effect on click through rate or the number of products purchased.

235
00:17:37,300 --> 00:17:39,520
If we fail to reject the null.

236
00:17:39,520 --> 00:17:45,160
If the test statistic is not significant enough, then we can't say that variation B is better than

237
00:17:45,160 --> 00:17:48,580
variation A There's no change to the status quo.

238
00:17:48,610 --> 00:17:51,760
We can't prove that there's a significant difference.

239
00:17:51,760 --> 00:17:57,820
But if the test statistic is significant enough to allow us to reject the null hypothesis, then we

240
00:17:57,820 --> 00:18:03,820
reject the null lending support to our alternative hypothesis, which means that we have at least some

241
00:18:03,820 --> 00:18:10,090
support for the idea that Variation B performs better than variation A Now remember, just like with

242
00:18:10,090 --> 00:18:17,110
hypothesis testing, the whole idea here is not about whether we found a better sample mean with variation

243
00:18:17,140 --> 00:18:22,480
B than with variation A or a better sample proportion, like with the set of hypothesis statements,

244
00:18:22,480 --> 00:18:28,990
a better sample proportion with variation B than with variation A Just like with hypothesis testing,

245
00:18:28,990 --> 00:18:37,150
we might find a better result with variation B than with variation A But the question is, is it better

246
00:18:37,150 --> 00:18:37,900
enough?

247
00:18:37,900 --> 00:18:44,020
We need it to be statistically significant in order to support the idea that we think variation B will

248
00:18:44,020 --> 00:18:50,590
do better across the entire population, not just the sample, then variation A So to take a super simple

249
00:18:50,590 --> 00:18:56,920
example, let's say we show variation A of our web page to 1000 customers.

250
00:18:56,920 --> 00:19:03,070
So 1000 customers see the blue button and let's say 1000 customers see the green button variation.

251
00:19:03,070 --> 00:19:11,710
B And let's say that mean revenue per customer on the blue button page is $212 and mean revenue per

252
00:19:11,710 --> 00:19:16,360
customer on the green button page is, let's say, $220.

253
00:19:16,360 --> 00:19:22,330
Well, you can probably get an intuitive sense here that while there is a difference between 212 and

254
00:19:22,330 --> 00:19:29,560
220 and it looks like Variation B produces more revenue than variation A, the difference is, at least

255
00:19:29,560 --> 00:19:36,490
in this super simple example, the difference doesn't seem significant enough to make us feel very confident,

256
00:19:36,490 --> 00:19:44,200
much less 95% confident that variation B is definitely better than variation A Whereas if you saw data

257
00:19:44,200 --> 00:19:51,580
where for the 1000 customers that see the blue button mean revenue per customer is 212 but mean revenue

258
00:19:51,580 --> 00:19:55,090
for the 1000 customers who see the green button is.

259
00:19:55,840 --> 00:19:57,100
$500.

260
00:19:57,100 --> 00:20:02,110
Just at a basic level, that's going to catch your attention way more when the difference here is almost

261
00:20:02,110 --> 00:20:04,340
$300 versus $8.

262
00:20:04,360 --> 00:20:10,810
So even though in both cases variation B is better than variation A, in this second case, variation

263
00:20:10,810 --> 00:20:16,930
B has a much better chance of being statistically significantly better than variation A And that's the

264
00:20:16,930 --> 00:20:17,470
whole point.

265
00:20:17,470 --> 00:20:18,850
It can't just be better.

266
00:20:18,850 --> 00:20:20,110
It has to be better enough.

267
00:20:20,110 --> 00:20:27,310
It has to be statistically significant in order to feel confident at a 95% confidence level that the

268
00:20:27,310 --> 00:20:33,370
improvement that we're seeing in variation B here is not contained to just the sample that that improvement

269
00:20:33,370 --> 00:20:36,310
is actually going to map to the entire population.

270
00:20:36,310 --> 00:20:42,100
So hopefully that gives you an idea of the foundational ideas behind this, A B testing process and

271
00:20:42,100 --> 00:20:47,650
how they're so closely related to the hypothesis testing process that we've just worked through.

272
00:20:47,650 --> 00:20:53,110
In the next section, we'll start talking about regression, which is all about trends in our data.

273
00:20:53,110 --> 00:20:58,570
And as mentioned earlier, we'll look a little more at this test statistic we mentioned earlier, which

274
00:20:58,570 --> 00:21:01,420
is the chi squared test statistic.