1
00:00:05,540 --> 00:00:10,010
Welcome back, everyone, to this section of the course on the normal distribution.

2
00:00:11,130 --> 00:00:16,379
We've already explored probability mass functions as well as probability density functions for some

3
00:00:16,379 --> 00:00:18,270
particular data distributions.

4
00:00:18,390 --> 00:00:24,210
However, we have yet to explore one of the most frequently used and fundamental distributions known

5
00:00:24,210 --> 00:00:25,800
as the normal distribution.

6
00:00:27,140 --> 00:00:32,240
Let's explore the types of problems we're going to be able to answer once we understand the normal distribution

7
00:00:32,240 --> 00:00:37,070
and identify real world data distributions that follow a normal distribution.

8
00:00:38,940 --> 00:00:44,190
If you end up working with a real world data set that you can end up treating as normally distributed,

9
00:00:44,190 --> 00:00:50,640
you've actually unlocked another set of tools that is equations that you can use to calculate the probabilities

10
00:00:50,640 --> 00:00:51,690
of outcomes.

11
00:00:52,930 --> 00:00:57,400
One of the key tools used in normal distributions is known as a Z score.

12
00:00:57,550 --> 00:01:02,290
Fundamentally, once you're able to treat that real world data set as normally distributed, then you

13
00:01:02,290 --> 00:01:06,910
can use the Z score and that's going to allow you to calculate the probability of data points being

14
00:01:06,910 --> 00:01:08,920
between any interval range.

15
00:01:09,670 --> 00:01:12,520
For example, imagine that you run a hospital.

16
00:01:12,520 --> 00:01:18,490
So what you end up doing is you collect historical data around the length of human pregnancies in days.

17
00:01:18,670 --> 00:01:23,980
Once you have the mean and the standard deviation of that data set and that you've tested the overall

18
00:01:23,980 --> 00:01:29,140
data set to understand that it's following a normal distribution, you can calculate the probability

19
00:01:29,140 --> 00:01:32,440
of a pregnancy lasting longer than any number of days.

20
00:01:33,010 --> 00:01:38,680
So you can imagine that calculating that sort of probability that can aid in all sorts of tasks such

21
00:01:38,680 --> 00:01:44,290
as scheduling, hospital staffing, managing inventory for deliveries or alert systems for pregnancies

22
00:01:44,290 --> 00:01:46,420
lasting longer than number of days.

23
00:01:46,450 --> 00:01:51,160
The key idea to keep in mind is that the answers would actually still be in probability terms.

24
00:01:51,160 --> 00:01:56,560
So you would get answers in the form of something like there's a 10% chance that a pregnancy lasts longer

25
00:01:56,560 --> 00:01:57,940
than a number of days.

26
00:01:59,390 --> 00:02:01,210
So what are we going to cover in the section?

27
00:02:01,220 --> 00:02:06,100
We'll have a discussion of mean variance and standard deviation as they pertain to the normal distribution.

28
00:02:06,110 --> 00:02:11,870
Then we'll have a deeper dive into the normal distribution, otherwise known as a Gaussian distribution.

29
00:02:11,870 --> 00:02:16,910
Then we'll talk about the standard normal distribution, which is a specific case of a normal distribution.

30
00:02:16,910 --> 00:02:22,640
And then finally we'll talk about Z scores, which allow us to actually use a normal distribution to

31
00:02:22,640 --> 00:02:24,110
calculate probabilities.

32
00:02:25,660 --> 00:02:30,310
So the normal distribution is one of the most common distributions we use in business because so many

33
00:02:30,310 --> 00:02:33,670
real life data sets end up resembling a normal distribution.

34
00:02:34,930 --> 00:02:40,870
Formerly the normal distribution is defined as a type of continuous probability distribution for a real

35
00:02:40,870 --> 00:02:42,520
value random variable.

36
00:02:42,580 --> 00:02:47,620
Now, we already learned about probability density functions and the function and the general form for

37
00:02:47,620 --> 00:02:50,050
a normal distribution follows this formula.

38
00:02:51,030 --> 00:02:56,040
While the formula can actually be a bit complex at first sight, for example, it has E and also a PI,

39
00:02:56,040 --> 00:02:57,390
which is kind of crazy.

40
00:02:57,420 --> 00:03:03,960
You can take a close look at the terms and realize that in order to figure out the output f of x, you

41
00:03:03,960 --> 00:03:05,370
really just need three things.

42
00:03:05,370 --> 00:03:07,080
You need x itself.

43
00:03:07,080 --> 00:03:11,320
And then just the mean and standard deviation for your particular data set.

44
00:03:11,340 --> 00:03:14,790
You plug those in and then eventually you'll get F of X out.

45
00:03:16,020 --> 00:03:20,910
So let's take a step back and think about the shape of a normal distribution and what that implies for

46
00:03:20,910 --> 00:03:24,210
people who want to use its properties to answer questions.

47
00:03:25,670 --> 00:03:31,610
Fundamentally, we can think of the normal distribution as a data distribution where most values tend

48
00:03:31,610 --> 00:03:38,420
to fall closer to a mean value and then are distributed with some degree of variance.

49
00:03:38,420 --> 00:03:42,170
For example, heights of people are normally distributed.

50
00:03:42,230 --> 00:03:46,550
Most people end up being closer to the average height.

51
00:03:46,670 --> 00:03:51,260
Then less people are either going to be very short or very tall.

52
00:03:52,790 --> 00:03:55,820
So you'll often see a visualization of a normal distribution.

53
00:03:55,820 --> 00:04:02,600
Look something like this, where most values tend to be closer to the mean, and then you have the tails

54
00:04:02,600 --> 00:04:04,850
which indicate lower probabilities.

55
00:04:06,080 --> 00:04:11,960
Notice how it's just a probability density function showing the likelihood of choosing a particular

56
00:04:11,960 --> 00:04:13,430
data feature value.

57
00:04:14,620 --> 00:04:19,810
In this particular case, you're actually seeing a normal distribution that's centered around a mean

58
00:04:19,810 --> 00:04:21,160
value of zero.

59
00:04:22,370 --> 00:04:25,520
However, a normal distribution can have any mean value.

60
00:04:26,530 --> 00:04:32,230
So here we can see a normal distribution for the global heights of women, which tend to have a mean

61
00:04:32,230 --> 00:04:34,750
value of 163 centimetres.

62
00:04:36,280 --> 00:04:42,160
We've also seen the normal distribution is defined by the standard deviation, which is essentially

63
00:04:42,160 --> 00:04:46,150
a metric related to the variance of the distribution.

64
00:04:47,700 --> 00:04:55,170
Imagine a data set with a wider variance such as the prices used cars are sold at depending on the condition

65
00:04:55,170 --> 00:04:55,950
of the car.

66
00:04:55,980 --> 00:05:01,890
We could have a wide range of possible prices being paid for used vehicles, thus a larger standard

67
00:05:01,890 --> 00:05:03,000
deviation value.

68
00:05:04,590 --> 00:05:08,220
So here we can see a normal distribution of used car prices.

69
00:05:08,430 --> 00:05:13,650
And for a second, you may be thinking, hey, this looks exactly the same as the chart you just showed

70
00:05:13,650 --> 00:05:15,690
earlier for the heights of women.

71
00:05:16,460 --> 00:05:22,400
However, you should notice that the range on the x axis is a lot wider.

72
00:05:22,430 --> 00:05:29,450
Here we're going all the way from like 5000 to 25000, where previously, if you were to take a look

73
00:05:29,450 --> 00:05:36,590
back at the other slides, notice that this is actually a much tighter standard deviation in terms of

74
00:05:36,590 --> 00:05:37,580
centimeters.

75
00:05:41,350 --> 00:05:44,750
So you're probably wondering how does this work in real life?

76
00:05:44,770 --> 00:05:50,650
I can read all the textbooks I want and see all the pretty visualization diagrams, but in real life

77
00:05:50,650 --> 00:05:53,980
you don't start with a nice probability density function curve.

78
00:05:53,980 --> 00:06:00,700
Instead, you're going to have a series of real values, such as a data set of used car prices or measurements

79
00:06:00,700 --> 00:06:01,660
of heights.

80
00:06:01,660 --> 00:06:08,050
So you actually have to go from the real data set first and then map it to some sort of theoretical

81
00:06:08,050 --> 00:06:10,150
probability density function curve.

82
00:06:10,210 --> 00:06:13,060
And remember, the formula technically allows you to do that.

83
00:06:13,060 --> 00:06:18,490
You just needed to know the mean and the standard deviation, and then you can plug in your x value

84
00:06:18,490 --> 00:06:20,350
and start drawing out that curve.

85
00:06:21,550 --> 00:06:25,870
So imagine that we're in charge of a school and are performing standardized testing.

86
00:06:25,990 --> 00:06:30,970
Perhaps we're trying to analyze the probabilities of students across the nation scoring particularly

87
00:06:30,970 --> 00:06:32,170
well on the test.

88
00:06:32,470 --> 00:06:37,660
Can we use our single school as a sample of the overall student population?

89
00:06:38,960 --> 00:06:44,930
So let's imagine that we have this data set and we have 100 students and their test scores, first student

90
00:06:44,930 --> 00:06:48,590
scores, 70%, second students scored 85% and so on.

91
00:06:50,180 --> 00:06:52,250
So we're going to take a real data set.

92
00:06:53,150 --> 00:06:57,210
And then now we actually need to test if this data is normally distributed.

93
00:06:57,230 --> 00:06:59,120
There's a couple of different ways to do this.

94
00:06:59,150 --> 00:07:05,690
A simple way is to just visualize the data so I can take this data set and then actually create a histogram

95
00:07:05,690 --> 00:07:10,730
and you'll notice that it starts to look like a normal distribution curve and you can play around with

96
00:07:10,730 --> 00:07:14,990
the bin count to get a better idea if it's following a normal distribution.

97
00:07:16,590 --> 00:07:21,420
So again, here we can see that visually speaking, it looks like it's following a normal distribution

98
00:07:21,420 --> 00:07:22,530
in the histogram.

99
00:07:23,650 --> 00:07:26,650
But sometimes that may not actually be as clear.

100
00:07:26,680 --> 00:07:31,390
In that case, we have more stringent mathematical tests for normality.

101
00:07:31,450 --> 00:07:33,550
A test for normality, basically tests.

102
00:07:33,550 --> 00:07:37,870
Hey, what's the probability that this data set is actually normally distributed?

103
00:07:39,370 --> 00:07:41,530
So there are several normality tests.

104
00:07:41,560 --> 00:07:43,210
Here's a list of a bunch of them.

105
00:07:43,210 --> 00:07:48,070
In fact, there's so many, there's actually a full Wikipedia page on the different normality tests.

106
00:07:48,100 --> 00:07:53,230
I also find it curious that so many of these normality tests have two names attached to them, but that's

107
00:07:53,230 --> 00:07:55,290
just something you can explore on your own.

108
00:07:55,300 --> 00:08:01,000
But keep in mind you basically just pass in your data set to this test and then it reports back a probability

109
00:08:01,000 --> 00:08:02,530
of being normally distributed.

110
00:08:03,550 --> 00:08:09,970
So again, technically speaking, these tests do not tell you for certain whether your data set is normally

111
00:08:09,970 --> 00:08:10,810
distributed.

112
00:08:10,840 --> 00:08:17,290
They just usually present some sort of p value or metric or probability saying, Hey, your actual data

113
00:08:17,290 --> 00:08:20,860
set has like a 99% chance of being normally distributed.

114
00:08:22,320 --> 00:08:28,410
Now, more specifically, these tests typically operate using what's known as a hypothesis paradigm,

115
00:08:28,410 --> 00:08:33,659
where you posit a hypothesis that your particular data sample, that is that data sample of student

116
00:08:33,659 --> 00:08:38,280
test scores is normally distributed or comes from a normally distributed population.

117
00:08:39,700 --> 00:08:47,140
Now, keep in mind, we recommend that you perform both a basic visual test and a normality test before

118
00:08:47,140 --> 00:08:49,900
assuming you can treat a data set as normally distributed.

119
00:08:50,020 --> 00:08:55,360
Doing both the normality tests and having that confirm there's a high probability that your particular

120
00:08:55,360 --> 00:08:58,780
data set is normally distributed and doing the visual to see.

121
00:08:58,780 --> 00:08:59,020
Yes.

122
00:08:59,020 --> 00:09:03,670
And DH this looks like it's normally distributed is a really nice way of saying, okay, I can start

123
00:09:03,670 --> 00:09:07,510
treating this data set as being or belonging to a normal distribution.

124
00:09:07,630 --> 00:09:10,960
So let's head back to our real world example of student tests.

125
00:09:12,370 --> 00:09:17,110
So let's imagine we just conducted a SHAPIRO Wilk normality test.

126
00:09:17,200 --> 00:09:20,650
Then We've also visualized this particular data set.

127
00:09:20,830 --> 00:09:26,320
In that case, if the SHAPIRO will test gives me a high confidence that our actual data set is normally

128
00:09:26,320 --> 00:09:29,610
distributed and visually, I can see it looks like it's normally distributed.

129
00:09:29,620 --> 00:09:34,180
Then what I'm going to do is treat it as normally distributed or belonging from a normal distribution

130
00:09:34,180 --> 00:09:34,960
population.

131
00:09:36,290 --> 00:09:41,360
So now that I can treat this data as normally distributed, that means I can use that mean and standard

132
00:09:41,360 --> 00:09:43,670
deviation to begin to answer questions.

133
00:09:43,700 --> 00:09:49,850
Again, recall that the PDF formula that is the probability density function that I saw earlier in the

134
00:09:49,850 --> 00:09:54,350
slides that just needed the mean standard deviation in order to draw that curve.

135
00:09:54,380 --> 00:09:57,230
Then you just plug in the x values and it'll draw the curves for you.

136
00:09:58,870 --> 00:10:03,910
You should also keep in mind that normal distributions come with very unique properties of the mean

137
00:10:03,910 --> 00:10:05,200
and standard deviation.

138
00:10:05,380 --> 00:10:15,070
Essentially, 34.1% of all the data is going to fall between zero and minus one times the standard deviation.

139
00:10:15,190 --> 00:10:22,420
That is to say that 68.2% of all the values are going to fall within a standard deviation around the

140
00:10:22,420 --> 00:10:22,960
mean.

141
00:10:23,050 --> 00:10:28,540
In this particular diagram, we're showing the mean as zero, but that doesn't have to be the case.

142
00:10:29,880 --> 00:10:34,470
So since we can calculate the mean and standard deviation from our data set, we can now begin to plug

143
00:10:34,470 --> 00:10:36,210
in formulas to answer questions.

144
00:10:37,340 --> 00:10:43,820
These calculations are so common that a standard score or Z score system allows us to easily convert

145
00:10:43,820 --> 00:10:48,320
the probability space of the normal distribution using our data set values.

146
00:10:48,320 --> 00:10:52,300
And you don't need to worry about completely understanding the diagram or what I'm discussing here.

147
00:10:52,310 --> 00:10:56,240
I just want to give you an overview of what you're going to learn in this section.

148
00:10:56,240 --> 00:11:02,600
So the main idea is you take your real world data set, you check if it's normally distributed, then

149
00:11:02,600 --> 00:11:08,060
you can apply what's known as a Z score, which again uses the mean and standard deviation to begin

150
00:11:08,060 --> 00:11:15,290
to actually answer questions like What's the probability that a value falls above or below X or within

151
00:11:15,290 --> 00:11:17,720
the ranges of particular X values?

152
00:11:19,110 --> 00:11:24,840
The absolute value of Z represents the distance between that raw score X and the population mean in

153
00:11:24,840 --> 00:11:26,640
units of the standard deviation.

154
00:11:26,820 --> 00:11:32,370
This is all to say that you're going to actually be able to calculate probabilities of particular x

155
00:11:32,370 --> 00:11:33,540
interval ranges.

156
00:11:35,190 --> 00:11:39,720
Z is negative when the raw score is below the mean and positive when above.

157
00:11:41,240 --> 00:11:45,500
And again, we're going to discuss these scores in more detail later on in the section of the course.

158
00:11:45,500 --> 00:11:50,270
And this section of the course basically builds up to understanding Z scores.

159
00:11:50,600 --> 00:11:56,630
Technically, we use the sample statistics and not the population statistics in this particular example.

160
00:11:57,990 --> 00:11:59,790
So let's revisit our data.

161
00:11:59,940 --> 00:12:06,990
Imagine you wanted to know what's the probability of a student in, let's say, the entire nation scoring

162
00:12:06,990 --> 00:12:09,150
above an 80 on the test.

163
00:12:09,180 --> 00:12:14,640
Now, I have my data distribution, but it's technically just a sample from my school.

164
00:12:15,150 --> 00:12:20,640
Now I want to use my sample to figure out the probability of the entire population.

165
00:12:22,470 --> 00:12:26,910
So again, you should notice how this differs from the question How many students in our particular

166
00:12:26,910 --> 00:12:29,610
dataset scored 80 or above on the test?

167
00:12:29,640 --> 00:12:31,840
That's not what I'm actually asking here.

168
00:12:31,860 --> 00:12:38,900
I'm thinking in terms of how do I use this sample dataset to answer questions about the larger population?

169
00:12:38,910 --> 00:12:45,090
And when you read studies on in journals or headlines and newspapers, they're not answering this question

170
00:12:45,090 --> 00:12:47,410
of how many students from this particular data set.

171
00:12:47,430 --> 00:12:52,800
Instead, they're using things like normal distributions and Z scores to try to figure out probabilistic

172
00:12:52,800 --> 00:12:55,830
findings reflective of the larger population.

173
00:12:57,320 --> 00:13:03,470
So the purpose of this exercise is to use our student sample to gain insight on the overall population

174
00:13:03,470 --> 00:13:04,520
of students.

175
00:13:04,520 --> 00:13:08,240
That also is going to come with some requirements like having a large enough sample.

176
00:13:08,270 --> 00:13:10,940
I can't do this sort of thing with just one student.

177
00:13:10,940 --> 00:13:16,460
And later on we'll discover that around 30 minimum data points are going to be needed to begin to do

178
00:13:16,460 --> 00:13:18,200
things like test for normality.

179
00:13:19,690 --> 00:13:27,160
So in our particular example of the students, we have a mean test score of 69.32 and a standard deviation

180
00:13:27,160 --> 00:13:28,600
of 7.59.

181
00:13:29,830 --> 00:13:35,710
So now I can end up doing is we'll discover that if I'm trying to answer the question, what's the probability

182
00:13:35,710 --> 00:13:41,960
that a student in the general population of the nation is going to score higher than 80%?

183
00:13:41,980 --> 00:13:44,270
Then I set X as 80.

184
00:13:44,290 --> 00:13:45,570
Then I take that value.

185
00:13:45,580 --> 00:13:51,850
I plug in x is equal to 80 mean of 69.32 and a standard deviation of 7.59.

186
00:13:51,880 --> 00:13:56,020
Eventually, we'll learn about the Z score formula and we'll plug those in.

187
00:13:56,020 --> 00:14:03,100
That leads to a Z score of 1.407 and then that will end up returning what's known as a P value from

188
00:14:03,100 --> 00:14:04,090
the Z table.

189
00:14:04,120 --> 00:14:07,930
Keep in mind, we're going to cover this in a lot more detail later on, but essentially that begins

190
00:14:07,930 --> 00:14:13,180
to let you answer questions like, Hey, what's the probability that someone scores below that X value?

191
00:14:13,210 --> 00:14:15,880
What's probability somebody scores above that X value?

192
00:14:15,880 --> 00:14:19,570
And what's the probability that somebody scores within a particular range?

193
00:14:19,570 --> 00:14:21,550
And you can see those calculations here.

194
00:14:21,550 --> 00:14:26,590
And you can also just Google search for Z score calculator and you'll be able to do this sort of thing

195
00:14:26,590 --> 00:14:27,360
yourself.

196
00:14:29,070 --> 00:14:34,860
So for this particular example, keep in mind I just made up the data, but you would end up calculating

197
00:14:34,860 --> 00:14:39,470
the mean, the standard deviation, and you pass in the particular x value you're interested in.

198
00:14:39,480 --> 00:14:44,940
In this case I said 80, and then after doing this calculation with the Z score and understanding that

199
00:14:44,940 --> 00:14:50,520
it's normally distributed, I can calculate that the probability of a student in the general population

200
00:14:50,520 --> 00:14:54,720
scoring above an 80 on the test is approximately equal to 8%.

201
00:14:54,750 --> 00:14:57,060
Again, this was for this particular made up data set.

202
00:14:58,420 --> 00:15:03,370
So just as in our previous discussions of data distributions, once you've confirmed the recognized

203
00:15:03,370 --> 00:15:10,120
a normal distribution in your data, you can use the Z score, relationships or formulas to easily answer

204
00:15:10,120 --> 00:15:13,240
probability questions about intervals in your sample data.

205
00:15:14,760 --> 00:15:19,950
The reason why the normal distribution is so critical to understand is because so many data sets in

206
00:15:19,950 --> 00:15:22,380
nature happen to be normally distributed.

207
00:15:22,410 --> 00:15:26,310
Part of the reason behind that has to do with something known as the Central Limit Theorem, which we're

208
00:15:26,310 --> 00:15:28,320
going to discuss later on in the course.

209
00:15:29,540 --> 00:15:34,670
For now, let's take a closer look at the normal distribution and its unique properties.

210
00:15:34,670 --> 00:15:38,330
But again, keep in mind the main steps and the main ideas.

211
00:15:38,330 --> 00:15:43,670
You take your real world data set, you visualize it and maybe perform a normality test to understand

212
00:15:43,670 --> 00:15:44,960
that it's normally distributed.

213
00:15:45,020 --> 00:15:49,550
From there you calculate the mean and standard deviation, and then from that you're going to end up

214
00:15:49,550 --> 00:15:54,860
using what's known as a Z score to be able to answer questions with probabilistic answers.

215
00:15:54,860 --> 00:15:56,590
So let's learn how to do that.

216
00:15:56,600 --> 00:15:58,070
We'll see you at the next lecture.