1
00:00:00,120 --> 00:00:06,060
Now that we understand how to calculate, mean variance and standard deviation for a normal data set,

2
00:00:06,060 --> 00:00:11,100
we want to talk about the equation of the probability density function of the normal distribution.

3
00:00:11,340 --> 00:00:19,510
So this function F of x is the standard form of the probability density function for a normal distribution.

4
00:00:19,530 --> 00:00:23,820
Notice that it relies upon the standard deviation given here by Sigma.

5
00:00:23,820 --> 00:00:28,250
We also see it here and the mean MU, which we see here.

6
00:00:28,260 --> 00:00:33,870
So in order to build the equation for the probability density function, we need to have the mean and

7
00:00:33,870 --> 00:00:36,740
standard deviation that we learned to calculate earlier.

8
00:00:36,750 --> 00:00:45,900
So let's say, for instance, that we had from our dataset calculated a mean of five and a standard

9
00:00:45,900 --> 00:00:52,950
deviation of three, then we could plug these values into the probability density function and we would

10
00:00:52,950 --> 00:00:58,380
get one over three times the square root of two pi.

11
00:00:58,380 --> 00:01:07,170
And then Euler's number here raised to the negative one half power multiplied by X minus our mean of

12
00:01:07,170 --> 00:01:13,590
five, so x minus five, all divided by standard deviation, so three quantity squared.

13
00:01:13,590 --> 00:01:20,400
And now we have here the equation for the probability density function of the normal distribution that

14
00:01:20,400 --> 00:01:23,470
has a mean of five and a standard deviation of three.

15
00:01:23,490 --> 00:01:28,550
Now we talked about earlier how a normal distribution was a bell shaped curve.

16
00:01:28,560 --> 00:01:34,050
Well, here are some examples of these probability density functions we've been talking about.

17
00:01:34,050 --> 00:01:40,890
In fact, this distribution in green here is the distribution for the normally distributed data with

18
00:01:40,890 --> 00:01:43,140
mean five and standard deviation three.

19
00:01:43,140 --> 00:01:47,250
In red, we have a mean of five and a standard deviation of two.

20
00:01:47,250 --> 00:01:51,540
And in blue we have a mean of five and a standard deviation of one.

21
00:01:51,570 --> 00:01:59,010
We can see that the mean is five for all three distributions because we see here that all three distributions

22
00:01:59,010 --> 00:02:00,750
are centered at five.

23
00:02:00,750 --> 00:02:06,630
Remember that for any normal distribution, the mean is right here at the center of the curve and for

24
00:02:06,630 --> 00:02:09,900
any normal distribution, this is also where we'll find the median.

25
00:02:09,900 --> 00:02:14,190
In other words, in a normal distribution, the mean and median are always equivalent.

26
00:02:14,220 --> 00:02:20,550
Now what we notice is that the curve, the graph of the probability density function is of course just

27
00:02:20,550 --> 00:02:25,830
dictated by these two values, the mean and the standard deviation, because those are the only two

28
00:02:25,830 --> 00:02:30,720
values we need in order to plug into this formula for the probability density function.

29
00:02:30,720 --> 00:02:38,640
And that's why we can express any normal distribution as just its mean and standard deviation, or what

30
00:02:38,640 --> 00:02:44,160
we actually use for our standard notation, which is variance of course, just the square of standard

31
00:02:44,160 --> 00:02:44,910
deviation.

32
00:02:44,910 --> 00:02:50,610
So whenever we're dealing with any normal distribution, we will often just express it this way the

33
00:02:50,610 --> 00:02:53,940
normal distribution of the mean comma, the variance.

34
00:02:53,940 --> 00:03:00,300
And so instead of having to express this entire probability density function or give extra information

35
00:03:00,300 --> 00:03:05,160
or a sketch of the curve, all we have to do is say we're dealing with the normal distribution.

36
00:03:05,160 --> 00:03:10,110
Five one and that indicates a mean of five and a variance of one.

37
00:03:10,110 --> 00:03:15,000
Here we have a normal distribution with a mean of five and a variance of four, and here we have a normal

38
00:03:15,000 --> 00:03:18,030
distribution with a mean of five and a variance of nine.

39
00:03:18,030 --> 00:03:23,310
Now we always give the variance, but of course we know that standard deviation is just the square root

40
00:03:23,310 --> 00:03:24,300
of variance.

41
00:03:24,300 --> 00:03:30,030
So based on these expressions of these three normal curves which correspond to these three curves that

42
00:03:30,030 --> 00:03:38,790
we already sketched, we know that the standard deviation here is one, two and three, the square root

43
00:03:38,790 --> 00:03:41,070
of one, four and nine respectively.

44
00:03:41,070 --> 00:03:46,620
So just from this notation here, we can immediately see mean variance.

45
00:03:46,620 --> 00:03:52,350
We can take the square root of variance to get standard deviation, and then we can use this information

46
00:03:52,350 --> 00:03:58,590
to plug into the equation of the probability density function for a normal curve and use that equation

47
00:03:58,590 --> 00:04:00,450
then to sketch the curve.

48
00:04:00,450 --> 00:04:06,990
If we graph this equation after we've plugged in the mean and the standard deviation, we will get this

49
00:04:06,990 --> 00:04:09,180
picture of the normal curve.

50
00:04:09,180 --> 00:04:14,610
Let's notice again here too, that as the variance and standard deviation increase, the normal curve

51
00:04:14,640 --> 00:04:16,950
gets flatter and wider.

52
00:04:16,950 --> 00:04:21,180
And that's because variance and standard deviation are measures of spread.

53
00:04:21,180 --> 00:04:26,430
And so when we see a higher value for variance and therefore a higher value for standard deviation,

54
00:04:26,460 --> 00:04:29,610
that means the data is spread out further from the mean.

55
00:04:29,610 --> 00:04:36,300
The data gets pushed out further away from the mean to both the right and left, as opposed to when

56
00:04:36,300 --> 00:04:38,730
variance and standard deviation are smaller.

57
00:04:38,730 --> 00:04:45,330
And we see more of the area under the curve, more of the data pushed closer to the mean on the right

58
00:04:45,330 --> 00:04:46,350
and the left hand side.

59
00:04:46,350 --> 00:04:48,480
That data is pulled in toward the mean.

60
00:04:48,480 --> 00:04:51,720
And so we see the curve rise up and become taller.

61
00:04:51,720 --> 00:04:53,790
That normal curve is taller.

62
00:04:53,790 --> 00:04:56,700
The data is more tightly clustered around the mean.

63
00:04:56,700 --> 00:04:59,910
The spread is smaller, but as long as our data.

64
00:05:00,070 --> 00:05:05,680
Creates this symmetric bell shape following this form of a probability density function.

65
00:05:05,680 --> 00:05:12,310
We know that we have normally distributed data regardless of whether the curve is taller and more narrow,

66
00:05:12,310 --> 00:05:16,900
like this blue curve or shorter and wider like this green curve.

67
00:05:16,930 --> 00:05:23,260
Now, that being said, let's talk about one of the most important conclusions we can draw from a normal

68
00:05:23,260 --> 00:05:30,970
distribution, and that is breaking up the area under the curve into one, two and three standard deviations

69
00:05:30,970 --> 00:05:32,020
around the mean.

70
00:05:32,350 --> 00:05:38,920
So here we have a normal curve, we have the mean here at MU, and we've separated the area under the

71
00:05:38,920 --> 00:05:41,440
curve based on standard deviation.

72
00:05:41,440 --> 00:05:49,780
So this value right here, Mu plus sigma means the mean plus one standard deviation, whereas mu plus

73
00:05:49,780 --> 00:05:53,920
two sigma here means the mean plus two standard deviations.

74
00:05:53,920 --> 00:06:00,430
So here for this green curve, we said that the mean was here at five and that the standard deviation

75
00:06:00,430 --> 00:06:01,540
was three.

76
00:06:01,540 --> 00:06:06,880
So one standard deviation above the mean has to be five plus three or eight.

77
00:06:06,880 --> 00:06:13,630
So we see that here at eight and then one standard deviation below, the mean has to be five minus three

78
00:06:13,630 --> 00:06:15,730
or over here at two.

79
00:06:15,730 --> 00:06:23,380
And so we can see for this green curve that we have the mean here and then this is one standard deviation

80
00:06:23,380 --> 00:06:26,740
below the mean and one standard deviation above the mean.

81
00:06:26,740 --> 00:06:35,590
Contrast that with this blue curve here where we see the mean at the same value of five.

82
00:06:35,590 --> 00:06:41,230
And then we said that standard deviation was equal to one, which means one standard deviation above

83
00:06:41,230 --> 00:06:46,540
the mean of five has to be at six here and one standard deviation below the mean.

84
00:06:46,540 --> 00:06:50,260
It has to be at five minus one or four right here.

85
00:06:50,290 --> 00:06:57,910
So we have four and six and we can say that for the blue curve, an interval of one standard deviation

86
00:06:57,910 --> 00:07:00,610
around the mean ranges from 4 to 6.

87
00:07:00,610 --> 00:07:06,520
Whereas for the green curve, an interval of one standard deviation around the mean ranges from 2 to

88
00:07:06,520 --> 00:07:12,610
8, a much wider interval for one standard deviation around the mean, which of course makes sense because

89
00:07:12,610 --> 00:07:18,250
we know that for the green curve the data is more spread out, variance and standard deviation are larger.

90
00:07:18,250 --> 00:07:23,560
That measure of spread is larger, the data is more spread out away from the mean, whereas for the

91
00:07:23,560 --> 00:07:26,500
blue curve, the data is more tightly clustered around the mean.

92
00:07:26,500 --> 00:07:28,720
We see that smaller standard deviation.

93
00:07:28,720 --> 00:07:34,720
So for the normal distribution, we're particularly interested in these intervals that define one standard

94
00:07:34,720 --> 00:07:40,630
deviation around the mean between mu minus sigma and mu plus sigma, two standard deviations around

95
00:07:40,630 --> 00:07:46,810
the mean between mu minus two sigma and mu plus two sigma and three standard deviations around the mean

96
00:07:46,810 --> 00:07:50,230
between U minus three sigma and U plus three sigma.

97
00:07:50,260 --> 00:07:59,560
Because what we know to always be true for normally distributed data is that 68% of the data will always

98
00:07:59,560 --> 00:08:02,500
fall within one standard deviation of the mean.

99
00:08:02,500 --> 00:08:09,430
And because the normal distribution is always symmetric around the mean, that means 34% of the data.

100
00:08:09,430 --> 00:08:17,380
Or we could also say 34% of the area under the probability density function under this curve, or 34%

101
00:08:17,380 --> 00:08:23,950
of the probability has to fall between one standard deviation below the mean and the mean, And 34%

102
00:08:23,950 --> 00:08:29,110
of the probability or the area under the curve has to fall between the mean and one standard deviation

103
00:08:29,110 --> 00:08:30,040
above the mean.

104
00:08:30,040 --> 00:08:36,970
We also know that for every normal distribution, about 95% of the area under the curve will fall within

105
00:08:36,970 --> 00:08:45,160
two standard deviations of the mean, and that about 99.7% of the area under the curve or the probability

106
00:08:45,160 --> 00:08:48,040
will fall within three standard deviations of the mean.

107
00:08:48,070 --> 00:08:54,790
These numbers, this idea, these percentages are called the empirical rule or more blatantly, the

108
00:08:54,790 --> 00:09:00,670
6895, 99.7 rule given by these values 68, 95, 99.7.

109
00:09:00,670 --> 00:09:06,850
But the empirical rule is what tells us that for every normal distribution, for every set of normally

110
00:09:06,850 --> 00:09:12,550
distributed data, for every probability density function, representing a normal curve, any time we

111
00:09:12,550 --> 00:09:18,430
have this bell shaped curve that is normally distributed, then we can always break up the area under

112
00:09:18,430 --> 00:09:24,490
the curve based on these percentages, which allows us to answer all kinds of probability questions.

113
00:09:24,490 --> 00:09:30,430
For instance, with this information alone, if we know that our normally distributed data has a mean

114
00:09:30,430 --> 00:09:35,440
of five and a standard deviation of three, going back to this curve that we started with here and we

115
00:09:35,440 --> 00:09:40,720
want to answer the question, what is the probability that a randomly selected value in our data set

116
00:09:40,720 --> 00:09:46,210
will fall between five and eight or we'll have a value between five and eight?

117
00:09:46,210 --> 00:09:52,630
Well, we know that the mean is five, and that the boundary here of one standard deviation above the

118
00:09:52,630 --> 00:09:54,010
mean is eight.

119
00:09:54,010 --> 00:09:59,080
So the probability that one randomly selected data point is going to fall between five and eight.

120
00:09:59,110 --> 00:09:59,890
That's just.

121
00:09:59,970 --> 00:10:04,590
This particular section of the normal distribution.

122
00:10:04,590 --> 00:10:11,460
And we know that 34% of the area under the curve falls within this interval.

123
00:10:11,460 --> 00:10:17,100
And so we know right away that there's a 34% chance that a randomly selected value from our data set

124
00:10:17,100 --> 00:10:19,640
will take on a value between five and eight.

125
00:10:19,650 --> 00:10:23,850
And of course, we can use these percentages in an infinite number of other ways.

126
00:10:23,850 --> 00:10:26,030
So take this same data set.

127
00:10:26,040 --> 00:10:29,430
What's the probability that we get a value less than five?

128
00:10:29,430 --> 00:10:30,960
Well, five is the mean.

129
00:10:30,960 --> 00:10:36,450
And in a normally distributed data set, 50% of the data is below the mean, 50% of the data is above

130
00:10:36,450 --> 00:10:36,780
the mean.

131
00:10:36,780 --> 00:10:40,890
So the probability of getting a value less than five is 50%.

132
00:10:40,890 --> 00:10:44,670
The probability of getting a value greater than five is also 50%.

133
00:10:44,670 --> 00:10:48,240
The probability of getting a value less than 11.

134
00:10:48,240 --> 00:10:51,840
For this normal curve here the mean is five.

135
00:10:51,840 --> 00:10:53,640
The standard deviation is three.

136
00:10:53,760 --> 00:10:58,320
So one standard deviation above the mean is five plus three equals eight.

137
00:10:58,320 --> 00:11:02,130
Two standard deviations above the mean is eight plus three equals 11.

138
00:11:02,130 --> 00:11:04,920
So 11 is two standard deviations above the mean.

139
00:11:04,950 --> 00:11:09,300
11 puts us right here at two standard deviations above the mean.

140
00:11:09,300 --> 00:11:14,430
And so to find the probability that we get a value less than 11 would just mean that we add up all of

141
00:11:14,430 --> 00:11:21,810
these probabilities, starting with 13.5 and then 3430 for 13.5 and 2.35.

142
00:11:21,810 --> 00:11:27,270
Or how about the probability that we get a value outside three standard deviations around the mean?

143
00:11:27,360 --> 00:11:34,290
Well, we know that within three standard deviations of the mean, we have 99.7% of the area under the

144
00:11:34,290 --> 00:11:40,830
curve, which means the area outside of three standard deviations is 0.3%.

145
00:11:41,070 --> 00:11:47,430
If we divide that by two to put half of that area into each tail, we see these values right here.

146
00:11:47,430 --> 00:11:54,060
So instead of the total area in these two tails being 0.3% in this upper tail here we have 0.15%.

147
00:11:54,060 --> 00:12:00,630
And in the lower tail here, we have 0.15% of the area outside of three standard deviations away from

148
00:12:00,630 --> 00:12:01,170
the mean.

149
00:12:01,170 --> 00:12:07,080
And that 0.15% captures all of the area under the curve infinitely.

150
00:12:07,080 --> 00:12:10,020
Out to the left here and then infinitely out to the right here.

151
00:12:10,020 --> 00:12:17,550
So within three standard deviations of the mean, we have 99.7% of the area, 99.7% of the probability.

152
00:12:17,550 --> 00:12:25,170
This 0.15% in each tail captures everything else As far as this tail goes out to the left or as far

153
00:12:25,170 --> 00:12:29,760
as this tail goes out to the right, everything outside of three standard deviations.

154
00:12:29,760 --> 00:12:35,670
So then, for instance, if we wanted to say what's the probability that we get a value less than 11

155
00:12:35,670 --> 00:12:41,250
going back to this normal curve here with a mean of five and a standard deviation of three?

156
00:12:41,340 --> 00:12:51,840
Well, if our mean is five and then one standard deviation above the mean is five plus three or eight,

157
00:12:51,870 --> 00:12:54,330
then two standard deviations.

158
00:12:55,150 --> 00:12:55,540
Above.

159
00:12:55,540 --> 00:13:01,300
The mean is five plus six or 11 or eight plus three equals 11.

160
00:13:01,300 --> 00:13:07,750
So we know that 11 for this curve puts us at two standard deviations above the mean, which is right

161
00:13:07,750 --> 00:13:08,230
here.

162
00:13:08,230 --> 00:13:11,480
So the probability that we get a value less than 11.

163
00:13:11,500 --> 00:13:20,270
All we have to do is add up all these percentages 0.15, 2.35, 13.5, 34, 34 and 13.5.

164
00:13:20,290 --> 00:13:30,430
Or to take the easier route since probability always sums to 100% or one, we can just take 100% and

165
00:13:30,430 --> 00:13:34,090
we can subtract from that this 2.35.

166
00:13:35,060 --> 00:13:48,470
Percent value and this 0.15% value and we get a result of 97.5%, which tells us that we have a 97.5%

167
00:13:48,470 --> 00:13:51,620
chance of finding a value less than 11.

168
00:13:51,650 --> 00:13:56,060
Given this normally distributed curve with these properties a mean of five, a standard deviation of

169
00:13:56,060 --> 00:14:02,540
three, it's this idea of the empirical rule and these standardized percentages, because the normal

170
00:14:02,540 --> 00:14:09,290
distribution is a standard curve that's going to allow us, as we continue on here, to answer all kinds

171
00:14:09,290 --> 00:14:15,200
of probability questions and then even start doing some statistical analysis on normally distributed

172
00:14:15,200 --> 00:14:22,430
data, which ultimately leads us to an idea about significance, being able to start with a hypothesis,

173
00:14:22,430 --> 00:14:29,420
run an experiment or collect data, and then determine the significance of the result of that data to

174
00:14:29,420 --> 00:14:33,160
determine how unusual it is to find the result that we did.

175
00:14:33,170 --> 00:14:36,320
We'll talk all about that when we get to hypothesis testing.

176
00:14:36,320 --> 00:14:43,160
But the idea here is that this normal distribution we see over and over and over again in real life

177
00:14:43,160 --> 00:14:49,880
and almost more importantly, even when we collect data and we determine that it's not normally distributed,

178
00:14:49,880 --> 00:14:57,080
we'll learn about a way that we can sort of change that non normal data into normally distributed data

179
00:14:57,080 --> 00:15:03,050
so that we're able to use this normal distribution and all of these probability figures that go along

180
00:15:03,050 --> 00:15:03,560
with it.

181
00:15:03,560 --> 00:15:09,710
And so because we'll be using this normal distribution over and over and over again for normally distributed

182
00:15:09,710 --> 00:15:17,180
data and even non normally distributed data, it's critical that we understand this idea of this symmetric

183
00:15:17,180 --> 00:15:23,330
bell shaped normal distribution, where the empirical rule can give us the likelihood that a particular

184
00:15:23,330 --> 00:15:28,670
data point falls within some certain standard deviation interval around the mean.

