1
00:00:00,090 --> 00:00:06,060
So we've talked about different ways to measure the center of a data set and the spread of that data

2
00:00:06,060 --> 00:00:06,540
set.

3
00:00:06,660 --> 00:00:13,590
Now we want to talk about the quartiles and I.Q. R or interquartile range of a data set so that we can

4
00:00:13,590 --> 00:00:17,190
understand what they can tell us about our data.

5
00:00:17,340 --> 00:00:22,740
So the easiest way to explain quartiles and IQ is to look at a data set.

6
00:00:22,950 --> 00:00:25,530
Let's say that we have this data set here.

7
00:00:25,680 --> 00:00:31,650
There are 18 values in this set, ranging from 66 to 75.

8
00:00:31,800 --> 00:00:38,810
And we're going to say that this data represents the golf score of 18 golfers.

9
00:00:38,820 --> 00:00:42,900
So maybe the golf course is a par 72 course.

10
00:00:43,020 --> 00:00:48,870
The best golfer shot a 66 and the worst golfer shot a 75.

11
00:00:48,960 --> 00:00:50,580
So we have 18 golfers.

12
00:00:50,580 --> 00:00:52,110
This is our data set.

13
00:00:52,110 --> 00:00:58,500
And the first thing we want to say is that this idea of a quartile is related to what we already know

14
00:00:58,500 --> 00:01:01,200
about the median of the data set.

15
00:01:01,200 --> 00:01:05,850
So if you remember when we talked about median, we talked about it as the center of the set.

16
00:01:05,850 --> 00:01:12,330
When we ordered the set from smallest to largest and found the center value.

17
00:01:12,450 --> 00:01:13,620
So let's start there.

18
00:01:13,620 --> 00:01:15,030
Let's think about the median.

19
00:01:15,120 --> 00:01:19,500
If we look at this data set, there are 18 values in the set, as we said.

20
00:01:19,500 --> 00:01:24,390
And so we have here the lower nine values and the upper nine values.

21
00:01:24,390 --> 00:01:34,740
Therefore, we know that the median is found by taking the mean of these two center values here.

22
00:01:35,010 --> 00:01:41,970
So if we take the mean of these two values, we would add 69 and 69 and then divide by two.

23
00:01:41,970 --> 00:01:44,970
And of course, the result there will be 69.

24
00:01:44,970 --> 00:01:49,770
So the median here of this data set is 69.

25
00:01:49,770 --> 00:01:56,070
Now, when we talk about the quartiles of a data set, we usually talk about the first, second and

26
00:01:56,070 --> 00:01:57,390
third quartiles.

27
00:01:57,390 --> 00:02:04,290
And in general, just at a broad level, you want to think about the first quartile as the median of

28
00:02:04,290 --> 00:02:06,060
this lower half of the data.

29
00:02:06,360 --> 00:02:11,790
You want to think about the third quartile as the median of the upper half of the data.

30
00:02:12,030 --> 00:02:16,230
And you want to think about the second quartile as being equal to the median.

31
00:02:16,230 --> 00:02:24,330
So we can refer to the first quartile sometimes as Q one, sometimes we'll refer to it as the first

32
00:02:24,330 --> 00:02:26,490
quartile or the lower quartile.

33
00:02:26,490 --> 00:02:32,430
So we'll call it the first quartile or we'll call it the lower quartile.

34
00:02:32,460 --> 00:02:35,040
We can also think about it as the.

35
00:02:35,800 --> 00:02:38,860
25th percentile of the data.

36
00:02:38,980 --> 00:02:45,610
On the other hand, the third quartile will often call it Q three or the third quartile.

37
00:02:45,610 --> 00:02:53,050
So instead of the first, we'll call it the third or we'll call it the upper quartile and we can think

38
00:02:53,050 --> 00:02:57,400
about it also as the 75th percentile of the data.

39
00:02:57,400 --> 00:03:02,290
And then the median here we call that the second quartile or Q two.

40
00:03:02,320 --> 00:03:09,190
So Q Two or the second quartile instead of the lower or the upper quartile, we would say that this

41
00:03:09,190 --> 00:03:13,900
is the median because the second quartile is equal to the median.

42
00:03:13,900 --> 00:03:22,390
And we would also refer to this as the 50th percentile because in the same way that the median divides,

43
00:03:22,390 --> 00:03:28,750
the lower half of the data from the upper half of the data, meaning that 50% of the data points will

44
00:03:28,750 --> 00:03:33,370
fall below the median and 50% of the data points will fall above the median.

45
00:03:33,370 --> 00:03:35,830
And therefore it's the 50th percentile.

46
00:03:35,830 --> 00:03:45,010
In that same way, 25% of the data points will fall below the first quartile, whereas 75% of the data

47
00:03:45,010 --> 00:03:48,970
points will fall below the third quartile or below the upper quartile.

48
00:03:49,000 --> 00:03:56,770
So the upper quartile divides the lower 75% of the data points from the upper 25% of the data points.

49
00:03:56,770 --> 00:04:04,210
The first quartile divides the lower 25% of the data points from the upper 75% of the data points.

50
00:04:04,210 --> 00:04:07,150
So these are our three quartiles.

51
00:04:07,150 --> 00:04:13,240
And one of the reason that these values are interesting to us is because they lead us to what we call

52
00:04:13,240 --> 00:04:14,860
a five number summary.

53
00:04:15,010 --> 00:04:21,880
And the five number summary gives us a really good picture of the center and the spread of the data

54
00:04:21,880 --> 00:04:23,500
set at the same time.

55
00:04:23,500 --> 00:04:29,320
So in the past we looked at the center of the data set as mean median or mode, and then we looked separately

56
00:04:29,320 --> 00:04:34,060
at the spread of the data set as the variance or standard deviation.

57
00:04:34,060 --> 00:04:39,190
But the five number summary kind of gives us a picture of both the center and the spread at the same

58
00:04:39,190 --> 00:04:39,760
time.

59
00:04:39,760 --> 00:04:45,610
So that five number summary is always going to include the same five values.

60
00:04:45,610 --> 00:04:49,510
It will include the minimum and maximum values from the data set.

61
00:04:49,510 --> 00:04:52,270
So the lowest value and the largest value.

62
00:04:52,270 --> 00:05:01,090
In the case of this data set, of course, that's 66 for the minimum value here and 75 for the maximum

63
00:05:01,090 --> 00:05:01,690
value.

64
00:05:01,720 --> 00:05:05,140
It will include the median, which of course we're already familiar with.

65
00:05:05,140 --> 00:05:11,830
And for this data set, we already calculated that that median was 69 and then it'll include the first

66
00:05:11,830 --> 00:05:18,190
and third quartile or the lower quartile and the upper quartile and without even calculating the lower

67
00:05:18,190 --> 00:05:23,620
and upper quartiles for this particular data set, you can already start to get a sense that this five

68
00:05:23,620 --> 00:05:29,500
number summary gives us a pretty good picture of the center and spread of the data set, because not

69
00:05:29,500 --> 00:05:36,460
only do we have the median as the center and we get the full range from the minimum to the maximum value,

70
00:05:36,490 --> 00:05:43,990
let's go ahead and say while we're at it, that we define the range of a data set as being equal to

71
00:05:43,990 --> 00:05:48,040
the maximum value minus the minimum value.

72
00:05:48,040 --> 00:05:51,640
So in this case, the range of the data set is 75.

73
00:05:52,360 --> 00:05:56,860
-66 or in our case, a range of nine.

74
00:05:56,980 --> 00:05:59,800
So we get the minimum, the maximum, the total range.

75
00:05:59,800 --> 00:06:01,930
We can see the center as the median.

76
00:06:02,140 --> 00:06:04,540
And these Q one and Q three values.

77
00:06:04,540 --> 00:06:11,650
Give us a picture of the middle 50% of the data set because we know everything below.

78
00:06:11,680 --> 00:06:19,060
Q One below the first quartile is the first 25% of the values in the set, and we know everything above.

79
00:06:19,090 --> 00:06:27,970
Q Three is the last 25% or the greatest 25% of the values in the set, but everything between Q one

80
00:06:27,970 --> 00:06:35,170
and Q three between the first and third quartiles represents that middle 50% of the data set of the

81
00:06:35,170 --> 00:06:36,540
values in the data set.

82
00:06:36,550 --> 00:06:39,160
And so those two values taken together.

83
00:06:39,190 --> 00:06:40,600
Q one and Q three.

84
00:06:40,630 --> 00:06:46,410
Give us a little bit of a picture of how tightly clustered the data is around the median.

85
00:06:46,420 --> 00:06:52,660
If the difference between Q one and Q three is fairly small, then we know that most of the data is

86
00:06:52,660 --> 00:06:54,340
tightly clustered around the median.

87
00:06:54,340 --> 00:07:00,190
But if the difference between Q one and Q three is large, then we know that the data is more spread

88
00:07:00,190 --> 00:07:00,700
out.

89
00:07:00,790 --> 00:07:07,720
So this five number summary given by these quartiles here is just another way to get a sense of what

90
00:07:07,720 --> 00:07:11,030
our data set is doing now while we're here.

91
00:07:11,050 --> 00:07:14,860
Let's go ahead and talk about how to calculate these quartile values.

92
00:07:14,860 --> 00:07:18,880
We already know how to calculate the second quartile because it's equal to the median and we know how

93
00:07:18,880 --> 00:07:19,900
to find the median.

94
00:07:19,900 --> 00:07:20,830
But how do we find.

95
00:07:20,860 --> 00:07:22,560
Q one and Q three.

96
00:07:22,570 --> 00:07:29,500
Well, the answer to that is a little nuanced, because technically there are different ways of calculating

97
00:07:29,500 --> 00:07:35,820
the first and the third quartile, and they're not always going to give exactly the same answer.

98
00:07:35,830 --> 00:07:42,780
There's no universally agreed upon method for finding the exact value of Q one and Q three.

99
00:07:42,790 --> 00:07:49,270
Just know that any accepted method we use is still going to give us a good approximation of what Q one

100
00:07:49,270 --> 00:07:55,030
and Q three actually are one of the methods you can use for calculating these two quartiles?

101
00:07:55,030 --> 00:08:02,320
And maybe the most straightforward one is to start by considering whether the original data set has

102
00:08:02,320 --> 00:08:05,260
an even or an odd number of data points.

103
00:08:05,260 --> 00:08:08,980
So in this case, this data set has 18 data points.

104
00:08:08,980 --> 00:08:12,940
So we have a data set with an even number of data points.

105
00:08:12,940 --> 00:08:19,600
If there are an even number of data points in the original set, then of course we already know, as

106
00:08:19,600 --> 00:08:26,650
we saw here, that the median is going to be found by taking the mean of these two numbers in the center,

107
00:08:26,650 --> 00:08:31,150
because with an even number of data points, the data is going to split into two halves perfectly.

108
00:08:31,150 --> 00:08:35,890
In this case, we have nine values in the lower half and nine values in the upper set.

109
00:08:35,890 --> 00:08:40,450
So of course, the median is going to be found by taking the mean of these two middle numbers.

110
00:08:40,450 --> 00:08:46,990
And then in order to find the first and third quartile, we can just look at each half of the data set

111
00:08:46,990 --> 00:08:50,590
and find the median of the lower and upper halves.

112
00:08:50,590 --> 00:08:57,490
So if we just look at the lower half here, starting with this lowest value, 66 and going up to this

113
00:08:57,490 --> 00:09:00,760
69 value right here, there are nine values.

114
00:09:00,760 --> 00:09:05,350
Now in this lower half, we can consider the median of these nine values.

115
00:09:05,350 --> 00:09:10,270
Well, of course, if we have nine values and we're looking for the median, we know we're going to

116
00:09:10,270 --> 00:09:15,760
have four values on the low side of this half for values on the high side of this half.

117
00:09:15,760 --> 00:09:20,860
And then the median is this fifth value in the center right here, 68.

118
00:09:20,860 --> 00:09:24,100
And then we would consider the same thing with the upper half.

119
00:09:24,100 --> 00:09:31,960
The upper half contains nine data points ranging from this 69 value right here up to 75.

120
00:09:31,960 --> 00:09:39,280
So if we ignore the first four data points, 69, 69, 70 and 70, and we ignore the last four data

121
00:09:39,280 --> 00:09:48,280
points, 71, 72, 73 and 75, the median of this upper half is this value here, 71 and we can fill

122
00:09:48,280 --> 00:09:53,940
in our five number summary and say 68 and 71.

123
00:09:53,950 --> 00:10:00,550
Now once we have Q one and Q three, going back to what we said earlier about the five number summary,

124
00:10:00,550 --> 00:10:06,100
the difference between the first and third quartiles gives us a picture of how spread out the data is

125
00:10:06,100 --> 00:10:12,310
around the median, because what this five number summary tells us is that 50% of all of the values

126
00:10:12,310 --> 00:10:18,250
in the data set fall between 68 and 71 around a median of 69.

127
00:10:18,250 --> 00:10:22,900
That gives us a sense of how tightly clustered the data is around the median.

128
00:10:22,900 --> 00:10:25,450
And that value is important.

129
00:10:25,450 --> 00:10:29,080
That's the interquartile range that we mentioned at the beginning.

130
00:10:29,200 --> 00:10:37,990
The IQR of a data set is equal to the difference between the third quartile and the first quartile.

131
00:10:37,990 --> 00:10:48,280
So in our case, 71 for the third quartile -68 for the first quartile or three.

132
00:10:48,610 --> 00:10:51,160
So the interquartile range of the IQR of this.

133
00:10:51,210 --> 00:10:53,250
Particular data set is three.

134
00:10:53,250 --> 00:10:54,760
It's Q three minus.

135
00:10:54,780 --> 00:10:57,900
Q one the difference between the first and the third quartiles.

136
00:10:57,900 --> 00:11:00,720
So that's what our five number summary looks like.

137
00:11:00,720 --> 00:11:01,700
That's how we calculate.

138
00:11:01,710 --> 00:11:08,130
Q one, Q three and the median, when we have an odd number of data points in our set.

139
00:11:08,160 --> 00:11:10,950
But what about if we have an even number of data points in the set?

140
00:11:10,950 --> 00:11:19,290
So this is the same data set except for the fact that we have removed one value of 68 from the set.

141
00:11:19,290 --> 00:11:24,150
So now we have 17 golf scores instead of 18 golf scores.

142
00:11:24,150 --> 00:11:30,990
We have an odd number of values in the data set and one common method for computing the first and the

143
00:11:30,990 --> 00:11:37,020
third quartiles, when we have an odd number of data points, is to again first find the median.

144
00:11:37,020 --> 00:11:44,280
So if we look at the lower and upper halves of the data because we have 17 values here in the set,

145
00:11:44,280 --> 00:11:50,700
we've underlined the first eight and the last eight, leaving us with just this center value here of

146
00:11:50,700 --> 00:11:54,900
69 as the median and one method for computing.

147
00:11:54,900 --> 00:12:02,100
Q one and Q three when we have an odd number of data points, is just to exclude this median value when

148
00:12:02,100 --> 00:12:04,050
we calculate Q one and Q three.

149
00:12:04,050 --> 00:12:10,740
So excluding the median, we just look at this lower half of the data, these eight values here, and

150
00:12:10,740 --> 00:12:14,370
we find the median of these eight values.

151
00:12:14,370 --> 00:12:23,940
So looking at the median here, we have four values 66, 67, 67, 68, and then four values 68 through

152
00:12:23,940 --> 00:12:24,780
69 here.

153
00:12:24,780 --> 00:12:27,690
So the median should fall right here.

154
00:12:27,720 --> 00:12:34,920
And we know, of course, that that means that we are taking the mean of these two values here.

155
00:12:34,920 --> 00:12:44,070
And so the mean of these two is 68, and the first quartile will be 68 to find Q three, the upper quartile,

156
00:12:44,070 --> 00:12:46,560
we consider just these eight values.

157
00:12:46,650 --> 00:12:53,610
We know that the median should fall between the lower four values and the upper four values.

158
00:12:53,610 --> 00:13:00,900
So right here and therefore we know that that means that we are taking the mean of these two values

159
00:13:00,900 --> 00:13:04,440
and we get an upper quartile of 71.

160
00:13:04,440 --> 00:13:13,470
So in this case, removing that value of 68 didn't change the value of the median or Q one or Q three.

161
00:13:13,500 --> 00:13:15,900
All three values stayed the same.

162
00:13:15,900 --> 00:13:18,600
But of course that won't always be the case.

163
00:13:18,600 --> 00:13:24,450
So these methods that we showed here to calculate Q one and Q three when we had an even number of data

164
00:13:24,450 --> 00:13:26,490
points, are an odd number of data points.

165
00:13:26,490 --> 00:13:28,560
You can always use these methods.

166
00:13:28,560 --> 00:13:35,700
They'll work, but other methods sometimes have us include this median value here when we calculate

167
00:13:35,700 --> 00:13:39,930
Q one and Q three instead of excluding it as we did with our method.

168
00:13:39,930 --> 00:13:46,500
Sometimes the method for an even number of data points up here, we'll have us calculate this median

169
00:13:46,500 --> 00:13:53,280
and then include this median as if we had an extra value of 69 right here and then go ahead and find

170
00:13:53,280 --> 00:13:54,810
the lower and upper quartile.

171
00:13:54,810 --> 00:14:01,680
So there are slight differences, but as you can imagine, they're always going to produce fairly similar

172
00:14:01,680 --> 00:14:02,580
results.

173
00:14:02,580 --> 00:14:08,250
So the most important thing here is just understanding what the first and third quartiles are, how

174
00:14:08,250 --> 00:14:13,800
they fit into this idea of a five number summary to give us an idea of the center and spread of the

175
00:14:13,800 --> 00:14:18,300
data set and at least one method for calculating the three quartiles.

176
00:14:18,300 --> 00:14:24,690
And of course we calculated these quartile values by hand, but we can also use computers to help us

177
00:14:24,690 --> 00:14:27,900
find these values, especially as the data set gets larger.

178
00:14:27,900 --> 00:14:34,650
For instance, you can use the quartile function in Excel to calculate any of the five values in the

179
00:14:34,650 --> 00:14:40,950
five number summary The quartile function will just ask you for your array, which will be the range

180
00:14:40,950 --> 00:14:46,470
of cells where you have your data set in the spreadsheet and then for the quartile that you want the

181
00:14:46,470 --> 00:14:48,210
function to return.

182
00:14:48,210 --> 00:14:54,570
And so you would enter the array and then either the number zero, if you want the minimum value, the

183
00:14:54,570 --> 00:15:01,110
number one, if you want the first quartile or the numbers two, three or four, if you want the median

184
00:15:01,110 --> 00:15:06,120
third quartile or maximum value of that array of that data set.

185
00:15:06,120 --> 00:15:12,450
So instead of trying to do this by hand, if we had inputted this entire data set into Excel, we could

186
00:15:12,450 --> 00:15:18,690
have used the quartile function, highlighted this data set, and then input the number one into the

187
00:15:18,690 --> 00:15:21,480
function for Excel to return to us.

188
00:15:21,480 --> 00:15:25,200
The first quartile 68 of this data set.

189
00:15:25,200 --> 00:15:32,430
So the last thing we want to say is that it's very common for us to use this concept of quartiles to

190
00:15:32,430 --> 00:15:34,890
identify outliers in the data set.

191
00:15:34,890 --> 00:15:40,380
We talked a little bit about outliers before when we talked about measuring the center of a data set,

192
00:15:40,380 --> 00:15:47,520
but determining which values in the set technically qualify as outliers can be more of an art than a

193
00:15:47,520 --> 00:15:48,000
science.

194
00:15:48,000 --> 00:15:50,580
But one method that people use is to.

195
00:15:50,900 --> 00:15:56,840
Boy, these quartiles to give sort of benchmarks for outliers in the data set.

196
00:15:56,840 --> 00:16:08,570
And so sometimes we will say that any value in the set that is lower than the first quartile -1.5 times

197
00:16:08,690 --> 00:16:17,450
the IQ r of the data set will be an outlier in the data or similarly, any value that is larger than

198
00:16:17,450 --> 00:16:25,790
the third quartile plus 1.5 times the IQ r that will consider any value larger than this value to be

199
00:16:25,790 --> 00:16:28,070
an outlier on the upper end of the data.

200
00:16:28,070 --> 00:16:34,550
So you can imagine with this data set right here, we said that the first quartile was equal to 68,

201
00:16:34,550 --> 00:16:43,580
so let's say 68 and then we'll say -1.5 times the IQ r, which we calculated to be three.

202
00:16:43,940 --> 00:16:51,560
So we have 68 -4.5, which gives us a value of 63.5.

203
00:16:51,560 --> 00:16:58,580
And so by this method, if we choose to use this method for determining outliers, we would say that

204
00:16:58,580 --> 00:17:05,270
any value less than 63.5 would be considered an outlier of this data set.

205
00:17:05,270 --> 00:17:13,339
So if we have 66, 67, 67 all the way to 75, if this is our data set, anything less than 63 and a

206
00:17:13,339 --> 00:17:15,319
half, we would consider an outlier.

207
00:17:15,319 --> 00:17:24,740
And similarly on the high end, if we take Q three, which we know is 71 for this set, so 71 plus 1.5

208
00:17:24,740 --> 00:17:34,100
times an IQ R of three is 71 plus 4.5 or 75.5.

209
00:17:34,190 --> 00:17:41,600
So any value greater than 75.5 we could consider to be an outlier for this data set.

210
00:17:41,600 --> 00:17:47,180
Again, this isn't some objective, universally accepted standard for outliers.

211
00:17:47,180 --> 00:17:54,170
It's just one way that we could maybe set some parameters to start getting an idea for what would we

212
00:17:54,170 --> 00:17:58,460
even consider an outlier on the low end or an outlier on the high end?

213
00:17:58,460 --> 00:18:05,600
For a data set like this one, it sort of gives us some boundaries, Some gates around the data set,

214
00:18:05,600 --> 00:18:13,250
a sense of which values are unusually low or unusually high given the rest of the data that we have.

215
00:18:13,250 --> 00:18:19,820
So again, a lot that we can do with this concept of quartiles and interquartile range, but all of

216
00:18:19,820 --> 00:18:27,590
this information is really just more context around the center and the spread of a data set.

