1
00:00:00,120 --> 00:00:05,670
Now we want to start talking about distribution plots and specifically the histogram, which is a type

2
00:00:05,670 --> 00:00:07,010
of distribution plot.

3
00:00:07,020 --> 00:00:14,820
So this is an example of a histogram we have here age across the horizontal axis and population in thousands

4
00:00:14,820 --> 00:00:16,560
along the vertical axis.

5
00:00:16,560 --> 00:00:25,020
So this value here 50 means 50,000, 100 is 100,000 etc. and the ages along the horizontal axis are

6
00:00:25,020 --> 00:00:26,610
ages zero and up.

7
00:00:26,610 --> 00:00:32,009
And then for the second part of the histogram here, ages ten and up, which means this first bar or

8
00:00:32,009 --> 00:00:35,790
this first bin is ages 0 to 9.

9
00:00:35,790 --> 00:00:44,160
This bin right here is ages 10 to 19, this bin is 20 to 29, etc. all the way up to this last bin,

10
00:00:44,160 --> 00:00:47,820
which is 100 to 109.

11
00:00:47,820 --> 00:00:52,620
This is a histogram that, for example, might represent the population of a city.

12
00:00:52,620 --> 00:00:58,350
So maybe in this city, if we're trying to read the histogram, what it tells us is that there are about

13
00:00:58,350 --> 00:01:06,360
25,000, zero, ten nine year olds living in the city, maybe about, let's say 40,000, 10 to 19 year

14
00:01:06,360 --> 00:01:12,570
olds living in the city all the way up to the largest group, which is the 40 to 49 year olds, of which

15
00:01:12,570 --> 00:01:15,720
it looks like there might be about 170,000.

16
00:01:15,750 --> 00:01:19,500
Here are the things that we want to understand about a histogram.

17
00:01:19,500 --> 00:01:25,230
First of all, this horizontal axis feature, whatever characteristic were plotting along the horizontal

18
00:01:25,230 --> 00:01:29,250
axis, in this case, its age needs to be continuous.

19
00:01:29,250 --> 00:01:36,930
So age is a continuous characteristic, time is continuous, height and weight are both continuous because

20
00:01:36,930 --> 00:01:43,110
all of these things can be broken down into smaller and smaller and smaller measurements, whereas a

21
00:01:43,110 --> 00:01:45,810
different kind of characteristic might be discrete.

22
00:01:45,810 --> 00:01:52,410
So in the last lesson, we looked at line plots and we plotted months along the horizontal axis.

23
00:01:52,410 --> 00:01:58,980
So January, February, March, April, etc. While months are similar to time, if we're specifically

24
00:01:58,980 --> 00:02:04,410
talking about months, there's nothing in between January and February, there's nothing in between

25
00:02:04,410 --> 00:02:05,700
February and March.

26
00:02:05,700 --> 00:02:08,699
Those are distinct discrete buckets.

27
00:02:08,699 --> 00:02:11,820
We're not talking about a continuous characteristic.

28
00:02:11,820 --> 00:02:18,360
Compare that to something like this here with age where in this 0 to 9 years old bucket, we could have

29
00:02:18,360 --> 00:02:24,330
someone who is exactly four years old, but we could also have someone who is four years, three months,

30
00:02:24,330 --> 00:02:31,530
27 days, 4 hours, 6 minutes and 32.3 seconds years old, and they would fall into this bucket and

31
00:02:31,530 --> 00:02:32,790
there's a clean cut off.

32
00:02:32,790 --> 00:02:37,740
The moment that person turns ten years old, they would graduate from this first bucket into this second

33
00:02:37,740 --> 00:02:38,190
bucket.

34
00:02:38,190 --> 00:02:44,340
So this feature along the horizontal axis needs to be continuous if we're going to use a histogram to

35
00:02:44,340 --> 00:02:45,570
represent the data.

36
00:02:45,570 --> 00:02:53,760
Now, because of the continuity of the data set, we sketch a bar for each bin or class that we list

37
00:02:53,760 --> 00:03:02,010
out here along the horizontal axis, and we include no gaps in between each bar in our plot and including

38
00:03:02,010 --> 00:03:08,580
no gaps, suggests the continuous nature of this characteristic along the horizontal axis.

39
00:03:08,580 --> 00:03:14,610
When we talk about bar plots or bar charts, which we'll look at later, they'll look a lot like histograms,

40
00:03:14,610 --> 00:03:20,640
except that we will leave a little space or a little gap in between each bar here with the histogram,

41
00:03:20,640 --> 00:03:23,130
we leave no space or no gap between the bars.

42
00:03:23,130 --> 00:03:29,520
So right here or right here, there's no gap between these bars.

43
00:03:29,520 --> 00:03:34,970
And that's because of the continuous nature of this characteristic along the horizontal axis.

44
00:03:34,980 --> 00:03:41,520
Now, that being said, of course, like we've talked about, each bar is a count of the number of occurrences

45
00:03:41,520 --> 00:03:45,090
that fall into this bin or class or interval.

46
00:03:45,090 --> 00:03:51,990
So the height of each bar represents the number of occurrences that we find of people ages 0 to 9 in

47
00:03:51,990 --> 00:03:53,010
this city.

48
00:03:53,010 --> 00:03:58,410
This bar here, the height of it, represents the number of occurrences of people that we find between

49
00:03:58,410 --> 00:04:00,840
ages ten and 19 in this city.

50
00:04:00,840 --> 00:04:06,810
Now, one of the reasons that a histogram is really helpful is because it simplifies what could otherwise

51
00:04:06,840 --> 00:04:11,010
be a very large and messy data set here.

52
00:04:11,010 --> 00:04:15,120
We can clearly tell the population of this city is very large.

53
00:04:15,120 --> 00:04:16,980
It's multiple hundreds of thousands.

54
00:04:16,980 --> 00:04:23,430
If we were to try to plot a value along our chart here for the age of every single person in the city,

55
00:04:23,430 --> 00:04:29,430
the chart would quickly get very out of control because again, we would need a bar for a person of

56
00:04:29,430 --> 00:04:30,300
each age.

57
00:04:30,300 --> 00:04:36,840
And so if we had two people whose ages were different by even one day or even a couple of hours or a

58
00:04:36,840 --> 00:04:40,710
second, that would be a different individual bar on our chart.

59
00:04:40,710 --> 00:04:47,730
It would be a different individual age or timestamp along our horizontal axis that would get extremely

60
00:04:47,730 --> 00:04:48,690
overwhelming.

61
00:04:48,690 --> 00:04:53,040
It would make our chart so messy that it would be unreasonable to interpret it.

62
00:04:53,040 --> 00:04:58,680
What makes a lot more sense is that when we have several hundreds of thousands of people, we group

63
00:04:58,680 --> 00:04:59,820
them into bins.

64
00:05:00,330 --> 00:05:06,990
And then create this histogram that gives us a much clearer, simpler picture of the population when

65
00:05:06,990 --> 00:05:08,820
we're creating a histogram.

66
00:05:09,000 --> 00:05:14,520
We want to make sure that our buckets or bins or classes, whatever we're using here along the horizontal

67
00:05:14,520 --> 00:05:19,620
axis that that bin size, that that class width is always the same.

68
00:05:19,620 --> 00:05:27,240
So we have here for this example, ten year increments, we have ages 0 to 9, 10 to 19, 22, 29,

69
00:05:27,240 --> 00:05:31,800
32, 39, 42, 49, all the way up to 100 to 109.

70
00:05:31,800 --> 00:05:35,070
Keeping that bin width or that class width.

71
00:05:35,070 --> 00:05:38,940
The same is the only way that the histogram makes sense.

72
00:05:38,940 --> 00:05:47,910
If we instead have one bin for ages 0 to 9, another one for 10 to 19 and that have been for 20 to 49

73
00:05:47,910 --> 00:05:52,320
and then maybe 50 to 59 and then 60 all the way up to 109.

74
00:05:52,320 --> 00:05:57,360
It's going to make the data almost meaningless because those bin widths aren't the same.

75
00:05:57,360 --> 00:06:04,710
So we can't interpret or compare different bins to get an accurate picture of what our age distribution

76
00:06:04,710 --> 00:06:05,670
is looking like.

77
00:06:05,670 --> 00:06:11,340
So we need to make sure that our bin width, that our class width is always equivalent so that this

78
00:06:11,340 --> 00:06:16,050
distribution is consistent across each bucket along the horizontal axis.

79
00:06:16,050 --> 00:06:18,870
So we know that we have to keep bin with the same.

80
00:06:18,870 --> 00:06:22,920
But how do we know how many bins to use based on our data?

81
00:06:23,070 --> 00:06:27,240
Well, that's a little bit more of an art than it is a perfect science.

82
00:06:27,240 --> 00:06:32,880
But again, as we've said before, with all of these kinds of charts and plots, we're just trying to

83
00:06:32,880 --> 00:06:35,670
communicate a clear picture of the data.

84
00:06:35,670 --> 00:06:41,550
So we need to use whatever number of bins will communicate the idea we're trying to convey.

85
00:06:41,550 --> 00:06:48,270
So for instance, with this age data, for these age intervals across our city's population, we can

86
00:06:48,270 --> 00:06:52,950
see that we have people from ages zero all the way to 109.

87
00:06:52,950 --> 00:07:00,480
So we could certainly break that data into two buckets, classifying people into either the 0 to 54

88
00:07:00,480 --> 00:07:07,110
group or the 55 to 109 group and just have literally two bins or two classes.

89
00:07:07,110 --> 00:07:13,260
And while that's not wrong, it certainly wouldn't give us as clear of a picture as this histogram.

90
00:07:13,260 --> 00:07:17,820
This gives us much more information about the age distribution of our population.

91
00:07:17,820 --> 00:07:23,040
So with an age range like this, we probably wouldn't want to use two buckets, but we also wouldn't

92
00:07:23,040 --> 00:07:30,300
want to use, let's say 55 buckets because that would break this up into only two year bins or two year

93
00:07:30,300 --> 00:07:33,960
increments and that would maybe be more detailed than we need.

94
00:07:33,960 --> 00:07:37,380
So it's about finding something reasonable in the middle.

95
00:07:37,410 --> 00:07:43,170
It gives us a good picture of the distribution without giving us more detail than we really need.

96
00:07:43,170 --> 00:07:50,550
And hopefully you have a good gut feel for if our age range is essentially 0 to 110, then breaking

97
00:07:50,550 --> 00:07:56,370
up that interval into ten year buckets gives us 11 bars in our distribution.

98
00:07:56,370 --> 00:07:58,950
That's probably pretty reasonable.

99
00:07:58,950 --> 00:08:04,410
An 11 bar distribution should give us a pretty good picture without having buckets that are too big,

100
00:08:04,410 --> 00:08:08,070
that we don't get enough data or too small, that we get too much data.

101
00:08:08,070 --> 00:08:13,470
So we're just looking for a good middle ground and anything within that middle ground should give us

102
00:08:13,470 --> 00:08:15,150
a decent histogram.

103
00:08:15,150 --> 00:08:20,550
If you build a histogram and you feel like the bin width was too wide or too narrow, you can always

104
00:08:20,550 --> 00:08:25,500
make an adjustment to get a less detailed distribution or a more detailed distribution.

105
00:08:25,500 --> 00:08:31,800
Now some of the simple math that goes along with this, let's say that we wanted to build a histogram

106
00:08:31,800 --> 00:08:36,240
from the raw data of the ages of the people who live in this city.

107
00:08:36,240 --> 00:08:41,580
We want to think about the smallest value in the dataset and the largest value in the dataset.

108
00:08:41,580 --> 00:08:44,370
So in this case, the smallest value is age zero.

109
00:08:44,400 --> 00:08:46,890
The largest value is age 109.

110
00:08:46,890 --> 00:08:54,330
So to get the entire interval, because we're including both zero and 109, we have to to find the interval,

111
00:08:54,330 --> 00:08:59,400
say 109 -0, the largest value minus the smallest value.

112
00:08:59,400 --> 00:09:05,640
But if we're including both of those values in the dataset, both of those endpoints, then the subtraction

113
00:09:05,640 --> 00:09:10,530
is what's called inclusive and we have to add one to get the full range of the data.

114
00:09:10,530 --> 00:09:15,570
In other words, if you were to use your fingers to count starting at zero, so counting on your fingers,

115
00:09:15,570 --> 00:09:19,440
you say zero one, two, three, four, all the way up to 109.

116
00:09:19,440 --> 00:09:21,810
You would actually count on your fingers to.

117
00:09:22,570 --> 00:09:27,030
110 because you're including that zero value and that 109 value.

118
00:09:27,040 --> 00:09:29,080
That's why we add that plus one.

119
00:09:29,080 --> 00:09:33,190
So we have this range of 110 values.

120
00:09:33,190 --> 00:09:40,240
So then we just need to pick whatever seems to be a reasonable class interval or class width or bin

121
00:09:40,240 --> 00:09:40,690
width.

122
00:09:40,690 --> 00:09:45,520
And let's say that we want to divide this up into every ten years like we did.

123
00:09:45,520 --> 00:09:52,870
So if we divide this by ten, then that means that we will have 11 different buckets or 11 different

124
00:09:52,870 --> 00:09:56,080
classes, each with a width of ten.

125
00:09:56,080 --> 00:09:59,380
So then we take our smallest value, which is zero.

126
00:09:59,410 --> 00:10:01,330
We know the interval is ten.

127
00:10:01,330 --> 00:10:07,840
So if we again count on our fingers from zero, so 012 all the way up until we get to our 10th finger,

128
00:10:07,840 --> 00:10:11,230
then the upper bound of that first interval is nine.

129
00:10:11,230 --> 00:10:17,590
So we know that our first bin, our first class is ages 0 to 9, and then from there we can find the

130
00:10:17,590 --> 00:10:23,380
rest of our classes ages 10 to 19, ages 20 to 29.

131
00:10:23,380 --> 00:10:30,220
And notice that the difference between all of the class lower bounds will always be the bin width.

132
00:10:30,220 --> 00:10:33,880
So the difference between zero and ten, ten and 20 is always ten.

133
00:10:33,880 --> 00:10:39,010
And the difference between the upper bounds of all of our intervals will also be that same bin width.

134
00:10:39,010 --> 00:10:44,140
So the difference between nine and 19, between 19 and 29 is always ten.

135
00:10:44,140 --> 00:10:50,470
So whatever width we find for each class or each bin, we should see reflected here between the lower

136
00:10:50,470 --> 00:10:53,140
bounds and here between the upper bounds.

137
00:10:53,140 --> 00:11:02,320
And if we keep going here 30 to 39, all the way down to our highest value, we get to 102, 109.

138
00:11:02,320 --> 00:11:09,100
If we then count the number of bins that we have, we should have 11 bins since we calculated that bin

139
00:11:09,100 --> 00:11:13,060
count earlier, so 11 bins each with a width of ten.

140
00:11:13,090 --> 00:11:20,320
Then once we have the range of each bin, we can in our raw data, count the number of occurrences that

141
00:11:20,320 --> 00:11:21,400
fall within that bin.

142
00:11:21,400 --> 00:11:28,420
So if we have our raw data, we would count up the number of people in our data ages 0 to 9, and we

143
00:11:28,420 --> 00:11:30,190
would record that value.

144
00:11:30,190 --> 00:11:33,700
So let's say that's 25,000 people.

145
00:11:33,700 --> 00:11:37,630
We would count up the number of people who have an age between ten and 19.

146
00:11:37,630 --> 00:11:40,600
And let's say based on our graph here, that that's about.

147
00:11:41,320 --> 00:11:42,370
40,000 people.

148
00:11:42,370 --> 00:11:46,660
So we would count the number of occurrences in each bin or in each class.

149
00:11:46,660 --> 00:11:52,480
And then to sketch the histogram, we just place each class, each bin along the horizontal axis in

150
00:11:52,480 --> 00:11:58,810
order, and then we place counts here along our vertical axis and we plot the height of each bar in

151
00:11:58,810 --> 00:12:03,910
the histogram according to the number of occurrences that we see in each bin or each class.

152
00:12:03,910 --> 00:12:08,320
And that is how we build a histogram from a raw data set.

153
00:12:08,320 --> 00:12:14,440
That's the basic idea behind a histogram, which is our first type here of distribution plot.

154
00:12:14,470 --> 00:12:20,380
These are going to become more and more important as we go forward because what we realize is that this

155
00:12:20,380 --> 00:12:29,650
distribution can eventually create for us a probability distribution if we connect the top of each bar

156
00:12:29,920 --> 00:12:37,900
in the histogram, if we find the midpoint of each class and then we sketch a point at the top of each

157
00:12:37,900 --> 00:12:38,680
bar here.

158
00:12:38,680 --> 00:12:46,030
And if we eventually connect these points with a smooth curve, what we get is a distribution that models

159
00:12:46,030 --> 00:12:52,660
our raw data and eventually we can use that distribution to answer probability questions about how likely

160
00:12:52,660 --> 00:12:58,810
it is, for instance, that choosing a person at random in our city will give us a person between the

161
00:12:58,810 --> 00:13:02,560
ages of 40 and 49 or between the ages of 70 and 79.

162
00:13:02,560 --> 00:13:08,530
So not only is the histogram an important plot or chart like the scatter plots and line plots that we've

163
00:13:08,530 --> 00:13:12,580
looked at so far and like the charts and graphs that we'll look at next.

164
00:13:12,580 --> 00:13:17,620
But this kind of distribution plot will also be important for us to remember as we move forward to learn

165
00:13:17,620 --> 00:13:20,890
more about probability and statistics.