1
00:00:00,090 --> 00:00:02,830
We talked in the last lecture about histograms.

2
00:00:02,850 --> 00:00:08,970
Now we want to talk about a very similar type of plot, which is the bar plot and an example of a bar

3
00:00:08,970 --> 00:00:11,670
plot or bar chart looks like this.

4
00:00:11,700 --> 00:00:18,960
Remember that we use histograms for continuous variables like age or time or temperature.

5
00:00:18,960 --> 00:00:22,350
So we have that continuous variable along the horizontal axis.

6
00:00:22,350 --> 00:00:28,380
When we use a bar plot, it's usually for a categorical variable which we put along the horizontal axis.

7
00:00:28,380 --> 00:00:33,900
So here, for example, is a bar plot of the number of times that each continent has hosted the Summer

8
00:00:33,900 --> 00:00:34,560
Olympic Games.

9
00:00:34,560 --> 00:00:41,070
So we have Europe, North America, Asia, Australia and South America and this variable which we could

10
00:00:41,070 --> 00:00:42,330
call continent.

11
00:00:42,330 --> 00:00:49,260
So if we say that along the horizontal axis here, we plot continent, continent is a categorical,

12
00:00:49,260 --> 00:00:51,360
not continuous variable.

13
00:00:51,390 --> 00:00:55,020
Obviously there's nothing continuous about different continents.

14
00:00:55,020 --> 00:01:00,960
There's no continent in between Europe or North America or a continent in between North America and

15
00:01:00,960 --> 00:01:01,620
Asia.

16
00:01:01,620 --> 00:01:08,160
In the same way that if we plot age along the horizontal axis, we can think about in between values,

17
00:01:08,160 --> 00:01:14,730
because age can take on every value as long as we can measure age to a detailed enough degree.

18
00:01:14,730 --> 00:01:17,460
But there's no in between concept here with continents.

19
00:01:17,460 --> 00:01:20,400
These are just different buckets, different categories.

20
00:01:20,400 --> 00:01:23,940
So this is a categorical, not continuous variable.

21
00:01:23,940 --> 00:01:30,600
When we have a variable like this, it's really common to use a bar plot or bar chart because it's one

22
00:01:30,600 --> 00:01:34,710
of the simplest ways to summarize and graphically represent.

23
00:01:34,710 --> 00:01:40,860
A set of data will usually just put the categorical variable along the horizontal axis and then we'll

24
00:01:40,860 --> 00:01:43,380
put count along the vertical axis.

25
00:01:43,380 --> 00:01:45,960
So we could say here that this is count.

26
00:01:45,960 --> 00:01:50,940
And again, this particular bar chart is the number of times each continent has hosted the Summer Olympic

27
00:01:50,940 --> 00:01:51,510
Games.

28
00:01:51,510 --> 00:01:58,410
When we're talking about a categorical variable, we usually refer to each category as an individual.

29
00:01:58,410 --> 00:02:04,010
So we plot the individuals of the categorical variable along the horizontal axis.

30
00:02:04,020 --> 00:02:09,210
Keep in mind, though, that for bar charts or bar graphs, not only can we create a chart where the

31
00:02:09,210 --> 00:02:15,270
bars are shown vertically, we can also create this same chart where the bars are shown horizontally.

32
00:02:15,270 --> 00:02:18,900
This is exactly the same chart, except that we have flipped the axes.

33
00:02:18,900 --> 00:02:24,840
We have the continents here, the categorical variable along this vertical axis, and then along the

34
00:02:24,840 --> 00:02:26,970
horizontal axis we have the count.

35
00:02:27,000 --> 00:02:30,330
Notice one thing that these charts have in common.

36
00:02:30,330 --> 00:02:37,650
We don't always have to do this, but it can be nice for a bar graph to organize the data from least

37
00:02:37,650 --> 00:02:43,350
to greatest or greatest, to least notice that we have arranged the individuals in this categorical

38
00:02:43,350 --> 00:02:43,860
variable.

39
00:02:43,860 --> 00:02:48,990
We have arranged the continents in such a way that we're showing the continent that's hosted the Summer

40
00:02:48,990 --> 00:02:54,480
Olympics the most times first, and then all the way down to the least number of times last.

41
00:02:54,480 --> 00:02:57,480
And then same thing here with the bar chart flipped the other way.

42
00:02:57,480 --> 00:03:02,520
We have the continent with the highest count on top, all the way down to the lowest count on the bottom.

43
00:03:02,640 --> 00:03:09,150
It can be nice to arrange the data that way because visually it makes it really easy for us to see highest

44
00:03:09,150 --> 00:03:12,030
count, lowest count and everything in between.

45
00:03:12,030 --> 00:03:17,580
But depending on the variable, sometimes it's not going to make sense to put the data in order by count.

46
00:03:17,580 --> 00:03:23,370
For instance, maybe our categorical variable is months of the year, so January, February, March,

47
00:03:23,370 --> 00:03:25,530
April, etc. all the way to December.

48
00:03:25,530 --> 00:03:32,400
If that's the case, we could certainly rearrange the months based on count and show the greatest count

49
00:03:32,400 --> 00:03:34,830
first and the lowest count last.

50
00:03:34,830 --> 00:03:39,870
But it also might make the most sense just to show the months in order, starting with January ending

51
00:03:39,870 --> 00:03:40,620
in December.

52
00:03:40,620 --> 00:03:46,950
It always comes back to what we're trying to represent and what makes most sense to communicate the

53
00:03:46,950 --> 00:03:48,600
picture we're trying to get across.

54
00:03:48,630 --> 00:03:55,980
Notice also that with these bar graphs, we always show a space or a gap between each bar, whereas

55
00:03:55,980 --> 00:03:59,760
with the histogram we eliminated that space between the bars.

56
00:03:59,760 --> 00:04:05,490
And so we had what looked like one continuous distribution that was intentional to indicate that we

57
00:04:05,490 --> 00:04:07,380
were dealing with a continuous variable.

58
00:04:07,410 --> 00:04:14,100
These gaps or breaks or this separation in between each bar indicates or gives us another hint that

59
00:04:14,100 --> 00:04:16,790
we're dealing with a categorical variable.

60
00:04:17,010 --> 00:04:23,430
Realize here that any bar plot or bar chart is associated with a frequency table.

61
00:04:23,430 --> 00:04:31,110
So the raw data for these charts might be the complete historical list of the host city for every summer

62
00:04:31,110 --> 00:04:32,100
Olympic Games.

63
00:04:32,100 --> 00:04:40,380
And using that raw data, we could make a frequency table so we could take every instance of the Summer

64
00:04:40,380 --> 00:04:45,420
Olympic Games and create a table where we group those games together by continent.

65
00:04:45,420 --> 00:04:47,270
And so our table might look like this.

66
00:04:47,280 --> 00:04:49,080
Europe, North America.

67
00:04:49,890 --> 00:04:54,990
Asia, Australia and South America.

68
00:04:55,170 --> 00:04:58,170
And then our count.

69
00:04:59,150 --> 00:05:05,810
Of games would be 16, six, three, two and one.

70
00:05:05,810 --> 00:05:07,610
So this is continent.

71
00:05:08,570 --> 00:05:15,890
And this is count and this is what we call the frequency table that we can use to build a bar chart

72
00:05:15,890 --> 00:05:17,480
or bar plot or bar graph.

73
00:05:17,510 --> 00:05:24,560
Obviously, once we have this frequency table, we would list the individuals in the categorical variable

74
00:05:24,590 --> 00:05:30,110
across the horizontal axis, or if we're doing our chart horizontally across the vertical axis.

75
00:05:30,110 --> 00:05:36,290
And then we would just make sure that each bar extends to the correct count of that individual.

76
00:05:36,290 --> 00:05:36,810
So.

77
00:05:36,830 --> 00:05:39,430
Europe has hosted the Summer Olympic Games 16 times.

78
00:05:39,440 --> 00:05:47,870
So when we create a category here for Europe, we show a bar that extends up to this point here, 16

79
00:05:48,080 --> 00:05:53,450
and then we show North America right about here at six, etc..

80
00:05:53,450 --> 00:05:55,490
And so we plot those bars.

81
00:05:55,670 --> 00:06:01,400
And then the last thing we want to say about bar plots is that they are also great for comparing two

82
00:06:01,400 --> 00:06:02,600
different variables.

83
00:06:02,600 --> 00:06:08,750
For instance, maybe we want to compare summer and Winter Olympic Games in this chart we're showing.

84
00:06:09,550 --> 00:06:13,630
Summer games in blue and we're showing.

85
00:06:14,240 --> 00:06:16,510
Winter Games in green.

86
00:06:16,520 --> 00:06:22,280
If this is the case, we obviously want to make sure to include a legend like this that we've shown

87
00:06:22,280 --> 00:06:22,500
here.

88
00:06:22,520 --> 00:06:28,610
Maybe we would do something like this, but we want to indicate which series is associated with which

89
00:06:28,610 --> 00:06:30,020
set of games.

90
00:06:30,020 --> 00:06:36,590
And these bar charts are a really good way to show the comparison between these, because not only can

91
00:06:36,590 --> 00:06:41,270
we see the comparison between games for each continent, so we can quickly see here that Europe has

92
00:06:41,270 --> 00:06:46,190
hosted the Summer Olympics more than the Winter Olympics, whereas it looks like North America has hosted

93
00:06:46,190 --> 00:06:48,740
an equal number of summer and Winter Olympic Games.

94
00:06:48,740 --> 00:06:54,920
But we can also quickly compare the Summer Olympic Games by continent and the Winter Olympic Games by

95
00:06:54,920 --> 00:06:55,430
continent.

96
00:06:55,430 --> 00:07:00,200
So there's several things we can do all in one chart and we can do it very quickly.

97
00:07:00,200 --> 00:07:03,440
So simple idea here with bar graphs.

98
00:07:03,440 --> 00:07:08,600
The biggest takeaway being that we use a bar graph when we're dealing with a categorical variable,

99
00:07:08,600 --> 00:07:12,530
whereas we use a histogram when we're dealing with a continuous variable.

100
00:07:12,530 --> 00:07:18,620
And if we're creating a bar graph, we want to make sure to leave space in between the bars.

101
00:07:18,620 --> 00:07:24,440
Whereas when we create the histogram, we want to make sure to show no space at all between the bars

102
00:07:24,440 --> 00:07:29,240
so that we can make it clear whether we're dealing with a histogram or a bar graph.

103
00:07:29,240 --> 00:07:35,420
And then assuming we are dealing with a bar graph, if we're starting from a frequency table like this

104
00:07:35,420 --> 00:07:42,620
one, we just want to decide whether or not it makes more sense to order our data by count like we did

105
00:07:42,620 --> 00:07:49,040
here from largest to smallest count, or whether it makes sense to show the individuals in the categorical

106
00:07:49,040 --> 00:07:51,860
variable in a different order like we talked about earlier.

107
00:07:51,860 --> 00:07:57,170
If our individuals are months of the year and maybe it makes more sense to show them chronologically

108
00:07:57,170 --> 00:08:05,960
is January, February, March, etc., even if the counts then aren't in order from largest to smallest.

