1
00:00:00,090 --> 00:00:06,000
The last couple of plot types we want to talk about are violin plots and kernel density estimation plots.

2
00:00:06,000 --> 00:00:10,860
And both of these are based on the histogram that we already looked at earlier.

3
00:00:10,860 --> 00:00:19,620
So the process of kernel density estimation or CD is what allows us to produce a smooth curve out of

4
00:00:19,620 --> 00:00:20,460
a histogram.

5
00:00:20,640 --> 00:00:25,920
So as an example, if we take this histogram that we were working with earlier, when we first studied

6
00:00:25,920 --> 00:00:32,700
histograms, we remind ourselves that this is a distribution essentially, and kernel density estimation

7
00:00:32,700 --> 00:00:39,510
allows us to create a distribution curve or a density curve out of this histogram, for instance, something

8
00:00:39,510 --> 00:00:41,190
that maybe looks like this.

9
00:00:41,190 --> 00:00:47,430
So essentially what we're doing here is just creating a smooth curve out of the original data set that

10
00:00:47,430 --> 00:00:48,270
we started with.

11
00:00:48,300 --> 00:00:54,090
It lets us visualize the shape of the data, the shape of the distribution without looking at all the

12
00:00:54,090 --> 00:00:56,670
individual bars in this discrete histogram.

13
00:00:56,670 --> 00:01:03,420
So in terms of terminology here, we might say that we use kernel density estimation to create the kernel

14
00:01:03,420 --> 00:01:04,290
density estimate.

15
00:01:04,290 --> 00:01:09,390
Where the kernel density estimate is this curve, we might also call it the distribution curve, the

16
00:01:09,390 --> 00:01:13,410
density curve, the kernel density plot or the CD plot.

17
00:01:13,410 --> 00:01:19,620
But no matter what we call it, we'll use this CD plot as part of the violin plot as well.

18
00:01:19,620 --> 00:01:26,700
Now, it's important to say that today we'll basically always use software or calculators to create

19
00:01:26,700 --> 00:01:28,890
file in plots and CD plots.

20
00:01:28,890 --> 00:01:34,140
We don't really do this process by hand, but we do want to understand the fundamentals of what's going

21
00:01:34,140 --> 00:01:34,890
on behind this.

22
00:01:34,890 --> 00:01:42,300
So what we really want to know about the CD plot is first, that we can dictate how smooth the plot

23
00:01:42,300 --> 00:01:42,900
is.

24
00:01:42,900 --> 00:01:48,120
And that smoothness is based on what we call the bandwidth of the kernel function.

25
00:01:48,120 --> 00:01:53,070
So think of the kernel function as something maybe that looks like this.

26
00:01:53,070 --> 00:01:59,340
And this kernel function is essentially looking at all of the data points contained under the curve

27
00:01:59,340 --> 00:02:00,480
of the kernel function.

28
00:02:00,480 --> 00:02:04,860
And we could think about the data points underneath this function based on the data points that are

29
00:02:04,860 --> 00:02:06,960
making up the histogram behind it.

30
00:02:06,960 --> 00:02:12,450
So it basically looks at those data points and based on the number of data points and how close they

31
00:02:12,450 --> 00:02:17,940
are to the center of the kernel function gives us a height for the kernel density estimate or gives

32
00:02:17,940 --> 00:02:19,860
us a height for this density curve.

33
00:02:19,860 --> 00:02:24,840
We can use kernel functions of different shapes so we can use a shape like this one.

34
00:02:24,990 --> 00:02:30,720
You'll also see a kernel density function that is a normal distribution like this.

35
00:02:30,720 --> 00:02:36,900
We can use a kernel density function that's a uniform distribution like this one or even a triangular

36
00:02:36,900 --> 00:02:38,460
distribution like this one.

37
00:02:38,460 --> 00:02:43,770
And when it comes to the kernel density function, we really just want to think about its width and

38
00:02:43,770 --> 00:02:44,340
its height.

39
00:02:44,340 --> 00:02:49,320
So if we have this kernel density function, it has a certain width to it.

40
00:02:49,320 --> 00:02:52,350
We call that the bandwidth and it has a certain height to it.

41
00:02:52,350 --> 00:02:54,000
We call that the amplitude.

42
00:02:54,000 --> 00:03:00,030
So this kernel density function here has a smaller bandwidth and a smaller amplitude.

43
00:03:00,120 --> 00:03:05,760
Then this kernel density function here, because this larger one here is both wider and taller.

44
00:03:05,760 --> 00:03:08,160
And the fact that it's wider means it has a larger bandwidth.

45
00:03:08,160 --> 00:03:11,820
And the fact that it's taller means it has a larger amplitude bandwidth.

46
00:03:11,820 --> 00:03:18,480
The width of the kernel density function is what dictates how smooth the kernel density plot will appear.

47
00:03:18,480 --> 00:03:24,900
So a lower bandwidth or a narrower kernel density function indicates a less smooth curve.

48
00:03:24,900 --> 00:03:32,340
So if we were starting with this original kernel density estimation plot and it's based on some kernel

49
00:03:32,340 --> 00:03:38,250
density function, if we were to decrease the bandwidth or narrow the width of that kernel density function,

50
00:03:38,250 --> 00:03:44,640
then maybe our kernel density estimation plot changes from looking something like this to looking something

51
00:03:44,640 --> 00:03:45,420
like this.

52
00:03:45,420 --> 00:03:47,370
See how it's much less smooth?

53
00:03:47,370 --> 00:03:51,060
We see a lot more variation throughout the CD plot.

54
00:03:51,060 --> 00:03:55,290
If we increase the bandwidth again, then we go back to the smoother curve.

55
00:03:55,320 --> 00:04:00,990
So if we were going to summarize that, we could see here a table where as we move from left to right,

56
00:04:00,990 --> 00:04:04,290
the amplitude of the kernel density function increases.

57
00:04:04,290 --> 00:04:08,670
In other words, in each row here, as we move from left to right, we can see that the height of the

58
00:04:08,670 --> 00:04:14,100
triangular distribution here, assuming we're using a triangular distribution, that the height is increasing,

59
00:04:14,100 --> 00:04:16,350
height is increasing, height is increasing.

60
00:04:16,350 --> 00:04:21,930
And as we move from top to bottom, the bandwidth of the kernel density function is increasing.

61
00:04:21,930 --> 00:04:26,310
So even though we're keeping height roughly the same here in this first column, as we move from top

62
00:04:26,310 --> 00:04:30,960
to bottom, the bandwidth increases the width of that kernel density function increases.

63
00:04:30,960 --> 00:04:36,720
So in the top left here, we have the kernel density function with the smallest bandwidth and amplitude.

64
00:04:36,720 --> 00:04:41,640
And in the lower right we have the kernel density function with the largest bandwidth and amplitude.

65
00:04:41,640 --> 00:04:45,930
In other words, this is what's going on behind the scenes when we use software.

66
00:04:45,930 --> 00:04:52,320
So when we use software to build a CD plot, all we're doing is entering the data that would have created

67
00:04:52,320 --> 00:04:53,610
the histogram in the first place.

68
00:04:53,610 --> 00:04:59,610
So all of the data points that make up our histogram and then we give the software some value for bandwidth

69
00:04:59,610 --> 00:04:59,760
and.

70
00:04:59,840 --> 00:05:02,150
Amplitude of the kernel density function.

71
00:05:02,150 --> 00:05:06,680
And that creates for us the CD plot based on those values.

72
00:05:06,680 --> 00:05:14,570
And in short, it's a way to turn a discrete histogram distribution here into a smooth distribution

73
00:05:14,570 --> 00:05:16,570
curve or a smooth density curve.

74
00:05:16,580 --> 00:05:23,840
Now, keeping the shape of this curve in mind, this CD plot here in red, let's look at how the shape

75
00:05:23,840 --> 00:05:27,830
of this plot here can transition into a violin plot.

76
00:05:28,040 --> 00:05:35,810
So to move toward violin plots, what we've done here is taken the CD plot for this histogram and we've

77
00:05:35,810 --> 00:05:37,640
created its mirror image.

78
00:05:37,640 --> 00:05:41,810
We've also moved this horizontal axis down here to the bottom.

79
00:05:41,810 --> 00:05:50,630
And now if we take away the histogram, what we have is a violin plot for this same data, which means

80
00:05:50,630 --> 00:05:57,800
that a violin plot is really just a more visual way of showing us the distribution of the data along

81
00:05:57,800 --> 00:05:58,820
some scale.

82
00:05:58,820 --> 00:06:04,790
So violin plots are also really similar to box plots or box and whisker plots in the sense that we always

83
00:06:04,790 --> 00:06:11,900
have this sort of external axis that helps us position where the violin plot actually is according to

84
00:06:11,900 --> 00:06:13,490
this external scale.

85
00:06:13,490 --> 00:06:19,610
So we have this external scale, we have the violin plot, which is just a symmetric picture of that

86
00:06:19,610 --> 00:06:23,870
distribution that was created through the process of kernel density estimation.

87
00:06:23,960 --> 00:06:29,900
And then for violin plots, we usually add additional data, and this is where we see that violin plots

88
00:06:29,900 --> 00:06:32,540
actually are very similar to box plots.

89
00:06:32,540 --> 00:06:38,210
So this red outline here is a visual representation of the data distribution.

90
00:06:38,210 --> 00:06:44,540
But then we add something very similar to a box plot inside of this violin plot.

91
00:06:44,540 --> 00:06:51,920
So this box here is just like the box from the box plot in the sense that this left edge here is the

92
00:06:51,920 --> 00:06:54,500
first quartile or the 25th percentile.

93
00:06:54,590 --> 00:07:00,530
The right edge of the box is the third quartile or the 75th percentile.

94
00:07:00,530 --> 00:07:04,280
And then we also indicate the median, which is this blue dot.

95
00:07:04,280 --> 00:07:10,220
Here we indicate the median, which is the second quartile, also the median.

96
00:07:10,220 --> 00:07:16,610
And therefore, of course, we can identify the interquartile range as the value here at Q three minus

97
00:07:16,610 --> 00:07:18,320
the value here at Q one.

98
00:07:18,320 --> 00:07:24,650
So it looks like according to this external scale, if we say that the third quartile is maybe 70 and

99
00:07:24,650 --> 00:07:30,560
the first quartile is maybe 20, then the interquartile range would be 70 -20 or 50.

100
00:07:30,560 --> 00:07:35,900
So from this box plot, we have the first, second and third quartiles, which means we also have the

101
00:07:35,900 --> 00:07:38,660
median, because the median is equivalent to the second quartile.

102
00:07:38,660 --> 00:07:40,970
That's two different ways of saying the same thing.

103
00:07:40,970 --> 00:07:45,530
And then we usually also plot this line behind the box.

104
00:07:45,530 --> 00:07:49,670
And this line is not equivalent to the whiskers of the box plot.

105
00:07:49,670 --> 00:07:57,560
Usually it represents a 95% confidence interval and we'll talk about confidence intervals later in the

106
00:07:57,560 --> 00:07:58,160
course.

107
00:07:58,160 --> 00:08:04,130
But the idea is that 95% of our data points lie within this 95% confidence interval.

108
00:08:04,130 --> 00:08:08,660
We will also sometimes include the mean as part of our violent plot as well.

109
00:08:08,660 --> 00:08:10,040
Not always, but sometimes.

110
00:08:10,040 --> 00:08:14,570
So this extra dot here is supposed to represent the mean.

111
00:08:14,570 --> 00:08:19,040
That's something we can optionally choose to include in our violin plot if we want to.

112
00:08:19,040 --> 00:08:27,230
So sometimes we'll see violin plots sketched out this way with this box plot in the center and the 95%

113
00:08:27,230 --> 00:08:28,310
confidence interval.

114
00:08:28,310 --> 00:08:34,880
But sometimes we'll see violin plots with just three lines running here, perpendicular to the plot

115
00:08:34,880 --> 00:08:38,419
that represent the first, second and third quartiles.

116
00:08:38,419 --> 00:08:43,340
So if we sketched in those lines behind this information, we might show them like this.

117
00:08:43,340 --> 00:08:46,400
You can see here at the first, second and third quartile.

118
00:08:46,400 --> 00:08:51,710
And if we chose to go that direction, then our plot would just look something like this, where we

119
00:08:51,710 --> 00:08:58,010
have this line for the medium, this dashed line for the first quartile and this dashed line for the

120
00:08:58,010 --> 00:08:58,880
third quartile.

121
00:08:58,880 --> 00:09:02,960
It gives us a lot of the same information, but in a simpler form.

122
00:09:02,960 --> 00:09:09,410
And then the other thing that we want to say is that we can always display violin plots horizontally

123
00:09:09,410 --> 00:09:13,010
like we did here or vertically like this.

124
00:09:13,010 --> 00:09:19,700
Now, the histogram itself behind the kernel density estimation plot and therefore behind this violin

125
00:09:19,700 --> 00:09:27,200
plot histograms, we almost always display sitting on a horizontal axis with the bars extending up vertically.

126
00:09:27,200 --> 00:09:31,850
So a horizontal violin plot like this one kind of matches that format.

127
00:09:31,850 --> 00:09:38,900
But it's very common to also see violin plots rotated counterclockwise by 90 degrees and shown vertically

128
00:09:38,900 --> 00:09:39,680
like this.

129
00:09:39,680 --> 00:09:41,240
Everything's exactly the same.

130
00:09:41,240 --> 00:09:43,850
It's just that we're showing the distribution vertically.

131
00:09:43,850 --> 00:09:49,670
So it's as if we take a horizontal violin plot like this, grab the right edge of it this way, and

132
00:09:49,670 --> 00:09:57,380
rotate it counterclockwise by 90 degrees just a quarter turn to turn it on its end and create a vertical

133
00:09:57,380 --> 00:09:59,600
plot here, whether we show the.

134
00:09:59,820 --> 00:10:00,430
Island plot.

135
00:10:00,450 --> 00:10:08,340
Horizontally or vertically, we always run the axis for scale parallel to the body of the violin.

136
00:10:08,340 --> 00:10:09,870
So here we have a horizontal plot.

137
00:10:09,900 --> 00:10:12,460
Our axis for scale is horizontal.

138
00:10:12,480 --> 00:10:16,260
Parallel to the plot here, because our violin plots are vertical.

139
00:10:16,290 --> 00:10:20,820
This is our axis for scale, and so it runs vertically parallel to the violins.

140
00:10:20,850 --> 00:10:27,210
Now, violin plots, we can say, have an advantage over the traditional box and whisker plot in the

141
00:10:27,210 --> 00:10:34,530
sense that they give a great visual picture of the distribution as sort of the body of this violin.

142
00:10:34,530 --> 00:10:41,070
But that also makes violin plots more visually busy and sometimes harder to interpret depending on what

143
00:10:41,070 --> 00:10:42,230
it is we're trying to show.

144
00:10:42,240 --> 00:10:48,000
As always, when we're trying to choose a chart type, we should just be thinking about the kind of

145
00:10:48,000 --> 00:10:53,730
chart that we should use in order to best represent or most clearly communicate the data that we're

146
00:10:53,730 --> 00:10:55,000
trying to display.

147
00:10:55,020 --> 00:11:00,960
So maybe depending on the data that we have, it's not super helpful or important to show this visual

148
00:11:00,960 --> 00:11:02,670
representation of the distribution.

149
00:11:02,760 --> 00:11:07,200
We might, in that case, just choose to use simple box plots instead.

150
00:11:07,200 --> 00:11:12,600
But maybe showing this visual representation of the distribution is particularly helpful because we

151
00:11:12,600 --> 00:11:17,220
can more clearly see the similarities or differences across distributions.

152
00:11:17,220 --> 00:11:21,510
We just need to think about what makes the most sense for the data that we're working with in the results

153
00:11:21,510 --> 00:11:23,160
that we're trying to communicate.

154
00:11:23,190 --> 00:11:29,790
If we have multiple data sets like we're showing here in this graph, we have data set A, data set

155
00:11:29,820 --> 00:11:36,660
B, and data set C, It can be particularly helpful to arrange the order of the corresponding violin

156
00:11:36,660 --> 00:11:39,510
plots based on the value of the median.

157
00:11:39,510 --> 00:11:45,840
So in this comparison plot here, we can see that the violins are arranged from smallest median on the

158
00:11:45,840 --> 00:11:51,900
left to largest meeting on the right, unless there's some other reason to arrange the data in a different

159
00:11:51,900 --> 00:11:52,380
order.

160
00:11:52,380 --> 00:12:01,530
For instance, maybe data set B here represents the month of January, data set C represents February,

161
00:12:01,530 --> 00:12:08,130
and data set A represents March, and may be this chronology is important to us such that it would make

162
00:12:08,130 --> 00:12:14,970
more sense to show the violin plot for data set B and then C and then A so that we can see change over

163
00:12:14,970 --> 00:12:15,480
time.

164
00:12:15,480 --> 00:12:17,400
That version might make more sense.

165
00:12:17,400 --> 00:12:22,950
Or if we don't have this kind of ordering or something like this doesn't matter, then maybe arranging

166
00:12:22,950 --> 00:12:29,280
them this way as a, B, C in order of increasing median does a better job at communicating the differences

167
00:12:29,280 --> 00:12:30,780
between the data sets.

168
00:12:30,780 --> 00:12:35,040
So that's the idea behind creating CD plots and violin plots.

169
00:12:35,040 --> 00:12:39,570
And now that we understand how to create all these different types of charts and graphs, in the next

170
00:12:39,570 --> 00:12:45,990
lecture, we'll look at common plot pitfalls or things that we should make sure to avoid when we're

171
00:12:45,990 --> 00:12:49,140
creating these different types of charts and graphs.

