1
00:00:05,550 --> 00:00:11,130
Welcome everyone to this section on Core Data Concepts, where we're going to be focusing on the measures

2
00:00:11,130 --> 00:00:11,910
of data.

3
00:00:13,160 --> 00:00:19,010
Now fundamentally at the core of using data science to solve real world problems and find solutions

4
00:00:19,010 --> 00:00:24,260
to business challenges, we use data, and it's important to understand core concepts of data before

5
00:00:24,260 --> 00:00:26,990
continuing on to use it with a variety of methods.

6
00:00:28,160 --> 00:00:32,870
So in this section of the course, we're going to be focusing on some core topics to understanding what

7
00:00:32,870 --> 00:00:34,370
data actually is.

8
00:00:34,370 --> 00:00:39,620
And as you begin to learn about probability statistics and visualizations regarding data, we need to

9
00:00:39,620 --> 00:00:44,450
take a little bit of time to understand what do we actually mean when we say the term data.

10
00:00:45,780 --> 00:00:48,990
So the natural question arises what is data?

11
00:00:50,710 --> 00:00:56,410
Formally speaking, we can think of data as collected observations and information about something which

12
00:00:56,410 --> 00:00:58,480
can be structured or unstructured.

13
00:00:58,660 --> 00:01:03,940
Let's think about a few examples of data to get an idea of what we mean by terms like collected information

14
00:01:03,940 --> 00:01:05,770
or structured versus unstructured.

15
00:01:07,080 --> 00:01:11,900
Imagine that you went to Antarctica and you started collecting information about penguins.

16
00:01:11,910 --> 00:01:16,530
Maybe you take a penguin, make some measurements, and then send them back out into the wild.

17
00:01:17,290 --> 00:01:22,840
Well, you can start organizing this information and maybe you give each penguin a unique ID and then

18
00:01:22,840 --> 00:01:26,740
you also do measurements like the flipper length or the body mass in grams.

19
00:01:27,280 --> 00:01:31,620
You should notice already the idea of units of measurements and also structure.

20
00:01:31,630 --> 00:01:36,970
Here we can see a clear structure in a tabular form similar to a spreadsheet where you have columns

21
00:01:36,970 --> 00:01:43,360
and rows, where each column indicates possibly a feature of this particular data set, and then each

22
00:01:43,360 --> 00:01:47,560
row indicates an actual unique measurement of a particular penguin.

23
00:01:47,830 --> 00:01:51,160
Now, often when we think about data, we're thinking about numbers.

24
00:01:51,160 --> 00:01:54,340
But we should point out that not all data has to be numeric.

25
00:01:55,120 --> 00:01:59,660
For example, we could also take note of the sex of the penguin, whether they are male or female.

26
00:01:59,680 --> 00:02:05,590
Or you could take note of maybe the colour of a penguin's beak, whether it's yellow or orange or bright

27
00:02:05,590 --> 00:02:06,800
red, etc..

28
00:02:06,820 --> 00:02:12,880
So again, keep in mind that not all data needs to be numeric or even really have some sort of unit

29
00:02:12,910 --> 00:02:13,750
of measurement.

30
00:02:15,210 --> 00:02:19,290
I should point out, this is actually a real data set, and the data was collected and made available

31
00:02:19,290 --> 00:02:24,870
by Dr. Gorman at the Palmer Station in Antarctica, which is a member of the long term Ecological research

32
00:02:24,870 --> 00:02:25,550
Network.

33
00:02:25,560 --> 00:02:29,220
And you can Google search penguin dataset and you can find more information on it.

34
00:02:31,070 --> 00:02:34,790
So we've got an idea that data doesn't necessarily need to be numerical.

35
00:02:34,790 --> 00:02:39,950
For example, the colors of cars or the sexes of those penguins, whether they're male or female.

36
00:02:39,950 --> 00:02:43,130
And so we have this idea that there's different types of data.

37
00:02:43,130 --> 00:02:48,470
And in fact, we can distinguish between different types of data, such as continuous versus discrete

38
00:02:48,470 --> 00:02:52,160
data as well as structure data versus unstructured data.

39
00:02:52,310 --> 00:02:55,430
Let's talk a little bit more about this vocabulary of data.

40
00:02:56,400 --> 00:03:00,960
So in order to explore some core concepts, we need some vocabulary use to describe them.

41
00:03:00,960 --> 00:03:05,460
So we're going to be talking about continuous versus discrete, sometimes known as continuous versus

42
00:03:05,460 --> 00:03:06,360
categorical.

43
00:03:06,480 --> 00:03:09,330
We'll also talk about structured versus unstructured data.

44
00:03:09,330 --> 00:03:14,280
We'll talk about nominal versus ordinal data as well as population data versus sample data.

45
00:03:15,930 --> 00:03:19,200
Let's begin by discussing continuous versus discrete data.

46
00:03:20,500 --> 00:03:22,420
So it is discreet data.

47
00:03:22,450 --> 00:03:27,820
Formally speaking, you'll often see discrete data described as it can only take certain values.

48
00:03:27,820 --> 00:03:32,470
And I like to think of it as there's no values in between values.

49
00:03:32,470 --> 00:03:34,480
So let's give you an example of this.

50
00:03:35,380 --> 00:03:40,720
Let's imagine you're running a used car lot and you're taking inventory of your cars and you want to

51
00:03:40,720 --> 00:03:42,820
take inventory of the car model.

52
00:03:42,820 --> 00:03:43,900
Is it a Toyota?

53
00:03:43,930 --> 00:03:44,810
Is it a Tesla?

54
00:03:44,830 --> 00:03:46,740
Is it a Ferrari, etc.?

55
00:03:46,780 --> 00:03:49,220
This is known as discrete data.

56
00:03:49,240 --> 00:03:55,210
The car models can only take certain values and there's no real values in between values.

57
00:03:55,210 --> 00:03:58,940
For example, there's no value that's really in between Toyota and Tesla.

58
00:03:58,960 --> 00:04:04,870
The car model has to be one of those certain values, so it has to be a brand or model of a particular

59
00:04:04,870 --> 00:04:05,440
car.

60
00:04:06,960 --> 00:04:11,760
I should point out that you can also do this with playing card values that have some numbers in them.

61
00:04:11,760 --> 00:04:17,140
For example, playing card values have to be ace, two, three, jack, Queen, King, and so on.

62
00:04:17,160 --> 00:04:22,050
There's no playing card value that's 2.5 or in between certain values.

63
00:04:22,050 --> 00:04:26,040
So don't get confused with discrete versus numeric data.

64
00:04:26,070 --> 00:04:31,290
You can have numeric data that is also discrete data depending on the context.

65
00:04:31,290 --> 00:04:32,400
So for example.

66
00:04:33,680 --> 00:04:38,510
Let's imagine that you're marking down all the possible values of rolling a single die.

67
00:04:38,540 --> 00:04:43,910
This is clearly numeric data, but it's still discrete because no matter how many times you roll this

68
00:04:43,910 --> 00:04:48,770
die, you're not going to be able to get any other value besides one, two, three, four, five and

69
00:04:48,770 --> 00:04:49,440
six.

70
00:04:49,460 --> 00:04:52,840
So this is numeric data and it's also discrete data.

71
00:04:52,850 --> 00:04:55,580
There is no 3.5 on the die.

72
00:04:57,510 --> 00:04:59,520
Now let's talk about continuous data.

73
00:04:59,550 --> 00:05:02,820
Formally speaking, continuous data can take any value.

74
00:05:02,850 --> 00:05:07,320
That is to say there's going to be an infinite amount of values in between any two values if you're

75
00:05:07,320 --> 00:05:08,760
able to get precise enough.

76
00:05:09,660 --> 00:05:13,550
For example, the height or weight of people is a continuous value set.

77
00:05:13,560 --> 00:05:18,360
So someone could be 172 centimetres tall or 173 centimetres tall.

78
00:05:18,360 --> 00:05:23,730
And we can see this continuous data because there are values in between this that someone could be.

79
00:05:24,550 --> 00:05:28,060
So someone could be 172.5 centimetres tall.

80
00:05:28,060 --> 00:05:36,070
And if we're able to get super precise with this, someone could be 172.54 centimetres tall or 152.542

81
00:05:36,100 --> 00:05:42,310
centimetres, etc. So hopefully that distinguishes the idea of continuous data versus discrete data.

82
00:05:43,900 --> 00:05:49,780
And remember that while continuous data is numeric, such as a weight being 160 kilograms, discrete

83
00:05:49,780 --> 00:05:54,250
data can be numeric like a dice roll of two or a string like the color blue.

84
00:05:54,400 --> 00:05:59,590
You should also keep in mind that sometimes the context and framing of a data set will decide whether

85
00:05:59,590 --> 00:06:02,400
you should think of data as continuous or discrete.

86
00:06:02,410 --> 00:06:08,110
And in fact, when you're communicating with your colleagues about data, context is super important.

87
00:06:09,250 --> 00:06:14,230
For example, let's imagine I just asked you, is color data continuous or discrete?

88
00:06:14,290 --> 00:06:19,060
Based off what we just described, you may quickly say, oh, it's going to be discrete, like the color

89
00:06:19,060 --> 00:06:19,810
of a car.

90
00:06:20,020 --> 00:06:22,840
So you may just want immediately say there's discrete values.

91
00:06:22,840 --> 00:06:26,920
Something is yellow, orange or red or green or blue, etc..

92
00:06:27,040 --> 00:06:32,950
But if you start to think about this more like on a spectrum of wavelengths, then maybe the context

93
00:06:32,950 --> 00:06:34,270
is more physics based.

94
00:06:34,360 --> 00:06:39,850
So if we're in a physics lab and we're actually measuring the visible spectrum of light, then color

95
00:06:39,850 --> 00:06:42,670
starts to become more like a continuous data set.

96
00:06:42,670 --> 00:06:47,440
And in this context, but maybe color is not the best word, and I should actually be describing something

97
00:06:47,440 --> 00:06:50,260
like wavelengths as a continuous data set.

98
00:06:50,290 --> 00:06:54,590
Although technically, if we're only in the visible spectrum, I could use the word color.

99
00:06:54,610 --> 00:06:56,980
That's why context is so important here.

100
00:06:57,160 --> 00:07:01,390
So now we're really thinking more about wavelengths in nanometers.

101
00:07:01,390 --> 00:07:05,610
And in fact, as I mentioned, does the term color even make sense here?

102
00:07:05,620 --> 00:07:10,560
So there's gamma rays, x rays, radio waves, etc., which we can't see.

103
00:07:10,570 --> 00:07:13,900
So keep in mind that context is super important.

104
00:07:13,900 --> 00:07:19,480
And being careful with the vocabulary you choose with describing data is also going to be very important.

105
00:07:21,430 --> 00:07:26,500
And also don't confuse numeric and ordered discrete data with continuous data.

106
00:07:26,620 --> 00:07:29,950
Let me talk about an example here to clarify what I mean.

107
00:07:31,120 --> 00:07:33,900
Let's consider an airline with passenger classes.

108
00:07:33,910 --> 00:07:37,450
There's a first class, a second class and a third class.

109
00:07:37,600 --> 00:07:39,670
Now, this is technically numeric data.

110
00:07:39,670 --> 00:07:42,700
We could say it's one, two and three in the same way.

111
00:07:42,700 --> 00:07:45,610
There's one, two, three, four, five, six on a dice roll.

112
00:07:45,910 --> 00:07:48,970
However, there also is an order to this.

113
00:07:48,970 --> 00:07:53,410
So second class is technically in between first and third class.

114
00:07:53,410 --> 00:07:55,930
But why is this still discrete?

115
00:07:55,960 --> 00:08:01,000
That's because there's no values that are any values that can be taken.

116
00:08:01,000 --> 00:08:05,180
That is, there's only a certain amount of discrete values that the data can take.

117
00:08:05,200 --> 00:08:07,810
It can only take first, second or third class.

118
00:08:07,810 --> 00:08:10,630
There's no 1.57 passenger class.

119
00:08:10,630 --> 00:08:15,010
So this is kind of an interesting example of it's technically numeric data because it's you could think

120
00:08:15,010 --> 00:08:19,900
of it as one, two, three, and it also has a clear ordering, which kind of gives the presence or

121
00:08:19,900 --> 00:08:22,720
feeling that some data sets are in between others.

122
00:08:22,720 --> 00:08:28,390
But really it's still discrete because the data itself has to be one of these certain values.

123
00:08:28,390 --> 00:08:31,960
It can't just be any value on a continuous spectrum.

124
00:08:33,789 --> 00:08:38,110
So to help distinguish with this type of data, we need some more vocabulary, and that's where the

125
00:08:38,110 --> 00:08:41,230
terms nominal versus ordinal come into play.

126
00:08:42,850 --> 00:08:47,200
Nominal data is classified without a natural order or rank.

127
00:08:47,590 --> 00:08:52,810
So, for example, if we're thinking of categories of the screen, animals like dogs, cats, lizards,

128
00:08:52,810 --> 00:08:55,590
horses, etc. this is nominal data.

129
00:08:55,600 --> 00:09:00,640
There's no real ranking that's official to these particular animals, although we may think dogs are

130
00:09:00,640 --> 00:09:01,450
the best animals.

131
00:09:01,450 --> 00:09:02,950
But that's for another time.

132
00:09:04,350 --> 00:09:08,490
A good test for nominal data is to see if it can be clearly sorted or not.

133
00:09:08,520 --> 00:09:10,260
Nominal data cannot be sorted.

134
00:09:10,260 --> 00:09:15,120
So if you just gave someone a bunch of species of different animals, there's no real clear definition

135
00:09:15,120 --> 00:09:16,950
for how they should be sorted or not.

136
00:09:18,120 --> 00:09:20,220
Ordinal data can be sorted.

137
00:09:20,220 --> 00:09:22,860
That is to say it has an order to it, that's all.

138
00:09:23,250 --> 00:09:26,640
So that's a good, helpful reminder of what ordinal data is.

139
00:09:27,240 --> 00:09:31,700
So our previous examples of passenger classes is an ordinal discrete data set.

140
00:09:31,710 --> 00:09:34,020
So it's still discrete data, it's also numeric.

141
00:09:34,020 --> 00:09:36,540
But the factor here is that it's ordinal.

142
00:09:36,540 --> 00:09:41,160
So there is a natural ranking of these items first, second and third class.

143
00:09:42,300 --> 00:09:45,930
So we understand that second class is in between first and third class.

144
00:09:47,350 --> 00:09:51,280
I should also point out that ordinal data doesn't necessarily need to be numeric.

145
00:09:51,280 --> 00:09:57,310
So whether data terms such as going to be hot, mild or cold, those can be said to be ordinal because

146
00:09:57,310 --> 00:10:01,360
we can think of hot, mild and cold as being able to be sorted.

147
00:10:01,360 --> 00:10:04,780
So from hottest to coldest temperature words.

148
00:10:06,640 --> 00:10:10,370
When thinking about continuous versus this great and nominal versus ordinal.

149
00:10:10,390 --> 00:10:13,600
Try to keep in mind the context of the problem you're trying to solve.

150
00:10:13,630 --> 00:10:18,400
It may not be necessary to apply labels such as ordinal if they aren't useful to the challenge at hand

151
00:10:18,400 --> 00:10:19,930
or the problem you're trying to solve.

152
00:10:21,530 --> 00:10:24,560
Let's also talk about structured versus unstructured data.

153
00:10:24,710 --> 00:10:29,600
So we need to understand that not all data is going to be formatted nicely in a table or spreadsheet.

154
00:10:29,720 --> 00:10:34,010
We should also understand that in some cases you actually don't even want it in a structured format.

155
00:10:35,670 --> 00:10:40,620
Structured data is highly specific and is stored in a predefined format, so this could be something

156
00:10:40,620 --> 00:10:41,890
like an Excel spreadsheet.

157
00:10:41,910 --> 00:10:47,970
JSON files, which are JavaScript object notation files, XML files, SQL databases.

158
00:10:47,970 --> 00:10:50,730
They all follow some sort of predefined format.

159
00:10:50,760 --> 00:10:55,890
That is to say that as long as you hand over the data to someone else who is familiar with the format,

160
00:10:55,890 --> 00:10:58,350
they're going to be able to work with that data.

161
00:10:59,530 --> 00:11:05,530
Then there's unstructured data that's not in any particular format, such as audio data or video or

162
00:11:05,530 --> 00:11:10,490
text data that doesn't need to follow any particular predefined structured format.

163
00:11:10,510 --> 00:11:16,390
So, for example, video data, that can be a short little YouTube clip or it can be a two hour movie

164
00:11:16,390 --> 00:11:19,920
and there's no real self defined structure there.

165
00:11:19,930 --> 00:11:23,380
It just happens to be in an encoding format.

166
00:11:24,040 --> 00:11:30,730
So be careful not to confuse computer encoded file formats with formatted or structured data.

167
00:11:30,730 --> 00:11:36,430
So just because a text is in a PDF format doesn't technically make it structured data because you'd

168
00:11:36,430 --> 00:11:39,720
have to open up that PDF file to understand what you're actually looking at.

169
00:11:39,730 --> 00:11:43,540
Are you looking at reviews or are you looking at some legal documents?

170
00:11:43,540 --> 00:11:49,360
So unstructured data, again, is not in any particular format where you can just easily open it up

171
00:11:49,360 --> 00:11:51,670
in a wide variety of programs?

172
00:11:53,500 --> 00:11:58,180
So typically it's going to be that unstructured data is going to be harder to work with.

173
00:11:58,180 --> 00:12:01,540
But in certain fields it's actually necessary to achieve results.

174
00:12:02,610 --> 00:12:07,740
Many state of the art deep learning techniques use unstructured data to learn patterns and generate

175
00:12:07,740 --> 00:12:08,790
new objects.

176
00:12:09,920 --> 00:12:13,370
For example, you may have heard of Dolly two from Open Eye.

177
00:12:13,400 --> 00:12:19,730
It's a deep learning program that is actually able to take custom text descriptions such as an astronaut

178
00:12:19,730 --> 00:12:26,210
riding a horse in the style of Andy Warhol and then produce an image to reflect the text description.

179
00:12:26,240 --> 00:12:32,390
Now, this is a super complex, deep learning model, but you get an idea that it actually used unstructured

180
00:12:32,390 --> 00:12:34,370
data for its training set.

181
00:12:34,400 --> 00:12:41,060
It used unstructured text descriptions as well as unstructured image files, and then was able to eventually

182
00:12:41,060 --> 00:12:45,830
learn the relationship between a text description and the output image.

183
00:12:45,830 --> 00:12:51,680
And after being trained on many, many images and many text descriptions, it was able to understand,

184
00:12:51,680 --> 00:12:55,910
given a text description prompt, what sort of image should be output.

185
00:12:55,910 --> 00:13:01,100
And I would highly encourage you to do a Google search for Dolly to to see all the amazing images people

186
00:13:01,100 --> 00:13:03,620
have created with this amazing deep learning program.

187
00:13:05,610 --> 00:13:09,180
And then let's also talk about population versus sample.

188
00:13:09,330 --> 00:13:13,560
So finally, we also need to consider the scope of our data collection.

189
00:13:13,710 --> 00:13:20,040
Is this data a full representation of everything available and known, or is it actually just a sample

190
00:13:20,040 --> 00:13:20,910
of everything?

191
00:13:22,000 --> 00:13:25,630
So population consists of every member of a group.

192
00:13:25,810 --> 00:13:29,110
Keep in mind, this is dependent on the context of a situation.

193
00:13:29,200 --> 00:13:35,440
For example, a list of all the student names inside of a school contains data on the entire population.

194
00:13:35,440 --> 00:13:40,380
Because in this context, the scope of what we're looking for is just this particular school.

195
00:13:40,390 --> 00:13:45,480
And if we get a list of all the student names in that school, that's the entire population here.

196
00:13:45,490 --> 00:13:47,080
Context is super important.

197
00:13:48,350 --> 00:13:52,790
Don't confuse the term population in this context with the population of an entire country.

198
00:13:52,790 --> 00:13:57,590
In the context of data science, we use population to describe the entire available dataset.

199
00:13:59,150 --> 00:14:04,500
Now, often it's not going to be possible to record data on an entire population.

200
00:14:04,520 --> 00:14:10,460
In this case, we rely on a sample from the population, which is a subset of all the members of the

201
00:14:10,460 --> 00:14:10,940
group.

202
00:14:12,130 --> 00:14:17,140
For example, let's imagine we had an optional school survey that we only ask some of the students to

203
00:14:17,140 --> 00:14:18,910
fill out in this context.

204
00:14:18,910 --> 00:14:22,420
That would be a sample of the population of the entire school.

205
00:14:22,630 --> 00:14:28,180
We should always try to have samples that are representative of the population we're trying to understand.

206
00:14:28,210 --> 00:14:33,790
So having a survey where only two students fill it out in a school of 1000 students is probably not

207
00:14:33,790 --> 00:14:35,080
a representative sample.

208
00:14:36,110 --> 00:14:40,100
Later on, we'll discover that sample sizes are actually a very well studied science.

209
00:14:40,100 --> 00:14:45,290
So, for example, how many students should we survey for a school of 1000 students to get a representative

210
00:14:45,290 --> 00:14:45,890
sample?

211
00:14:46,070 --> 00:14:52,100
You can intuitively feel that one or two students is not enough, but you also have an intuition that

212
00:14:52,100 --> 00:14:59,030
it's probably not necessary to survey all 1000 students to get an idea of how students in the general

213
00:14:59,030 --> 00:15:00,420
population feel.

214
00:15:00,440 --> 00:15:06,500
So there's actually a really well defined science to what sample size you need to try to get a representative

215
00:15:06,500 --> 00:15:07,100
sample.

216
00:15:07,880 --> 00:15:11,780
So in case you're curious about the answer to that particular question, we'll discuss it later on.

217
00:15:11,780 --> 00:15:16,760
But it actually depends a bit on our assumptions of the overall population and its distribution and

218
00:15:16,760 --> 00:15:17,810
the task at hand.

219
00:15:17,840 --> 00:15:23,390
You can find out more now at this particular Wikipedia link on sample size determination, or you can

220
00:15:23,390 --> 00:15:27,110
just keep watching along the course and we'll revisit this idea later on.

221
00:15:28,580 --> 00:15:33,140
So we've seen that data comes in different forms and that we need to be cognizant of the context surrounding

222
00:15:33,140 --> 00:15:36,770
the data and more importantly on what we are using the data for.

223
00:15:36,800 --> 00:15:42,440
The ability to measure certain features of data is crucial to understanding data sets, especially numeric

224
00:15:42,440 --> 00:15:42,980
ones.

225
00:15:42,980 --> 00:15:47,600
And I hope the vocabulary we taught you right now is going to be useful in describing your data sets

226
00:15:47,600 --> 00:15:48,380
later on.

227
00:15:49,710 --> 00:15:53,740
So let's continue on by exploring two key concepts of data measurements.

228
00:15:53,760 --> 00:15:59,400
Measurements of central tendency like mean median and mode and measurements of dispersion, variance

229
00:15:59,400 --> 00:16:00,630
and standard deviation.

230
00:16:00,720 --> 00:16:02,460
We'll see you at the next lecture.

