WEBVTT

0
00:00.290 --> 00:01.470
In the last lesson,

1
00:01.470 --> 00:03.570
we got started using pandas.

2
00:03.570 --> 00:04.770
We installed the library,

3
00:04.770 --> 00:08.370
and we imported it to read a CSV file.

4
00:08.370 --> 00:11.940
And then, we used that file to get hold of a column,

5
00:11.940 --> 00:15.450
which has been automatically identified as soon as it read

6
00:15.450 --> 00:17.160
this data.csv.

7
00:17.160 --> 00:20.250
So, what exactly are we dealing with here?

8
00:20.250 --> 00:22.830
Well, one of the most useful things I find

9
00:22.830 --> 00:25.950
is to do type checks on any of the objects

10
00:25.950 --> 00:29.520
that I'm working with from a new novel library.

11
00:29.520 --> 00:32.490
For example, we can use the type method to check

12
00:32.490 --> 00:36.000
what exactly is the data type() of this data

13
00:36.000 --> 00:38.070
that we're getting back from pandas.

14
00:38.070 --> 00:42.033
So if I go ahead and print this one, and comment out

15
00:42.033 --> 00:45.390
this second line, then you can see what we're getting

16
00:45.390 --> 00:49.410
is what's called a pandas DataFrame object.

17
00:49.410 --> 00:51.630
In the Package overview, they talk about

18
00:51.630 --> 00:55.380
the two primary data structures of pandas.

19
00:55.380 --> 00:57.870
Series and a DataFrame.

20
00:57.870 --> 00:59.490
A DataFrame is kind of

21
00:59.490 --> 01:02.490
the equivalent of your whole table here.

22
01:02.490 --> 01:05.430
So, every single sheet inside an Excel file

23
01:05.430 --> 01:07.080
or inside a Google Sheet file

24
01:07.080 --> 01:10.203
would be considered a DataFrame in pandas.

25
01:11.040 --> 01:13.350
Now, what about the second part here,

26
01:13.350 --> 01:16.140
where we've gotten hold of one of the columns

27
01:16.140 --> 01:17.550
in our DataFrame?

28
01:17.550 --> 01:22.170
If I do a type check on this object, and I hit Run,

29
01:22.170 --> 01:26.580
then you can see, this is a pandas Series object.

30
01:26.580 --> 01:31.580
The Series is the other super important concept in pandas.

31
01:31.830 --> 01:35.940
And the Series is basically equivalent to a list.

32
01:35.940 --> 01:40.020
It's kind of like a single column in your table.

33
01:40.020 --> 01:43.170
So, the temperature column would be a Series,

34
01:43.170 --> 01:45.450
the condition column would be a Series,

35
01:45.450 --> 01:48.240
and the day column would also be a Series.

36
01:48.240 --> 01:50.100
So, once you've grokked this idea

37
01:50.100 --> 01:54.150
that the whole table is basically a DataFrame in pandas,

38
01:54.150 --> 01:57.270
and every single column is a Series,

39
01:57.270 --> 01:59.880
kind of like a list in pandas,

40
01:59.880 --> 02:01.920
then you're pretty much half of the way there

41
02:01.920 --> 02:05.190
to understanding how this library works.

42
02:05.190 --> 02:07.920
Now if we go over to the API reference,

43
02:07.920 --> 02:08.753
you can see that

44
02:08.753 --> 02:12.060
this is basically a list of all of the things

45
02:12.060 --> 02:13.980
that you can do with pandas.

46
02:13.980 --> 02:17.250
And it is a long list of things.

47
02:17.250 --> 02:20.520
But let's take a look at those two core classes,

48
02:20.520 --> 02:22.740
the DataFrame and the Series.

49
02:22.740 --> 02:25.110
So if we go to the DataFrame, you can see

50
02:25.110 --> 02:28.680
that it has things on how to construct a new DataFrame,

51
02:28.680 --> 02:30.150
how to get hold of the index,

52
02:30.150 --> 02:32.550
how to get hold of the column labels,

53
02:32.550 --> 02:35.070
and there's a whole bunch of attributes

54
02:35.070 --> 02:39.030
that you can tap into as well as many methods.

55
02:39.030 --> 02:41.700
For example, if we take a look at this section

56
02:41.700 --> 02:44.340
on Serialization / IO / conversion,

57
02:44.340 --> 02:47.070
you can see that you can actually convert a DataFrame

58
02:47.070 --> 02:49.800
to various different file types.

59
02:49.800 --> 02:51.810
You can convert it to an Excel file,

60
02:51.810 --> 02:54.060
you can convert it to HTML.

61
02:54.060 --> 02:57.060
You can also convert it to a dictionary.

62
02:57.060 --> 02:57.990
So if we click on this,

63
02:57.990 --> 03:00.810
this takes us to the actual documentation

64
03:00.810 --> 03:03.720
on how you would use this method.

65
03:03.720 --> 03:06.930
And if you look at the basic Python documentation

66
03:06.930 --> 03:08.640
versus panda's documentation,

67
03:08.640 --> 03:10.950
you'll see why this is so much better.

68
03:10.950 --> 03:14.730
It lists out all of the possible parameters.

69
03:14.730 --> 03:16.950
It gives you examples on how you can use

70
03:16.950 --> 03:20.790
each of the methods, and it's also got some related methods

71
03:20.790 --> 03:23.440
that it thinks that you might want to take a look at.

72
03:24.360 --> 03:26.850
So, let's use this method.

73
03:26.850 --> 03:30.060
And what we're going to do is, I'm going to get my data,

74
03:30.060 --> 03:32.880
and I'm going to call to_dict,

75
03:32.880 --> 03:35.370
which is going to convert it into a dictionary.

76
03:35.370 --> 03:37.720
So, let's call that data_dict = data.to_dict().

77
03:39.150 --> 03:42.810
And then if we print out our new data_dictionary,

78
03:42.810 --> 03:46.440
you can see that pandas has taken our table,

79
03:46.440 --> 03:48.990
and taken each column of the table

80
03:48.990 --> 03:53.640
to create a separate dictionary for each of the columns.

81
03:53.640 --> 03:56.850
So, we've got day, temperature, and condition,

82
03:56.850 --> 03:58.890
and we can now work with this

83
03:58.890 --> 04:00.963
as if it were a real dictionary.

84
04:01.920 --> 04:06.480
Now, if we take a look at the Series data type.

85
04:06.480 --> 04:09.300
And you take a look at the Conversion section

86
04:09.300 --> 04:11.760
for this type of data,

87
04:11.760 --> 04:14.010
then you can see that you can actually convert

88
04:14.010 --> 04:17.130
each of the Series to a list, if you want to be able

89
04:17.130 --> 04:20.343
to work with it just as you would with any other list.

90
04:21.360 --> 04:24.150
So, we saw that we can get our data,

91
04:24.150 --> 04:27.180
and then get the temperature column.

92
04:27.180 --> 04:30.540
And this, when we printed out the type, was a Series.

93
04:30.540 --> 04:33.687
So, we can get the Series, and then call .to_list().

94
04:34.980 --> 04:39.210
And this will turn this data Series into a Python list.

95
04:39.210 --> 04:40.710
So, let's call that temp_list.

96
04:42.000 --> 04:44.013
And let's go ahead and print it out.

97
04:46.890 --> 04:49.710
This would be a list of all the temperatures,

98
04:49.710 --> 04:54.540
and this is now converted into the raw Python data type.

99
04:54.540 --> 04:55.950
So, we can do all of the things

100
04:55.950 --> 04:57.960
that we can do to a Python list,

101
04:57.960 --> 05:00.423
like for example, we could check its length.

102
05:01.260 --> 05:03.870
So, here comes another challenge for you.

103
05:03.870 --> 05:06.450
Given what we've spoken about so far,

104
05:06.450 --> 05:08.550
can you figure out if you can work out

105
05:08.550 --> 05:12.720
the average temperature in our column of temperatures?

106
05:12.720 --> 05:15.450
So remember, you can always Google if you don't remember

107
05:15.450 --> 05:18.090
how to calculate the average of something.

108
05:18.090 --> 05:20.193
So, pause the video and give that a go.

109
05:22.410 --> 05:23.940
All right, so we know that we can get

110
05:23.940 --> 05:28.740
a list of all of the temperatures in that list,

111
05:28.740 --> 05:32.640
and we know that Python has a built-in method called sum().

112
05:32.640 --> 05:35.790
So, then we can get the sum of all of the temperatures

113
05:35.790 --> 05:37.500
in our list of temperatures,

114
05:37.500 --> 05:42.213
and then we can divide it by the length of the temp_list.

115
05:43.110 --> 05:44.970
The sum of all the values divided

116
05:44.970 --> 05:48.240
by the number of values gives us the average,

117
05:48.240 --> 05:50.070
which is also known as the mean.

118
05:50.070 --> 05:53.760
And so if we print this out, then you can see,

119
05:53.760 --> 05:58.760
the average temperature of the week was 17.4.

120
05:58.890 --> 06:02.130
Now, an alternative way of solving this challenge is,

121
06:02.130 --> 06:06.660
maybe you took a look through this list of methods,

122
06:06.660 --> 06:09.450
and you might have come across some of the computations

123
06:09.450 --> 06:14.340
and statistics that you can do with your Series in pandas.

124
06:14.340 --> 06:16.740
Now, one of those methods is the mean.

125
06:16.740 --> 06:19.740
So, you can actually get rid of all of this excess work,

126
06:19.740 --> 06:21.900
and take our data Series,

127
06:21.900 --> 06:24.210
which is basically the column of data

128
06:24.210 --> 06:29.210
under the heading temp, and simply call .mean on it.

129
06:29.520 --> 06:33.060
And now if I print this out, you'll see

130
06:33.060 --> 06:35.640
it's the same result as before

131
06:35.640 --> 06:37.773
without a lot of the extra work.

132
06:38.640 --> 06:40.830
In addition to the mean, you can get other things

133
06:40.830 --> 06:43.170
like the median or the mode,

134
06:43.170 --> 06:46.050
and a whole bunch of other things just by calling

135
06:46.050 --> 06:49.200
the right method on the data Series.

136
06:49.200 --> 06:51.120
So, here's a challenge for you.

137
06:51.120 --> 06:54.060
I want you to get hold of the maximum value

138
06:54.060 --> 06:56.190
from this column of temperatures

139
06:56.190 --> 06:59.130
by using one of the data Series methods.

140
06:59.130 --> 07:01.860
Pause the video, have a look at the documentation,

141
07:01.860 --> 07:04.010
and see if you can complete this challenge.

142
07:06.120 --> 07:08.220
All right, to get the maximum value,

143
07:08.220 --> 07:11.160
we're probably going to need this max method.

144
07:11.160 --> 07:14.670
So, we call this method in the same way as we did before,

145
07:14.670 --> 07:17.610
which is get hold of the data Series.

146
07:17.610 --> 07:20.610
So, our entire table is stored in data,

147
07:20.610 --> 07:22.650
and then we can get the column

148
07:22.650 --> 07:24.810
under the heading temperature.

149
07:24.810 --> 07:27.480
So, this is now a data Series,

150
07:27.480 --> 07:30.330
and then we can call that method max on it.

151
07:30.330 --> 07:31.740
And if we print it out,

152
07:31.740 --> 07:35.580
then you can see what we're getting is 24.

153
07:35.580 --> 07:38.193
So, the highest temperature was 24.

154
07:39.780 --> 07:42.120
So, you've seen that when we're working with pandas,

155
07:42.120 --> 07:46.620
it's really easy to get hold of data in a particular column,

156
07:46.620 --> 07:48.930
all we have to do is just take the DataFrame,

157
07:48.930 --> 07:50.700
use a set of square brackets,

158
07:50.700 --> 07:53.880
and then specify the name of the column,

159
07:53.880 --> 07:58.680
which it takes by default as the first row of the data.

160
07:58.680 --> 08:02.490
So, the day column, the temp, the condition.

161
08:02.490 --> 08:05.760
So if I want to get hold of all of the conditions,

162
08:05.760 --> 08:08.520
then I would say data["condition"].

163
08:08.520 --> 08:11.073
And if I go ahead and print this out,

164
08:12.090 --> 08:15.330
you can see it gets hold of all of the weather conditions,

165
08:15.330 --> 08:18.570
and selects that column to print out.

166
08:18.570 --> 08:21.360
Now, an alternative way to using

167
08:21.360 --> 08:24.750
the square bracket notation, where you have to be careful

168
08:24.750 --> 08:26.310
about the string you use here,

169
08:26.310 --> 08:30.480
it has to match, the name of the column exactly.

170
08:30.480 --> 08:32.760
Another way that you can work with the columns

171
08:32.760 --> 08:37.760
is simply by calling data.condition.

172
08:37.980 --> 08:41.010
And the fact that this code is valid at all means that

173
08:41.010 --> 08:45.930
pandas behind the scenes, has taken each of these columns

174
08:45.930 --> 08:47.100
and each of the headings,

175
08:47.100 --> 08:50.370
and converted those headings into attributes.

176
08:50.370 --> 08:53.700
So, we can say data.condition or data.day.

177
08:53.700 --> 08:55.710
And if I print that out, you can see

178
08:55.710 --> 08:57.210
it's actually going to be

179
08:57.210 --> 09:01.293
exactly the same as doing it like this.

180
09:02.220 --> 09:05.040
So, it's up to you which method you want to use

181
09:05.040 --> 09:07.230
to select the columns.

182
09:07.230 --> 09:11.580
But be aware that if your column name has a capital C,

183
09:11.580 --> 09:16.580
for example here, then your key has to be a capital C.

184
09:16.650 --> 09:19.593
And also, your attribute has to be a capital C.

185
09:20.460 --> 09:23.580
So effectively, when you're using a DataFrame like this,

186
09:23.580 --> 09:26.550
it's almost like you're treating it as a dictionary.

187
09:26.550 --> 09:29.580
And you're pulling out each column by the key.

188
09:29.580 --> 09:31.890
Now, when you are using the DataFrame like this,

189
09:31.890 --> 09:34.020
then you're kind of treating it more like an object.

190
09:34.020 --> 09:36.480
You're saying data.attribute,

191
09:36.480 --> 09:39.660
and you get hold of the data in that column.

192
09:39.660 --> 09:42.690
So, I'm going to restore everything to lowercase

193
09:42.690 --> 09:45.600
because I find it easier to read the code.

194
09:45.600 --> 09:48.330
But the next thing I want to show you is a little bit harder,

195
09:48.330 --> 09:50.010
which is how do you get data,

196
09:50.010 --> 09:53.190
which are in the rows of our DataFrame.

197
09:53.190 --> 09:57.060
If I wanted to get hold of the entire row of data

198
09:57.060 --> 10:00.990
for where the day is equal to Monday,

199
10:00.990 --> 10:03.870
then the way that I would do that in pandas, is firstly,

200
10:03.870 --> 10:06.360
get a hold of my entire data table,

201
10:06.360 --> 10:10.140
and then, inside that data table, get hold of the column

202
10:10.140 --> 10:12.090
that I want to search through.

203
10:12.090 --> 10:14.820
So, I'm going to search through the day column,

204
10:14.820 --> 10:19.530
so I can use data.day or data["day"],

205
10:19.530 --> 10:21.630
both will work the same.

206
10:21.630 --> 10:25.170
But once I've got the column, then I can say, well,

207
10:25.170 --> 10:28.140
where inside that column, I want to check

208
10:28.140 --> 10:33.140
for the row where the value is equal to Monday.

209
10:33.630 --> 10:37.200
This is basically going to return my row that I want.

210
10:37.200 --> 10:39.960
So, I'm going to print this out, and I'm going to

211
10:39.960 --> 10:43.830
comment out all of the previous code other than the place

212
10:43.830 --> 10:45.870
where we created our DataFrame.

213
10:45.870 --> 10:48.000
And then, I'm going to run my code.

214
10:48.000 --> 10:51.900
And you can see it's pulled out that correct row

215
10:51.900 --> 10:54.630
where the day is equal to Monday.

216
10:54.630 --> 10:57.843
And it's given me all of the rest of the data for that row.

217
10:59.040 --> 11:00.720
So, here's a challenge for you.

218
11:00.720 --> 11:03.990
Can you figure out how to pull out the row of data

219
11:03.990 --> 11:07.590
from our weather data where the temperature

220
11:07.590 --> 11:09.240
was at the maximum?

221
11:09.240 --> 11:11.940
So, which row of data had the highest temperature

222
11:11.940 --> 11:12.960
in the week?

223
11:12.960 --> 11:14.763
Pause the video and give that a go.

224
11:16.020 --> 11:19.020
So, we know that we can get the maximum temperature

225
11:19.020 --> 11:22.950
in the temperature column just by using this code.

226
11:22.950 --> 11:25.080
Now, you can either use this method

227
11:25.080 --> 11:27.450
where you say data["temp"],

228
11:27.450 --> 11:29.760
or you can use the attribute .temp,

229
11:29.760 --> 11:31.260
which is the code that I prefer.

230
11:31.260 --> 11:34.440
I don't like writing a lot of strings if I can avoid it.

231
11:34.440 --> 11:38.940
In this case, we're checking to see which row inside

232
11:38.940 --> 11:42.060
our column of temperatures

233
11:42.060 --> 11:45.450
is equal to the maximum temperature.

234
11:45.450 --> 11:50.450
We would say data.temp == data.temp.max().

235
11:51.930 --> 11:54.420
And then, we're going to get our data,

236
11:54.420 --> 11:58.350
and access the row that fits that criteria.

237
11:58.350 --> 12:01.470
Now if I print out this row,

238
12:01.470 --> 12:04.140
then you can see it was the row for Sunday,

239
12:04.140 --> 12:08.100
where the temperature was 24 and the condition was sunny.

240
12:08.100 --> 12:10.860
Essentially, when we get our DataFrame

241
12:10.860 --> 12:12.960
and then we use some square brackets,

242
12:12.960 --> 12:15.420
and inside those square brackets, if we only put

243
12:15.420 --> 12:18.420
the name of our column, day, temp, or condition,

244
12:18.420 --> 12:21.000
then we would get the entire column.

245
12:21.000 --> 12:24.900
But if we filter that column by a condition,

246
12:24.900 --> 12:29.130
say when a particular column is equal to a particular value,

247
12:29.130 --> 12:32.610
then we actually get hold of the row instead.

248
12:32.610 --> 12:35.220
Now, once you've gotten hold of the data in the row,

249
12:35.220 --> 12:37.860
you can actually go one step further.

250
12:37.860 --> 12:41.280
Because we know that the row contains lots of data, right?

251
12:41.280 --> 12:45.300
What if we wanted that particular row's temperature

252
12:45.300 --> 12:47.880
or that particular row's condition?

253
12:47.880 --> 12:49.770
Well, let's say that we create a variable

254
12:49.770 --> 12:53.130
called monday, which is equal to our DataFrame,

255
12:53.130 --> 12:55.170
and then searching through that DataFrame,

256
12:55.170 --> 13:00.170
where the [data.day == "Monday"].

257
13:00.480 --> 13:03.330
So, now with this row, monday,

258
13:03.330 --> 13:07.860
we can then tap into the values under different columns

259
13:07.860 --> 13:10.680
by using the same kind of way that we got data

260
13:10.680 --> 13:13.140
in the entire column over here.

261
13:13.140 --> 13:16.350
So, we can say monday.condition.

262
13:16.350 --> 13:21.350
And if I print this out and just comment these other bits,

263
13:23.910 --> 13:26.970
then you can see that I get the actual condition

264
13:26.970 --> 13:30.510
for that particular day, which happens to be Sunny.

265
13:30.510 --> 13:31.860
Now, here's a challenge.

266
13:31.860 --> 13:35.220
I want you to get Monday's temperature,

267
13:35.220 --> 13:38.010
but because my temperatures are in Celsius,

268
13:38.010 --> 13:40.650
I want you to convert it into Fahrenheit.

269
13:40.650 --> 13:42.060
So, pause the video,

270
13:42.060 --> 13:44.210
and see if you can complete that challenge.

271
13:46.560 --> 13:50.130
All right, we know we can get hold of Monday's temperature

272
13:50.130 --> 13:55.130
by saying monday...and then the name of that column

273
13:55.320 --> 14:00.320
in the row, which happens to be temp, T-E-M-P.

274
14:00.690 --> 14:02.670
And then, we get the first value in the series

275
14:02.670 --> 14:03.783
at index zero.

276
14:05.160 --> 14:08.250
And then if we want to convert Celsius to Fahrenheit,

277
14:08.250 --> 14:12.060
all we have to do is to multiply the Celsius

278
14:12.060 --> 14:15.990
by 9 over 5, and then add 32.

279
14:15.990 --> 14:18.360
So, it'll be Monday temperature

280
14:18.360 --> 14:21.930
multiplied by 9 divided by 5,

281
14:21.930 --> 14:24.363
and then add 32.

282
14:31.200 --> 14:35.103
So, monday_temp_F, and then we can print this value out.

283
14:37.470 --> 14:41.130
So now when I run it, we get 53.6.

284
14:41.130 --> 14:43.650
And if I put Monday's temperature, 12,

285
14:43.650 --> 14:48.320
into this Google converter, I get the same value, 53.6.

286
14:49.440 --> 14:53.100
Now, the final thing I want to show you is how you create

287
14:53.100 --> 14:55.893
a DataFrame from scratch.

288
14:56.820 --> 14:59.400
So, in our case, we created our DataFrame

289
14:59.400 --> 15:02.040
by reading from our CSV file,

290
15:02.040 --> 15:04.620
but what if you wanted to create a DataFrame

291
15:04.620 --> 15:08.310
just from some data that you're generating in Python?

292
15:08.310 --> 15:11.010
Let's say that I have this dictionary of values,

293
15:11.010 --> 15:13.800
I've got some students, and these are their names

294
15:13.800 --> 15:17.490
held in a list, and then each of them has a score.

295
15:17.490 --> 15:19.590
And the scores correspond to the students,

296
15:19.590 --> 15:23.250
so 76 is Amy's score, James scored 56,

297
15:23.250 --> 15:25.260
and Angela scored 65.

298
15:25.260 --> 15:29.760
Now, how would we create a DataFrame from this dictionary?

299
15:29.760 --> 15:31.080
It's really simple.

300
15:31.080 --> 15:33.360
We call our pandas library,

301
15:33.360 --> 15:36.300
and we get hold of the DataFrame class,

302
15:36.300 --> 15:39.840
and then we initialize that class with some data.

303
15:39.840 --> 15:42.450
And the data, in our case, is just going to be

304
15:42.450 --> 15:44.580
our data dictionary.

305
15:44.580 --> 15:48.630
And now, if I go ahead and save this as our data,

306
15:48.630 --> 15:51.090
so I'm going to comment out what we had previously,

307
15:51.090 --> 15:53.403
and then print out this data,

308
15:54.390 --> 15:56.910
then you can see, I've now created a table

309
15:56.910 --> 15:59.460
using the values from that dictionary.

310
15:59.460 --> 16:02.220
Now, we can go even further than this.

311
16:02.220 --> 16:05.310
When we've created a DataFrame, we can actually get

312
16:05.310 --> 16:10.050
that DataFrame to be converted to a CSV file.

313
16:10.050 --> 16:15.050
And this to_csv() method takes only one required input,

314
16:15.330 --> 16:18.480
which is the path that you want to save this file.

315
16:18.480 --> 16:21.420
So, let's just create a new file,

316
16:21.420 --> 16:23.120
which we'll call ("new_data.csv").

317
16:25.500 --> 16:28.920
And when I run this code, then watch over here,

318
16:28.920 --> 16:32.820
you'll see a new CSV file being created from thin air,

319
16:32.820 --> 16:36.573
and all of our data has been added to that CSV file.

320
16:37.890 --> 16:39.810
So, we're just getting a glimpse

321
16:39.810 --> 16:43.770
into how powerful this panda library can be.

322
16:43.770 --> 16:47.880
And we're only really using it to read CSV data

323
16:47.880 --> 16:49.920
and write CSV data.

324
16:49.920 --> 16:51.690
Which is a common format

325
16:51.690 --> 16:55.110
that you'll see being manipulated using Python.

326
16:55.110 --> 16:56.190
In later lessons,

327
16:56.190 --> 16:58.620
we're going to dive deeper into data analysis,

328
16:58.620 --> 17:00.630
and we're going to be looking not only at pandas,

329
17:00.630 --> 17:03.230
but we're also going to be looking at NumPy,

330
17:03.230 --> 17:06.840
Matplotlib, and other libraries that make it easier for us

331
17:06.840 --> 17:09.840
to work with large chunks of data.

332
17:09.840 --> 17:10.830
In the next lesson,

333
17:10.830 --> 17:12.960
we're going to be putting what we've learned to use

334
17:12.960 --> 17:17.100
by analyzing some squirrel data from Central Park.

335
17:17.100 --> 17:20.163
So for all of that and more, I'll see on the next lesson.