WEBVTT

0
00:00.180 --> 00:02.340
When I was really young, I remember 

1
00:02.370 --> 00:06.420
a piece of homework that I had to do for Geography in school.

2
00:06.510 --> 00:11.340
And the teacher asked us to basically watch the weather report every single day

3
00:11.400 --> 00:14.550
on the news and note down what was the temperature,

4
00:14.880 --> 00:17.280
what was the weather condition of that day.

5
00:17.640 --> 00:20.910
And then we would end up with the weather condition and temperature for the

6
00:20.910 --> 00:23.460
previous week. Now at the time,

7
00:23.460 --> 00:27.930
it was the point in time where a lot of teachers didn't actually really know what

8
00:27.930 --> 00:30.510
the internet did. So, um,

9
00:30.810 --> 00:35.550
I realized that I could actually just wait until the day before I have to give

10
00:35.550 --> 00:37.920
the homework in, go onto the internet,

11
00:37.980 --> 00:40.380
find the weather conditions for the previous seven days

12
00:40.500 --> 00:43.980
and then write it out instead of having to watch the weather forecast every

13
00:43.980 --> 00:46.770
single day. So I don't know what that says about me,

14
00:46.800 --> 00:51.800
but if you are somebody who enjoys taking shortcuts in life and refuse to do

15
00:52.410 --> 00:56.760
work that can be done by computers, then you are in the right place.

16
00:56.910 --> 00:59.310
Learning Python is definitely the way to go.

17
01:00.510 --> 01:02.730
So inside Google sheets

18
01:02.760 --> 01:06.420
which you can access by going to sheets.google.com,

19
01:06.930 --> 01:11.930
I have created a new spreadsheet of the days of the week and the temperature of

20
01:13.140 --> 01:18.120
each of those days in Celsius and also the weather condition on those days.

21
01:18.630 --> 01:22.560
This is a replica of that homework from many, many moons ago,

22
01:23.040 --> 01:28.040
but we're going to be working with this data to learn how we can read data files

23
01:28.290 --> 01:31.350
and then analyze them and how we can do lots of things with it.

24
01:31.860 --> 01:36.330
So the first thing I want you to do is the head over to the link in the course

25
01:36.330 --> 01:39.090
resources which will take you to this spreadsheet.

26
01:39.570 --> 01:43.320
And then once you've got it open, then go to file download

27
01:43.410 --> 01:48.410
and I want you to download it in the comma separated values or CSV format.

28
01:51.210 --> 01:53.970
And you should end up with a file like this.

29
01:54.540 --> 01:59.540
Now I want you to rename this file so that it's just weather_data.csv

30
01:59.700 --> 02:04.260
, and then I've want you to create a new PyCharm project

31
02:04.410 --> 02:06.120
which you can call anything you want.

32
02:06.180 --> 02:10.170
I've called it day 25. And also create your main.py.

33
02:10.860 --> 02:15.860
Now I'm going to drag my weather_data.csv into my folder day-25 and click

34
02:16.980 --> 02:21.570
refactor to move that file into my project folder. Now,

35
02:21.600 --> 02:22.433
at this point,

36
02:22.440 --> 02:27.440
PyCharm recognizes that this is a CSV file and it asks you whether if you want to

37
02:27.750 --> 02:32.550
install some plugins to make it easier to view this file. Now,

38
02:32.550 --> 02:33.060
at this point,

39
02:33.060 --> 02:38.060
you can click cancel because we want to view the data as the raw data,

40
02:38.700 --> 02:41.880
which is in this CSV format. Now,

41
02:41.910 --> 02:46.710
CSVs are a very common way of representing tabula data,

42
02:46.710 --> 02:51.710
so data that fits into tables like a spreadsheet. And CSV,

43
02:51.900 --> 02:55.260
as you've already seen stands for comma separated values.

44
02:56.250 --> 02:58.050
So that's why when you look at the data,

45
02:58.080 --> 03:03.080
you can see each row here is a single set of data and each piece of data is

46
03:06.100 --> 03:08.830
separated by a comma without space.

47
03:09.610 --> 03:13.060
So we've seen how we can open files, read files, write

48
03:13.060 --> 03:15.070
to files. As a challenge,

49
03:15.100 --> 03:20.100
I want you to go ahead and open up this weather_data.csv file inside your

50
03:20.560 --> 03:25.560
main.py and add each line of data into a list which we'll call data.

51
03:28.870 --> 03:30.580
Pause the video and give that a go.

52
03:30.630 --> 03:31.463
<v 1>Okay.</v>

53
03:33.390 --> 03:36.360
<v 0>All right. So we know that we're going to need to open the file.</v>

54
03:36.660 --> 03:40.710
So it's stored inside the same folder as our day-25.

55
03:40.770 --> 03:45.770
So we can just use a relative file path to tap into this weather_data.csv.

56
03:46.890 --> 03:50.070
And we're going to save this data as a data_file.

57
03:51.600 --> 03:55.290
And then we're going to get the data by reading this data file.

58
03:55.620 --> 03:58.830
So data = data_file.read.

59
03:59.280 --> 04:02.850
And not only are we going to read it, we're going to use readlines

60
04:02.880 --> 04:07.880
which we know will take each line in this file and turn it into an item in a

61
04:09.150 --> 04:12.450
list. Now, if I go ahead and print my data,

62
04:16.170 --> 04:21.170
then you can see I've got this list now where each item is a row in that list.

63
04:23.130 --> 04:27.030
But as you can imagine, it would be pretty painful to work with the data,

64
04:27.300 --> 04:29.400
which is all in a string format.

65
04:29.460 --> 04:32.400
And they are still separated by commas.

66
04:32.730 --> 04:36.930
It would take a lot of cleaning to actually be able to extract each column and

67
04:36.930 --> 04:41.760
each row. So what can we do instead? Well,

68
04:41.760 --> 04:46.760
there's actually a inbuilt library that helps us with CSVs because Python is a

69
04:48.000 --> 04:52.290
language that's used really heavily for data processing, data analysis.

70
04:52.590 --> 04:56.700
There's a lot of great tools for working with tabula data,

71
04:56.910 --> 05:01.320
like our weather data. First, we're going to import the CSV library,

72
05:01.740 --> 05:04.440
and then we're going to, again, open up a file,

73
05:04.530 --> 05:07.740
weather_data.csv as our data file,

74
05:08.550 --> 05:11.670
and then we're going to use this CSV library.

75
05:12.210 --> 05:14.850
And it has a method called reader,

76
05:15.450 --> 05:19.380
which takes the file in question, which has already been opened

77
05:19.410 --> 05:24.410
so this is going to be all data_file, and it can read it an output

78
05:24.480 --> 05:25.313
the data.

79
05:25.710 --> 05:29.550
So now let's go ahead and print this data and let's see what we've got.

80
05:30.420 --> 05:33.720
You can see that it's created a CSV reader

81
05:33.750 --> 05:37.560
object. This object can be looped through.

82
05:37.920 --> 05:42.120
So if we wanted to get each row inside this data,

83
05:42.120 --> 05:44.790
we can say for row in data,

84
05:45.090 --> 05:49.860
go ahead and print each row. And once you've done that,

85
05:49.890 --> 05:54.540
you can see it's taken each of the rows inside our weather_data.csv,

86
05:55.050 --> 06:00.050
and separated out each item into a single value.

87
06:00.770 --> 06:03.980
So for example, on the Monday row,

88
06:04.010 --> 06:06.740
we've got the Monday as a string,

89
06:06.740 --> 06:10.850
we've got the temperature as a string and also the condition as a string.

90
06:11.150 --> 06:14.660
So now it's much easier for us to work with this data.

91
06:15.680 --> 06:18.320
Using what you know about Python lists

92
06:18.590 --> 06:22.790
I want you to create a new list called temperatures,

93
06:23.360 --> 06:28.360
and this list is going to contain all of the temperatures that is inside this

94
06:29.060 --> 06:32.990
weather_data.csv, like this 12 degrees, 14, 15,

95
06:33.350 --> 06:36.140
and it's going to be in the format of an integer.

96
06:36.200 --> 06:40.040
I don't want to see it as a string with quotation marks around.

97
06:40.370 --> 06:43.460
It should be a pure number so that we can work with it more easily.

98
06:44.090 --> 06:45.890
So this is your challenge.

99
06:46.190 --> 06:50.390
Pause the video and see if you can extract all of the temperatures from this

100
06:50.390 --> 06:52.430
file into this new list.

101
06:52.630 --> 06:53.463
<v 2>Okay.</v>

102
06:55.600 --> 06:58.660
<v 0>All right. So when we printed out each of the rows,</v>

103
06:58.720 --> 07:03.720
we can see that we've created several lists where each list contains an entire

104
07:04.840 --> 07:07.600
row of data from our weather data CSV.

105
07:08.260 --> 07:10.840
If we wanted to get the temperature,

106
07:10.930 --> 07:15.730
then it's going to be the item at index one in that list. For example,

107
07:15.730 --> 07:18.970
if we wanted to get the Monday temperature,

108
07:19.300 --> 07:24.300
then all we have to do is to tap into each of these rows and get the item at

109
07:24.520 --> 07:27.670
index 1. If I go ahead and print this,

110
07:27.700 --> 07:30.730
then you can see we get all of the temperatures,

111
07:31.090 --> 07:33.550
but also the label for that column.

112
07:34.090 --> 07:36.730
So if we want to exclude that label,

113
07:36.760 --> 07:41.760
then all we have to do is use an if statement and check if row at index one

114
07:43.090 --> 07:47.320
does not equal temp, which is the name of that column label,

115
07:47.740 --> 07:52.740
then we're going to tap into our list of temperatures and append this row at

116
07:53.410 --> 07:56.410
index one, which is going to be a temperature number.

117
07:57.040 --> 07:59.980
So now after we've done the entire for loop,

118
08:00.010 --> 08:04.480
then we can print out our list of temperatures. And if you take a look,

119
08:04.510 --> 08:07.030
you can see we've now got a list of all the temperatures

120
08:07.300 --> 08:12.130
excluding that column title, but they are all in the format of strings.

121
08:12.550 --> 08:15.130
So if we want to convert that into an integer,

122
08:15.370 --> 08:18.580
then all we have to do is wrap that around an int.

123
08:19.900 --> 08:23.530
So that's the goal of the challenge. Now you can of course,

124
08:23.530 --> 08:26.320
separate out this line into many more lines,

125
08:26.620 --> 08:31.620
but I think this should make enough sense for you at this stage. While CSV is the

126
08:32.710 --> 08:36.580
inbuilt CSV reading and writing library,

127
08:37.210 --> 08:42.210
notice how much faff was involved in just simply getting a single column of

128
08:43.060 --> 08:47.500
data. What are we going to do if we have more data,

129
08:47.500 --> 08:50.920
more complex data with way more columns, way more rows and

130
08:51.140 --> 08:53.440
we want to do more interesting things with it?

131
08:53.920 --> 08:56.460
This is going to be quite painful to work with.

132
08:57.060 --> 09:00.840
This is the point where we want to get the help of some pandas.

133
09:01.170 --> 09:03.750
Not these kinds of pandas. As cute as they are

134
09:03.750 --> 09:06.300
they're not going to help us with our data analysis.

135
09:06.630 --> 09:11.630
But instead, I'm talking about the Pandas library and this is a Python data

136
09:12.780 --> 09:14.190
analysis library

137
09:14.520 --> 09:19.520
which is super helpful and super powerful to perform data analysis on tabula

138
09:21.090 --> 09:25.410
data like the one that we have. In order to work with it,

139
09:25.440 --> 09:30.000
we have to import this library. But because it's not in built,

140
09:30.060 --> 09:32.700
you'll need to install it into your project.

141
09:33.090 --> 09:36.870
So the shortcut way of this is simply just a type import pandas

142
09:36.960 --> 09:38.400
and then once you see the red line,

143
09:38.700 --> 09:42.840
go ahead and hover over it and then click install package pandas,

144
09:43.020 --> 09:47.250
and then you can watch the progress down here. Now, while that's installing,

145
09:47.310 --> 09:50.940
I want to quickly introduce you to the documentation for this library.

146
09:51.240 --> 09:54.690
It's really well documented and it's really powerful,

147
09:54.690 --> 09:58.470
so it has a lot of things in the documentation. If you head over

148
09:58.470 --> 10:02.970
to pandas.pydata.org and then click on documentation,

149
10:03.300 --> 10:06.360
then you will be able to see all the things that you can do with it.

150
10:06.780 --> 10:08.730
So there's a API reference,

151
10:08.730 --> 10:13.730
there's a quick getting started guide as well as a user guide on the key

152
10:13.770 --> 10:18.770
concepts of pandas. When you're using a new library of any sort, a good idea is

153
10:20.490 --> 10:23.700
to take a look at their getting started guide if they have one,

154
10:24.060 --> 10:28.530
because it tells you how you can install it and a number of questions that you

155
10:28.530 --> 10:29.790
might have, for example,

156
10:29.790 --> 10:34.790
what kind of data does pandas handle or how can I read and write tabula data.

157
10:35.100 --> 10:38.100
And it's actually done really, really well. So if you have a moment,

158
10:38.220 --> 10:43.080
take a quick look at this page. And once you head back to PyCharm,

159
10:43.140 --> 10:46.800
your packages should have installed successfully. Now,

160
10:46.800 --> 10:48.840
once we've installed our pandas,

161
10:48.930 --> 10:51.960
you'll see that it's still grey because we're not using it yet.

162
10:52.470 --> 10:57.150
If we want to use pandas, all we have to do is say pandas.

163
10:57.900 --> 11:00.840
and in our case, we actually want to read our CSV.

164
11:01.320 --> 11:03.480
So we can say read_csv

165
11:04.050 --> 11:08.550
and inside this method, you can do lots and lots of things,

166
11:08.610 --> 11:13.140
as you can see by all of the attribute names. But most of these are optional.

167
11:13.560 --> 11:18.270
The only one that is not optional is the path that leads to the CSV file.

168
11:18.750 --> 11:22.440
So if we get hold of our weather_data.csv,

169
11:22.800 --> 11:25.860
then we can read that CSV using pandas.

170
11:26.160 --> 11:29.940
So notice how we don't have to open the file as a data file,

171
11:30.150 --> 11:34.350
or you use a CSV reader, it's one step and you're done.

172
11:34.800 --> 11:39.060
So now we've got hold of our data and if I print out the data,

173
11:39.090 --> 11:42.570
you can see how beautifully formatted it is.

174
11:43.110 --> 11:45.930
It's being printed out as an actual table,

175
11:45.960 --> 11:50.790
it's got the column headings on top of each of the columns and each of the rows

176
11:50.790 --> 11:55.600
gets given an index so that we can more easily identify how many records we have

177
11:55.900 --> 11:57.220
and where each one is.

178
11:58.390 --> 12:01.840
If we wanted to think about that previous task that we tried to do

179
12:01.840 --> 12:06.840
where we just try to get hold of a single column of data from this table,

180
12:08.830 --> 12:10.240
then using pandas,

181
12:10.330 --> 12:14.950
it is literally as easy as saying data and then square brackets,

182
12:15.070 --> 12:19.750
and then the name of that column. So in our case, it's temp.

183
12:20.860 --> 12:23.620
And now if I go ahead and print this out,

184
12:24.640 --> 12:29.640
you can see it's basically already identified the column and it's printed out

185
12:29.890 --> 12:31.720
all of the data in that column.

186
12:32.140 --> 12:34.720
So the really smart thing that Panda's doing here is

187
12:34.760 --> 12:39.700
it takes that first row to be the names of each column

188
12:40.120 --> 12:43.480
and it automatically knows how to find the data

189
12:43.750 --> 12:47.380
when you just specify the name of the column like this.

190
12:48.280 --> 12:51.220
So three lines versus eight lines,

191
12:51.730 --> 12:56.110
and we get better formatting. It's no wonder that most Python developers,

192
12:56.170 --> 12:58.240
as soon as they encounter a CSV fault,

193
12:58.270 --> 13:00.910
they will start using pandas to work with it

194
13:01.210 --> 13:03.970
no matter how simple the task or project.

195
13:04.300 --> 13:07.660
So that was a quick introduction to CSV data,

196
13:08.020 --> 13:13.020
working with CSV data and how to get started using pandas. In the next lesson,

197
13:14.080 --> 13:18.100
we're going to dive deeper into this library and see all of the common things that

198
13:18.100 --> 13:19.300
we can do with pandas.