WEBVTT

0
00:00.510 --> 00:04.950
Now that you've learned how to do the basics of data analysis with pandas,

1
00:05.280 --> 00:08.250
it's time to put your knowledge to the test.

2
00:08.700 --> 00:13.700
And we're going to be working with a really interesting data set. Back in 2018/

3
00:13.950 --> 00:18.390
2019, a bunch of volunteers went into New York  central park

4
00:18.450 --> 00:23.100
and they basically combed the entire park to find all of the squirrels.

5
00:23.730 --> 00:28.680
So this resulted in a huge dataset of squirrels, squirrel numbers,

6
00:28.680 --> 00:31.410
squirrel fur color and a whole bunch of other things.

7
00:31.860 --> 00:36.860
And all of this data can be found on the New York city open data website

8
00:37.020 --> 00:40.740
which we'll link to in the course resources. When you head over here,

9
00:40.740 --> 00:44.130
you can take a look at all of the data that they collected.

10
00:44.490 --> 00:48.540
If you take a look at all of the columns in this dataset and click show

11
00:48.540 --> 00:52.770
all, you can see they've logged the location of each of the squirrels,

12
00:53.010 --> 00:55.320
they gave each squirrel a unique ID,

13
00:55.560 --> 00:59.730
they evaluated whether if the squirrel was an adult or a juvenile,

14
01:00.030 --> 01:04.380
and they looked at what the primary fur color was. So it could be gray, cinnamon

15
01:04.380 --> 01:07.860
which is red, or black. Now, if you scroll down

16
01:07.860 --> 01:10.830
there's actually a really interesting visualization of this data,

17
01:11.090 --> 01:13.130
where they've plotted all of this squirrels

18
01:13.130 --> 01:15.800
onto a map of central park.

19
01:16.220 --> 01:21.220
So you can take a look at the distribution of the squirrel population in central

20
01:21.380 --> 01:21.680
park.

21
01:21.680 --> 01:24.830
So you can see where the red squirrels are, where the gray ones are and where the

22
01:24.830 --> 01:27.860
black ones are. So there's only three colors, really.

23
01:28.040 --> 01:30.830
So what you're going to do is you're going to go to this website,

24
01:30.860 --> 01:34.670
click on export and download the CSV data.

25
01:35.780 --> 01:36.830
Now, once you do that,

26
01:36.830 --> 01:41.830
you'll end up with a 2018 central park squirrel data like this.

27
01:42.890 --> 01:47.890
Now you're going to pull that data into your day-25 project folder and click

28
01:48.470 --> 01:49.770
re-factor to move it in.

29
01:50.240 --> 01:55.130
And you can see that this is a lot bigger than what ever data we had before.

30
01:55.430 --> 01:57.800
There's thousands of entries,

31
01:57.860 --> 02:00.740
and it's actually real dedication that these volunteers went around

32
02:00.740 --> 02:04.160
logging all of this data, observing all the squirrels.

33
02:04.340 --> 02:07.040
It must've been a very tedious task.

34
02:07.340 --> 02:09.740
But now that somebody else has carried out the hard part,

35
02:10.040 --> 02:12.980
we can do some data analysis on that data.

36
02:13.430 --> 02:18.020
So I want you to comment out everything that you've got so far in your day-25

37
02:18.020 --> 02:18.853
project

38
02:19.070 --> 02:24.070
and the goal is for you to use that data and use what you've learned about

39
02:24.290 --> 02:25.123
pandas

40
02:25.160 --> 02:30.160
to be able to create a CSV that's called squirrel_count that has a small table

41
02:32.330 --> 02:36.050
which just contains the fur color, so there's only three fur colors,

42
02:36.170 --> 02:40.580
and they are logged under the primary fur color column.

43
02:41.090 --> 02:44.990
And it can either basically be gray, cinnamon

44
02:45.020 --> 02:47.060
which is red, or black.

45
02:47.090 --> 02:50.810
There's only three possible values in that column. Now,

46
02:50.840 --> 02:55.460
what you're going to do is you are going to figure out how many gray squirrels

47
02:55.460 --> 02:56.510
there are in total,

48
02:56.540 --> 03:01.540
how many cinnamon ones and how many black ones based on that primary fur color

49
03:02.020 --> 03:02.853
column.

50
03:03.040 --> 03:07.330
And then you're gonna take that data and build a new data frame from it,

51
03:07.660 --> 03:08.590
and using that,

52
03:08.620 --> 03:13.620
create this final CSV using pandas. Now that you know what you need to do,

53
03:14.260 --> 03:18.160
have a think about the problem and see if you can complete this challenge.

54
03:18.400 --> 03:22.690
Pause the video now. All right.

55
03:22.690 --> 03:25.870
If you've commented out the line where you've imported pandas,

56
03:25.900 --> 03:27.760
then you'll obviously have to do that again.

57
03:28.330 --> 03:32.320
So let's think about what we want to do. We want to isolate the column

58
03:32.320 --> 03:36.190
which is the primary fur color. And if it helps you,

59
03:36.190 --> 03:40.120
you can actually better visualize the data on this website where they've got a

60
03:40.120 --> 03:44.740
table preview. So you can see that here is the primary fur color,

61
03:44.800 --> 03:47.410
and you can see the colors that have been logged.

62
03:48.130 --> 03:52.180
So our goal is to somehow get hold of this data series

63
03:52.210 --> 03:56.170
which contains this entire column, figure out how many of them are gray,

64
03:56.170 --> 04:01.060
how many of them are black and how many are cinnamon. How do we do this?

65
04:01.150 --> 04:03.820
Well, firstly, let's get hold of our data.

66
04:03.820 --> 04:08.820
So we're going to use our pandas.read_csv method.

67
04:09.670 --> 04:14.650
And then we're going to direct it towards this 2018 central parks squirrel census

68
04:14.650 --> 04:16.030
data. Now, if you want,

69
04:16.030 --> 04:19.780
you can actually right-click refactor and rename it to something a little bit

70
04:19.780 --> 04:23.890
shorter. But because I know that PyCharm will actually fill this in for me

71
04:23.950 --> 04:25.720
as long as I start out with a string

72
04:26.260 --> 04:28.840
and that that file is in the same folder,

73
04:29.110 --> 04:33.460
it'll actually type it all out if I just hit enter, then it doesn't really matter.

74
04:34.000 --> 04:34.780
But of course,

75
04:34.780 --> 04:37.600
make sure that you don't have any typos in here if you're typing it out,

76
04:37.870 --> 04:41.290
because otherwise when you hit run, you're going to get a whole bunch of error

77
04:41.300 --> 04:43.000
text inside your console.

78
04:43.870 --> 04:46.600
So once I've successfully read that CSV,

79
04:46.630 --> 04:50.080
I've now got a dataframe. From that data frame,

80
04:50.110 --> 04:53.980
I can get hold of the column that I'm interested in

81
04:54.220 --> 04:56.530
which is called primary fur color. Now,

82
04:56.530 --> 05:01.150
because it's got spaces, it's easier to access that data by using a square

83
05:01.150 --> 05:04.900
bracket and then putting in the name of that column like this.

84
05:04.990 --> 05:07.630
This is one of the methods that we showed you. Now,

85
05:07.630 --> 05:09.790
once I've gotten hold of that column, well

86
05:09.790 --> 05:14.080
the next thing I need to do is to find all of the rows in that column

87
05:14.110 --> 05:17.470
where the data is equal to each of the colors.

88
05:17.500 --> 05:20.980
So there was the color which is gray,

89
05:20.980 --> 05:24.250
so gray not 'ey'.

90
05:24.970 --> 05:28.600
And then once we've got hold of all the gray squirrels,

91
05:28.870 --> 05:31.840
then we're going to pull that out from our data.

92
05:32.440 --> 05:36.130
So now we should have a bunch of gray squirrels,

93
05:36.580 --> 05:39.010
and it's probably a good idea to print them out

94
05:39.040 --> 05:41.320
just to see if that actually worked.

95
05:41.650 --> 05:45.280
And because I expect there'll be lots of rows with gray squirrels,

96
05:45.490 --> 05:49.900
it makes sense to make it a plural, grey_squirrels. So now if I hit run,

97
05:50.200 --> 05:55.200
you can see listed here are all of the rows where it contains a gray squirrel.

98
05:57.260 --> 05:59.570
Now it redacted this table

99
05:59.570 --> 06:03.710
so that it can actually display it in the console because we know that there's many,

100
06:03.710 --> 06:05.900
many columns and there's many, many rows.

101
06:06.140 --> 06:10.340
It's just showing you the first few rows and then the last few rows and also the

102
06:10.340 --> 06:12.470
first few columns and the last few columns.

103
06:12.890 --> 06:16.760
So we can be pretty sure that we've managed to get hold of all of the rows that

104
06:16.760 --> 06:20.990
contain gray squirrels as their primary fur color. Now,

105
06:20.990 --> 06:25.910
what if we wanted to know the gray squirrels count? Well,

106
06:25.910 --> 06:30.910
we could use our length method because remember, once we get hold of the rows

107
06:31.610 --> 06:35.000
it kind of gets treated a bit like a iterable,

108
06:35.330 --> 06:40.330
like a list, and can use methods like length on this data.

109
06:41.060 --> 06:44.330
So now if we print grey_squirrels_count,

110
06:45.350 --> 06:50.350
you can see that we've got a total of 2,473 grey squirrels.

111
06:51.620 --> 06:56.420
So now all I need to do is to repeat this process for the other colored

112
06:56.420 --> 07:00.440
squirrels. So I'm going to call it a red squirrel, even though theoretically,

113
07:00.470 --> 07:02.570
their fur color is cinnamon.

114
07:02.990 --> 07:07.580
So I'm just going to copy and paste that in there in case I make any typos and

115
07:07.580 --> 07:10.400
the final squirrel is the black squirrel.

116
07:11.270 --> 07:13.910
So those are all three squirrel types,

117
07:14.000 --> 07:17.240
and if I go ahead and print out all of these,

118
07:17.510 --> 07:22.510
so the reds squirrel count, the black squirrel count and the gray squirrel count,

119
07:23.390 --> 07:28.390
you can see that you got mostly grey squirrels, a few red ones and very

120
07:28.610 --> 07:30.710
rarely do you actually see a black squirrel.

121
07:30.920 --> 07:35.540
I certainly haven't seen one recently. Now that we've got our three values,

122
07:35.570 --> 07:39.980
it's time to construct our data frame. So to construct our data frame,

123
07:40.100 --> 07:43.640
the easiest way is to actually just create a dictionary.

124
07:44.000 --> 07:48.470
So I'm going to create a data dictionary and this dictionary is going to have

125
07:48.530 --> 07:50.450
two key-value pairs.

126
07:50.630 --> 07:53.750
So the first key is going to be the fur color

127
07:55.070 --> 07:59.690
and this is going to contain the three fur colors, which is, um,

128
07:59.720 --> 08:04.400
gray, cinnamon or red, and black.

129
08:05.210 --> 08:06.830
And then, um,

130
08:06.890 --> 08:10.760
the next key value pair is going to be the count.

131
08:11.420 --> 08:15.830
So now we can create a list where the first value is going to be the gray squirrel 

132
08:15.830 --> 08:19.520
count, next is going to be the red squirrel count and finally,

133
08:19.520 --> 08:24.050
it's going to be the black squirrel count. So now that we've got our dictionary

134
08:24.320 --> 08:26.750
and this is what it looks like,

135
08:27.500 --> 08:31.700
then we can go ahead and actually turn this into a data frame.

136
08:32.090 --> 08:32.930
So to do that,

137
08:32.930 --> 08:37.220
we need to get hold of the pandas and then get hold of the data frame class,

138
08:37.580 --> 08:41.120
and then initialize it using this data dictionary.

139
08:41.570 --> 08:44.300
So I'm going to save that as df, df for data frame.

140
08:44.780 --> 08:48.860
And then the final thing I need to do is to get my df

141
08:48.920 --> 08:51.620
to convert to a CSV.

142
08:52.190 --> 08:54.530
So now I get to specify the name of the file,

143
08:54.530 --> 08:59.250
which I will call squirrel_count.csv

144
08:59.730 --> 09:03.540
and once I hit run, you'll see that new file show up right here

145
09:03.990 --> 09:08.990
and you can see that it's constructed my new CSV file with all of the data that

146
09:09.840 --> 09:13.260
I've extracted from my central park squirrel census

147
09:13.620 --> 09:17.220
and I've now got a new table with the data that I'm interested in.

148
09:17.790 --> 09:21.780
So did you manage to complete this challenge? If you found it tricky

149
09:21.780 --> 09:25.950
working with the data frames and figuring out how to get hold of the columns or

150
09:25.950 --> 09:29.490
how to get hold of the rows depending on the conditions we're interested

151
09:29.490 --> 09:33.360
in, then I strongly recommend to just head back the last lesson,

152
09:33.630 --> 09:37.830
just try to write out the code that we're doing in each step of the video

153
09:37.830 --> 09:38.663
yourself,

154
09:38.790 --> 09:41.910
just to make sure that you're a hundred percent sure with what's going on.

155
09:42.630 --> 09:44.820
Once you are ready, head of the next lesson

156
09:44.880 --> 09:49.350
we're going to finally build our US states game.

157
09:50.030 --> 09:52.410
For all of that and more, I'll see you on the next lesson.