1
00:00:11,670 --> 00:00:16,440
In this lecture we are going to look at the notebook which implements a feed for neural network for

2
00:00:16,440 --> 00:00:20,200
image classification on the amnesty data set.

3
00:00:20,220 --> 00:00:23,330
This lecture is going to walk you through a prepared code lab notebook.

4
00:00:23,610 --> 00:00:29,430
Although a very good exercise which I always recommend is once you know how this is done to try and

5
00:00:29,430 --> 00:00:35,820
recreate it yourself with as few references as possible as usual you can look at the title of the notebook

6
00:00:36,090 --> 00:00:38,550
to determine what notebook we are currently looking at.

7
00:00:39,870 --> 00:00:44,100
So as usual we're going to start by importing pi torch name pi and map lib.

8
00:00:44,730 --> 00:00:50,250
We'll also be using the torch vision library which includes the MLS data set and utility functions for

9
00:00:50,250 --> 00:00:57,690
handling images.

10
00:00:58,070 --> 00:01:02,810
We'll start by downloading the train data which happens when we call the torch vision that data sets

11
00:01:02,810 --> 00:01:05,550
that M.A. function for the first time.

12
00:01:05,690 --> 00:01:09,020
Let's go through each of these arguments one by one so you understand that.

13
00:01:09,860 --> 00:01:15,280
First we have the route argument where we specify the file path we want to download the data to.

14
00:01:15,470 --> 00:01:19,750
In this case we're going to download the data to the local directory.

15
00:01:19,790 --> 00:01:25,850
Next we set train equal to true which indicates that this function will return the train data set.

16
00:01:25,860 --> 00:01:32,570
Next we said transform to transforms that to tensor which is an operation from the torch vision library

17
00:01:32,900 --> 00:01:35,320
that does some useful pre processing for us.

18
00:01:36,050 --> 00:01:40,760
This way everything we want to do to the data happens inside this function.

19
00:01:40,760 --> 00:01:44,810
We'll see later how we can make even more sophisticated use of the transform argument.

20
00:01:45,920 --> 00:01:52,620
Finally we have the download argument which tells PI to urge to download the data from the printout

21
00:01:52,650 --> 00:01:55,890
we can see that in fact 4000 downloaded.

22
00:01:55,890 --> 00:01:59,840
If you read it carefully you can deduce that these are the train inputs.

23
00:01:59,850 --> 00:02:03,660
The train labels the test inputs and the test labels.

24
00:02:03,660 --> 00:02:12,570
You can also see that they've been downloaded to the folder amnesty slash raw.

25
00:02:12,700 --> 00:02:18,570
Next we're going to print out the data attribute of the train dataset which as we know represents the

26
00:02:18,570 --> 00:02:20,130
input data.

27
00:02:20,130 --> 00:02:23,990
Really what we want to do is make sure what we're seeing makes sense.

28
00:02:24,000 --> 00:02:33,550
So what we can see is that this appears to be a three dimensional array of a bunch of zeros.

29
00:02:33,800 --> 00:02:37,180
It also has these type equal to talks that you went ape.

30
00:02:37,490 --> 00:02:39,170
Does this make sense.

31
00:02:39,200 --> 00:02:40,840
In fact it does.

32
00:02:40,850 --> 00:02:46,280
There are of course some non-zero values in the dataset but we can't see them because a large percentage

33
00:02:46,280 --> 00:02:49,730
of each image in the AMA's dataset is black.

34
00:02:49,730 --> 00:02:55,820
Only the digits themselves are nonblack and those take up a tiny part of the image so it's not surprising

35
00:02:55,820 --> 00:03:02,900
that if we print out all the values near the edges of the image that they would be all zero.

36
00:03:03,200 --> 00:03:13,270
We can verify this by checking the maximum value in the tensor which is as expected 255.

37
00:03:13,740 --> 00:03:19,280
Next we can check the shape of the tensor which is sixty thousand by twenty eight by twenty eight.

38
00:03:19,320 --> 00:03:25,840
As I mentioned earlier finally we can print the target's actual view which returns a one dimensional

39
00:03:25,840 --> 00:03:29,610
tensor of integers which appear to be between 0 and 9.

40
00:03:29,620 --> 00:03:30,400
As expected

41
00:03:36,060 --> 00:03:40,670
next we're going to call the M.A. function again but this time we want the test data set.

42
00:03:40,920 --> 00:03:43,860
So we need to pass in a train equal to False.

43
00:03:43,860 --> 00:03:46,470
Notice how this time nothing is downloaded.

44
00:03:46,470 --> 00:03:52,030
That's because the four files we needed to download were already downloaded in the previous step.

45
00:03:52,200 --> 00:03:57,160
When we check the shape of the input data we see that it's ten thousand by twenty eight by twenty eight

46
00:03:57,360 --> 00:03:58,650
as promised.

47
00:03:58,650 --> 00:04:02,850
So there are sixty thousand training samples and 10000 test samples

48
00:04:07,940 --> 00:04:09,270
now that we have our data.

49
00:04:09,290 --> 00:04:11,070
It's time to build our model.

50
00:04:11,450 --> 00:04:15,600
As you can see it's exactly as I've specified in the earlier lectures.

51
00:04:15,860 --> 00:04:21,530
We have a linear layer followed by a real you followed by another linear layer all wrapped in a sequential

52
00:04:22,190 --> 00:04:27,500
and there is no need for a soft Max as we know it's been combined with the cross entropy loss for numerical

53
00:04:27,500 --> 00:04:28,100
stability

54
00:04:32,100 --> 00:04:34,350
the next step is new since.

55
00:04:34,350 --> 00:04:38,060
From this point forward we'll be looking at larger and larger data sets.

56
00:04:38,070 --> 00:04:44,160
There is a need to make use of the GP you we know that GP use are useful for speeding up deep learning

57
00:04:44,490 --> 00:04:50,010
because while they were originally built for gaming a lot of the matrix algebra that happens in physics

58
00:04:50,010 --> 00:04:56,820
engines is the same as the matrix algebra that happens in deep learning so we can make use of gaming

59
00:04:56,820 --> 00:05:04,620
technology to speed up deep learning computations.

60
00:05:04,640 --> 00:05:09,970
So basically if you have a GP you available it will be called CUDA colon 0.

61
00:05:09,980 --> 00:05:15,510
So what we're doing here is checking if that string is in the list of available devices.

62
00:05:15,740 --> 00:05:23,200
If it is we'll set that to be the device and if it's not we'll set the device to be the string CCU next.

63
00:05:23,270 --> 00:05:27,100
And this is important we call model that to device.

64
00:05:27,200 --> 00:05:33,800
This transfers all the parameters of our model to the GP you you can picture your RAM and your GP you

65
00:05:34,100 --> 00:05:40,940
as being two different physical spaces which they are and in order to do any computation like multiplying

66
00:05:40,940 --> 00:05:45,610
or adding all the numbers we want to compute on have to be on the same device.

67
00:05:46,130 --> 00:05:51,650
So whenever we want to calculate the output of our model both of the models parameters and the input

68
00:05:51,650 --> 00:05:54,290
data have to be on either the GP you.

69
00:05:54,290 --> 00:05:59,480
Or on the main RAM but we can't have one set of numbers on one device in another set of numbers.

70
00:05:59,480 --> 00:06:00,500
On the other device

71
00:06:05,620 --> 00:06:09,900
next we're going to set our loss and optimizer which you've already seen.

72
00:06:09,910 --> 00:06:14,770
Note that this is our new laws the cross entropy loss meant for multiple categories

73
00:06:20,470 --> 00:06:25,330
next there's another new thing we're going to create generators which will allow us to loop through

74
00:06:25,330 --> 00:06:29,100
each batch of data as we iterate through epoch.

75
00:06:29,110 --> 00:06:35,650
These are called Data loaders in PI to search for the input arguments we specify the data arrays which

76
00:06:35,650 --> 00:06:42,010
are the train data set and test data set variables we specify the batch size and we specify whether

77
00:06:42,010 --> 00:06:48,940
or not to shuffle the data you'll notice that we shuffle the training data but not the test data.

78
00:06:48,940 --> 00:06:54,900
This is because for the training data if we loop through each sample in the same order each time this

79
00:06:54,910 --> 00:06:59,110
will introduce unwanted correlations which will decrease performance.

80
00:06:59,110 --> 00:07:04,360
Think of our example of measuring the average height of everyone in the world or more practically drug

81
00:07:04,360 --> 00:07:05,420
testing.

82
00:07:05,500 --> 00:07:10,120
We always want our sample to be random for the test data.

83
00:07:10,140 --> 00:07:19,710
There is no need to shuffle it because all we want to do is evaluate the loss and accuracy.

84
00:07:19,720 --> 00:07:24,910
Next we have a little snippet of code just to test how the data loader works.

85
00:07:24,910 --> 00:07:29,020
As mentioned it's a generator so we can do a full loop over it.

86
00:07:29,020 --> 00:07:34,180
For this data loader I'm going to set the batch size to 1 and I'm going to print out not just the shape

87
00:07:34,180 --> 00:07:38,740
of the data we get on each iteration but also the data itself.

88
00:07:38,740 --> 00:07:40,050
We can see something interesting.

89
00:07:40,060 --> 00:07:47,960
If we look at the data which is that the data now ranges from 0 to 1 instead of from 0 to 255.

90
00:07:48,450 --> 00:07:53,700
This is kind of unintuitive at first because you might think why was it that when we printed the data

91
00:07:53,730 --> 00:08:01,170
earlier it was 0 to 255 but only when we use the data loader as it becomes 0 to 1.

92
00:08:01,170 --> 00:08:05,550
This seems strange because it seems like the data loader object should be generic.

93
00:08:05,550 --> 00:08:09,510
It can generate data from any kind of data set that you pass in.

94
00:08:09,690 --> 00:08:15,480
In fact the normalization comes from the two tensor function we passed in earlier to the M.A. function

95
00:08:16,230 --> 00:08:28,840
although we didn't see it at the time because it does not get called when you just call the data attribute.

96
00:08:29,130 --> 00:08:35,080
So in the next line we can see what happens if we call the transform function manually on the data.

97
00:08:35,220 --> 00:08:42,290
The max is 1 as expected and so this transform function is called internally by the data loader but

98
00:08:42,290 --> 00:08:44,870
it's not called when you just do train data set that data

99
00:08:54,690 --> 00:08:57,120
next we're going to train our model.

100
00:08:57,120 --> 00:09:02,070
You'll notice that we only need 10 epochs to train this model which is a lot less than before.

101
00:09:02,100 --> 00:09:03,410
Why might that be.

102
00:09:03,780 --> 00:09:10,560
If you think about it this goes back to the theory behind a batch gradient descent as you recall a back

103
00:09:10,620 --> 00:09:14,210
is a representative sample of the entire dataset.

104
00:09:14,250 --> 00:09:20,040
In other words doing batch screening at a center on a single batch is like doing gradient descent on

105
00:09:20,040 --> 00:09:27,720
the entire dataset but much faster since we're only looking at a small subset of the data and keep in

106
00:09:27,720 --> 00:09:33,420
mind that the total number of iterations is still very high because we have sixty thousand data points

107
00:09:33,420 --> 00:09:34,780
to loop over.

108
00:09:34,860 --> 00:09:40,740
In fact some data sets are so large that we only do a single epoch over the entire dataset.

109
00:09:41,100 --> 00:09:44,190
So that's why the total number of epochs is small here.

110
00:09:44,190 --> 00:09:51,040
It looks small but we're actually still doing a pretty large number of gradient descent steps in total.

111
00:09:51,130 --> 00:09:55,350
Now I hope that you recognize most of the elements of this loop at this point.

112
00:09:55,450 --> 00:10:00,160
As mentioned we'll be looking at more or less the same thing for each example throughout the course.

113
00:10:00,160 --> 00:10:02,030
So what's different.

114
00:10:02,050 --> 00:10:07,480
First we have two nested loops one over the epochs and one over the train loader.

115
00:10:07,480 --> 00:10:13,380
Before we go into the inner loop we first need to create an empty list to store the train loss.

116
00:10:13,600 --> 00:10:18,490
We want the loss per epoch but what we actually get is the loss per batch.

117
00:10:18,490 --> 00:10:21,940
So let's just store the losses for each batch and then deal with them at the end.

118
00:10:22,720 --> 00:10:28,890
Alternatively you could just plot the loss per back instead of the last pre epoch.

119
00:10:28,930 --> 00:10:32,110
Next we enter the inner loop through the train loader.

120
00:10:32,110 --> 00:10:38,650
The first thing we do in this loop is to transfer the data to the GP you as you recall our model parameters

121
00:10:38,650 --> 00:10:40,200
are already on the GP you.

122
00:10:40,390 --> 00:10:47,660
And if we want to do any computation between the model and the data both have to be on the GP you next

123
00:10:47,660 --> 00:10:54,540
we're going to reshape the data to be n by B where D is 784 and is the batch size.

124
00:10:54,560 --> 00:11:00,710
Note that for flexibility we can specify the first argument as minus one which tells PI talks to just

125
00:11:00,710 --> 00:11:04,530
assign whatever value is appropriate for the data we're given.

126
00:11:04,610 --> 00:11:06,220
This works the same in num pi.

127
00:11:06,230 --> 00:11:07,340
If you want to try it out

128
00:11:13,800 --> 00:11:18,260
next we have all the usual steps to perform one step of gradient descent.

129
00:11:18,570 --> 00:11:24,420
At the end when we get a loss we append it to our list of losses for this particular epoch.

130
00:11:24,540 --> 00:11:30,330
Once we're outside the epoch we simply take a mean of this list to be the train loss for that epoch

131
00:11:35,980 --> 00:11:36,630
as noted.

132
00:11:36,640 --> 00:11:38,060
This is a little misleading.

133
00:11:38,650 --> 00:11:43,300
Technically it's not the loss for the epoch because we have been training the entire time we've been

134
00:11:43,300 --> 00:11:49,060
looping through this epoch so the losses we've collected represent a number of different states of our

135
00:11:49,060 --> 00:11:50,260
model.

136
00:11:50,290 --> 00:11:55,510
Nonetheless it would be inefficient to have to loop through the data twice just to calculate the loss

137
00:11:55,870 --> 00:11:58,120
so we don't really want to do that if we don't need to

138
00:12:01,110 --> 00:12:05,160
Next we calculate the Test loss which goes through a similar loop.

139
00:12:05,160 --> 00:12:09,810
Remember that we can't calculate the test loss all at once because it may not fit into memory.

140
00:12:09,810 --> 00:12:15,690
If our dataset is too large this loop has all the same steps as the train loop except for the backward

141
00:12:15,690 --> 00:12:17,100
and optimizer steps.

142
00:12:21,580 --> 00:12:25,950
Finally at the end of the outer loop we store the losses we collected and print them out.

143
00:12:31,800 --> 00:12:42,180
As usual we would like to see the loss per iteration which is what we have next.

144
00:12:42,200 --> 00:12:48,080
The next step is to calculate the accuracy again having a data loader instead of just a plain array

145
00:12:48,140 --> 00:12:49,670
makes things more difficult.

146
00:12:49,730 --> 00:12:54,100
So we have to loop through the data rather than just do a single computation.

147
00:12:54,140 --> 00:12:59,480
We start by initializing the counts for the number correct and the number total to zero.

148
00:12:59,480 --> 00:13:03,290
Then we loop through the data loader inside the loop.

149
00:13:03,290 --> 00:13:11,570
We transfer the data to the GP you reshape the inputs and then get the outputs these outputs are logics

150
00:13:11,650 --> 00:13:16,990
and not probabilities but that means we can still take the max to get the prediction.

151
00:13:16,990 --> 00:13:23,620
We call it torture that Max over Axis 1 which means that we take the max over the columns and this returns

152
00:13:23,620 --> 00:13:28,330
both the maximum value in each row and the corresponding indices.

153
00:13:28,420 --> 00:13:34,320
We only want the indices since that corresponds to the classes so the first thing returns from this

154
00:13:34,320 --> 00:13:36,450
function would be the maximum value.

155
00:13:36,630 --> 00:13:45,490
And the second thing returned from this function would be the index.

156
00:13:45,580 --> 00:13:51,940
Next we check how many of the predictions are equal to the targets and assign that to incorrect since

157
00:13:51,940 --> 00:13:53,320
this is all in PI torch land.

158
00:13:53,320 --> 00:13:59,410
We have to call that item to bring it back to Python land in order to get the number of samples.

159
00:13:59,440 --> 00:14:03,760
That's just the shape of the targets at the 0 with index.

160
00:14:03,790 --> 00:14:09,430
Finally when we're outside the loop the accuracy is just the number correct divided by the number total.

161
00:14:09,520 --> 00:14:10,210
As you've seen

162
00:14:16,060 --> 00:14:21,490
next we do the same thing for the test accuracy and when we're done we print the train accuracy and

163
00:14:21,490 --> 00:14:22,510
the test accuracy

164
00:14:25,530 --> 00:14:27,320
as you can see we do pretty well.

165
00:14:34,420 --> 00:14:37,690
Next we have some code to plot a confusion matrix.

166
00:14:37,690 --> 00:14:39,700
This code is outside the scope of this course.

167
00:14:39,730 --> 00:14:44,890
So you don't have to worry about understanding it just knowing what a confusion matrix is is what I

168
00:14:44,890 --> 00:14:46,660
want you to focus on.

169
00:14:46,720 --> 00:14:52,810
So the idea is we want to draw a table with the true labels on one axis and the predicted labels on

170
00:14:52,810 --> 00:14:53,740
the other axis

171
00:15:01,360 --> 00:15:06,390
so on this next block of code we have a loop similar to the above where we make predictions but now

172
00:15:06,390 --> 00:15:14,490
we just store them in no higher res instead of calculating the accuracy so you see here we use concatenate

173
00:15:14,520 --> 00:15:15,450
rather than summing

174
00:15:18,720 --> 00:15:21,180
and at the end we call our confusion matrix function

175
00:15:27,000 --> 00:15:34,750
so basically the confusion matrix tells us for each label how many predictions correspond to that label.

176
00:15:34,770 --> 00:15:40,260
Hopefully we expected to find most of the entries along the diagonal where the label would be equal

177
00:15:40,260 --> 00:15:41,550
to the prediction.

178
00:15:41,880 --> 00:15:48,060
Of course since we don't have 100 percent accuracy there are a few entries not on the diagonal.

179
00:15:48,060 --> 00:15:52,530
Importantly however it's good to try and see if we can make sense of these results.

180
00:15:52,590 --> 00:15:55,340
They're very interpretable because we're working with images.

181
00:15:55,380 --> 00:15:58,800
And when you have images it's very easy to simply look at them.

182
00:15:59,550 --> 00:16:02,760
So where do we have the highest inaccuracy.

183
00:16:02,760 --> 00:16:07,110
It seems to be where the true label is for and the predicted label is 9

184
00:16:10,000 --> 00:16:11,720
this actually makes sense.

185
00:16:11,920 --> 00:16:19,220
If you write down a 9 the connection points in the main lines all appear in approximately the same spot.

186
00:16:19,400 --> 00:16:24,360
You can imagine that it would be easy for a human observer to make the same mistake.

187
00:16:24,500 --> 00:16:29,900
We can look for more instances of a large number of predictions being incorrect such as a three being

188
00:16:29,900 --> 00:16:33,600
confused with an 8 and a 2 being confused with an 8.

189
00:16:33,650 --> 00:16:41,290
You should be able to recognize why these mistakes were being made.

190
00:16:41,410 --> 00:16:47,460
The next thing we do is actually plot some of these misclassified samples to confirm our suspicions.

191
00:16:47,560 --> 00:16:52,720
Perhaps the misclassified forces really do look like science in order to find this out.

192
00:16:52,750 --> 00:16:58,120
We're going to select a randomly misclassified sample and plot the image to do this.

193
00:16:58,120 --> 00:17:01,870
First we have to gather the indices of all the misclassified samples.

194
00:17:02,140 --> 00:17:09,360
We can do this using the name pi where function the where function simply returns the index value where

195
00:17:09,360 --> 00:17:16,260
the input array is true so we can pass in the boolean array p test not equal to y test.

196
00:17:16,260 --> 00:17:20,960
Note that the where function returns the results in a tuple but we only care about the first element.

197
00:17:21,150 --> 00:17:24,450
So that's why we use the square brackets 0.

198
00:17:24,470 --> 00:17:30,050
Next we're going to use NPR at random not choice to select one of these random misclassified indices

199
00:17:30,380 --> 00:17:32,750
and assign this variable to ie.

200
00:17:33,260 --> 00:17:34,600
So let's check out the results.

201
00:17:37,640 --> 00:17:40,170
So here's a 5 predicted as a 9.

202
00:17:40,310 --> 00:17:42,380
That kind of makes sense because this looks like a 9

203
00:17:45,800 --> 00:17:47,840
use a for predicted as a 9.

204
00:17:48,070 --> 00:17:49,810
As you can see it kind of looks like a 9

205
00:17:53,830 --> 00:17:55,270
here's a 2 predicted as an 8

206
00:17:59,470 --> 00:18:00,970
here's a 3 predicted as a 9

207
00:18:04,270 --> 00:18:05,760
here's a 5 predicted as a 3

208
00:18:09,470 --> 00:18:11,320
Here's a 7 predicted as a 2.

209
00:18:11,340 --> 00:18:12,810
So this one actually looks like a 2

210
00:18:15,580 --> 00:18:21,580
so I think overall it makes sense what we're seeing digits that confuse the computer are the same as

211
00:18:21,580 --> 00:18:23,170
digits that would confuse us.