1
00:00:11,580 --> 00:00:16,690
At this point, we've done most of the legwork to describe what we are about to implement in code.

2
00:00:17,280 --> 00:00:22,020
Now we just have to discuss a little more about what the code is going to look like and then we can

3
00:00:22,020 --> 00:00:23,230
look at the actual code.

4
00:00:23,940 --> 00:00:29,290
So let's again recap all the usual steps in the context of our data and our new model.

5
00:00:30,090 --> 00:00:36,390
So step number one is to load in the data for the coming example, will be looking at the classic amnesty

6
00:00:36,390 --> 00:00:40,080
data set, which is a data set containing handwritten digits.

7
00:00:40,950 --> 00:00:47,040
Our job is to accept as input an image of the digit and then classify what the digit is.

8
00:00:47,580 --> 00:00:50,760
Since there are ten possible digits, zero up to nine.

9
00:00:51,120 --> 00:00:53,550
This is a multiclass classification problem.

10
00:00:54,950 --> 00:01:00,230
Also note that even though these data are technically images, they don't come from actual image files

11
00:01:00,230 --> 00:01:07,220
like JPEG or pages, instead the amnesty to say it is actually already included in the Torture Vision

12
00:01:07,220 --> 00:01:10,350
Library, which makes things a little bit more convenient for us.

13
00:01:11,180 --> 00:01:15,180
Also, they are greyscale images, so we don't have to worry about color right now.

14
00:01:15,980 --> 00:01:19,430
Later in the course, we'll look at how to load an actual image files.

15
00:01:22,160 --> 00:01:28,310
Step number two is to build our model as per this section of the course, the model is a feat of forward

16
00:01:28,310 --> 00:01:29,090
neural network.

17
00:01:29,660 --> 00:01:35,810
Basically, it's going to be multiclass logistic regression with a few more layers before it all change

18
00:01:35,810 --> 00:01:36,920
together sequentially.

19
00:01:38,090 --> 00:01:39,920
Step number three is to train the model.

20
00:01:40,610 --> 00:01:43,790
Now, this is the beauty of deep learning and deep learning frameworks.

21
00:01:44,180 --> 00:01:46,220
There is zero additional work to do here.

22
00:01:46,940 --> 00:01:52,280
All we have to do is pretty much the same training loop we have already seen, although there is one

23
00:01:52,280 --> 00:01:57,620
small modification we are going to make in order to improve the training process called batch gradient

24
00:01:57,620 --> 00:01:58,110
descent.

25
00:01:58,490 --> 00:02:00,740
We'll see what that is later on in this lecture.

26
00:02:02,180 --> 00:02:04,520
Step number four is to evaluate the model.

27
00:02:05,060 --> 00:02:10,070
Again, this is exactly the same, you could say that these steps are model agnostic.

28
00:02:10,110 --> 00:02:12,800
They don't care what the actual form of the model is.

29
00:02:13,640 --> 00:02:16,740
Step number five is to use the model to make predictions.

30
00:02:17,270 --> 00:02:21,920
Well, do something interesting in this section, which will be to look at which predictions the neural

31
00:02:21,920 --> 00:02:23,140
network is getting wrong.

32
00:02:23,840 --> 00:02:27,800
This might give us some insight on why the neural network is doing what it's doing.

33
00:02:32,900 --> 00:02:38,210
So let's start by breaking down step number one, as mentioned, this data is going to come from the

34
00:02:38,450 --> 00:02:43,940
torture library itself, which will keep things simple for us, actually, to be more specific.

35
00:02:44,120 --> 00:02:49,220
It comes from a sister library called Torture Vision, which you normally have to install separately.

36
00:02:49,820 --> 00:02:54,380
But since everything is installed for you in Google CoLab, we won't need to worry about that.

37
00:02:55,160 --> 00:03:00,410
Torture Vision specializes in providing functionality for, you guessed it, machine vision.

38
00:03:00,980 --> 00:03:05,840
This includes both helper functions and data sets of which will see multiple in this course.

39
00:03:06,440 --> 00:03:11,270
Of course, one of these data sets is the famous amnesty data set, which is a standard machine learning

40
00:03:11,270 --> 00:03:13,200
benchmark of handwritten digits.

41
00:03:13,820 --> 00:03:15,790
This is an image classification problem.

42
00:03:15,800 --> 00:03:19,820
So the data are images and the targets are what the image is of.

43
00:03:20,450 --> 00:03:27,140
Each image is the same size 28 by 28 pixels, which is equal to seven hundred eighty four pixels in

44
00:03:27,140 --> 00:03:27,580
total.

45
00:03:28,400 --> 00:03:31,430
The amnesty to say it contains only grayscale images.

46
00:03:31,730 --> 00:03:37,670
So the images are just of size 28 by 28 and not twenty eight by twenty eight by three, which is what

47
00:03:37,700 --> 00:03:39,320
they would be if they had color.

48
00:03:40,640 --> 00:03:42,540
A grayscale image is pretty basic.

49
00:03:42,680 --> 00:03:48,920
Just imagine it like a matrix of numbers, the number in each element of the matrix tells us how dark

50
00:03:48,920 --> 00:03:50,900
or how light that pixel should be.

51
00:03:51,560 --> 00:03:56,410
So if you have a very small number close to zero, you'll get a very dark color like black.

52
00:03:56,930 --> 00:04:02,270
If you have a very large number, close to 255, you'll get a very light color close to white.

53
00:04:07,260 --> 00:04:13,020
When we call the function torch vision data sets that Imust and passenger train equal to true, we get

54
00:04:13,020 --> 00:04:17,100
back a training data set object, which is similar to what we saw in Sijia Learn.

55
00:04:17,850 --> 00:04:23,910
Specifically, if we call the data attribution, we get the input data, which are the images and if

56
00:04:23,910 --> 00:04:29,130
we call the target's attribute, we get the targets, which are the labels telling us what digit the

57
00:04:29,130 --> 00:04:30,720
corresponding image is of.

58
00:04:31,350 --> 00:04:37,080
We can think of train data set data as XStream and train data set targets as we train.

59
00:04:37,650 --> 00:04:43,290
So X train is of shape and by twenty eight by twenty eight, whereas Y train is a one B vector of length

60
00:04:43,290 --> 00:04:46,680
N in our case and is equal to sixty thousand.

61
00:04:51,690 --> 00:04:57,480
Correspondingly, if we call the same function, Tajh, vision data sets not missed, but we pass in

62
00:04:57,480 --> 00:05:02,070
a train equal to false, then we'll get back the test data set as before.

63
00:05:02,070 --> 00:05:09,480
We can call test data, set data to get X test and we can call test data, set targets to get Y tests.

64
00:05:10,140 --> 00:05:18,030
X test is of size and test by 28 by 28 and Y test is a one Devecser of length and test and in our case

65
00:05:18,210 --> 00:05:19,890
and test is 10000.

66
00:05:24,940 --> 00:05:31,300
As a reminder, the image pixels are stored as integers from zero to 255, so we'll have to scale them

67
00:05:31,300 --> 00:05:32,530
to go from zero to one.

68
00:05:33,220 --> 00:05:38,210
This is because we like the input data into a neural network to be in a small range, as you recall.

69
00:05:39,010 --> 00:05:44,140
In addition, the neural networks and other machine learning models that we have seen so far expect

70
00:05:44,150 --> 00:05:49,840
their inputs to be of shape and body, which is a two dimensional array and is the number of samples

71
00:05:49,840 --> 00:05:51,390
and the number of features.

72
00:05:51,940 --> 00:05:56,920
But as you just learned, our data sets come in the shape and by twenty eight by twenty eight, which

73
00:05:56,920 --> 00:05:58,330
is a three dimensional array.

74
00:05:59,080 --> 00:06:04,810
Therefore we'll have to use the special view function in PI Torch to flatten the data into a two dimensional

75
00:06:04,810 --> 00:06:08,590
array and treat each of the pixels as if they were an input feature.

76
00:06:09,310 --> 00:06:10,630
This is thanks to my rule.

77
00:06:10,630 --> 00:06:11,920
All data is the same.

78
00:06:12,340 --> 00:06:16,840
We don't care if this input represents a breast cancer feature vector or image pixels.

79
00:06:17,140 --> 00:06:19,420
Neuron that works will work the same way regardless.

80
00:06:24,480 --> 00:06:26,800
Step number two is to instantiate the model.

81
00:06:27,450 --> 00:06:32,640
So first, let's start with the equation that represents how we get from the input to the output in

82
00:06:32,640 --> 00:06:33,450
our neural network.

83
00:06:34,140 --> 00:06:38,860
If we write it out all in one line, you can see that it follows the pattern inside to outside.

84
00:06:39,450 --> 00:06:42,470
That's just your basic order of operations for arithmetic.

85
00:06:43,170 --> 00:06:48,150
But we can split it up into intermediate variables so that we can see each step one at a time.

86
00:06:48,630 --> 00:06:51,990
So A1 is equal to one at times X plus one.

87
00:06:52,380 --> 00:06:54,240
Zwaan is equal to Sigma A1.

88
00:06:54,600 --> 00:07:01,470
A2 is equal to two times one plus two and then technically Y is equal to softmax of a two.

89
00:07:01,800 --> 00:07:06,950
But remember that we omit this step in PI talk because it's already included in the lost function.

90
00:07:07,880 --> 00:07:13,700
One important thing to note is that even when we use an activation function such as the rescue, we

91
00:07:13,700 --> 00:07:16,780
often still write it as sigma when we write down the math.

92
00:07:17,270 --> 00:07:19,380
So that's just something to keep in mind for the future.

93
00:07:19,400 --> 00:07:21,110
So you don't get confused when you see it.

94
00:07:26,050 --> 00:07:31,570
Basically, the important thing to remember is that each of these math equations is just an object in

95
00:07:31,570 --> 00:07:32,300
pie talk.

96
00:07:33,190 --> 00:07:36,450
So we have N.N. Linear 784 128.

97
00:07:36,610 --> 00:07:39,730
And then you add a final and linear one.

98
00:07:39,730 --> 00:07:40,540
Twenty, eight, ten.

99
00:07:41,500 --> 00:07:44,920
As mentioned previously, Ruyu is a finite default choice.

100
00:07:45,310 --> 00:07:48,660
But you might be wondering why did I choose the value one twenty eight.

101
00:07:49,270 --> 00:07:51,330
And of course, there is no special reason.

102
00:07:51,760 --> 00:07:56,740
This is just a matter of hyper parameter selection for which you should use trial and error as well

103
00:07:56,740 --> 00:07:59,070
as past experience to guide your decisions.

104
00:07:59,800 --> 00:08:04,580
You might also notice that oddly, I had to specify the number one eight twice.

105
00:08:05,200 --> 00:08:10,630
This makes sense because if the output of the first layer is one twenty eight and that's the input to

106
00:08:10,630 --> 00:08:13,690
the next layer, then the next layer must also have input.

107
00:08:13,690 --> 00:08:14,890
Size 128.

108
00:08:16,510 --> 00:08:22,120
But it also does not make sense in the sense that this is a totally redundant, for example, in intensive

109
00:08:22,120 --> 00:08:27,700
Lotu or CARUS, you only have to specify this number once, and it also leads to more opportunities

110
00:08:27,700 --> 00:08:28,360
for error.

111
00:08:28,810 --> 00:08:32,070
If you specify different numbers, your program simply won't run.

112
00:08:32,800 --> 00:08:36,510
You'll notice that the same theme will apply to later sections of the course as well.

113
00:08:36,760 --> 00:08:40,630
And this gets especially complicated for convolutional neural networks.

114
00:08:41,830 --> 00:08:46,990
Finally, there are 10 output's because there are ten classes which correspond to the 10 digits, zero

115
00:08:46,990 --> 00:08:47,650
up to nine.

116
00:08:52,740 --> 00:08:58,350
Step number three is training the model, this will almost be the same loop as we had previously, but

117
00:08:58,350 --> 00:08:59,790
with one additional detail.

118
00:09:00,630 --> 00:09:06,330
Previously on Each Epoch, we took the entire data set, made a prediction on it, and then calculated

119
00:09:06,330 --> 00:09:08,150
the loss and did gradient descent.

120
00:09:08,760 --> 00:09:13,000
But ask yourself, what if your data set is actually too large to fit into memory?

121
00:09:13,890 --> 00:09:16,350
Previously we just had a few hundred samples.

122
00:09:16,560 --> 00:09:18,380
Now we have 60000.

123
00:09:19,230 --> 00:09:25,890
Today's data sets, such as the famous image that data set have millions of images, image that is approximately

124
00:09:25,890 --> 00:09:31,620
one hundred fifty gigabytes, which can't even fit on some laptop drives, never mind fit into memory.

125
00:09:32,190 --> 00:09:34,230
So clearly we need a better solution.

126
00:09:39,310 --> 00:09:45,400
The solution is batch gradient descent, and the idea is this let's suppose we split our data up into

127
00:09:45,400 --> 00:09:51,940
batches, then instead of having just one loop over each of the epochs, we'll have a nested loop over

128
00:09:51,940 --> 00:09:52,730
each batch.

129
00:09:53,440 --> 00:09:58,990
Then we'll do all our usual steps on that small batch rather than over the entire data set.

130
00:09:59,590 --> 00:10:05,110
That includes zeroing the gradient, calculating the outputs for that batch, calculating the loss and

131
00:10:05,110 --> 00:10:06,700
doing one step of gradient descent.

132
00:10:11,770 --> 00:10:17,650
In fact, batch gradient descent is such a fundamental part of deep learning that the Pitchfork's library

133
00:10:17,650 --> 00:10:23,080
comes with a data loaded class which comes with a batch size argument and acts as a generator over the

134
00:10:23,080 --> 00:10:23,650
data set.

135
00:10:24,250 --> 00:10:29,800
So although in my in-depth courses I will teach you how to write a batch gradient descent loop by yourself

136
00:10:30,130 --> 00:10:36,100
in PI talk, we can just loop over a data load or object and it automatically yields batches of inputs

137
00:10:36,100 --> 00:10:36,850
and targets.

138
00:10:41,890 --> 00:10:48,100
The next question to consider is, why does this work, why is it OK to do any of a sent over batches

139
00:10:48,280 --> 00:10:49,830
instead of the entire data set?

140
00:10:50,710 --> 00:10:51,750
Here's some intuition.

141
00:10:52,420 --> 00:10:55,810
Suppose we want to know the average height of all human beings on Earth.

142
00:10:56,440 --> 00:11:01,840
Think about why that would be an infeasible task, just like how storing all of imaging and memory is

143
00:11:01,840 --> 00:11:02,440
infeasible.

144
00:11:03,370 --> 00:11:06,000
Well, there are seven billion people currently on Earth.

145
00:11:06,520 --> 00:11:08,840
You can't possibly ask them all how tall they are.

146
00:11:09,610 --> 00:11:12,490
Therefore, we have to do what all scientific experiments do.

147
00:11:12,850 --> 00:11:15,240
Take a random sample from the population.

148
00:11:15,910 --> 00:11:20,610
This is what scientists do when they want to test drugs or when they want to do a psychology experiment.

149
00:11:21,100 --> 00:11:25,410
They can't possibly ask everyone in the world to participate in their experiment.

150
00:11:25,750 --> 00:11:31,240
So they have to choose just a small sample of people and generate statistics based on this sample.

151
00:11:31,240 --> 00:11:37,330
Only at the same time do we not expect that the sample represents the true population.

152
00:11:37,780 --> 00:11:39,150
The answer is usually yes.

153
00:11:39,970 --> 00:11:45,070
Do we expect that our measurements on the sample will be close to the measurements we could theoretically

154
00:11:45,070 --> 00:11:46,780
take on the entire population?

155
00:11:47,170 --> 00:11:48,370
The answer is yes.

156
00:11:49,150 --> 00:11:54,400
The same thing happens when we do batch gradient descent, even though we are only looking at a small

157
00:11:54,400 --> 00:11:56,890
sample of the data at any given time.

158
00:11:57,220 --> 00:12:03,550
We do expect that the loss and its gradient will be representative of the entire dataset, which itself

159
00:12:03,550 --> 00:12:08,140
is what we expect to be representative of the true distribution of where the data set comes from.

160
00:12:09,160 --> 00:12:14,950
Furthermore, on each epoch we still see every data point, so we don't miss anything unlike real world

161
00:12:14,950 --> 00:12:16,180
population samples.

162
00:12:21,230 --> 00:12:27,110
Finally, step one, step five are to evaluate the model and make predictions using the model, as we

163
00:12:27,110 --> 00:12:32,890
just discussed, with large data sets, it may not be feasible to load in the entire data set into memory.

164
00:12:33,530 --> 00:12:38,210
Therefore, we can't just do model of inputs and pass in the entire data set all at once.

165
00:12:39,700 --> 00:12:45,250
And so this is yet another place where we have to make use of our data loader objects due to the fact

166
00:12:45,250 --> 00:12:50,740
that our predictions come in batches, we won't just have a single array of predictions to compare to

167
00:12:50,740 --> 00:12:52,180
the targets like we did before.

168
00:12:53,950 --> 00:12:59,230
One way we can get around this is to just apply the definition of accuracy directly and calculate it

169
00:12:59,230 --> 00:12:59,950
ourselves.

170
00:13:00,520 --> 00:13:03,540
For each batch we get the predictions and the targets.

171
00:13:04,060 --> 00:13:09,280
We don't want the accuracy on this batch only, but what we can do is count up the number of predictions

172
00:13:09,280 --> 00:13:09,580
we got.

173
00:13:09,580 --> 00:13:10,170
Correct.

174
00:13:10,720 --> 00:13:16,240
If we tally up how many we got correct and how many samples we saw in total, then we can just divide

175
00:13:16,240 --> 00:13:19,560
the number correct by the number total once the loop is over.

176
00:13:19,660 --> 00:13:20,860
And that's our accuracy.

177
00:13:25,820 --> 00:13:29,730
Another complication is how we actually make predictions in the first place.

178
00:13:30,350 --> 00:13:33,260
Recall that we are no longer doing binary classification.

179
00:13:33,720 --> 00:13:39,080
So it's not a matter of just checking whether the output is greater than zero or less than zero instead

180
00:13:39,140 --> 00:13:40,250
of the output is of size.

181
00:13:41,000 --> 00:13:42,620
Work is the number of classes.

182
00:13:43,220 --> 00:13:49,190
As you know, these K outputs will theoretically be passed through a softmax function to give us the

183
00:13:49,190 --> 00:13:53,100
probability that the input belongs to each of these classes.

184
00:13:53,870 --> 00:13:55,860
But do we need these probabilities?

185
00:13:56,360 --> 00:13:59,520
Well, just like with binary classification, the answer is no.

186
00:14:00,320 --> 00:14:04,140
Why is the softmax function called softmax function in the first place?

187
00:14:04,730 --> 00:14:10,130
It's because it's kind of like the max function, except that instead of just telling us which element

188
00:14:10,130 --> 00:14:13,940
is the biggest, it just weights each of the elements relative to each other.

189
00:14:14,210 --> 00:14:16,550
In other words, a softer version of the max.

190
00:14:17,710 --> 00:14:23,560
In other words, the maximum value before we do the softmax will still be the maximum value after we

191
00:14:23,560 --> 00:14:24,520
do the softmax.

192
00:14:24,820 --> 00:14:29,140
It's just that the softmax happens to map those values to probabilities.

193
00:14:30,720 --> 00:14:36,240
Therefore, we just need to pick which of the key outputs has the maximum value and not worry about

194
00:14:36,240 --> 00:14:37,650
them being probabilities or not.

195
00:14:38,860 --> 00:14:44,980
In Pittsburgh, this is accomplished with torchlight, Max, which returns both the maximum value and

196
00:14:44,980 --> 00:14:45,790
its index.

197
00:14:46,810 --> 00:14:52,500
And because we're doing classification, we're actually interested in the index and not the value itself.