1
00:00:05,620 --> 00:00:08,570
Hi and welcome back to the course intersection.

2
00:00:08,650 --> 00:00:14,650
We'll take a look at replicating or building delineate an Alex that architectures using carrots with

3
00:00:14,650 --> 00:00:15,760
TensorFlow 2.0.

4
00:00:16,240 --> 00:00:17,390
So let's get started.

5
00:00:17,500 --> 00:00:20,440
So open this notebook here, which I've already done.

6
00:00:20,980 --> 00:00:23,710
And first, we'll start with the Linnet architecture.

7
00:00:23,920 --> 00:00:29,800
So you may have remembered the net architecture was made famous in the late 90s, where the researchers

8
00:00:29,800 --> 00:00:35,860
trained it on the amnesty to set that was generated by Eusebius and achieved some remarkable results.

9
00:00:36,220 --> 00:00:40,890
So we're going to try to implement that exact same architecture using carrots.

10
00:00:41,410 --> 00:00:44,920
So this is an overview of the architecture that the researchers used.

11
00:00:44,920 --> 00:00:47,620
You can see the layers described properly here.

12
00:00:47,620 --> 00:00:49,210
You can see it in the diagram as well.

13
00:00:49,630 --> 00:00:52,360
We have Plyo convolutions here now.

14
00:00:52,570 --> 00:00:59,080
One thing to note the input the researchers use in the 90s when this paper was published was 2D two

15
00:00:59,080 --> 00:00:59,830
by 2D two.

16
00:00:59,860 --> 00:01:02,770
However, when we lodo amnesty, they said we're loading to one.

17
00:01:02,770 --> 00:01:04,900
That's a 28 by 28 resolution.

18
00:01:05,320 --> 00:01:07,090
So we lose four pixels.

19
00:01:07,090 --> 00:01:10,000
Not that much, but we lose some information going forward.

20
00:01:10,090 --> 00:01:13,930
So it's not exactly the same dataset that we'll be working on.

21
00:01:13,990 --> 00:01:16,900
However, they're going to use roughly the same architecture.

22
00:01:16,930 --> 00:01:23,110
However, the output shapes will be slightly different because our input starts at a smaller that mentioned.

23
00:01:23,680 --> 00:01:29,820
So just to quickly recap, we have our first layer here with five by five filters, right?

24
00:01:29,830 --> 00:01:31,000
One padding zero.

25
00:01:31,570 --> 00:01:34,780
Then we have a max pool for average pooling to use enough max pooling.

26
00:01:35,230 --> 00:01:41,710
Then we have a continuity layer average pulling again conv with 120 filters, which is quite a bit.

27
00:01:42,220 --> 00:01:48,100
And then we have the fully connected layers here with 220 units, 84 units.

28
00:01:48,490 --> 00:01:50,800
Then finally, outputs to the 10 here.

29
00:01:50,980 --> 00:01:53,530
So let's build this in Keros.

30
00:01:54,190 --> 00:01:57,370
So as I said, we're not building exactly the same thing.

31
00:01:57,370 --> 00:01:58,960
We're building something that's quite similar.

32
00:01:59,350 --> 00:02:02,560
But for now, this should be familiar to you, this part of the code.

33
00:02:03,050 --> 00:02:05,160
Here's where we lowered the amnesty to set.

34
00:02:05,170 --> 00:02:09,490
Here's where we just get our image rows and columns, such as 28 and 28, respectively.

35
00:02:09,970 --> 00:02:16,600
Here's where we add a fourth dimension by reshaping the the data to that specific size that we want,

36
00:02:16,600 --> 00:02:17,470
which is this.

37
00:02:18,490 --> 00:02:20,870
Then we just get the full input image shape.

38
00:02:20,880 --> 00:02:22,990
We convert float to the two.

39
00:02:23,470 --> 00:02:28,720
We normalize our data and then we get one hot and coatings for the labels.

40
00:02:29,290 --> 00:02:34,600
Then get a number of classes as well from that, as well as a number of pixels, which I don't believe

41
00:02:34,600 --> 00:02:35,590
we use here.

42
00:02:35,590 --> 00:02:41,530
But we might use actually maybe this copied and pasted this line unnecessarily, but we'll see if we

43
00:02:41,530 --> 00:02:41,890
use it.

44
00:02:42,370 --> 00:02:44,360
And this is just to go over this.

45
00:02:44,380 --> 00:02:45,460
These are inputs.

46
00:02:45,910 --> 00:02:48,580
This is a optimizer we'll be using at a delta.

47
00:02:49,590 --> 00:02:55,690
We don't actually use batch gnome in this model, so it's taking this out and everything else should

48
00:02:55,690 --> 00:02:56,360
be fine.

49
00:02:56,380 --> 00:03:00,160
Here we are ready to run this block of code.

50
00:03:00,220 --> 00:03:01,690
So go ahead and run this.

51
00:03:01,720 --> 00:03:08,470
It may take a little while because it's the first block of code you're running in this notebook because

52
00:03:08,470 --> 00:03:10,240
what happens when you first run a block?

53
00:03:10,630 --> 00:03:16,990
It has to connect to a cloud instance in the Google Cloud platform to give you basically the machine

54
00:03:16,990 --> 00:03:18,250
that we're running this code on.

55
00:03:18,760 --> 00:03:20,680
So now let's replicate.

56
00:03:20,680 --> 00:03:28,480
Alternately, as if you remember correctly, we had six filters here at five by five kernel sizes where

57
00:03:28,510 --> 00:03:29,230
we didn't.

58
00:03:29,230 --> 00:03:32,180
We use zero padding initially and the architecture.

59
00:03:32,200 --> 00:03:39,400
However, because we have a 28 by 28, which is slightly less input image sized and the 32 by 32, the

60
00:03:39,400 --> 00:03:45,100
researchers would put that, then I'm going to use padding equal to same padding, equal to same basically

61
00:03:45,100 --> 00:03:51,190
as the padding around the input image so that the output feature map is the same size as the input feature

62
00:03:51,190 --> 00:03:51,490
map.

63
00:03:51,910 --> 00:03:54,490
Somebody use that for all the conflicts here.

64
00:03:54,970 --> 00:04:01,030
However, I'm going to use the same number of filters six 16 and one 120 would redo activation.

65
00:04:01,600 --> 00:04:03,880
We'll use max pooling instead of average balloon.

66
00:04:03,880 --> 00:04:07,140
Because I've done this experiment, max pooling does work better.

67
00:04:07,150 --> 00:04:12,100
So I mean, there's no point in replicating the exact same thing, but will it replicate something that's

68
00:04:12,100 --> 00:04:12,850
quite similar?

69
00:04:13,600 --> 00:04:16,480
So we do have a max pool is there, then we flatten.

70
00:04:16,930 --> 00:04:23,230
We're going to use this same one hundred and twenty and 84 units, an output to the ten output nodes.

71
00:04:23,770 --> 00:04:25,050
And Wolf the optimizer.

72
00:04:25,060 --> 00:04:32,500
We're going to use other delta for mid metrics again, accuracy and for loss, categorical cross entropy,

73
00:04:32,500 --> 00:04:36,580
which is appropriate for multiclass classification problem like this?

74
00:04:36,820 --> 00:04:43,510
So let's run this and we get our model output here, which is quite nice, quite convenient.

75
00:04:43,510 --> 00:04:48,010
To have this, we can double check all of these things, all the output sizes.

76
00:04:48,580 --> 00:04:52,410
This is actually quite small, but that should be should work fine.

77
00:04:52,420 --> 00:04:58,000
The total number of parameters is only a hundred and ninety one thousand, which which is tiny.

78
00:04:58,960 --> 00:05:03,100
But let's see how performance train 10 epochs on this model.

79
00:05:03,240 --> 00:05:04,860
Let's see if it gives us these results.

80
00:05:05,310 --> 00:05:11,670
I would say a decent result should be somewhere above 80 percent accuracy after 10 epochs, so let's

81
00:05:11,670 --> 00:05:12,600
see how that goes.

82
00:05:12,720 --> 00:05:14,700
It's going to trend quite quickly, as you can see.

83
00:05:17,060 --> 00:05:22,190
So the accuracy of the first e-book wasn't that great, 11 percent, but then it went up to fifteen

84
00:05:22,190 --> 00:05:24,080
point nine twenty one point eight.

85
00:05:24,470 --> 00:05:25,100
Let's keep going.

86
00:05:25,130 --> 00:05:34,790
Twenty six point four one point three one nine point twenty three seven point four one.

87
00:05:34,790 --> 00:05:39,140
I don't think we're going to get anywhere near 80 now, but as you can see, it's quite fast to train,

88
00:05:39,770 --> 00:05:45,380
so we can easily probably treat this for like 50 bucks, which I will do other than this 50 bucks here

89
00:05:45,380 --> 00:05:46,730
and to see what it gives us.

90
00:05:47,150 --> 00:05:50,540
So we'll move on to the adolescent, though.

91
00:05:50,870 --> 00:05:56,740
In the meantime, just so you know, there's nothing special going on, this lupus as a standard could

92
00:05:56,750 --> 00:05:59,240
be used for treating me as a bad size of one twenty eight.

93
00:05:59,690 --> 00:06:05,730
Just use a model that fit procedure, specify that size of your box, which we define a pill.

94
00:06:06,410 --> 00:06:09,650
We just point to a validation test dataset and the labels.

95
00:06:10,010 --> 00:06:16,910
We always shuffle the output and save the model at the end just so we want to have it saved just in

96
00:06:16,910 --> 00:06:18,380
case we want to reuse it again.

97
00:06:19,040 --> 00:06:23,600
And we just said a few boosts when we start to evaluate the performance for vertical one, just to get

98
00:06:23,600 --> 00:06:24,830
some more information out of it.

99
00:06:25,340 --> 00:06:28,760
And then we print the test Lawson test accuracy at the end there.

100
00:06:30,170 --> 00:06:35,720
So as you can see in 50 bucks is actually starting to do much better, which this highlights an important

101
00:06:35,720 --> 00:06:40,910
lesson right now because initially remember, we started, our accuracy was much worse.

102
00:06:41,660 --> 00:06:43,280
Well, there's two things happening here.

103
00:06:43,340 --> 00:06:43,820
One.

104
00:06:46,940 --> 00:06:50,960
This actually of just thinking about it, this actually may have been a feature implemented in TensorFlow

105
00:06:50,960 --> 00:06:57,110
2.0 that I was not aware of because I've been using PI to it so much lately, but it continue with treating

106
00:06:57,110 --> 00:07:02,870
the model, so it didn't actually start off from the initial scratch layer or initially started.

107
00:07:03,230 --> 00:07:09,560
It was at 11 percent accuracy and when we ended it after 10 epochs, it was at 55 percent accuracy.

108
00:07:10,040 --> 00:07:15,620
That's why when we run this block of code again, it continues the model, so it saves the weights and

109
00:07:15,620 --> 00:07:16,460
continuous training.

110
00:07:16,640 --> 00:07:19,370
Previously, it would have just started from scratch.

111
00:07:19,880 --> 00:07:24,500
So it's actually quite nice to see if we can actually resume training conveniently like this.

112
00:07:24,530 --> 00:07:30,260
So that's how we're able to get so much better accuracy when starting this because we're not starting

113
00:07:30,260 --> 00:07:31,940
from scratch, we're continuing training.

114
00:07:33,980 --> 00:07:39,100
And you can see at the end, maybe close to 50 bucks we will have in the 90s.

115
00:07:39,110 --> 00:07:44,500
Let's see if we get close to 1990, which is where the researchers got with the initial Leonard architecture.

116
00:07:44,510 --> 00:07:50,690
Somehow, I don't want to come close, but I suspect we'll get may be close to 94 percent accuracy,

117
00:07:51,170 --> 00:07:51,980
so I'll leave this.

118
00:07:57,340 --> 00:08:02,140
OK, so you can see we didn't get any food, but we got ninety three point one, three percent accuracy,

119
00:08:02,680 --> 00:08:03,640
which is quite good.

120
00:08:04,090 --> 00:08:07,240
Not disappointed with that after such a short trading time.

121
00:08:07,600 --> 00:08:09,490
We're getting a decent model out of this.

122
00:08:09,940 --> 00:08:14,490
And what's remarkable is that we're getting that performance with such a tiny model of one hundred and

123
00:08:14,500 --> 00:08:18,940
ninety one thousand parameters is tiny, as you can see in it trains quite quickly.

124
00:08:19,600 --> 00:08:21,370
So the code is running here right now.

125
00:08:21,580 --> 00:08:27,670
What it's still doing in the background is doing a model that evaluate metric, which just gets to performance

126
00:08:27,670 --> 00:08:28,480
metrics out of this.

127
00:08:28,480 --> 00:08:32,340
So we're going to get to a test loss and test accuracy at the end of this.

128
00:08:32,340 --> 00:08:33,790
So let's wait for that.

129
00:08:34,600 --> 00:08:39,850
In the meantime, though, let's take a look at the Alex Net Network, which we're going to train.

130
00:08:40,360 --> 00:08:46,250
So we're going to train an Alex net, which is a lot deeper, a lot more complicated network on the

131
00:08:46,270 --> 00:08:47,770
Safari 10 dataset.

132
00:08:48,280 --> 00:08:56,050
So this is an illustration of the Alex Net architecture, and this is an illustration of the Safari

133
00:08:56,050 --> 00:08:57,100
10 dataset.

134
00:08:57,110 --> 00:09:01,840
So you can see it has 10 classes, 10 of these fairly different classes.

135
00:09:01,840 --> 00:09:07,060
I would say, although dog and there could be quite similar sometimes, but the Dems do have these horns,

136
00:09:07,060 --> 00:09:08,440
which make them quite distinctive.

137
00:09:08,950 --> 00:09:13,540
Horses can be similar to dogs as well sometimes, but these images do look a lot different.

138
00:09:14,140 --> 00:09:17,890
Frog, apparently, was an important category this they saw fit to fit in.

139
00:09:18,610 --> 00:09:20,360
But anyway, let's keep going.

140
00:09:20,380 --> 00:09:22,600
So what will do with this?

141
00:09:22,600 --> 00:09:24,640
Lower the 10 dataset here?

142
00:09:25,180 --> 00:09:26,350
So let's run this.

143
00:09:31,980 --> 00:09:37,390
So it's now loading the data set right now, it should take roughly about a minute, maybe less.

144
00:09:37,410 --> 00:09:42,720
In the meantime, we can scroll back up and check our performance metrics at the end, which this is

145
00:09:42,730 --> 00:09:50,220
what we do know on what is giving us ninety three point one per cent accuracy and our test losses point

146
00:09:50,220 --> 00:09:51,720
to four, which is quite good.

147
00:09:51,990 --> 00:09:53,010
It's not amazing.

148
00:09:53,460 --> 00:09:58,860
If you want, I would recommend you train this for maybe 150 bucks and as well.

149
00:09:58,890 --> 00:10:07,410
It's a very good experiment for you to go to the architecture here of the Alex Smith network and added

150
00:10:07,410 --> 00:10:11,040
more filters and went way to disco here.

151
00:10:12,270 --> 00:10:14,700
Added more filters change a filter size.

152
00:10:15,150 --> 00:10:18,630
Maybe add a new, fully connected live increase.

153
00:10:18,650 --> 00:10:24,990
Number of nodes in this fully connected layers maybe adjust destroyed or remove some padding if you

154
00:10:24,990 --> 00:10:27,610
wanted to set padding to be zero.

155
00:10:27,640 --> 00:10:27,930
You can.

156
00:10:27,930 --> 00:10:33,900
You can just put off a commentaire for it to set the padding at zero.

157
00:10:34,510 --> 00:10:35,250
Use.

158
00:10:37,470 --> 00:10:44,460
Valid instead of putting equal seem valid, it basically tells the padding to basically be none, so

159
00:10:44,460 --> 00:10:45,180
it's zero padding.

160
00:10:46,440 --> 00:10:48,150
So let's go back down.

161
00:10:48,180 --> 00:10:52,680
Sorry for the quick scrolling and we've loaded our sofar 10 data sets.

162
00:10:52,680 --> 00:11:00,510
We can see it's 50000 images in the training dataset, 10000 and attested a set dimensions are 32 by

163
00:11:00,520 --> 00:11:01,950
22 and your color.

164
00:11:02,070 --> 00:11:03,630
That's why they have a depth of three.

165
00:11:04,830 --> 00:11:07,230
So here is where we build Alex.

166
00:11:07,230 --> 00:11:13,050
That network you can see we have five convolutional layers in a separate to them nicely.

167
00:11:13,590 --> 00:11:21,450
You have 96 filters in the first one 256 and the second 512 and the third a thousand twenty four and

168
00:11:21,450 --> 00:11:23,820
the fourth and Intels and 24 in the fifth.

169
00:11:24,270 --> 00:11:29,760
And you can see it starts at eleven by eleven and five by five tree by tree tree by Tree Tree Battery,

170
00:11:30,180 --> 00:11:34,150
which is actually against the convention of CNN filters.

171
00:11:34,170 --> 00:11:38,760
I prefer to use smaller filters up in front and larger filters behind.

172
00:11:39,240 --> 00:11:43,980
However, this is what the Alex Net Network it looks like, so we wouldn't change it too much.

173
00:11:44,490 --> 00:11:48,240
We also use an L to regularize the head can see so that the zero.

174
00:11:48,660 --> 00:11:49,470
But it does apply.

175
00:11:49,480 --> 00:11:51,120
Some regularization still.

176
00:11:51,840 --> 00:11:53,610
Actually, I'm not even sure if it does.

177
00:11:54,120 --> 00:12:00,960
Let's set this to zero zero one, actually, just in case, and we're using batch gnome here.

178
00:12:01,260 --> 00:12:05,670
Now, I don't believe that Cinnamon was part of the initial Alex net network.

179
00:12:06,120 --> 00:12:08,370
However, that's added into this.

180
00:12:08,400 --> 00:12:13,170
Like, let's make some improvements on it and we will add drop out as well.

181
00:12:13,680 --> 00:12:16,980
And you can see our fully connected layers are quite bigger.

182
00:12:17,520 --> 00:12:21,840
This one has treated 72 nodes, as well as four thousand ninety six.

183
00:12:22,320 --> 00:12:23,730
Number of classes is 10.

184
00:12:24,270 --> 00:12:29,420
We apply Bartolome again at the end of this, and we were using drop eight point five.

185
00:12:29,430 --> 00:12:30,690
I wouldn't really want to go.

186
00:12:30,690 --> 00:12:31,590
It's so high.

187
00:12:31,600 --> 00:12:35,400
I want to stick to 0.3, but nevertheless, let's try it.

188
00:12:35,400 --> 00:12:36,870
So let's print this.

189
00:12:36,960 --> 00:12:42,940
And lastly, we're using add a delta again as the optimizer so that a delta is quite good.

190
00:12:42,960 --> 00:12:45,870
However, we could use stochastic gradient descent or Adam.

191
00:12:46,260 --> 00:12:53,760
I tend to use Adam a lot, so you can change it to Adam if you want to see what would with the Dakotas

192
00:12:53,760 --> 00:12:54,700
to the

193
00:12:58,380 --> 00:13:01,890
if you wanted to see what the actual way the change this is.

194
00:13:02,340 --> 00:13:06,930
You can dig into the Keros documentation and see what they actually name these different optimizers.

195
00:13:07,410 --> 00:13:14,160
I don't believe this just Adam, but I'm not entirely sure it can be capital e needs in is case sensitive,

196
00:13:14,160 --> 00:13:15,450
so it needs to be set correctly.

197
00:13:17,010 --> 00:13:17,610
So there we go.

198
00:13:17,760 --> 00:13:19,930
So long architectures.

199
00:13:19,930 --> 00:13:22,290
As you can see, it's quite extensive.

200
00:13:22,290 --> 00:13:27,820
Isn't many, many different layers in Alex, but not nearly as much as video resonate or some of those

201
00:13:27,820 --> 00:13:28,920
that are deep CNN's.

202
00:13:29,490 --> 00:13:37,230
But we can see this 78 million parameters, which is quite a lot actually for such a I mean, it's not

203
00:13:37,230 --> 00:13:41,940
a primitive network, but for the network that was introduced maybe ten years ago, and this is quite

204
00:13:41,940 --> 00:13:42,240
heavy.

205
00:13:42,840 --> 00:13:50,230
So will lose about size of 64 will train for 25 e-books will do a regular training method here, see

206
00:13:50,280 --> 00:13:53,490
the model and get the performance scores at the end.

207
00:13:54,060 --> 00:13:56,900
So let's begin training and let's observe.

208
00:13:56,910 --> 00:14:02,490
So sit back and relax to this one is going to take a while to train, so I'm going to let the video

209
00:14:02,490 --> 00:14:07,290
stop recording now and resume when it's finished streamed because this is going to take maybe about

210
00:14:07,290 --> 00:14:10,590
15 15 minutes roughly estimate.

211
00:14:11,400 --> 00:14:11,670
OK.

212
00:14:11,820 --> 00:14:19,590
So that's it for this lesson when I come back after this model has been trained will discuss these results

213
00:14:19,590 --> 00:14:19,920
here.

214
00:14:20,550 --> 00:14:20,940
Thank you.

215
00:14:24,450 --> 00:14:31,170
And using the Alex Net architecture, and you can see, after roughly 20 minutes of training, 25 epochs

216
00:14:31,170 --> 00:14:35,880
have completed and we have gotten 61 percent accuracy now that's OK.

217
00:14:35,970 --> 00:14:36,960
It's not too bad.

218
00:14:37,170 --> 00:14:42,780
I mean, if we could do a lot better and you can see of scroll down and a physical image here, which

219
00:14:42,780 --> 00:14:47,010
is taken from people's would could and they rank basically all of the state of the art.

220
00:14:47,310 --> 00:14:51,120
CNN's and you can see this is their performance on zaffar 10.

221
00:14:51,120 --> 00:14:58,110
And you can see I mean, we've gotten some ridiculously high accuracy rates with the modern day CNN's.

222
00:14:58,530 --> 00:15:00,690
However, these are very, very deep networks.

223
00:15:00,700 --> 00:15:03,120
You can see this is six hundred and thirty two million parameters.

224
00:15:03,600 --> 00:15:08,010
These are actually not that deep at all, and they're quite they're performing quite well on 10.

225
00:15:08,430 --> 00:15:12,870
However, they do use extra training data, so it's it's not cheating.

226
00:15:12,870 --> 00:15:16,650
It's just a different way to put it in a different way.

227
00:15:16,650 --> 00:15:21,840
We can get the best results on previous tests and another test data sets, I should say.

228
00:15:22,770 --> 00:15:27,960
So don't feel bad with 61 percent, then to be honest, you can actually try and improve this and get

229
00:15:27,960 --> 00:15:33,330
mebecause to 70 even 75 percent by a few doing a few tweaks and training for more ebooks.

230
00:15:34,020 --> 00:15:41,280
So we'll stop there for now and then what we'll do will now go in to the some of these pre-trade networks

231
00:15:41,280 --> 00:15:45,430
like Viji and Resonance and a few others, and start loading them.

232
00:15:45,450 --> 00:15:51,990
And I'll show you how to load a pre-treated network and would pay to watch and us and how to just run

233
00:15:51,990 --> 00:15:56,310
some inferences on new images and get the classes back up back out.

234
00:15:56,910 --> 00:15:59,370
So it's quite a fun project we're going to do next.

235
00:15:59,550 --> 00:16:01,380
So I'll see you in the next section.

236
00:16:01,470 --> 00:16:01,920
Thank you.

237
00:16:02,350 --> 00:16:02,640
Bye.