1
00:00:11,590 --> 00:00:16,660
In this lecture, we you're going to improve our previous results using some of the techniques we just

2
00:00:16,660 --> 00:00:20,680
learned about, including data augmentation and batch normalization.

3
00:00:21,340 --> 00:00:26,470
This lecture is going to walk you through a prepared CoLab notebook, although a very good exercise,

4
00:00:26,470 --> 00:00:31,720
which I always recommend is once you know how this is done, to try and recreate it yourself with as

5
00:00:31,720 --> 00:00:33,190
few references as possible.

6
00:00:34,600 --> 00:00:39,580
As usual, you can look at the title of the notebook to determine what notebook we are currently looking

7
00:00:39,580 --> 00:00:39,840
at.

8
00:00:43,070 --> 00:00:47,180
One important feature of this notebook is that it has many options for you to choose from.

9
00:00:47,780 --> 00:00:50,170
We've added quite a few things since the last script.

10
00:00:50,810 --> 00:00:56,720
So as an exercise, one thing you want to try is experiment with what happens if only some of these

11
00:00:56,720 --> 00:00:58,640
options are used and not others.

12
00:00:59,150 --> 00:01:04,490
For example, check the results when we don't have data augmentation, check the results when we don't

13
00:01:04,490 --> 00:01:08,720
have BATCHELLER, check the results if we remove one layer of convolutions.

14
00:01:09,230 --> 00:01:13,880
This is part of the process of improving your results and hyper parameter optimization.

15
00:01:20,100 --> 00:01:25,920
OK, so at the top, you'll see that when we load in the data set, we specify a transforms that compose

16
00:01:26,100 --> 00:01:28,500
instead of just transforms that to Tenzer.

17
00:01:29,250 --> 00:01:30,780
This is a list of things I tried.

18
00:01:30,780 --> 00:01:34,920
And you definitely want to play around with these to see if you can improve the results.

19
00:01:35,460 --> 00:01:38,940
There are tons of things here and each of these things has its own parameters.

20
00:01:39,210 --> 00:01:41,240
So test them out and see what happens.

21
00:01:47,630 --> 00:01:52,910
So pretty much everything in this script is the same as the previous C14 script, so let's just scroll

22
00:01:52,910 --> 00:01:54,380
down to where we build the model.

23
00:01:58,650 --> 00:02:00,780
And so here we're just inspecting the data.

24
00:02:02,060 --> 00:02:03,170
Here's the data loader.

25
00:02:08,630 --> 00:02:10,550
All right, so here's what we define, the model.

26
00:02:11,510 --> 00:02:17,150
The first thing you'll see is that I've taken away the striated convolutions in this scenario for these

27
00:02:17,150 --> 00:02:18,170
small images.

28
00:02:18,200 --> 00:02:22,100
Normal convolution followed by maximally actually seems to work better.

29
00:02:23,950 --> 00:02:29,680
But also, we're taking inspiration from the VG network, where they do multiple convolution layers

30
00:02:29,680 --> 00:02:30,850
before they do pooling.

31
00:02:31,660 --> 00:02:34,450
Now Viji is designed for much larger images.

32
00:02:34,720 --> 00:02:36,540
So it's also a much larger network.

33
00:02:37,030 --> 00:02:42,580
They actually have five groups of convolutions and pullings, and each group has multiple convolutions,

34
00:02:43,480 --> 00:02:46,490
too, in the first couple layers and three in the following layers.

35
00:02:46,990 --> 00:02:48,790
So that's another thing we've done differently.

36
00:02:49,480 --> 00:02:54,740
First convolution with no strides and now multiple convolutions before doing pooling.

37
00:02:56,740 --> 00:03:00,750
Initially, you might want to try only this by itself to see how it works.

38
00:03:04,020 --> 00:03:09,330
The next thing you can see that we've added is a backchannel and later after every convolution, not

39
00:03:09,330 --> 00:03:12,660
much to say here since we just described how and why it works.

40
00:03:20,480 --> 00:03:25,910
But as I mentioned before, because I added so many things to the script, you want to turn some off

41
00:03:25,910 --> 00:03:31,040
and turn them on at different combinations so you can see for yourself the effect each of them has.

42
00:03:32,000 --> 00:03:35,390
You'll notice that for these convolutions I've used, padding equals one.

43
00:03:36,110 --> 00:03:42,050
This is because I wanted to show you how to achieve same padding such that the output image is always

44
00:03:42,050 --> 00:03:44,090
equal in size to the input image.

45
00:03:44,750 --> 00:03:50,090
Basically, we can take the formula from the PI torture documentation and plug in values for the kernel

46
00:03:50,090 --> 00:03:51,050
size in stride.

47
00:03:51,830 --> 00:03:57,460
Then we use the fact that we want each out to be equal to H in and then we can solve for P.

48
00:03:58,130 --> 00:04:00,230
In this case I get P equals one.

49
00:04:00,470 --> 00:04:05,270
So that means if we want the same mode convolution then we have to use padding equals one.

50
00:04:07,260 --> 00:04:12,960
This makes convolutional arithmetic a little easier for us after each convolution, the size of the

51
00:04:12,960 --> 00:04:14,250
image remains the same.

52
00:04:14,520 --> 00:04:17,770
And after the Max P'Pool, the image is downsampled by two.

53
00:04:18,330 --> 00:04:22,410
So basically, you can take 32 and divide it by two, three times.

54
00:04:26,290 --> 00:04:32,680
We end up with four, so the size of the input of the first linear layer is 128 times four times four.

55
00:04:35,840 --> 00:04:39,650
Finally, we have our final dense layers with the drop out, which is the same as before.

56
00:04:42,790 --> 00:04:49,270
All right, so let's scroll down so we instantiate the model, move the model to the GPU, create the

57
00:04:49,270 --> 00:04:52,300
loss and optimizer to batch gradient descent.

58
00:04:54,110 --> 00:05:00,890
Observe our losses, so these epochs take a little longer, around 15 seconds, and also notice that

59
00:05:00,890 --> 00:05:03,590
we did 80 epochs instead of just 15.

60
00:05:08,510 --> 00:05:13,190
All right, so here's the latest iteration, something you may have noticed in the previous script is

61
00:05:13,190 --> 00:05:18,080
that it had some overfitting and you can see here that the overfitting effect is less.

62
00:05:22,380 --> 00:05:23,790
So let's look at the accuracy.

63
00:05:25,600 --> 00:05:31,540
And you can see that we reach much higher accuracy than before, so now we get about 94 percent on the

64
00:05:31,540 --> 00:05:34,500
transfer and 87 percent on the test said.

65
00:05:39,390 --> 00:05:41,700
Next, let's look at the confusion matrix.

66
00:05:45,230 --> 00:05:49,910
All right, so this is encouraging because a lot of these off diagonal numbers are a lot smaller than

67
00:05:49,910 --> 00:05:50,400
before.

68
00:05:51,110 --> 00:05:56,090
You'll notice that a few of them are still quite high, such as when we confuse threes and fives, which

69
00:05:56,090 --> 00:05:57,250
are cats and dogs.

70
00:05:57,770 --> 00:06:03,860
I think this is the most difficult to discriminate in this data set because in a 32 by 32 image, cats

71
00:06:03,860 --> 00:06:05,870
and dogs are going to look extremely similar.

72
00:06:07,740 --> 00:06:12,760
We also still see a lot of guns and knives being confused, which correspond to automobiles and trucks.

73
00:06:13,200 --> 00:06:15,140
Again, that seems to make a lot of sense.

74
00:06:21,670 --> 00:06:25,180
OK, so let's take a look at a few misclassified samples.

75
00:06:26,540 --> 00:06:30,860
So here is a cat predicted as a truck that doesn't really make sense.

76
00:06:32,260 --> 00:06:34,570
Here's a truck predicted as an automobile.

77
00:06:35,170 --> 00:06:38,350
I think we can agree that that might be a truck.

78
00:06:40,480 --> 00:06:45,940
Here's a cat predicted as a horse, we can kind of see how that might be a horse or a dog.

79
00:06:48,600 --> 00:06:50,370
Here's a frog predicted as a cat.

80
00:06:50,970 --> 00:06:55,410
So this frog is kind of in the same poses a cat might stand in.

81
00:06:58,450 --> 00:07:00,100
Is a horse predicted as a deer?

82
00:07:02,060 --> 00:07:04,310
So, again, we can kind of see why that might be.

83
00:07:07,320 --> 00:07:09,150
Here's a truck predicted as a ship.

84
00:07:09,960 --> 00:07:11,610
Personally, I don't think it looks like anything.

85
00:07:14,870 --> 00:07:17,270
Here's an automobile predicted as an airplane.

86
00:07:19,010 --> 00:07:20,010
So that seems kind of wrong.

87
00:07:20,030 --> 00:07:20,900
It looks like a car.

88
00:07:23,210 --> 00:07:26,470
Here's a horse predicted as a deer that seems to make sense.

89
00:07:30,000 --> 00:07:33,020
Is a dog protected as a horse, that kind of makes sense.

90
00:07:36,760 --> 00:07:41,410
So are these four legged animals are very hard to distinguish, at least for a neural network.

91
00:07:43,870 --> 00:07:45,640
Here's a bird predicted as a horse.

92
00:07:46,810 --> 00:07:48,910
If we look closely, it kind of looks like a bird.

93
00:07:50,170 --> 00:07:55,190
We can definitely tell it looks more like a bird than a horse, but if you weren't told the label,

94
00:07:55,190 --> 00:07:56,690
you might not even know what it is.

95
00:08:02,010 --> 00:08:06,480
So I think it's clear that while we improve the performance of the neural network, it still doesn't

96
00:08:06,480 --> 00:08:09,420
reach human level image recognition capability.

97
00:08:17,190 --> 00:08:22,080
One conclusion you can draw from this lecture could be summarized as bigger is better.

98
00:08:22,680 --> 00:08:28,110
We've seen that by virtually adding a lot more data to our data set and using a much larger neural network

99
00:08:28,110 --> 00:08:33,030
than what we've seen so far in this course, we were able to significantly improve the performance of

100
00:08:33,030 --> 00:08:34,830
our network on this data set.

101
00:08:35,640 --> 00:08:40,530
This is a very common theme and deep learning, and it's often surprising because it seems so simple.

102
00:08:41,160 --> 00:08:45,410
The fact that it's so simple actually makes it seem less profound than it really is.

103
00:08:46,080 --> 00:08:51,780
Some people believe that instead of better algorithms, we just need to scale up our data and our models

104
00:08:51,780 --> 00:08:52,530
and our compute.

105
00:08:52,740 --> 00:08:55,650
And this by itself will lead to breakthroughs in AI.

106
00:08:57,680 --> 00:09:02,720
As noted, our final known network in this lecture is much larger than what we've worked with so far

107
00:09:03,350 --> 00:09:07,400
in general, CNN's can get pretty deep, even hundreds of layers.

108
00:09:07,940 --> 00:09:13,530
This is a big jump from the last section when all we had was just two dense layers at this scale.

109
00:09:13,820 --> 00:09:17,210
It helps to be able to see a summary of your model in some way.

110
00:09:23,150 --> 00:09:29,300
And so the torture summary library allows you to do just that by calling the summary function and passing

111
00:09:29,300 --> 00:09:35,480
in the model and input size, you get a table that shows every layer in your model, along with some

112
00:09:35,480 --> 00:09:40,100
other useful information, such as the output shape and the number of parameters in each layer.

113
00:09:42,360 --> 00:09:49,230
So first, we have our initial kofta delayer, which has 896 parameters, and it's Alpe shape is 32

114
00:09:49,230 --> 00:09:50,610
by 32 by 32.

115
00:09:51,240 --> 00:09:53,730
This is because we use the same mode convolution.

116
00:09:54,480 --> 00:09:58,350
We can do a quick sanity check to make sure that the number of parameters makes sense.

117
00:09:58,830 --> 00:10:02,000
Our filter is three by three by three by 32.

118
00:10:02,040 --> 00:10:04,050
That's because it has three input channels.

119
00:10:04,470 --> 00:10:08,190
Its size is three by three and it has 32 output channels.

120
00:10:08,460 --> 00:10:12,810
And so three times, three times, three times 32 is 864.

121
00:10:13,620 --> 00:10:17,330
Then we have the bias term, which is a vector of size 32.

122
00:10:17,850 --> 00:10:22,650
So 864 plus 32 is 896, which checks out.

123
00:10:30,090 --> 00:10:35,340
If we scroll down to the end, we can see the total number of parameters in the model for this neural

124
00:10:35,340 --> 00:10:37,650
network, it's about two point four million.

125
00:10:38,010 --> 00:10:41,220
So, again, a big jump from what we were working with previously.
