WEBVTT

00:00.600 --> 00:02.970
-: Hello and welcome back to the course on deep learning.

00:02.970 --> 00:04.560
Today we're talking about Max Pooling

00:04.560 --> 00:07.470
and we've got some very exciting slides coming up ahead

00:07.470 --> 00:11.010
and even a special surprise at the very end of the tutorial.

00:11.010 --> 00:12.420
So let's get started.

00:12.420 --> 00:16.050
The first question is what is pooling and why do we need it?

00:16.050 --> 00:17.310
Well, to answer that question

00:17.310 --> 00:18.630
let's have a look at these images.

00:18.630 --> 00:20.790
On these three images, we've got a cheetah.

00:20.790 --> 00:22.740
In fact, it is the same exact cheetah.

00:22.740 --> 00:26.250
On the first image, the image is positioned properly

00:26.250 --> 00:28.050
and the cheetah is looking straight at you.

00:28.050 --> 00:30.660
On the second image, it's a bit rotated,

00:30.660 --> 00:32.760
and the third image it's a bit squashed.

00:32.760 --> 00:34.830
And the thing here is that

00:34.830 --> 00:37.350
we want the neural network to be able to

00:37.350 --> 00:41.430
recognize the cheetah in every single one of these images.

00:41.430 --> 00:43.260
In fact, this is just one cheetah.

00:43.260 --> 00:45.090
What if we have lots of different cheetahs?

00:45.090 --> 00:48.780
Here's a cheetah, here's a cheetah, here's another cheetah

00:48.780 --> 00:51.690
here's a cheetah, here's a cheetah, and here's a cheetah.

00:51.690 --> 00:53.790
And we want the neural network to recognize

00:53.790 --> 00:56.250
all of these cheetahs as cheetahs.

00:56.250 --> 00:57.090
And

00:57.090 --> 00:59.760
how can it do that?

00:59.760 --> 01:01.800
If they're all looking in different directions

01:01.800 --> 01:04.530
they're all in different parts of the image, they're like,

01:04.530 --> 01:07.050
their faces are positioned in different parts of the image.

01:07.050 --> 01:08.640
Somebody's on the right hand side

01:08.640 --> 01:11.040
somebody's in the left corner, somebody's in the middle.

01:11.040 --> 01:12.510
They're all a bit different.

01:12.510 --> 01:14.310
The texture's a little bit different.

01:14.310 --> 01:16.200
The lighting is a bit different.

01:16.200 --> 01:17.430
There's lots of little differences.

01:17.430 --> 01:19.920
And so if the neural network looks for

01:19.920 --> 01:22.260
exactly a certain feature, for instance,

01:22.260 --> 01:25.530
a distinctive feature of the cheetah is

01:25.530 --> 01:28.470
the tears that are on its face

01:28.470 --> 01:30.780
going from the eyes or the

01:30.780 --> 01:32.820
shadows that look like tears.

01:32.820 --> 01:36.270
The texture, the pattern that is going from its eyes down

01:36.270 --> 01:38.520
its on the sides of its nose that looks like tears.

01:38.520 --> 01:40.920
That's a distinctive feature of the cheetah.

01:40.920 --> 01:45.240
But if it's looking for that feature, which it learned from

01:45.240 --> 01:50.130
certain cheetahs in an exact location or an exact shape

01:50.130 --> 01:53.430
or form or texture, it'll never find these other cheetahs.

01:53.430 --> 01:58.140
So we have to make sure that our neural network

01:58.140 --> 02:00.900
has a property called "spacial invariance,"

02:00.900 --> 02:04.660
meaning that it doesn't care where the

02:06.000 --> 02:06.833
features are allocated.

02:06.833 --> 02:10.440
Not, not so much as in which part of the image, because

02:10.440 --> 02:13.170
we've kind of taken that into consideration

02:13.170 --> 02:16.800
with our map, with our, with our convolution layer.

02:16.800 --> 02:21.300
But it doesn't have to care if the features are a bit tilted

02:21.300 --> 02:23.940
if the features are a bit different in texture,

02:23.940 --> 02:25.620
if the features are a bit closer,

02:25.620 --> 02:27.510
or if features are a bit further apart

02:27.510 --> 02:29.250
relative to

02:29.250 --> 02:30.210
relative to each other.

02:30.210 --> 02:33.420
So if the feature itself is a bit distorted

02:33.420 --> 02:37.410
our neural network has to have some level of flexibility

02:37.410 --> 02:39.900
to be able to still find that feature.

02:39.900 --> 02:42.660
And that is what pooling is all about.

02:42.660 --> 02:45.090
So let's have a look at how pooling works.

02:45.090 --> 02:46.170
Here's our feature map.

02:46.170 --> 02:48.360
So we've already done our convolution

02:48.360 --> 02:50.580
and we've completed that part.

02:50.580 --> 02:52.650
And now we we're working with the convolution layer.

02:52.650 --> 02:53.880
Now we're going to apply pooling.

02:53.880 --> 02:54.713
So how does it work?

02:54.713 --> 02:56.700
We're going to be applying max pooling.

02:56.700 --> 02:58.770
There's several different types of pooling can apply.

02:58.770 --> 03:01.020
There's mean pooling, max pooling, sum pooling.

03:01.020 --> 03:03.540
And we'll comment on those towards the end of the tutorial.

03:03.540 --> 03:05.070
But for now, we're just applying max pooling.

03:05.070 --> 03:10.070
So we take a box of two by two pixels, like that.

03:10.080 --> 03:12.330
And again, it doesn't have to be two by two.

03:12.330 --> 03:13.560
You can choose any size of box.

03:13.560 --> 03:14.670
And again, we'll comment on that

03:14.670 --> 03:15.840
towards the end of the tutorial.

03:15.840 --> 03:19.260
And you place it in the top left hand corner.

03:19.260 --> 03:21.930
And you find the maximum value in that box,

03:21.930 --> 03:24.510
and then you record only that value

03:24.510 --> 03:26.280
and you disregard the other three.

03:26.280 --> 03:27.900
So in your box you have four values.

03:27.900 --> 03:30.300
You just disregard three, you only keep one the maximum

03:30.300 --> 03:31.800
which is one in this case.

03:31.800 --> 03:34.650
Then you move your box to the right by a stride

03:34.650 --> 03:36.210
you select the stride once again.

03:36.210 --> 03:38.550
So here we select a stride of two

03:38.550 --> 03:41.070
and that's what you normally select.

03:41.070 --> 03:42.960
You can select the stride of one, you can select,

03:42.960 --> 03:44.400
so they're overlapping boxes.

03:44.400 --> 03:46.860
You can select any kind of stride that you like

03:46.860 --> 03:48.780
even three if you want.

03:48.780 --> 03:50.700
But we're selecting a stride of two here,

03:50.700 --> 03:52.470
and that's what is commonly used.

03:52.470 --> 03:53.940
And then you repeat, repeat the process

03:53.940 --> 03:55.290
you record the maximum.

03:55.290 --> 03:57.960
Here, if you cross over, doesn't matter,

03:57.960 --> 04:00.090
you just keep continue doing what you're doing.

04:00.090 --> 04:04.260
So you still record the maximum here, zero.

04:04.260 --> 04:05.700
Here the maximum is four.

04:05.700 --> 04:07.110
Here the maximum is two.

04:07.110 --> 04:10.110
Here are the maximum is one, zero, one, zero, two,

04:10.110 --> 04:11.370
and then one.

04:11.370 --> 04:14.010
So as you can see, a few things happened.

04:14.010 --> 04:14.843
First of all,

04:14.843 --> 04:19.110
we still were able to preserve the features, right?

04:19.110 --> 04:21.870
The maximum numbers they represent

04:21.870 --> 04:23.760
because we know how the convolution layer works,

04:23.760 --> 04:26.400
we know that the maximum or the bit large numbers

04:26.400 --> 04:27.450
in your feature map,

04:27.450 --> 04:29.400
they represent where you actually found

04:29.400 --> 04:31.650
the closest similarity to your feature.

04:31.650 --> 04:34.560
But by then pooling these features,

04:34.560 --> 04:35.970
we are first of all,

04:35.970 --> 04:38.280
getting rid of 75% of the information.

04:38.280 --> 04:40.350
that is not

04:40.350 --> 04:41.430
the feature,

04:41.430 --> 04:45.630
which is not the important things that we're looking out for

04:45.630 --> 04:49.740
because we are disregarding three pixels out of four.

04:49.740 --> 04:51.480
So we're only keeping 25%.

04:51.480 --> 04:56.100
And then also because we are taking the maximum

04:56.100 --> 05:00.720
of the pixels that we, or the values that we have

05:00.720 --> 05:04.170
we are therefore accounting for any distortion.

05:04.170 --> 05:08.610
So for instance, two images in which, for the example,

05:08.610 --> 05:12.690
they cheetah's tears on the eyes are in one image

05:12.690 --> 05:15.570
they're a bit to the left or a bit rotated to the left.

05:15.570 --> 05:16.620
And in the other one they're a bit,

05:16.620 --> 05:19.980
they're how they're supposed to be, or how we,

05:19.980 --> 05:22.020
like, if we take one as the basis and then the other one,

05:22.020 --> 05:23.782
they're a bit rotated to the left

05:23.782 --> 05:26.580
the pooled feature will be exactly the same.

05:26.580 --> 05:27.600
So you can see here,

05:27.600 --> 05:30.570
if we are talking about the cheetah's tears,

05:30.570 --> 05:32.670
then, let's say this is the four,

05:32.670 --> 05:34.290
and this is where it was here,

05:34.290 --> 05:36.060
then if it was a bit rotated,

05:36.060 --> 05:38.340
so for instance the four ended up over here,

05:38.340 --> 05:40.500
then when we are doing the pooling

05:40.500 --> 05:43.110
we're still going to get the same pooled feature map.

05:43.110 --> 05:46.170
And that kind of the principle behind it.

05:46.170 --> 05:50.250
It's a very rough explanation, again, intuitive explanation

05:50.250 --> 05:51.690
but that's the point of pooling

05:51.690 --> 05:55.110
that we're still being able to preserve the features,

05:55.110 --> 05:59.550
and moreover, account for their possible spatial,

05:59.550 --> 06:02.400
or textural, or other kind of distortions.

06:02.400 --> 06:05.820
And in addition to all of that, we are reducing the size.

06:05.820 --> 06:07.410
So there's another benefit.

06:07.410 --> 06:09.960
So we've got, we are preserving the features,

06:09.960 --> 06:12.150
we're introducing spatial invariance,

06:12.150 --> 06:16.230
we're reducing the size by 75%,

06:16.230 --> 06:17.160
which is huge.

06:17.160 --> 06:19.800
Which is really going to help us in terms of processing.

06:19.800 --> 06:23.070
And moreover, another benefit of pooling

06:23.070 --> 06:25.170
is we are reducing the number of parameters.

06:25.170 --> 06:27.780
So we are reducing again, by 75%

06:27.780 --> 06:29.700
we're reducing number of parameters that are going to

06:29.700 --> 06:32.640
go into our final layers of the neural network,

06:32.640 --> 06:35.280
and therefore we're preventing over fitting.

06:35.280 --> 06:37.830
It is a very important benefit of pooling

06:37.830 --> 06:41.220
that we are removing information.

06:41.220 --> 06:42.600
And that is a good thing.

06:42.600 --> 06:43.980
That is a good thing because

06:43.980 --> 06:44.890
that way

06:46.200 --> 06:50.160
our model won't be able to over fit onto that information

06:50.160 --> 06:51.510
because, especially because

06:51.510 --> 06:52.740
that information is not relevant.

06:52.740 --> 06:54.900
Remember like at the very start we were talking about

06:54.900 --> 06:56.490
even for human, us as humans

06:56.490 --> 06:59.640
it's important to see exactly the features rather than

06:59.640 --> 07:02.760
all this other noise that is coming into our eyes?

07:02.760 --> 07:04.920
Well, same thing for neural networks.

07:04.920 --> 07:08.910
By disregarding the unnecessary, non-important information

07:08.910 --> 07:12.480
we're helping with preventing of over fitting.

07:12.480 --> 07:13.313
So there we go.

07:13.313 --> 07:14.610
That is what pooling is about.

07:14.610 --> 07:16.893
And the question here is of course,

07:18.330 --> 07:19.890
why max pooling, right?

07:19.890 --> 07:22.170
There's lots of different types of pooling and you know,

07:22.170 --> 07:25.620
why a stride of two, why a size of two by two pixels?

07:25.620 --> 07:26.730
Lots of all these things.

07:26.730 --> 07:30.300
And on that note, I'd like to introduce you to

07:30.300 --> 07:32.707
this lovely research paper called

07:32.707 --> 07:34.680
"Evaluation of Pooling Operations

07:34.680 --> 07:37.770
in Convolutional Architectures for Object Recognition"

07:37.770 --> 07:41.130
by Dominik Scherer from University of Bonn.

07:41.130 --> 07:42.150
There's the link.

07:42.150 --> 07:45.000
And the beauty about this paper is that

07:45.000 --> 07:47.580
it's very, very simple, very straightforward.

07:47.580 --> 07:50.220
So if you've never read a research paper before

07:50.220 --> 07:51.540
which you'd like to give it a go

07:51.540 --> 07:53.850
this is a great place to start.

07:53.850 --> 07:57.060
It's very short, only ten pages, very easy to read.

07:57.060 --> 07:59.070
And plus, the extra benefit is that

07:59.070 --> 08:02.070
now that we've discussed convolution and pooling

08:02.070 --> 08:04.530
you will be totally comfortable with everything

08:04.530 --> 08:06.000
that they're talking about in this paper.

08:06.000 --> 08:07.110
And you,

08:07.110 --> 08:09.450
this is a great way to actually reinforce your knowledge.

08:09.450 --> 08:11.910
So I highly recommend checking this paper out.

08:11.910 --> 08:13.980
It'll take 20 minutes to read it

08:13.980 --> 08:16.320
and you can even skip part two

08:16.320 --> 08:17.580
which is called Related Work,

08:17.580 --> 08:19.890
if it feels a bit far fetched or alienating

08:19.890 --> 08:21.030
just don't read that part.

08:21.030 --> 08:23.880
Just go straight to from part one to part three.

08:23.880 --> 08:26.490
And the one thing that you do need to know about this paper,

08:26.490 --> 08:30.540
they talk about a concept called sub-sampling.

08:30.540 --> 08:33.240
Well, sub-sampling is basically average pooling.

08:33.240 --> 08:35.463
So remember how here we were taking,

08:36.300 --> 08:37.410
we were taking the maximum,

08:37.410 --> 08:39.990
so in our square we were taking the maximum value.

08:39.990 --> 08:43.050
There's a concept called mean pooling or sum pooling.

08:43.050 --> 08:45.210
Sum pooling is you just sum these values up.

08:45.210 --> 08:46.830
Average pooling or mean pooling,

08:46.830 --> 08:50.010
you take the average value out of all of these.

08:50.010 --> 08:52.740
And sub sampling is kind of like a generalization

08:52.740 --> 08:53.910
of mean pooling.

08:53.910 --> 08:57.300
It's, it's a more kind of generalized approach

08:57.300 --> 09:00.870
to taking the average of, of these values.

09:00.870 --> 09:02.460
And you can read a bit more about it in the paper

09:02.460 --> 09:04.500
but otherwise just think of it as

09:04.500 --> 09:06.900
average pooling when you're reading that paper.

09:06.900 --> 09:08.790
And so that's where you can get some additional information

09:08.790 --> 09:09.960
on this topic.

09:09.960 --> 09:12.330
And now kind of let's recap, Where have we gotten to?

09:12.330 --> 09:14.850
So there's our input image.

09:14.850 --> 09:17.400
Then we applied the convolution operation

09:17.400 --> 09:19.050
and we got the convolution layer.

09:19.050 --> 09:22.590
And now to each of those feature maps that we get

09:22.590 --> 09:24.240
we've applied the pooling layer.

09:24.240 --> 09:25.260
So we've got,

09:25.260 --> 09:28.830
we've done these two steps, convolution and pooling.

09:28.830 --> 09:30.960
And now we're going to do something very fun,

09:30.960 --> 09:32.190
something exciting.

09:32.190 --> 09:34.470
We're going to experiment with this.

09:34.470 --> 09:36.750
So this is a screenshot I took

09:36.750 --> 09:40.953
from a tool created by Adam Harley from,

09:42.720 --> 09:45.420
well back when he was at Ryerson University

09:45.420 --> 09:46.380
of Computer Science.

09:46.380 --> 09:50.040
And now he's at Carnegie Mellon I think doing his PhD.

09:50.040 --> 09:51.000
And great tool.

09:51.000 --> 09:53.250
So let's open up, let's have a look.

09:53.250 --> 09:54.180
So you can find it.

09:54.180 --> 09:55.800
You can't actually find it through Google.

09:55.800 --> 09:57.510
You have to know the URL.

09:57.510 --> 09:59.940
It's just hard to find it through Google

09:59.940 --> 10:01.290
'cause there's no text here.

10:01.290 --> 10:02.760
SC

10:02.760 --> 10:06.600
Well just, this URL, (laughs) scs.ryerson.ca,

10:06.600 --> 10:08.280
and then this stuff on the end.

10:08.280 --> 10:12.000
And basically this is exactly what we are doing,

10:12.000 --> 10:12.833
but visualize.

10:12.833 --> 10:14.400
So here you need to draw a number.

10:14.400 --> 10:16.710
So let's say I draw the number four.

10:16.710 --> 10:18.850
And this tool will

10:20.280 --> 10:21.360
put the number four here.

10:21.360 --> 10:24.150
That's your image in our first step.

10:24.150 --> 10:27.090
Then this is the convolution step, right?

10:27.090 --> 10:28.200
And this is the pooling step.

10:28.200 --> 10:30.390
And also pooling by the way is also called down sampling.

10:30.390 --> 10:33.960
So pooling and down sampling are the same things.

10:33.960 --> 10:35.820
So you can see it's applied convolution,

10:35.820 --> 10:37.470
then its applied pooling

10:37.470 --> 10:39.150
and you can see how it exactly works.

10:39.150 --> 10:42.270
So you can see what kind of convolutions it has applied

10:42.270 --> 10:44.910
or what kind of filters it applied, what they look like.

10:44.910 --> 10:47.820
You can see what features it's looking out for.

10:47.820 --> 10:49.380
And then it's applying pooling.

10:49.380 --> 10:50.670
So it's reducing the size.

10:50.670 --> 10:53.400
And you can see here that this is important, right?

10:53.400 --> 10:54.333
So you can see,

10:55.200 --> 10:58.770
that this is the convolved image,

10:58.770 --> 11:00.120
and this is the pooled image.

11:00.120 --> 11:01.800
And you can still see the same features.

11:01.800 --> 11:04.320
It's just less information, but same features, right?

11:04.320 --> 11:05.850
The features are preserved.

11:05.850 --> 11:07.203
That's the important part.

11:08.340 --> 11:10.740
And moreover, if you know, if all four was a bit

11:10.740 --> 11:13.260
to the kind of like rotated a bit to the side.

11:13.260 --> 11:15.090
it would still be able to pick up

11:15.090 --> 11:16.950
very similar pooled layers.

11:16.950 --> 11:18.570
And then after that it's got more layers.

11:18.570 --> 11:19.830
We haven't talked about that yet.

11:19.830 --> 11:22.789
So then it's got another convolution,

11:22.789 --> 11:26.193
convolutional layer here which we actually won't have.

11:27.090 --> 11:28.560
And then it has another pooled layer,

11:28.560 --> 11:30.960
but it's basically just repeating that same process.

11:30.960 --> 11:32.070
And then after that,

11:32.070 --> 11:33.120
this is what we're going to be talking

11:33.120 --> 11:34.950
further down in the course.

11:34.950 --> 11:38.100
It's got the fully connected layers, and so on.

11:38.100 --> 11:39.900
But you can definitely play around with that.

11:39.900 --> 11:43.713
So, if I delete that you like, if I draw a seven,

11:44.760 --> 11:47.790
you'll see that it actually tells you the guess,

11:47.790 --> 11:49.530
is it guesses that this is a seven.

11:49.530 --> 11:53.010
And the second guess, the second likelihood, is a three.

11:53.010 --> 11:55.320
So you can draw it some challenging things

11:55.320 --> 11:56.430
and see if it can pick them up.

11:56.430 --> 11:59.550
So let's say if I draw something that looks like a zero

11:59.550 --> 12:02.040
but it's not a finished zero, will it pick it up?

12:02.040 --> 12:03.720
Now this, this time it didn't pick it up.

12:03.720 --> 12:06.150
Looks like a nine to the, to the image.

12:06.150 --> 12:08.520
What if I kind of like finish it like that?

12:08.520 --> 12:11.640
See now it thinks it's a zero or a nine.

12:11.640 --> 12:12.690
And you can see over there

12:12.690 --> 12:14.520
what's lighting up, the zero or the nine.

12:14.520 --> 12:16.620
But we'll talk about that part for the down.

12:16.620 --> 12:17.453
Let's do one more.

12:17.453 --> 12:18.286
Let's say

12:19.200 --> 12:20.033
like eight,

12:20.033 --> 12:22.560
I think eights are pretty hard for this.

12:22.560 --> 12:23.790
No, picked up an eight.

12:23.790 --> 12:25.950
So you can see that goes into an eight

12:25.950 --> 12:28.920
and then like after that it stops being recognizable.

12:28.920 --> 12:32.130
This stops making sense to us humans, right?

12:32.130 --> 12:34.560
These features that it's working with.

12:34.560 --> 12:35.550
But at the same time

12:35.550 --> 12:38.940
it is correctly recognizing that it's an eight.

12:38.940 --> 12:40.530
Yeah. So definitely play around with that.

12:40.530 --> 12:44.280
You can draw a smiley face, see what happens then.

12:44.280 --> 12:45.910
Looks like a three to this

12:47.040 --> 12:49.140
to this tool because the tool is obviously trained up,

12:49.140 --> 12:51.090
only on digits from zero to nine.

12:51.090 --> 12:53.550
So it has to recognize something there

12:53.550 --> 12:55.953
out of those, and it recognizes a three.

12:56.970 --> 12:59.863
It's like in life when you see something like a

12:59.863 --> 13:02.340
a type of fruit that you've never seen before

13:02.340 --> 13:03.310
like a

13:04.230 --> 13:06.090
custard apple or something

13:06.090 --> 13:09.270
and you think that it's, like it's a,

13:09.270 --> 13:12.360
it's a pear because you've never actually seen one before

13:12.360 --> 13:14.040
you don't know what to classify it as.

13:14.040 --> 13:14.873
Same thing here.

13:14.873 --> 13:17.700
So it hasn't actually trained on smiley faces

13:17.700 --> 13:20.460
and that's why thinks it's a tree, it's the three.

13:20.460 --> 13:21.293
So there you go.

13:21.293 --> 13:22.680
It's a very powerful, powerful tool.

13:22.680 --> 13:25.290
It'll be helpful for you to play around with it.

13:25.290 --> 13:29.142
Actually you, when you put your mouse over a pixel,

13:29.142 --> 13:33.720
it shows you where the feature detector was

13:33.720 --> 13:34.710
to pick up that pixel

13:34.710 --> 13:37.470
so you can see where those, these pixels coming from.

13:37.470 --> 13:41.850
And also, so you can see how the filter was kind of like

13:41.850 --> 13:44.100
going through the image exactly how we talked about

13:44.100 --> 13:44.933
in the course.

13:44.933 --> 13:47.400
And here you can see, you can see the pooling.

13:47.400 --> 13:49.500
You can see that the pooling is done with,

13:51.060 --> 13:52.720
the pooling is done with a

13:54.600 --> 13:57.180
little square size of two by two.

13:57.180 --> 13:59.940
And you can see that it's a stride of two, as well

13:59.940 --> 14:03.930
just as we discussed in today's tutorial.

14:03.930 --> 14:06.060
So there you go, play, have a play around with that.

14:06.060 --> 14:09.210
And I hope you enjoyed today's session.

14:09.210 --> 14:10.590
I look forward to seeing you next time.

14:10.590 --> 14:12.663
And until then, enjoy deep learning.