WEBVTT

00:00.480 --> 00:02.940
-: Hello, and welcome back to the course on deep learning.

00:02.940 --> 00:05.493
Today we're kicking off convolutional neural networks.

00:05.493 --> 00:06.930
This is gonna be exciting.

00:06.930 --> 00:08.580
Let's dive straight into it.

00:08.580 --> 00:10.890
We're going to start off with an image.

00:10.890 --> 00:13.560
What do you see when you look at this image?

00:13.560 --> 00:15.660
Do you see a person looking at you,

00:15.660 --> 00:18.180
or do you see a person looking to the right?

00:18.180 --> 00:23.180
You can see that your brain is struggling to adjust.

00:24.060 --> 00:25.890
If you look to the right side of the image,

00:25.890 --> 00:27.420
just look at the right border of the image,

00:27.420 --> 00:29.220
you'll see a person looking to the right.

00:29.220 --> 00:31.500
If you look at the left border of the image,

00:31.500 --> 00:33.660
you'll see a person looking at you.

00:33.660 --> 00:38.660
And this just proves that what our brain is looking

00:39.210 --> 00:42.180
for when we see things is features.

00:42.180 --> 00:44.250
Depending on the features that it sees,

00:44.250 --> 00:46.140
depending on the features that you process,

00:46.140 --> 00:48.720
you categorize things in certain ways.

00:48.720 --> 00:51.810
So when you look on the right side of the image,

00:51.810 --> 00:54.030
you see certain features of a person looking to the right,

00:54.030 --> 00:57.240
because they're closer to your center of focus.

00:57.240 --> 00:59.400
And therefore, your brain classifies that

00:59.400 --> 01:00.990
as a person looking to the right.

01:00.990 --> 01:03.300
When you look to the left side of the image,

01:03.300 --> 01:06.150
you see more features of a person looking at you,

01:06.150 --> 01:09.540
and therefore your brain classifies it as such.

01:09.540 --> 01:11.220
So let's have a look at another one.

01:11.220 --> 01:12.900
This is a very famous image.

01:12.900 --> 01:15.900
You probably have already seen it, but what do you see here?

01:16.800 --> 01:17.940
So, some people will say

01:17.940 --> 01:22.940
that they see a young lady wearing a dress looking away.

01:23.820 --> 01:28.170
Some people will say they see an old lady, wearing a scarf

01:28.170 --> 01:30.240
on her head, looking down.

01:30.240 --> 01:32.430
So I'm gonna point these features out,

01:32.430 --> 01:35.096
and you'll see that'll become very obvious.

01:35.096 --> 01:37.500
So, this is the face of the young lady looking away.

01:37.500 --> 01:39.030
She's looking into the distance.

01:39.030 --> 01:41.280
That's her coat, that's her hair,

01:41.280 --> 01:43.650
that's her little feather in her hair.

01:43.650 --> 01:46.411
And on the other hand, this is the head

01:46.411 --> 01:48.990
of the old lady looking down.

01:48.990 --> 01:52.290
That's her nose, that's her mouth, that's her chin.

01:52.290 --> 01:55.770
That's the scarf on her head, and she's looking down.

01:55.770 --> 01:58.590
So as you can see, two in one, and depending

01:58.590 --> 02:01.320
on which features your brain picks up, it will switch

02:01.320 --> 02:06.320
between classifying the image as one or the other.

02:06.870 --> 02:10.530
The oldest one of these illusions recorded

02:10.530 --> 02:13.890
in the printed work is this one.

02:13.890 --> 02:15.180
It's the duck or the rabbit.

02:15.180 --> 02:16.980
So, is this a duck or is this a rabbit?

02:16.980 --> 02:18.390
Another example.

02:18.390 --> 02:21.720
And now I'm gonna show you an image which will just,

02:21.720 --> 02:25.650
for a second, just look at it and see what emotions,

02:25.650 --> 02:29.100
or what kind of visual experience you go through.

02:29.100 --> 02:31.140
So, what do you see?

02:31.140 --> 02:35.730
Do you feel a bit, not dizzy, but a bit dazzled?

02:35.730 --> 02:37.800
Like your brain is trying to understand what

02:37.800 --> 02:40.980
it is, like it's trying to, it's jumping

02:40.980 --> 02:43.890
between her eyes, the up and down eyes.

02:43.890 --> 02:48.890
And this is a classic example of when there are

02:48.930 --> 02:52.140
certain features where it could be this, it could be that,

02:52.140 --> 02:54.120
but your brain cannot decide.

02:54.120 --> 02:57.513
And because both seem plausible.

02:58.470 --> 03:01.650
Yeah, so, basically all these examples illustrate

03:01.650 --> 03:02.970
to us how the brain works,

03:02.970 --> 03:04.890
that it processes certain features

03:04.890 --> 03:08.760
on an image, or on whatever you see in real life,

03:08.760 --> 03:10.860
and it classifies that as such.

03:10.860 --> 03:14.130
And you've probably been in situations when you look

03:14.130 --> 03:16.140
over your shoulder quickly and you see something,

03:16.140 --> 03:20.820
you think it's, I don't know if it's like a ball,

03:20.820 --> 03:23.940
but it turns out to be a cat, or you think it's a car,

03:23.940 --> 03:25.530
but it turns out to be a shadow, and things like that.

03:25.530 --> 03:27.390
That's because you don't have enough time

03:27.390 --> 03:29.670
to process those features, or you don't have enough features

03:29.670 --> 03:31.200
to classify things as such.

03:31.200 --> 03:35.430
And this is, for me, this is very interesting,

03:35.430 --> 03:38.580
because what we are going to be doing with neural networks,

03:38.580 --> 03:40.800
with convolution neural networks, is very similar.

03:40.800 --> 03:44.010
And you'll find that the way that computers are going to

03:44.010 --> 03:46.410
be processing images is going to be extremely similar

03:46.410 --> 03:48.240
to the way we are processing images.

03:48.240 --> 03:50.520
So it's very valuable to understand,

03:50.520 --> 03:52.350
and just kind of remember these things,

03:52.350 --> 03:53.610
that this is how we do it.

03:53.610 --> 03:55.950
And I'm going to take this lady off your screens

03:55.950 --> 03:58.620
because she's probably already freaking you out by now.

03:58.620 --> 04:00.960
So here's something different.

04:00.960 --> 04:02.130
Here's an experiment.

04:02.130 --> 04:05.460
An experiment done on computers,

04:05.460 --> 04:06.990
on convolutional neural networks.

04:06.990 --> 04:11.340
So we're slowly moving now from humans to computers.

04:11.340 --> 04:14.343
And this slide is from a (indistinct) by Geoffrey Hinton.

04:15.330 --> 04:18.540
And here you have, basically it describes an experiment

04:18.540 --> 04:22.320
that he had done on some convolutional neural networks

04:22.320 --> 04:24.450
that he had trained up.

04:24.450 --> 04:26.550
So here you see three images,

04:26.550 --> 04:28.230
and we're gonna go through them left to right,

04:28.230 --> 04:30.120
and see how you would classify them,

04:30.120 --> 04:32.752
and then see how the computer classified them.

04:32.752 --> 04:35.430
So on the left, what do you think this is?

04:35.430 --> 04:37.680
You probably said cheetah, and you will be right.

04:37.680 --> 04:39.895
And this is what the computer said.

04:39.895 --> 04:41.220
So, right away, right off the bat,

04:41.220 --> 04:43.555
we're going to learn how to read these images,

04:43.555 --> 04:45.540
because if you're going to go deep

04:45.540 --> 04:49.620
into convolutional neural networks, no pun intended,

04:49.620 --> 04:52.470
if you're going to start learning more and more about them

04:52.470 --> 04:54.510
and using them, you'll see a lot of these.

04:54.510 --> 04:57.860
And I've actually seen people read them incorrectly.

04:57.860 --> 05:01.470
So here at the top, cheetah is what it actually is.

05:01.470 --> 05:04.860
So that's the actual correct label of the image.

05:04.860 --> 05:07.350
That's what the label of the image is,

05:07.350 --> 05:11.730
regardless of any processing and computer vision.

05:11.730 --> 05:13.920
And then here are the guesses.

05:13.920 --> 05:18.270
The top four, or five sometimes, guesses of the algorithm,

05:18.270 --> 05:20.640
and they're given the probability.

05:20.640 --> 05:24.090
So the computer said, or the neural network said,

05:24.090 --> 05:26.430
cheetah, leopard, snow leopard, or Egyptian cat can

05:26.430 --> 05:29.160
be one of the four, and cheetah has the highest vote.

05:29.160 --> 05:31.260
And throughout this part of the course,

05:31.260 --> 05:33.674
you'll understand what these votes mean

05:33.674 --> 05:34.860
and how they're derived.

05:34.860 --> 05:36.510
But for now, it's pretty intuitive, right?

05:36.510 --> 05:38.340
So it's a cheetah in reality.

05:38.340 --> 05:41.010
And the neural network guessed right, it said

05:41.010 --> 05:43.500
with a high probability of about, like, 95, 99%,

05:43.500 --> 05:44.333
it's a cheetah.

05:45.900 --> 05:49.170
Then the second one, what do you think is it that is?

05:49.170 --> 05:51.270
That is a bullet train.

05:51.270 --> 05:54.480
And the neural network was able to distinguish

05:54.480 --> 05:57.580
between bullet train, passenger car, subway, train,

05:57.580 --> 05:59.040
those are the top choices,

05:59.040 --> 06:01.260
of course it had many more options.

06:01.260 --> 06:03.330
These neural networks learn to distinguish

06:03.330 --> 06:05.490
from not just four categories,

06:05.490 --> 06:08.730
from dozens, thousands of categories at the same time.

06:08.730 --> 06:10.890
So those are the four options that it picked.

06:10.890 --> 06:12.810
And so that's bullet train, and it's a bullet train.

06:12.810 --> 06:14.660
So what do you think the last one is?

06:16.382 --> 06:18.480
There are a couple of options

06:18.480 --> 06:20.130
where it's not very clear what it is.

06:20.130 --> 06:22.830
It could be a frying pan, it could be a magnifying glass,

06:22.830 --> 06:27.830
it could be even maybe a pair of scissors, some might say.

06:28.080 --> 06:30.750
Well, the neural network said it was a pair of scissors.

06:30.750 --> 06:32.580
But you can see how you can go wrong here.

06:32.580 --> 06:35.490
First of all, it's not a very clear image,

06:35.490 --> 06:39.277
and also you can see that the probabilities

06:40.800 --> 06:41.790
are not as clear here.

06:41.790 --> 06:43.950
So the neural network was a bit confused,

06:43.950 --> 06:46.290
a bit indecisive, just as we are.

06:46.290 --> 06:48.540
So it said scissors was the highest probability,

06:48.540 --> 06:51.810
but then it had hand glass, which it actually was,

06:51.810 --> 06:53.790
not so far away on the second place,

06:53.790 --> 06:55.890
and frying pan, stethoscope.

06:55.890 --> 06:58.980
So basically, here you can see that scissors

06:58.980 --> 07:02.574
was its first guess, but the correct option was number two,

07:02.574 --> 07:04.063
and that's why it's colored in red.

07:04.063 --> 07:05.850
So there we go, that's what neural networks

07:05.850 --> 07:07.980
are already capable of, and this is actually quite

07:07.980 --> 07:10.650
an old slide, this was several years ago,

07:10.650 --> 07:11.850
now they're even better.

07:11.850 --> 07:15.120
And you will see that from the practical application

07:15.120 --> 07:16.920
that you'll be coding together with (indistinct).

07:16.920 --> 07:17.970
But now let's try to understand

07:17.970 --> 07:19.320
a bit better what convo nets,

07:19.320 --> 07:21.480
or convolutional neural networks actually are,

07:21.480 --> 07:23.940
and why are they gaining so much popularity.

07:23.940 --> 07:25.770
And they actually are gaining popularity,

07:25.770 --> 07:29.550
so you can see here a Google trends comparison

07:29.550 --> 07:31.740
I did just yesterday.

07:31.740 --> 07:35.670
Here you can see that convolutional neural networks

07:35.670 --> 07:39.480
are even taking over artificial neural networks.

07:39.480 --> 07:43.260
So a massive increase,

07:43.260 --> 07:44.910
and they're just gonna keep going that way

07:44.910 --> 07:48.450
because it is a very important field,

07:48.450 --> 07:50.850
that is where all the things happen,

07:50.850 --> 07:52.530
such as self-driving cars,

07:52.530 --> 07:55.950
how do they recognize people on the road,

07:55.950 --> 07:58.400
how to recognize stop signs and things like that.

07:59.643 --> 08:04.643
How is Facebook able to tag images or people in images?

08:04.920 --> 08:08.790
And not only just like, remember previously years ago,

08:08.790 --> 08:10.590
you had to tag people yourself?

08:10.590 --> 08:14.250
Then it would recognize faces, you had to add the names,

08:14.250 --> 08:16.470
and now it just recognizes the faces

08:16.470 --> 08:18.600
and adds the names at the same time.

08:18.600 --> 08:22.590
Well, that is what convolutional neural networks

08:22.590 --> 08:23.790
are capable of.

08:23.790 --> 08:25.143
And speaking of Facebook,

08:26.130 --> 08:28.980
if Geoffrey Hinton is the godfather

08:28.980 --> 08:32.970
of artificial neural networks and deep learning,

08:32.970 --> 08:36.210
then Yann LeCun is the grandfather

08:36.210 --> 08:39.120
of convolutional neural networks.

08:39.120 --> 08:42.570
Yann LeCun is a student of Geoffrey Hinton's.

08:42.570 --> 08:45.720
And in fact, here, you can see them together.

08:45.720 --> 08:49.984
And Geoffrey Hinton now is pioneering deep learning

08:49.984 --> 08:51.420
at Google.

08:51.420 --> 08:53.400
Yann LeCun is the director of Facebook

08:53.400 --> 08:54.900
artificial intelligence research,

08:54.900 --> 08:57.000
and also a professor at NYU.

08:57.000 --> 09:00.120
So slowly we're, I love this part of the course,

09:00.120 --> 09:03.360
slowly we're building up these names,

09:03.360 --> 09:06.810
or this kind of picture of the profiles

09:06.810 --> 09:09.390
of the people who are driving this field.

09:09.390 --> 09:13.830
And in the next couple of parts we'll get to know about

09:13.830 --> 09:16.290
a few more and we'll have this whole mafia

09:16.290 --> 09:18.810
as they call themselves, or Yann LeCun calls them,

09:18.810 --> 09:21.090
mafia, or conspiracy of deep learning.

09:21.090 --> 09:22.350
And you'll learn a bit more

09:22.350 --> 09:24.822
about how this whole field developed.

09:24.822 --> 09:27.390
And yeah, these are just some great, great people.

09:27.390 --> 09:30.480
And so, Yann LeCun back in the 80s

09:30.480 --> 09:33.600
and the 90s made significant contributions

09:33.600 --> 09:36.270
to the field of convolutional neural networks.

09:36.270 --> 09:39.720
And as we'll see throughout this course,

09:39.720 --> 09:43.890
has been able to develop, or help the world develop,

09:43.890 --> 09:46.620
something so extremely powerful.

09:46.620 --> 09:51.450
So, moving onto how convolution neural networks work.

09:51.450 --> 09:52.710
You have an input.

09:52.710 --> 09:54.330
It's very simple, it's very straightforward.

09:54.330 --> 09:56.190
So you have an input image,

09:56.190 --> 09:58.350
it goes through the convolutional neural network,

09:58.350 --> 10:01.980
and you have an output label, so it classifies that image

10:01.980 --> 10:05.760
as something like as a cheetah, or a bullet train,

10:05.760 --> 10:06.780
or something else.

10:06.780 --> 10:10.860
Now, going into a bit more detail.

10:10.860 --> 10:14.617
For instance, after a neural network has been trained up

10:14.617 --> 10:19.617
on certain images, on certain classified images,

10:19.770 --> 10:23.670
or images that have been categorized prior.

10:23.670 --> 10:25.140
After that you can give it,

10:25.140 --> 10:26.730
let's say a neural network has been trained

10:26.730 --> 10:30.510
up to recognize facial expressions, emotions.

10:30.510 --> 10:33.480
You can give it a face of a smiling person.

10:33.480 --> 10:37.350
Not just a face, like a drawing of a face, like this,

10:37.350 --> 10:39.420
but actual face of a person smiling,

10:39.420 --> 10:41.610
and it'll tell you that that person is happy.

10:41.610 --> 10:44.850
And you can give a face of a person that's frowning,

10:44.850 --> 10:47.280
it will tell you that that person is sad.

10:47.280 --> 10:48.570
You can recognize these emotions.

10:48.570 --> 10:51.030
And as you can see, that's already very powerful

10:51.030 --> 10:53.310
in terms of so many different applications,

10:53.310 --> 10:57.570
just of this one example you can think of right away.

10:57.570 --> 11:00.510
And, and in both cases it'll give you a probability.

11:00.510 --> 11:02.546
So it won't say, you know,

11:02.546 --> 11:04.950
with a hundred percent the person's happy or sad,

11:04.950 --> 11:08.970
it'll be 99, or 98, or maybe 80%

11:08.970 --> 11:11.760
when it's unclear of what's going on.

11:11.760 --> 11:12.990
And just like we are, right?

11:12.990 --> 11:16.650
Sometimes we can mistake things for what they're not.

11:16.650 --> 11:20.580
Or sometimes it's just not clear

11:20.580 --> 11:22.260
if the person is smiling or frowning,

11:22.260 --> 11:24.750
or if it's a dog a cat,

11:24.750 --> 11:28.530
or if it's a train or a bullet train, right?

11:28.530 --> 11:31.020
Sometimes we haven't seen enough features.

11:31.020 --> 11:32.370
And it all goes down to features,

11:32.370 --> 11:35.880
because that's how we process visual information,

11:35.880 --> 11:38.580
as we saw from the start of this tutorial.

11:38.580 --> 11:42.600
So, how is a neural network able

11:42.600 --> 11:44.130
to recognize these features?

11:44.130 --> 11:48.060
Well, it all starts at the very basic level.

11:48.060 --> 11:50.850
You have, let's say you have an image, you have two images.

11:50.850 --> 11:54.000
One is a black and white image of two by two pixels,

11:54.000 --> 11:56.520
and one is a colored image of two by two pixels.

11:56.520 --> 11:59.250
Well, neural networks leverage the fact

11:59.250 --> 12:04.250
that the black and white image is a two-dimensional array.

12:04.680 --> 12:07.200
So the way we see it right now on the left

12:07.200 --> 12:09.720
is just the visual representation, right?

12:09.720 --> 12:11.250
So it's some kind of picture,

12:11.250 --> 12:14.100
and for simplicity's sake it's just a two by two picture.

12:14.100 --> 12:16.950
But in computer terms, it's actually a two-dimensional array

12:16.950 --> 12:20.370
with every single one of those pixels having a value

12:20.370 --> 12:22.350
between zero and 255.

12:22.350 --> 12:25.410
So that's eight bits of information,

12:25.410 --> 12:27.690
two to the power of eight is 256.

12:27.690 --> 12:30.330
So therefore, the values are from zero to 255.

12:30.330 --> 12:32.250
And that's the intensity of the color.

12:32.250 --> 12:34.274
And in this case, the color white.

12:34.274 --> 12:36.300
So zero will be a completely black pixel,

12:36.300 --> 12:38.640
255 will be a completely white pixel,

12:38.640 --> 12:42.120
and between them you have the grayscale range

12:42.120 --> 12:44.610
of possible options for this pixel.

12:44.610 --> 12:46.500
And based on that information,

12:46.500 --> 12:50.010
computers are able to then work with the image,

12:50.010 --> 12:51.420
and that's kind of like the starting point

12:51.420 --> 12:55.170
that any image actually has a digital representation,

12:55.170 --> 12:56.580
has a digital form.

12:56.580 --> 12:59.460
And those are just basically ones and zeros

12:59.460 --> 13:03.240
that form a number, zero to 255 for every single pixel.

13:03.240 --> 13:04.350
And that's what the computer works with.

13:04.350 --> 13:06.884
It doesn't actually work with, you know,

13:06.884 --> 13:07.717
the colors or anything, it works with the ones

13:07.717 --> 13:09.540
and zeros at the end of the day,

13:09.540 --> 13:12.453
that's kind of like the foundation of it all.

13:13.350 --> 13:15.480
And in a colored image,

13:15.480 --> 13:17.960
it's actually a three-dimensional array.

13:17.960 --> 13:22.020
You've got a blue layer, a green layer, and a red layer.

13:22.020 --> 13:25.410
And in that sense, for RGB, red, green, blue,

13:25.410 --> 13:29.760
and each one of those colors has its own intensity.

13:29.760 --> 13:34.760
So basically, a pixel has three values assigned to it.

13:36.990 --> 13:41.400
Each one of them is between zero and 256, 255.

13:41.400 --> 13:45.969
And therefore you can find out what this image,

13:45.969 --> 13:48.360
what color exactly this pixel is,

13:48.360 --> 13:50.310
by combining those three values.

13:50.310 --> 13:53.520
And again, computers are going to be working with that.

13:53.520 --> 13:55.800
So that's the foundation of it all.

13:55.800 --> 13:59.520
That's the red channel, the green channel, the blue channel.

13:59.520 --> 14:03.570
And finally, let's have a look at, for instance,

14:03.570 --> 14:05.550
an example, a very trivial example,

14:05.550 --> 14:09.600
of a smiling face in computer terms.

14:09.600 --> 14:12.510
If we just really simplify things instead

14:12.510 --> 14:17.160
of having from zero to 255, in terms of having those values,

14:17.160 --> 14:19.050
just so that we can understand things better

14:19.050 --> 14:21.030
and really grasp the concepts,

14:21.030 --> 14:26.030
we're going to say zero is white, one is black, right?

14:26.820 --> 14:30.900
So we're just going to simplify things to the extreme

14:30.900 --> 14:33.990
and you'll see that that image can be represented like that.

14:33.990 --> 14:36.030
So the reason why we've brought this up is

14:36.030 --> 14:38.940
because we're going to, in all of our intuition tutorials

14:38.940 --> 14:40.830
we're gonna to structure on images like this

14:40.830 --> 14:43.773
which are very simple, but at the same time,

14:43.773 --> 14:45.450
then all those concepts can translate back

14:45.450 --> 14:48.960
to the zero to 256 range of values,

14:48.960 --> 14:50.700
and everything applies the same way there.

14:50.700 --> 14:52.050
And the steps that we're going to be going

14:52.050 --> 14:54.930
through with these images are step number one, convolution,

14:54.930 --> 14:58.500
step number two, max pooling, step number three, flattening,

14:58.500 --> 15:00.570
and step number four, full connection.

15:00.570 --> 15:02.490
And I can imagine that probably none

15:02.490 --> 15:05.760
of these words mean much to you at the moment,

15:05.760 --> 15:08.400
but by the end of this section of the course,

15:08.400 --> 15:11.700
you will understand them in great detail

15:11.700 --> 15:13.950
and exactly what they're doing.

15:13.950 --> 15:16.020
So we'll get started on the next tutorial.

15:16.020 --> 15:19.620
For now, the additional reading that you might want to look

15:19.620 --> 15:23.880
into is Yann LeCun's original paper

15:23.880 --> 15:27.950
that gave the rise to convolutional neural networks,

15:27.950 --> 15:30.090
it's called Gradient-Based Learning Applied

15:30.090 --> 15:31.650
to Document Recognition.

15:31.650 --> 15:33.390
You may have seen this image before,

15:33.390 --> 15:35.760
floating around the internet, it is from that paper.

15:35.760 --> 15:38.642
So if you wanna go back to the very beginnings

15:38.642 --> 15:41.623
of how it all happened, where it all came from,

15:41.623 --> 15:43.823
this is the paper to look into.

15:43.823 --> 15:46.410
And I look forward to seeing you on the next tutorial.

15:46.410 --> 15:48.453
Until then, enjoy deep learning.