WEBVTT

00:00.660 --> 00:03.600
-: Hello and welcome back to the course on deep learning.

00:03.600 --> 00:06.900
In today's tutorial, we're talking about gradient descent.

00:06.900 --> 00:10.020
What we learned previously was that in order

00:10.020 --> 00:12.240
for a neural network to learn,

00:12.240 --> 00:14.340
what needs to happen is back propagation.

00:14.340 --> 00:17.970
And that is when the error, the difference,

00:17.970 --> 00:21.150
or the sum of square differences between Y hat

00:21.150 --> 00:25.441
and Y is back propagated through the neural network

00:25.441 --> 00:28.500
and the weights are adjusted accordingly.

00:28.500 --> 00:30.900
So we saw that and today we're going to

00:30.900 --> 00:34.380
learn exactly how these weights are adjusted.

00:34.380 --> 00:36.090
So let's have a look.

00:36.090 --> 00:41.090
This is our very simple version of neural ledge, perception

00:41.840 --> 00:44.820
or a single layer feed-forward neural network.

00:44.820 --> 00:48.060
And what we can see here is this whole process

00:48.060 --> 00:52.260
in action where we've got some input value, then

00:52.260 --> 00:57.000
we've got a weight, then a activation function is applied.

00:57.000 --> 00:58.170
We have, we get Y hat

00:58.170 --> 00:59.880
and then we compare it to the actual value.

00:59.880 --> 01:01.830
We calculate the cost function.

01:01.830 --> 01:05.400
So how can we minimize the cost function?

01:05.400 --> 01:07.350
What can we do about it?

01:07.350 --> 01:10.170
Well, one approach to do it is a brute force

01:10.170 --> 01:13.830
approach where we just take all, lots

01:13.830 --> 01:16.230
of different possible weights and look

01:16.230 --> 01:18.210
at them and see which one works best.

01:18.210 --> 01:21.810
And what we do is, for instance, we'd try out for, let's say

01:21.810 --> 01:25.020
for example, a thousand weights and we'd try them out.

01:25.020 --> 01:27.720
We'd get something like this for the cost function.

01:27.720 --> 01:29.832
And this is a chart of

01:29.832 --> 01:32.850
on the Y axis you have cost function on the vertical axis.

01:32.850 --> 01:34.860
On the horizontal axis you have Y hat.

01:34.860 --> 01:37.710
And because you can see the formula is Y hat minus

01:37.710 --> 01:40.830
y squared, this is what the cost function would look

01:40.830 --> 01:42.630
something like that.

01:42.630 --> 01:47.630
And basically you'd find the best one is over here.

01:47.940 --> 01:51.030
So very simple, very intuitive approach.

01:51.030 --> 01:53.220
Why not do this brute force method?

01:53.220 --> 01:55.473
Why not just try out a thousand different,

01:55.473 --> 02:00.120
cost, thousand different parameters

02:00.120 --> 02:03.030
or inputs for weights and see which one works the best?

02:03.030 --> 02:04.380
You'll find the best one that way.

02:04.380 --> 02:07.800
Well, if you have just one way to optimize, this might work.

02:07.800 --> 02:10.440
But as you increase the number of weights,

02:10.440 --> 02:13.500
increase the number of synapses in your network

02:13.500 --> 02:16.590
you have to face the curse of dimensionality.

02:16.590 --> 02:19.470
And so what is the curse of dimensionality?

02:19.470 --> 02:21.090
The best way to describe this

02:21.090 --> 02:24.600
or explain it is to just look at a practical example.

02:24.600 --> 02:26.970
So remember this example we had when we were talking

02:26.970 --> 02:30.480
about how neural networks actually work where

02:30.480 --> 02:31.952
we were building or

02:31.952 --> 02:36.952
running a neural network for proper evaluation.

02:37.110 --> 02:38.460
So this is what it looked

02:38.460 --> 02:40.740
like when it was trained up already.

02:40.740 --> 02:43.320
Well, when it's not trained, before it's trained,

02:43.320 --> 02:45.510
before we know which, what are the weights,

02:45.510 --> 02:47.940
the actual neural network looks like this, right?

02:47.940 --> 02:52.940
Because we have all these different possible synapses

02:53.160 --> 02:55.140
and we still have to train up the weights.

02:55.140 --> 02:57.270
And here we have a total of 25 weights.

02:57.270 --> 02:59.807
So four times five at the start, plus five more

02:59.807 --> 03:03.630
from the hidden layer to the output layer, 25 weights total.

03:03.630 --> 03:08.630
And let's see how we could possibly brute force 25 weights.

03:09.270 --> 03:12.600
It's a very simple neural network right here.

03:12.600 --> 03:14.640
Very simple, just one hidden layer.

03:14.640 --> 03:18.360
And how could we brute force outweigh

03:18.360 --> 03:21.330
through a neural network of this size?

03:21.330 --> 03:24.390
Well, let's do some simple mathematical calculations.

03:24.390 --> 03:25.920
We have 25 weights.

03:25.920 --> 03:26.753
So that means

03:26.753 --> 03:28.980
if we have a thousand combinations that we're gonna test

03:28.980 --> 03:30.630
out for every weight, the total number

03:30.630 --> 03:31.920
of combinations is a thousand

03:31.920 --> 03:33.780
to the power of 25 or a thousand,

03:33.780 --> 03:37.800
or 10 to the power of 75 different combinations.

03:37.800 --> 03:41.553
Now, let's see how Sunway TaihuLight,

03:41.553 --> 03:44.471
the world's fastest supercomputer

03:44.471 --> 03:47.010
as of June 2016.

03:47.010 --> 03:49.890
What, how would it approach this problem, right?

03:49.890 --> 03:53.010
So Sunway TaihuLight looks

03:53.010 --> 03:56.756
like this, is a whole huge building pretty much

03:56.756 --> 03:59.010
for this one supercomputer.

03:59.010 --> 04:02.460
And it got the Guinness World Record

04:02.460 --> 04:05.190
for being the fastest supercomputer.

04:05.190 --> 04:08.370
Right now, it is the fastest supercomputer in the world

04:08.370 --> 04:11.520
and Sunway TaihuLight can operate

04:11.520 --> 04:15.108
at a speed of 93 petaflops.

04:15.108 --> 04:19.920
FLOPS stands for floating operation per second.

04:19.920 --> 04:24.030
So it can do 93 to the power times 10

04:24.030 --> 04:28.080
to the power of 15 floating operations per second.

04:28.080 --> 04:29.583
That's how quick it is.

04:31.290 --> 04:33.890
In comparison, average computers right now

04:33.890 --> 04:38.190
they do like just over several gigaflops and so on.

04:38.190 --> 04:40.691
So it like kind of those ranges,

04:40.691 --> 04:44.370
way less than Sunway TaihuLight.

04:44.370 --> 04:47.680
So Sunway TaihuLight is in the forefront of technology

04:47.680 --> 04:52.680
and let's say hypothetically that it can do one, test

04:55.890 --> 04:59.820
one combination for our neural network in one FLOP,

04:59.820 --> 05:01.650
basically in one floating operation.

05:01.650 --> 05:03.240
That is not possible.

05:03.240 --> 05:05.790
That is not practical because you need multiple floating

05:05.790 --> 05:09.510
operations to test out a single weight in your neural.

05:09.510 --> 05:11.280
But even let's, let's give it a head start.

05:11.280 --> 05:13.320
Let's say that it can do that.

05:13.320 --> 05:17.400
In a ideal world, it can do that in one floating operation.

05:17.400 --> 05:20.070
It can do one test per one floating operation.

05:20.070 --> 05:22.957
That means it will still require 10 to the power

05:22.957 --> 05:27.957
of 75 divided by 93 times 10 to power of 15 seconds to

05:28.080 --> 05:32.320
come to, run all of those tests to brute force

05:32.320 --> 05:34.110
through that network.

05:34.110 --> 05:36.970
So that means one or approximately 10 to the power

05:36.970 --> 05:39.350
of 58 seconds, and that is the same as 10 to the power

05:39.350 --> 05:42.120
of 50 years.

05:42.120 --> 05:44.310
That is a huge number,

05:44.310 --> 05:48.210
that is longer than the universe has existed.

05:48.210 --> 05:51.308
And that is definitely not going to simply

05:51.308 --> 05:53.820
this number is so huge, it's just definitely

05:53.820 --> 05:58.599
not going to work for us at all in our optimization.

05:58.599 --> 05:59.740
So there we go.

05:59.740 --> 06:02.599
This is a no-no, even on the world's fastest

06:02.599 --> 06:05.460
supercomputer Sunway TaihuLight.

06:05.460 --> 06:07.620
So we have to come up with a different approach.

06:07.620 --> 06:10.320
How we going to find the optimal weight?

06:10.320 --> 06:13.590
By the way, this, our neural network was very simple.

06:13.590 --> 06:15.360
What about if the neural networks looks

06:15.360 --> 06:17.190
like something like this, right?

06:17.190 --> 06:20.190
Or even greater than that, then yeah

06:20.190 --> 06:22.219
it's just not going to happen at all, ever.

06:22.219 --> 06:24.150
So the method we're going to be looking

06:24.150 --> 06:26.190
at is called gradient descent.

06:26.190 --> 06:28.100
And you may have heard of it already.

06:28.100 --> 06:30.720
If not, we will find out what it is right now.

06:30.720 --> 06:34.617
So there's our cost function

06:34.617 --> 06:39.617
and now we're going to see how we can faster, come up

06:39.720 --> 06:43.200
with a faster way to find the best option.

06:43.200 --> 06:45.060
So let's say we start somewhere.

06:45.060 --> 06:45.930
You gotta start somewhere.

06:45.930 --> 06:50.789
So we start over there and from that point in the top left,

06:50.789 --> 06:54.222
what we're going to do is we're going to look

06:54.222 --> 06:58.620
at the angle of our cost function at that point.

06:58.620 --> 07:00.570
So we're just going to basically, that's why

07:00.570 --> 07:02.130
it's called gradient, because you have to differentiate.

07:02.130 --> 07:04.230
We're not going to look at the mathematical equations.

07:04.230 --> 07:06.270
We will provide some tips

07:06.270 --> 07:09.079
on additional reading at the end of the next lecture.

07:09.079 --> 07:13.380
But basically you just need to differentiate, find

07:13.380 --> 07:16.298
out what the slope is in that specific point

07:16.298 --> 07:19.290
and find out if the slope is positive or negative.

07:19.290 --> 07:20.880
If the, if the slope is negative

07:20.880 --> 07:23.970
like in this case, means that it, you're going downhill.

07:23.970 --> 07:27.300
So to the right is downhill, to the left is uphill.

07:27.300 --> 07:29.760
And from there it means you need to go right.

07:29.760 --> 07:31.650
Basically you need to go downhill.

07:31.650 --> 07:33.060
And that's what we're going to do.

07:33.060 --> 07:35.460
Vroom, takes a step, right?

07:35.460 --> 07:36.720
The ball rolls down.

07:36.720 --> 07:38.340
Again, same thing.

07:38.340 --> 07:39.570
You calculate the slope.

07:39.570 --> 07:42.330
This time the slope is positive, meaning right is uphill

07:42.330 --> 07:45.030
left is downhill and you need to go left.

07:45.030 --> 07:47.248
And you roll the ball down and again

07:47.248 --> 07:52.143
you calculate the slope and you roll the ball right.

07:53.160 --> 07:53.993
There we go.

07:53.993 --> 07:56.160
So that's how you find in, in simple terms

07:56.160 --> 08:00.419
that's how you find the best weights,

08:00.419 --> 08:04.560
the best situation that minimizes your cost function.

08:04.560 --> 08:06.900
Of course, it's not going to be like a ball rolling,

08:06.900 --> 08:08.820
it's going to be a very zigzaggy type of approach

08:08.820 --> 08:11.079
but it's easier to remember or kind of it's--

08:11.079 --> 08:14.970
it's more fun to look at it as a ball rolling.

08:14.970 --> 08:16.318
But in reality, yes, you just

08:16.318 --> 08:18.570
it is going to be like a step-by-step approach.

08:18.570 --> 08:21.009
So it's gonna be a zigzaggy type of method.

08:21.009 --> 08:23.580
Yeah, and also there's, there's lots of

08:23.580 --> 08:25.050
other elements to it.

08:25.050 --> 08:28.970
There's things like, for instance, why,

08:28.970 --> 08:31.620
like why does it go down?

08:31.620 --> 08:35.100
Why does it not like go away over the line?

08:35.100 --> 08:38.730
So it could have jumped out of this, gone upwards instead

08:38.730 --> 08:40.470
of downwards and things like that.

08:40.470 --> 08:42.960
So there are parameters that you can tweak, and again

08:42.960 --> 08:45.570
we will mention where you can find out more on that.

08:45.570 --> 08:47.700
And plus we'll have this in the practical application.

08:47.700 --> 08:50.520
But in the simplest intuitive approach,

08:50.520 --> 08:51.750
this is what is happening.

08:51.750 --> 08:53.557
We are getting to the bottom

08:53.557 --> 08:56.700
by just understanding which way we need to go.

08:56.700 --> 08:59.550
Instead of brute-forcing through thousands and thousands

08:59.550 --> 09:03.000
and millions and billions and quadrillions of combinations,

09:03.000 --> 09:05.220
we can just simply every time have a look

09:05.220 --> 09:08.910
at where is, where, which way is it sloping?

09:08.910 --> 09:10.950
So right like you are, you imagine you're standing

09:10.950 --> 09:13.800
on a hill, which way does it feel that it's going downwards?

09:13.800 --> 09:15.300
And whichever way it is going downwards,

09:15.300 --> 09:17.040
you just keep walking that way you like

09:17.040 --> 09:18.120
you take 50 steps that way

09:18.120 --> 09:19.590
and then you, you assess again, okay

09:19.590 --> 09:21.480
which way is it going downwards this way?

09:21.480 --> 09:24.660
Okay, now take 50 steps or less, take 40 steps that way

09:24.660 --> 09:27.987
so it gets less and less and less as you get closer.

09:27.987 --> 09:29.640
So there, here's an example

09:29.640 --> 09:32.700
of gradient descent applied in a two-dimensional space.

09:32.700 --> 09:35.507
So that was a one-dimensional example.

09:35.507 --> 09:38.160
Here we have a two-dimensional space

09:38.160 --> 09:40.290
for the gradient descent.

09:40.290 --> 09:41.520
As you can see, it's getting closer

09:41.520 --> 09:44.610
to the minimum and it's also called gradient descent

09:44.610 --> 09:46.020
because you're descending

09:46.020 --> 09:49.137
into the minimum of the cost function.

09:49.137 --> 09:50.550
And finally,

09:50.550 --> 09:53.400
here's the gradient descent applied in three dimensions.

09:53.400 --> 09:54.233
This is what it looks

09:54.233 --> 09:56.520
like if you project it onto two dimensions

09:56.520 --> 09:59.670
you can see it's zigzagging its way into the minimum.

09:59.670 --> 10:00.503
So there we go.

10:00.503 --> 10:01.860
That was gradient descent.

10:01.860 --> 10:02.820
In next tutorial we'll talk

10:02.820 --> 10:05.970
about stochastic gradient descent is will be a continuation

10:05.970 --> 10:08.770
of this tutorial and I look forward to seeing you there.

10:09.667 --> 10:11.303
Until next time, enjoy deep learning.