WEBVTT

00:01.110 --> 00:03.360
-: Hello and welcome back to the course on deep learning.

00:03.360 --> 00:07.230
Today we talk about stochastic gradient descent.

00:07.230 --> 00:10.200
Previously, we learned about gradient descent

00:10.200 --> 00:11.670
and we found out that,

00:11.670 --> 00:15.210
it is a very efficient method to solve

00:15.210 --> 00:16.890
our optimization problem where

00:16.890 --> 00:19.590
we are trying to minimize the cost function.

00:19.590 --> 00:24.590
It basically takes us from 10 to the power of 57 years

00:25.500 --> 00:28.852
to solving a problem within minutes or hours

00:28.852 --> 00:31.050
or within a day or so.

00:31.050 --> 00:33.090
And it really helps speed things up

00:33.090 --> 00:36.360
because we can see which way is downhill

00:36.360 --> 00:38.250
and we can just go in that direction

00:38.250 --> 00:41.610
and take steps and get to the minimum faster.

00:41.610 --> 00:45.450
But the thing with gradient descent

00:45.450 --> 00:47.940
is that this method requires

00:47.940 --> 00:51.120
for the cost function to be convex.

00:51.120 --> 00:52.020
And as you can see here,

00:52.020 --> 00:55.290
we've specifically chosen a convex cost function.

00:55.290 --> 00:59.490
Basically, convex means that the function looks similar

00:59.490 --> 01:01.590
to what we are seeing now.

01:01.590 --> 01:05.520
It's just convexed into one direction

01:05.520 --> 01:09.330
and that it, in essence, has one global minimum

01:09.330 --> 01:11.640
and that's the one that we're going to find.

01:11.640 --> 01:14.040
But, what if our function is not convex?

01:14.040 --> 01:16.410
What if our cost function is not correct?

01:16.410 --> 01:18.000
What if it looks something like this?

01:18.000 --> 01:19.860
Well, first of all, how could that happen?

01:19.860 --> 01:23.310
Well, that could happen because if we,

01:23.310 --> 01:24.990
first of all, choose a cost function

01:24.990 --> 01:29.790
which is not the square difference between Y hat and Y

01:29.790 --> 01:33.870
or if we do choose the cost function, which is like that

01:33.870 --> 01:35.850
but then in a multidimensional space

01:35.850 --> 01:39.750
it can actually turn into something that is not convex.

01:39.750 --> 01:40.920
And so what would happen,

01:40.920 --> 01:42.690
in this case if we just tried to apply

01:42.690 --> 01:45.090
our normal gradient descent method,

01:45.090 --> 01:46.470
something like this could happen.

01:46.470 --> 01:49.020
We could find a local minimum

01:49.020 --> 01:51.240
of the cost function rather than the global one.

01:51.240 --> 01:54.750
So, this one was the best one and we found the wrong one.

01:54.750 --> 01:57.750
And therefore we don't have the correct weights.

01:57.750 --> 02:00.270
We don't have an optimized neural network.

02:00.270 --> 02:02.550
We have a subpar neural network.

02:02.550 --> 02:04.650
And so, what do we do in this case?

02:04.650 --> 02:09.650
Well, the answer here is stochastic gradient descent.

02:10.050 --> 02:12.840
And it turns out the stochastic gradient descent doesn't

02:12.840 --> 02:15.360
require for the cost function to be convex.

02:15.360 --> 02:18.124
So let's have a look at the two differences

02:18.124 --> 02:20.160
between the normal gradient descent that we talked about

02:20.160 --> 02:21.840
and the stochastic gradient descent.

02:21.840 --> 02:25.530
So normal gradient descent is when we take all of our rows,

02:25.530 --> 02:28.380
we plug them into our neural network, and once again,

02:28.380 --> 02:32.010
here we've got the neural network copied over several times

02:32.010 --> 02:33.720
but the rows are being plugged

02:33.720 --> 02:36.000
into that same neural network every time.

02:36.000 --> 02:37.200
So there's only one neural network.

02:37.200 --> 02:39.300
This is just for visualization purposes.

02:39.300 --> 02:40.710
And then once we've plugged them in,

02:40.710 --> 02:42.180
we've calculated our cost function

02:42.180 --> 02:43.410
based on the formula on the right

02:43.410 --> 02:45.510
and looking at the chart at the bottom.

02:45.510 --> 02:47.517
And then we adjust the weights then,

02:47.517 --> 02:49.740
and this is called the gradient descent method

02:49.740 --> 02:52.470
or it's also, the proper term

02:52.470 --> 02:54.480
is the batch gradient descend method.

02:54.480 --> 02:57.840
So, we take the whole batch from our sample,

02:57.840 --> 03:00.960
we apply it, and then we run that.

03:00.960 --> 03:02.610
The stochastic gradient descent method

03:02.610 --> 03:03.780
is a bit different.

03:03.780 --> 03:05.940
Here, we take the rows one by one.

03:05.940 --> 03:08.940
So, we take this row, we run our neural network

03:08.940 --> 03:12.000
and then we adjust the weights.

03:12.000 --> 03:13.560
Then we move on to the second row.

03:13.560 --> 03:16.560
We take the second row, we run our neural network,

03:16.560 --> 03:17.880
we look at the cost function,

03:17.880 --> 03:20.160
and then we adjust the weights again.

03:20.160 --> 03:22.680
Then we take another row, take row three,

03:22.680 --> 03:23.760
we run our neural network,

03:23.760 --> 03:25.440
we look at the cost function, we adjust the weight.

03:25.440 --> 03:27.933
So basically we are looking at,

03:28.890 --> 03:31.260
we're adjusting the weights after every single row,

03:31.260 --> 03:33.060
rather than doing everything together

03:33.060 --> 03:34.860
and then adjusting the weights.

03:34.860 --> 03:36.110
Two different approaches,

03:37.014 --> 03:39.720
and now we're going to just compare the two side by side.

03:39.720 --> 03:40.590
So, here they are.

03:40.590 --> 03:42.900
This is how to visually remember them.

03:42.900 --> 03:44.700
So, you've got the batch gradient descent

03:44.700 --> 03:49.110
where you're adjusting the weights after you've run them,

03:49.110 --> 03:52.980
after you've run all of the rows in your neural network.

03:52.980 --> 03:54.990
And then, basically you adjust the weights

03:54.990 --> 03:55.950
and you run the whole thing again,

03:55.950 --> 03:57.570
iteration, iteration, iteration.

03:57.570 --> 03:59.490
In the stochastic gradient descent method,

03:59.490 --> 04:01.143
you run one row at a time,

04:03.116 --> 04:03.949
and you adjust the weights

04:03.949 --> 04:05.040
you adjust the weights, you adjust the weights,

04:05.040 --> 04:07.770
and then you do everything again and again.

04:07.770 --> 04:11.220
And that is called the stochastic gradient descent method.

04:11.220 --> 04:12.873
The main two differences are,

04:13.986 --> 04:15.840
that the stochastic gradient descend method

04:15.840 --> 04:19.290
helps you avoid the problem

04:19.290 --> 04:22.950
where you find those local extremos

04:22.950 --> 04:27.950
or local minimums rather than the overall global minimum.

04:29.010 --> 04:31.170
And the reason for that, in simple terms,

04:31.170 --> 04:35.130
is that SGD or the stochastic gradient descent method,

04:35.130 --> 04:38.220
has much higher fluctuations because it can afford them.

04:38.220 --> 04:41.910
It's doing one iteration or one row at a time

04:41.910 --> 04:43.440
and therefore the fluctuations are much higher

04:43.440 --> 04:45.810
and it is much more likely to find

04:45.810 --> 04:49.410
the global minimum rather than just the local minimum.

04:49.410 --> 04:51.983
And the other thing about the stochastic gradient

04:51.983 --> 04:54.210
descent method compared to the batch gradient

04:54.210 --> 04:56.520
is that it's faster.

04:56.520 --> 04:58.680
Like, the first impression that you might have is,

04:58.680 --> 05:00.840
because it's doing every single row one at a time,

05:00.840 --> 05:01.740
it is slower.

05:01.740 --> 05:03.720
But actually, in fact, it is faster

05:03.720 --> 05:08.720
because it doesn't have to load up all the data into memory,

05:09.090 --> 05:11.640
and run, and wait until all of those rows

05:11.640 --> 05:12.660
are run all together.

05:12.660 --> 05:14.220
It can just run than one by one,

05:14.220 --> 05:15.540
so it's a much lighter algorithm.

05:15.540 --> 05:16.830
It's much faster in that sense.

05:16.830 --> 05:21.240
So though, it's has way more, in those senses,

05:21.240 --> 05:22.890
it has more advantages

05:22.890 --> 05:25.410
over the batch gradient descent method.

05:25.410 --> 05:29.160
The main advantage or the main pro

05:29.160 --> 05:30.960
for the batch gradient descent method is that

05:30.960 --> 05:34.050
it is a deterministic algorithm rather than

05:34.050 --> 05:36.960
stochastic gradient descent, being a stochastic algorithm,

05:36.960 --> 05:38.370
meaning it's random.

05:38.370 --> 05:40.800
And with the batch gradient descent method,

05:40.800 --> 05:44.310
as long as you have the same starting weights

05:44.310 --> 05:45.480
for your neural network,

05:45.480 --> 05:47.940
every time you run the batch gradient descent method,

05:47.940 --> 05:51.420
you will get the same iterations,

05:51.420 --> 05:55.985
the same results for the way your weights are being updated.

05:55.985 --> 05:58.320
For stochastic gradient decent method,

05:58.320 --> 06:01.170
you won't get that because it is a stochastic method.

06:01.170 --> 06:02.550
You are picking your roles,

06:02.550 --> 06:06.900
possibly at random and you are updating your neural network

06:06.900 --> 06:07.950
in a stochastic manner.

06:07.950 --> 06:10.680
And therefore you are just going to,

06:10.680 --> 06:11.730
every single time you run

06:11.730 --> 06:13.050
the stochastic gradient decent method,

06:13.050 --> 06:15.270
even if you have the same weights at the start,

06:15.270 --> 06:17.803
you're going to have a different process,

06:17.803 --> 06:20.760
different iterations to get there.

06:20.760 --> 06:25.760
So that's in a nutshell what stochastic gradient descent is.

06:26.130 --> 06:28.110
Also, there's a method in between the two

06:28.110 --> 06:30.570
called the mini batch gradient descent method,

06:30.570 --> 06:34.230
where you combine the two and you basically run,

06:34.230 --> 06:37.650
rather than running a whole batch or running one at a time,

06:37.650 --> 06:39.210
you run batches of rows,

06:39.210 --> 06:42.780
maybe 5, 10, 100, however many you decide to set,

06:42.780 --> 06:45.240
you run that number of rows at a time,

06:45.240 --> 06:46.197
then you update your weights,

06:46.197 --> 06:47.880
and you update your weights, and so on.

06:47.880 --> 06:50.550
And that's called the mini batch gradient descent method.

06:50.550 --> 06:53.010
If you'd like to learn more about gradient descent,

06:53.010 --> 06:56.640
there's a great article which you can have a look at.

06:56.640 --> 07:00.463
It's called, "A Neural Network in 13 Lives of Python

07:00.463 --> 07:03.363
(Part 2 - Gradient Descent)" by Andrew Trask.

07:04.530 --> 07:08.605
And the link's below; it's on GitHub, 2015 article.

07:08.605 --> 07:12.930
Very well-written, in very simple terms.

07:12.930 --> 07:16.620
It's got some interesting philosophical,

07:16.620 --> 07:18.700
or just interesting thoughts on

07:19.590 --> 07:22.470
how to apply gradient descend whether,

07:22.470 --> 07:24.180
you know, the advantages and disadvantages,

07:24.180 --> 07:28.170
and how to do things in certain situations.

07:28.170 --> 07:31.380
So, he's got some very cool tips, tricks, and hacks.

07:31.380 --> 07:33.780
Very easy read, so definitely check that out.

07:33.780 --> 07:37.020
And another one, a bit more heavier read,

07:37.020 --> 07:39.390
for those of you who are into mathematics,

07:39.390 --> 07:41.550
who want to get to the bottom of the mathematics,

07:41.550 --> 07:45.300
why gradient descent in that specific,

07:45.300 --> 07:47.310
what are the formulas that are driving gradient descent?

07:47.310 --> 07:48.210
How is it calculated?

07:48.210 --> 07:49.260
And so on.

07:49.260 --> 07:51.630
Check out the article or actually the book.

07:51.630 --> 07:52.897
It's a free online book called,

07:52.897 --> 07:56.340
"Neural Networks and Deep Learning" by Michael Nielsen,

07:56.340 --> 07:57.173
2015 book.

07:57.173 --> 07:59.610
It's just basically, it's all online.

07:59.610 --> 08:02.340
You can go ahead and check it out there.

08:02.340 --> 08:05.310
And they're, again, very soft introduction

08:05.310 --> 08:06.143
to the mathematics.

08:06.143 --> 08:11.143
But the mathematics are pretty heavy as you go along,

08:11.580 --> 08:13.105
as you read through the article.

08:13.105 --> 08:17.370
But at the same time, it gets you into into that move-

08:17.370 --> 08:20.160
I think even it has like a warm up chapter

08:20.160 --> 08:21.810
where you first warm up with the math

08:21.810 --> 08:22.680
and then you jump into them.

08:22.680 --> 08:24.000
So, interested in math?

08:24.000 --> 08:26.490
Then, this is the article to go to.

08:26.490 --> 08:27.323
And, there it goes.

08:27.323 --> 08:30.270
So, that's in a nutshell the difference

08:30.270 --> 08:33.070
between gradient descent and stochastic gradient descent

08:35.188 --> 08:36.420
and how the two work.

08:36.420 --> 08:39.870
And on that note, we're going to wrap up today's tutorial.

08:39.870 --> 08:42.000
I look forward to seeing you on the next one.

08:42.000 --> 08:44.253
And until then, enjoy deep learning.
