WEBVTT

00:00.330 --> 00:03.060
-: Hello and welcome back to the course on deep learning.

00:03.060 --> 00:06.240
This is an additional tutorial to talk

00:06.240 --> 00:08.640
about the Softmax and Cross Entropy functions.

00:08.640 --> 00:11.760
It is not a hundred percent necessary in order

00:11.760 --> 00:15.090
for you to go through all of the parts that we've been

00:15.090 --> 00:19.050
through in the main part of this section where we're talking

00:19.050 --> 00:21.300
about the convolutional neural networks.

00:21.300 --> 00:24.300
But at the same time, I thought it would be a good addition

00:24.300 --> 00:26.610
to your bag of knowledge and skill set.

00:26.610 --> 00:30.810
So let's go ahead and dig into these functions.

00:30.810 --> 00:33.240
So to start off with, what we have here

00:33.240 --> 00:37.140
is the convolutional neural network that we built

00:37.140 --> 00:40.200
in the main part of the section, and then

00:40.200 --> 00:43.290
at the end it pops out some probabilities

00:43.290 --> 00:48.030
0.95 for a dog and 0.055% for a cat.

00:48.030 --> 00:50.910
Given that photo in the left as an input.

00:50.910 --> 00:52.620
This is after the training has been conducted.

00:52.620 --> 00:54.390
This is actually, it's running

00:54.390 --> 00:57.330
and it's classifying a certain image.

00:57.330 --> 00:59.820
And so the question here is how come these two values

00:59.820 --> 01:00.900
add up to one?

01:00.900 --> 01:03.600
Because as far as we know from everything that we've learned

01:03.600 --> 01:07.230
about artificial neural networks, there is nothing to say

01:07.230 --> 01:10.590
that these two final neurons are connected

01:10.590 --> 01:11.700
between each other.

01:11.700 --> 01:14.820
So how would they know what the value of-

01:14.820 --> 01:15.990
how would each one of them know what

01:15.990 --> 01:17.370
the value of the other one is

01:17.370 --> 01:20.310
and how would they know to add their values up to one?

01:20.310 --> 01:22.260
Well, the answer is they wouldn't

01:22.260 --> 01:26.280
in the classic version of an artificial neural network.

01:26.280 --> 01:28.650
And the only way that they do is because we

01:28.650 --> 01:31.890
introduce a special function called the Softmax function

01:31.890 --> 01:33.930
in order to help us out of this situation.

01:33.930 --> 01:36.960
So normally what would happen is the dog

01:36.960 --> 01:40.863
and the cat neurons would have any kind of real values.

01:41.849 --> 01:45.180
They don't have to add up to one.

01:45.180 --> 01:48.450
But then we would apply the Softmax function

01:48.450 --> 01:50.910
which is written up over there at the top

01:50.910 --> 01:53.400
and that would bring these values to be

01:53.400 --> 01:56.370
between zero and one and it would make them add up to one.

01:56.370 --> 02:00.540
And to quote Wikipedia, the Softmax function

02:00.540 --> 02:03.270
or the normalized exponential function is a generalization

02:03.270 --> 02:05.125
of the logistic function that

02:05.125 --> 02:10.125
"squashes" a K dimensional vector of arbitrary real values

02:10.290 --> 02:12.630
to a K dimensional vector of real values

02:12.630 --> 02:15.330
in the range of zero to one that I'll add up to one.

02:15.330 --> 02:17.640
So basically, it does exactly what we want.

02:17.640 --> 02:20.220
It brings these values to be between zero and one

02:20.220 --> 02:22.920
and makes sure that they add up to one.

02:22.920 --> 02:25.170
And the way it works is that

02:25.170 --> 02:26.730
the way that this is possible is that

02:26.730 --> 02:27.780
because at the bottom over here

02:27.780 --> 02:29.970
you can see that there's a summation.

02:29.970 --> 02:33.690
So it takes the exponent and puts it

02:33.690 --> 02:36.570
in the power of Z and adds it up.

02:36.570 --> 02:38.820
So Z one, Z two across all of your classes

02:38.820 --> 02:39.960
all of these values.

02:39.960 --> 02:40.793
And so there

02:40.793 --> 02:44.370
that's your normalization happening right there.

02:44.370 --> 02:47.400
So that's how the Softmax function works.

02:47.400 --> 02:50.850
And it makes sense to introduce the Softmax function

02:50.850 --> 02:54.960
into convolutional neural networks because how

02:54.960 --> 02:59.490
strange would it be if you had a possible classes of a dog

02:59.490 --> 03:04.490
and a cat and for the dog class you had probability of 80%

03:04.701 --> 03:08.670
and for the cat class you had a probability of 45%, right?

03:08.670 --> 03:11.400
It just doesn't make sense like that.

03:11.400 --> 03:13.260
And therefore it's much better when

03:13.260 --> 03:14.790
you introduce the Softmax function

03:14.790 --> 03:17.220
and that's what you'll find happening most of the time

03:17.220 --> 03:19.740
in convolutional neural networks.

03:19.740 --> 03:24.030
Now the other thing is that the Softmax function comes hand

03:24.030 --> 03:27.540
in hand with something called the Cross Entropy function

03:27.540 --> 03:29.040
and it's a very handy thing for us.

03:29.040 --> 03:30.630
So let's first look at the formula.

03:30.630 --> 03:33.090
This is what the Cross Entropy function looks like.

03:33.090 --> 03:37.050
We're actually going to be using a different calculation.

03:37.050 --> 03:38.760
We're gonna be using this representation

03:38.760 --> 03:40.680
of the Cross Entropy, but the result's basically the same.

03:40.680 --> 03:42.540
This is just easier to calculate.

03:42.540 --> 03:46.549
And what I know this might sound very unrelated

03:46.549 --> 03:49.080
to anything right now, just formulas on your screen

03:49.080 --> 03:52.140
but there'll be some additional recommended reading

03:52.140 --> 03:53.070
at the end of this section.

03:53.070 --> 03:56.280
So don't worry if you're not picking up on the math

03:56.280 --> 03:58.350
like if we haven't explained the math right now.

03:58.350 --> 04:01.770
But the point here is that what is the Cross Entropy?

04:01.770 --> 04:03.660
Well, a Cross Entropy function

04:03.660 --> 04:06.780
remember how we previously in artificial neural networks

04:06.780 --> 04:11.780
we had a function called the mean squared error function

04:12.540 --> 04:14.880
which we used as the cost function

04:14.880 --> 04:17.790
for assessing our network performance.

04:17.790 --> 04:21.030
And our goal was to minimize the MSE

04:21.030 --> 04:23.910
in order to optimize our network performance.

04:23.910 --> 04:25.980
Well, that was our cross function then.

04:25.980 --> 04:30.117
There and in convolutional neural networks

04:30.117 --> 04:31.830
you can still use MSE

04:31.830 --> 04:35.490
but a better option in convolutional neural networks

04:35.490 --> 04:37.920
after you apply the Softmax function turns

04:37.920 --> 04:39.810
out to be the Cross Entropy function.

04:39.810 --> 04:42.810
And in convolutional neural networks

04:42.810 --> 04:44.550
when you apply Cross Entropy function

04:44.550 --> 04:46.620
it's not called the cost function anymore

04:46.620 --> 04:48.270
it's called the loss function.

04:48.270 --> 04:49.440
And they're very similar

04:49.440 --> 04:52.260
and they're just little terminological differences

04:52.260 --> 04:55.530
and like little, a bit different in what they mean.

04:55.530 --> 04:59.220
But for our purposes, it's pretty much the same thing.

04:59.220 --> 05:03.990
And what happens is the loss function is again

05:03.990 --> 05:05.910
something that we want to minimize

05:05.910 --> 05:09.660
in order to maximize the performance of our network.

05:09.660 --> 05:12.649
So let's have a look at a quick example

05:12.649 --> 05:15.270
of how this function can be applied.

05:15.270 --> 05:19.113
So let's say we've put an image of a dog into our network.

05:20.977 --> 05:23.220
The predicted value for dog 0.9

05:23.220 --> 05:24.510
and this is during the training.

05:24.510 --> 05:27.300
So we know the label that is a dog.

05:27.300 --> 05:29.430
So the predicted value is 0.9

05:29.430 --> 05:32.340
the predicted value for cat is 0.1.

05:32.340 --> 05:33.720
Then here we have the label.

05:33.720 --> 05:35.850
So we know it's a dog because this is training

05:35.850 --> 05:37.803
and one for dog, zero for cat.

05:38.693 --> 05:42.400
And so in this case, you need to use

05:43.320 --> 05:44.460
you need to plug these numbers

05:44.460 --> 05:47.790
into your formula for the Cross Entropy.

05:47.790 --> 05:51.150
So how you would do it is the values

05:51.150 --> 05:53.400
on the left go into the variable Q

05:53.400 --> 05:56.790
the one that is under the logarithm on the right side

05:56.790 --> 05:59.430
and the values from the right would go into P.

05:59.430 --> 06:02.160
And so it's important to remember which one goes where

06:02.160 --> 06:04.110
because if you get them wrong

06:04.110 --> 06:05.733
you don't wanna be taking a logarithm

06:05.733 --> 06:09.600
from a zero value or a logarithm from a one.

06:09.600 --> 06:12.510
So you just want to plug them in, make sure you plug them

06:12.510 --> 06:17.040
into the correct places, and then you basically add that up.

06:17.040 --> 06:19.470
So that's how the Cross Entropy works.

06:19.470 --> 06:20.430
And we'll look at a-

06:20.430 --> 06:22.350
Actually right now we're just going to look at a

06:22.350 --> 06:24.570
specific step by step example

06:24.570 --> 06:26.730
of applying this function in real life.

06:26.730 --> 06:30.300
And it'll kind of make more sense what Cross Entropy is.

06:30.300 --> 06:31.988
And it'll be less...

06:31.988 --> 06:35.790
My goal in this tutorial is to make you more comfortable

06:35.790 --> 06:39.370
with Cross Entropy because it can sound very convoluted

06:40.403 --> 06:42.033
and no pun intended.

06:42.870 --> 06:46.080
It can, like convolutional neural networks

06:46.080 --> 06:49.560
it can sound very complex, right? Scary.

06:49.560 --> 06:51.630
But it's not, that's the point.

06:51.630 --> 06:52.710
So let's go ahead and apply it

06:52.710 --> 06:54.090
just so we know that it's not scary.

06:54.090 --> 06:56.370
So here's neural net

06:56.370 --> 06:59.340
and also this will explain why we're doing this

06:59.340 --> 07:01.770
why we are looking at different cross functions.

07:01.770 --> 07:03.720
So neural network one, neural network two

07:03.720 --> 07:05.730
let's say we have two neural networks

07:05.730 --> 07:07.430
and then we pass an image of a dog

07:08.430 --> 07:12.150
and we know that this is a dog and not a cat.

07:12.150 --> 07:13.350
And then we have another image

07:13.350 --> 07:17.910
of a cat this time, an animal, and it's a cat, not a dog.

07:17.910 --> 07:20.790
And here we have a weird looking animal

07:20.790 --> 07:24.300
which is in fact a dog, not a cat if you look very closely.

07:24.300 --> 07:27.540
So we wanna see what our neural networks will predict.

07:27.540 --> 07:29.370
In the first case, neural network one

07:29.370 --> 07:33.330
90% dog, 10% cat, correct.

07:33.330 --> 07:36.720
Neural network number two, 60% dog, 40% cat.

07:36.720 --> 07:39.123
Still correct, worse but correct.

07:40.260 --> 07:43.770
Second option, first neural network

07:43.770 --> 07:47.310
10% dog, 90% cat, correct.

07:47.310 --> 07:51.540
Neural network number two, 30% dog, 70% cat.

07:51.540 --> 07:53.520
Worse but still correct.

07:53.520 --> 07:55.450
And then finally, neural network one

07:56.440 --> 08:01.440
in image three neural, one 40% dog, 60% cat. Incorrect.

08:01.890 --> 08:06.390
Neural network number two, 10% dog, 90% cat.

08:06.390 --> 08:08.250
Incorrect and worse.

08:08.250 --> 08:10.740
So the key here is that even though

08:10.740 --> 08:13.080
both networks got it wrong in the last one

08:13.080 --> 08:17.070
throughout all three images, neural network one

08:17.070 --> 08:18.870
was outperforming neural network too.

08:18.870 --> 08:23.310
So even in the last case it, it was very

08:23.310 --> 08:26.423
it gave dog like a 40% chance as opposed to

08:26.423 --> 08:29.160
neural network two only gave dog a 10% chance.

08:29.160 --> 08:31.380
So neural network one is outperforming

08:31.380 --> 08:35.580
across the board when compared to neural network two.

08:35.580 --> 08:37.230
And so now we're going to look

08:37.230 --> 08:41.040
at the functions that can measure performance

08:41.040 --> 08:43.020
that we've kind of talked about already.

08:43.020 --> 08:44.820
So let's put these into a table.

08:44.820 --> 08:48.330
So there's neural network one, you have the row number

08:48.330 --> 08:49.500
so that's the image number.

08:49.500 --> 08:52.440
And then for image one, you have what it predicted

08:52.440 --> 08:54.090
90% dog, 10% cat.

08:54.090 --> 08:55.440
So those are the hat variables.

08:55.440 --> 08:57.360
And then you have the actual values.

08:57.360 --> 09:00.540
So dog, correct, cat incorrect.

09:00.540 --> 09:03.420
Same thing for image number two

09:03.420 --> 09:05.220
and same thing for image number three.

09:05.220 --> 09:07.710
And same for neural network number two.

09:07.710 --> 09:11.070
So dog, 60%, cat 40% in the first image

09:11.070 --> 09:12.150
that's what it predicted.

09:12.150 --> 09:15.180
Correct answer was dog, not cat and so on.

09:15.180 --> 09:18.961
And so now let's see what errors we can actually get.

09:18.961 --> 09:22.230
So what errors we can calculate to estimate the performance

09:22.230 --> 09:24.930
and monitor the performance of our networks.

09:24.930 --> 09:28.620
So one type of error is called the classification error.

09:28.620 --> 09:32.700
And that is basically just asking

09:32.700 --> 09:34.020
did you get it right or not?

09:34.020 --> 09:36.090
Regardless of the probabilities, it's just

09:36.090 --> 09:37.950
did you get it right or did you not get it right?

09:37.950 --> 09:42.950
So in both cases, for both neural networks, each of them

09:43.410 --> 09:46.350
they got one or, so this is how many they got wrong.

09:46.350 --> 09:48.450
So they got one out of three wrong.

09:48.450 --> 09:52.350
So 33% error rate for neural network one

09:52.350 --> 09:54.807
and 33% error rate for neural network two.

09:54.807 --> 09:57.030
And so basically from this standpoint

09:57.030 --> 09:59.160
both neural networks perform at the same level

09:59.160 --> 10:00.210
but we know that's not true.

10:00.210 --> 10:01.980
We know that neural network one

10:01.980 --> 10:04.203
is outperforming neural network two.

10:05.100 --> 10:08.220
That's why a classification error is not a good measure

10:08.220 --> 10:11.013
especially for the purposes of back propagation.

10:11.850 --> 10:13.800
Means squared error different

10:13.800 --> 10:16.920
and by the way, I did these calculations in Excel.

10:16.920 --> 10:18.450
I just didn't want to bore you with them

10:18.450 --> 10:20.010
but you can totally just sit down

10:20.010 --> 10:22.020
and do them on a paper or in Excel.

10:22.020 --> 10:23.700
These are very straightforward calculations.

10:23.700 --> 10:27.784
Just basically take the sum of squared errors

10:27.784 --> 10:30.040
and then just take the average across your

10:31.740 --> 10:32.940
across your observations.

10:32.940 --> 10:35.040
And that's pretty much it.

10:35.040 --> 10:39.090
So for neural network one, you get 25%.

10:39.090 --> 10:43.350
For neural network two, you get 71% error rate.

10:43.350 --> 10:45.930
So as you can see, this one is more accurate.

10:45.930 --> 10:47.670
It's telling us that neural network one

10:47.670 --> 10:50.270
has a much lower error rate than neural network two.

10:51.150 --> 10:53.850
And then Cross Entropy, again we've seen the formula.

10:53.850 --> 10:54.960
You can also calculate this.

10:54.960 --> 10:56.940
This is actually even easier to calculate than

10:56.940 --> 11:00.870
the mean squared error. Cross Entropy gives you 38%

11:00.870 --> 11:05.490
for neural network one and 1.06 for neural network two.

11:05.490 --> 11:09.570
So you can see the results are a bit different when you look

11:09.570 --> 11:12.510
at them like that, when you look at, you know

11:12.510 --> 11:14.703
the mean squared error and Cross Entropy.

11:16.230 --> 11:21.070
The question of why would you use Cross Entropy over

11:23.250 --> 11:26.130
mean squared error isn't just about the kind

11:26.130 --> 11:27.960
of numbers that they spit out.

11:27.960 --> 11:30.180
These calculations were just to show you

11:30.180 --> 11:32.520
that this is all, it's all doable.

11:32.520 --> 11:34.680
You can just do it on a paper.

11:34.680 --> 11:37.890
These are not very intense mathematics.

11:37.890 --> 11:41.190
These are pretty simple, straightforward things.

11:41.190 --> 11:45.000
But the question of why would you use mean Cross Entropy

11:45.000 --> 11:46.290
over mean squared error?

11:46.290 --> 11:48.240
It's a very, very good question to ask.

11:48.240 --> 11:49.340
I'm glad you asked it.

11:50.460 --> 11:52.330
The answer to that is like

11:53.245 --> 11:55.300
there's several advantages

11:57.271 --> 11:58.110
of Cross Entropy

11:58.110 --> 12:01.410
over mean squared error which are not obvious.

12:01.410 --> 12:06.210
And so I'll mention a couple, then I'll let you know

12:06.210 --> 12:08.117
where you can find out more.

12:08.117 --> 12:12.510
So one of them is that if for instance

12:12.510 --> 12:17.040
at the very start of your back propagation

12:17.040 --> 12:22.040
your output value is very, very, very, very tiny. Very tiny.

12:22.320 --> 12:25.710
So it's much smaller than the actual value that you want.

12:25.710 --> 12:28.500
Then at the very start, the gradient

12:28.500 --> 12:32.220
in your gradient descent will be very, very low

12:32.220 --> 12:33.810
and it won't be enough.

12:33.810 --> 12:36.960
It'll be very hard for the neural network

12:36.960 --> 12:39.480
to actually start doing something

12:39.480 --> 12:42.590
and start moving around and start adjusting those weights

12:42.590 --> 12:45.120
and start actually moving in the right direction.

12:45.120 --> 12:47.820
Whereas when you use something like the Cross Entropy

12:47.820 --> 12:49.670
because it's got that logarithm in it

12:50.837 --> 12:55.320
it actually helps the network assess even a small error

12:55.320 --> 12:57.300
like that and do something about it.

12:57.300 --> 12:58.500
Here's how to think about it.

12:58.500 --> 12:59.440
So let's say

13:00.480 --> 13:03.270
again, this is a very intuitive approach.

13:03.270 --> 13:06.030
There's gonna be a link to the mathematics

13:06.030 --> 13:07.980
and you can derive these things

13:07.980 --> 13:09.480
through the mathematics in more detail

13:09.480 --> 13:11.220
but a very intuitive approach.

13:11.220 --> 13:12.053
Let's say

13:14.760 --> 13:17.700
your outcome that you want is one

13:17.700 --> 13:22.700
and right now you are at one one-millionth of one, right?

13:23.070 --> 13:24.453
0.000001.

13:25.383 --> 13:29.910
And then you improve next time, you improve your outcome

13:29.910 --> 13:32.673
from one-millionth to one-thousandth.

13:33.919 --> 13:37.290
And in terms of if you calculate the squared error

13:37.290 --> 13:40.260
you're just subtracting one from the other, or basically

13:40.260 --> 13:42.210
in each case you're calculating the square error.

13:42.210 --> 13:43.890
And you'll see that the squared errors

13:43.890 --> 13:46.740
when you compare one case versus the other

13:46.740 --> 13:48.150
it didn't change that much.

13:48.150 --> 13:50.520
You didn't improve your network that much when

13:50.520 --> 13:52.110
you're looking at the mean squared error.

13:52.110 --> 13:55.380
But if you're looking at the Cross Entropy

13:55.380 --> 13:57.960
because you're taking a logarithm and then

13:57.960 --> 14:01.380
you are comparing the two, dividing one by the other.

14:01.380 --> 14:04.650
You will see that you have actually improved

14:04.650 --> 14:06.150
your network significantly.

14:06.150 --> 14:10.590
So that jump from one-millionth to one-thousandth

14:10.590 --> 14:12.810
in mean squared error terms will be very low.

14:12.810 --> 14:14.790
It will be insignificant and

14:14.790 --> 14:19.440
it won't guide your gradient boosting process

14:19.440 --> 14:22.110
or your back propagation in the right direction.

14:22.110 --> 14:24.240
It will guide it in the right direction

14:24.240 --> 14:26.760
but it'll be like a very slow guidance.

14:26.760 --> 14:29.700
It won't have enough power.

14:29.700 --> 14:32.280
Whereas if you do it through Cross Entropy

14:32.280 --> 14:34.530
Cross Entropy will understand that, oh, even though

14:34.530 --> 14:37.830
these are very small adjustments that are just

14:37.830 --> 14:41.040
you know, making a tiny change in absolute terms

14:41.040 --> 14:43.860
in relative terms, it's a huge improvement

14:43.860 --> 14:46.110
and we are definitely going in the right direction.

14:46.110 --> 14:47.220
Let's keep going that way.

14:47.220 --> 14:50.890
So Cross Entropy will help your neural network

14:54.150 --> 14:57.930
get to the optimal state. It's a better way

14:57.930 --> 15:01.110
for the neural network to get to an optimal state.

15:01.110 --> 15:02.220
But bear in mind

15:02.220 --> 15:05.610
that this only works when the Cross Entropy

15:05.610 --> 15:08.250
is the preferred method only for classification.

15:08.250 --> 15:11.430
So if you're talking about things like regression

15:11.430 --> 15:13.770
like which we had in artificial neural networks

15:13.770 --> 15:17.520
then you would rather go with mean squared error.

15:17.520 --> 15:20.640
Whereas Cross Entropy is better for classification.

15:20.640 --> 15:22.530
And again, it has to do with the fact

15:22.530 --> 15:23.700
that we're using Softmax function.

15:23.700 --> 15:26.760
So that's a kind of intuitive explanation of that.

15:26.760 --> 15:29.370
A good place to learn a bit more about that

15:29.370 --> 15:31.157
if you're really interested in why

15:31.157 --> 15:35.280
are we using Cross Entropy versus mean squared error.

15:35.280 --> 15:38.190
Google a video by Geoffrey Hinton called

15:38.190 --> 15:40.680
the Softmax Output Function

15:40.680 --> 15:42.900
and he explains it very well.

15:42.900 --> 15:46.380
And, you know, being the godfather of deep learning

15:46.380 --> 15:48.030
who can explain it better anyway.

15:48.870 --> 15:51.690
And by the way, any video by Geoffrey Hinton is golden.

15:51.690 --> 15:54.303
He's just got a huge talent for explaining things.

15:55.290 --> 15:58.650
Anyway, so that's Softmax versus Cross Entropy.

15:58.650 --> 16:00.180
I hope that gives you kind of

16:00.180 --> 16:02.130
like an intuitive understanding of what's going on here.

16:02.130 --> 16:05.040
But more importantly that you are not put off

16:05.040 --> 16:06.540
by the term Cross Entropy

16:06.540 --> 16:09.090
because Hadlan will mention it in the practical tutorials

16:09.090 --> 16:12.059
and I wanted to make sure that you're prepared for that.

16:12.059 --> 16:13.350
And it's just another way

16:13.350 --> 16:15.690
of calculating your loss function

16:15.690 --> 16:17.640
and another way of optimizing your network

16:17.640 --> 16:21.870
which is specifically tailored to classification problems

16:21.870 --> 16:24.090
and therefore convolutional neural networks

16:24.090 --> 16:28.260
and comes hand in hand with the Softmax function.

16:28.260 --> 16:30.480
So additional reading, if you'd like

16:30.480 --> 16:34.590
a light introduction into Cross Entropy

16:34.590 --> 16:37.260
if you're interested in Cross Entropy a bit more, of course

16:37.260 --> 16:39.300
a good article to check out is called

16:39.300 --> 16:41.770
A Friendly Introduction to Cross Entropy Loss

16:42.771 --> 16:45.330
by Rob DiPietro, 2016.

16:45.330 --> 16:48.333
Here's the link below. Very, very nice.

16:50.195 --> 16:54.480
Very soft, no super complex math.

16:54.480 --> 16:56.160
Good analogies, good examples.

16:56.160 --> 16:57.540
He uses analogies of cars

16:57.540 --> 16:59.670
and you're looking at cars and he talks about information

16:59.670 --> 17:02.430
and bits and restrictions and you know

17:02.430 --> 17:03.330
how would you encode this?

17:03.330 --> 17:04.163
How would you encode that?

17:04.163 --> 17:05.880
It's a good article to have a look at

17:05.880 --> 17:08.747
and will give you a good overview of Cross Entropy

17:08.747 --> 17:12.900
like from an introductory standpoint. If you want to dig

17:12.900 --> 17:16.344
into the heavy math, like what you see here, then check

17:16.344 --> 17:20.340
out an article by, or a blog by How to Implement

17:20.340 --> 17:22.530
a Neural Network Intermezzo 2.

17:22.530 --> 17:25.240
So intermezzo is like an intermediary thing

17:27.090 --> 17:29.700
intermittent, you know, like when you go

17:29.700 --> 17:34.650
to a theater and you have like a break between

17:34.650 --> 17:36.750
the first part and the second part

17:36.750 --> 17:38.609
because he's like going through all these steps

17:38.609 --> 17:41.523
and then he says, Oh, I gotta explain this first.

17:42.450 --> 17:44.100
And yeah, so that's why it's called Intermezzo

17:44.100 --> 17:46.680
No other reason, as far as I understand.

17:46.680 --> 17:50.760
The article is by Peter Rolands, 2016 as well.

17:50.760 --> 17:52.560
So both are quite recent.

17:52.560 --> 17:55.980
And yeah, check out this if you would like to dig into

17:55.980 --> 17:59.760
the mathematics behind Cross Entropy

17:59.760 --> 18:02.910
behind Softmax and Cross Entropy in this article, actually.

18:02.910 --> 18:03.840
So there we go.

18:03.840 --> 18:07.350
That's all there is to these two.

18:07.350 --> 18:10.920
Hopefully I was able to add some additional clarity

18:10.920 --> 18:12.750
and good luck with that.

18:12.750 --> 18:16.950
It's gonna be fun and enjoy the practical tutorials.

18:16.950 --> 18:18.060
I'll see you next time.

18:18.060 --> 18:19.893
Until then, enjoy deep learning.