WEBVTT

00:01.470 --> 00:03.360
-: Hi, everyone, and welcome back.

00:03.360 --> 00:05.670
In the last lecture, you saw how we used

00:05.670 --> 00:08.160
our helper functions to really set up the core

00:08.160 --> 00:11.370
of our functionality to train our model.

00:11.370 --> 00:13.950
So we had our terminal state, our starting location,

00:13.950 --> 00:17.310
our next action, next location and shortest path.

00:17.310 --> 00:19.920
Really defines the majority of how our agent

00:19.920 --> 00:21.720
or how our environment's going to work.

00:21.720 --> 00:24.270
The last thing we have to do is use our training.

00:24.270 --> 00:25.620
In order to use our training,

00:25.620 --> 00:26.940
it's pretty straightforward in the sense

00:26.940 --> 00:29.100
of what we need to include,

00:29.100 --> 00:32.400
but how we go about it is a different story.

00:32.400 --> 00:35.010
That being said, also, if you guys use a different approach

00:35.010 --> 00:37.200
or you want to customize this and experiment with it,

00:37.200 --> 00:38.760
it's highly recommended.

00:38.760 --> 00:40.800
In addition to that, if you use a different approach

00:40.800 --> 00:41.880
and you wanna discuss it,

00:41.880 --> 00:43.890
please feel free to share it in the Q and A,

00:43.890 --> 00:45.540
more than happy to discuss it with you.

00:45.540 --> 00:46.560
It's such a great idea,

00:46.560 --> 00:48.780
because there's so many ways to solve this,

00:48.780 --> 00:51.210
and you might find a way that's way better,

00:51.210 --> 00:52.320
more advantageous.

00:52.320 --> 00:54.090
We could look at pros and cons.

00:54.090 --> 00:55.830
Again, more than happy to discuss it.

00:55.830 --> 00:57.390
All right, so what do we have to do?

00:57.390 --> 00:59.340
We have to launch our training.

00:59.340 --> 01:01.620
For our training, we want to use the epsilon,

01:01.620 --> 01:04.320
so basically the percentage that we're going to take

01:04.320 --> 01:09.090
for the best action instead of a random action.

01:09.090 --> 01:11.280
So let's set our epsilon,

01:11.280 --> 01:13.260
and let's set it to 0.9,

01:13.260 --> 01:15.780
and we're gonna do something similar with our values

01:15.780 --> 01:18.270
for our discount factor and our learning rate.

01:18.270 --> 01:23.270
So let's do discount factor equal to 0.9.

01:23.820 --> 01:28.820
Let's also take our learning rate and set it to equal 0.9.

01:30.060 --> 01:32.280
That's going to be the rate which our agent

01:32.280 --> 01:34.110
is going to or should learn.

01:34.110 --> 01:36.450
And we also want to take the number of training episodes.

01:36.450 --> 01:39.450
Let's call it N training episodes.

01:39.450 --> 01:41.460
So we have number of training episodes.

01:41.460 --> 01:42.630
Let's set it to a thousand.

01:42.630 --> 01:43.950
How many times this is going to train

01:43.950 --> 01:46.683
or how many episodes this is going to train on.

01:48.330 --> 01:50.970
Awesome, now essentially, what we want to do

01:50.970 --> 01:53.010
is iterate through each episode.

01:53.010 --> 01:56.520
We wanna take the range of our number of training episodes

01:56.520 --> 02:00.240
and set, basically, our epsilon.

02:00.240 --> 02:01.920
We want to find out our temporal difference.

02:01.920 --> 02:04.710
To set that, we need to look at some previous Q values.

02:04.710 --> 02:07.290
We need to calculate our starting position

02:07.290 --> 02:08.790
and our action index.

02:08.790 --> 02:11.190
So let's get started.

02:11.190 --> 02:12.600
Let's take a big for loop.

02:12.600 --> 02:17.600
So we have our for episode in the range

02:18.210 --> 02:20.133
of our number of training episodes.

02:22.260 --> 02:25.530
There we go, barring any syntax error on my part there.

02:25.530 --> 02:27.820
We want to take the row index

02:29.190 --> 02:34.000
and the column index for our grid for our maze column index

02:36.240 --> 02:40.080
and set it to our get starting location.

02:40.080 --> 02:41.400
We need a starting location,

02:41.400 --> 02:45.270
so let's have that for our each episode.

02:45.270 --> 02:48.900
Now, we can set our terminal state.

02:48.900 --> 02:53.900
While not is terminal state.

02:56.520 --> 03:01.520
We wanna look at the row, index, and the column index.

03:02.430 --> 03:07.430
Okay, next, let's set our action index for our next action.

03:07.560 --> 03:09.540
So we do need our next action as this agent

03:09.540 --> 03:11.490
iterates through the maze.

03:11.490 --> 03:16.487
For our action index, we can use our get next action,

03:17.940 --> 03:22.940
and we can use our row index column index,

03:22.950 --> 03:24.870
and let's call our epsilon.

03:24.870 --> 03:28.470
So we want to have this, instead of our random action,

03:28.470 --> 03:33.470
let's use column index and we need epsilon, awesome.

03:35.490 --> 03:39.960
If I could spell correctly here, we want epsilon

03:39.960 --> 03:41.820
All right, let me just delete this.

03:41.820 --> 03:44.070
Okay, let me grab the rest of the code,

03:44.070 --> 03:46.050
and we'll walk through it so we don't have

03:46.050 --> 03:48.540
to watch me typing out each line,

03:48.540 --> 03:50.670
but just keep that in mind how we want

03:50.670 --> 03:53.520
to approach this of setting the old row.

03:53.520 --> 03:55.350
We would need an old row index.

03:55.350 --> 03:58.140
We would also want to have the rewards

03:58.140 --> 04:00.540
for our columns and our rows.

04:00.540 --> 04:03.570
We need to take a look at Q values and old Q values,

04:03.570 --> 04:06.360
and we need to calculate our temporal difference.

04:06.360 --> 04:08.640
Okay, so you see the change in code

04:08.640 --> 04:10.170
or the update to the code.

04:10.170 --> 04:12.360
We are taking our old row index

04:12.360 --> 04:14.520
and our old column index to look

04:14.520 --> 04:16.620
at our row index and column index,

04:16.620 --> 04:20.130
our row index column index equal to our next location,

04:20.130 --> 04:21.750
and our next location would need

04:21.750 --> 04:24.333
the row index column index and action.

04:25.410 --> 04:27.870
Lastly, we're looking at our rewards.

04:27.870 --> 04:30.630
Our old Q values would be Q values

04:30.630 --> 04:33.360
with the old row index, the old column index,

04:33.360 --> 04:36.120
and again, referencing our action index

04:36.120 --> 04:38.760
and the calculation of our temporal difference,

04:38.760 --> 04:41.220
which is our reward plus the discount factor

04:41.220 --> 04:45.420
times the Q values with the row index and column index

04:45.420 --> 04:48.483
minus the old Q value from the previous Q value.

04:50.130 --> 04:52.800
Almost done, where we need our new Q value,

04:52.800 --> 04:55.710
which is our old Q value plus the learning rate

04:55.710 --> 04:57.480
times temporal difference,

04:57.480 --> 05:00.660
and our Q values with the old row index

05:00.660 --> 05:04.593
and our old column index equal to the new Q value.

05:05.910 --> 05:08.130
If you guys want to discuss this any further,

05:08.130 --> 05:10.020
again, highly recommend the Q and A.

05:10.020 --> 05:11.460
Please feel free to print.

05:11.460 --> 05:13.770
Try to experiment with all the variables used in here.

05:13.770 --> 05:16.200
If you wanna see the shapes, see how they can be used,

05:16.200 --> 05:18.660
what it's referencing, I highly recommend it,

05:18.660 --> 05:21.240
and I'm more than happy to discuss it further.

05:21.240 --> 05:23.220
One last thing we can do so we have

05:23.220 --> 05:24.840
a little bit of a notification.

05:24.840 --> 05:27.090
Let's add a print statement so we know

05:27.090 --> 05:30.150
when the training is complete for our thousand rows.

05:30.150 --> 05:33.600
All right, if there's no syntax errors on my part,

05:33.600 --> 05:34.770
we should be good to go.

05:34.770 --> 05:36.870
The collab notebook is initialized.

05:36.870 --> 05:38.790
I reran everything, so let me run this.

05:38.790 --> 05:41.760
It should be very quick, since we're only using NumPy.

05:41.760 --> 05:44.220
It's a very optimized model in a sense.

05:44.220 --> 05:46.080
Let me click this, and we will then look

05:46.080 --> 05:47.180
for the shortest path.

05:48.480 --> 05:50.760
All right, we have our training complete.

05:50.760 --> 05:54.900
Let's also print the shortest path for some starting option.

05:54.900 --> 05:57.270
So we're starting at row three, column nine

05:57.270 --> 06:00.300
with the shortest path option with our get shortest path.

06:00.300 --> 06:02.700
We wanna look at row five and zero,

06:02.700 --> 06:07.260
and starting at row nine and column number five,

06:07.260 --> 06:09.030
let's print these out.

06:09.030 --> 06:12.630
Awesome, we have the shortest path, but we're not done.

06:12.630 --> 06:14.070
We're almost there.

06:14.070 --> 06:16.860
We can see our postman automatically get the shortest path

06:16.860 --> 06:20.460
from a legal, we can consider, location

06:20.460 --> 06:22.920
from our city to the item packaging area.

06:22.920 --> 06:25.650
But what would about the reverse to the opposite scenario?

06:25.650 --> 06:29.250
Basically referring to can our postman deliver an item

06:29.250 --> 06:31.560
from anywhere in the city to the packaging area?

06:31.560 --> 06:35.580
But after the item, it would need to go then from that area,

06:35.580 --> 06:38.460
from the packaging area, to another location in the city,

06:38.460 --> 06:40.770
because it would have to pick up the next item.

06:40.770 --> 06:43.980
So what can we do to solve this?

06:43.980 --> 06:45.540
And it's actually pretty simple.

06:45.540 --> 06:48.870
You could revise the order of the shortest path.

06:48.870 --> 06:50.490
Try to think about that for a second.

06:50.490 --> 06:52.340
And then, this would be the solution.

06:53.490 --> 06:55.260
We can use the get shortest path.

06:55.260 --> 06:57.960
So let's take row five and column two,

06:57.960 --> 06:59.760
and then all you would need to do is use

06:59.760 --> 07:01.440
the reverse option with Python.

07:01.440 --> 07:03.030
We use path dot reverse,

07:03.030 --> 07:04.710
and we could print out the path.

07:04.710 --> 07:06.363
So we're looking at five and two,

07:07.260 --> 07:09.480
and there we go, we have the shortest path.

07:09.480 --> 07:10.800
And it's really helpful if you want

07:10.800 --> 07:12.600
to grab the image in the cell

07:12.600 --> 07:15.030
and bring it down to compare and look at those paths.

07:15.030 --> 07:16.440
But you can actually see the paths

07:16.440 --> 07:19.200
if you take the printed out, the statement,

07:19.200 --> 07:21.030
look at the column and the order

07:21.030 --> 07:23.370
and see how the salesman is traveling.

07:23.370 --> 07:27.000
So awesome work; you are using Q-learning.

07:27.000 --> 07:29.190
What we learned from this course has a little bit of a bonus

07:29.190 --> 07:32.760
and fun scenario section to solve this,

07:32.760 --> 07:35.190
essentially, traveling salesman kind of problem.

07:35.190 --> 07:37.980
We're looking at the postman to deliver packages

07:37.980 --> 07:39.720
and items in the city to find

07:39.720 --> 07:42.240
those shortest options and shortest paths.

07:42.240 --> 07:44.370
Highly recommend that you customize,

07:44.370 --> 07:47.220
test out other options, change the hyper parameters.

07:47.220 --> 07:48.630
And if you discover anything better,

07:48.630 --> 07:53.160
if you discover any optimized or best parameters to use,

07:53.160 --> 07:55.770
please feel free to share them in the Q and A.

07:55.770 --> 07:58.470
Amazing, hope you guys really enjoyed this.

07:58.470 --> 08:01.560
Please customize, experiment, and keep learning.

08:01.560 --> 08:05.133
It's so much fun to work with Q-learning and enjoy AI.
