WEBVTT

00:01.530 --> 00:03.480
-: Hi everyone, and welcome back.

00:03.480 --> 00:05.820
In the last lecture, we ended up finishing

00:05.820 --> 00:09.060
or finalizing the actual definition of our environment,

00:09.060 --> 00:12.270
the representation of our values, our punishment

00:12.270 --> 00:14.610
and rewards, and our actions in the environment.

00:14.610 --> 00:15.660
In this lecture,

00:15.660 --> 00:19.410
what I want to do is to introduce the training, the idea

00:19.410 --> 00:23.070
behind how to train the model, and to help get you started.

00:23.070 --> 00:25.320
And then in the next lecture, provide the solution,

00:25.320 --> 00:28.740
and more of a breakdown for these steps involved.

00:28.740 --> 00:31.110
So, we need to train the model.

00:31.110 --> 00:33.750
What would be required of training?

00:33.750 --> 00:37.830
This is an approach that we can use for the following steps

00:37.830 --> 00:38.910
for training the model.

00:38.910 --> 00:41.970
We wanna choose a random non-internal state,

00:41.970 --> 00:43.920
which would be the white square of our agent,

00:43.920 --> 00:45.510
and that's how we want to begin

00:45.510 --> 00:47.460
in our episode for the training.

00:47.460 --> 00:48.630
We then of course

00:48.630 --> 00:50.460
want to choose an action for the current state.

00:50.460 --> 00:52.230
We need to have our agent being able to move

00:52.230 --> 00:53.340
around the environment.

00:53.340 --> 00:57.480
So, actions for us and our agent in this challenge

00:57.480 --> 00:59.910
are gonna be chosen using Epsilon Greed.

00:59.910 --> 01:02.700
This algorithm will usually choose the most promising action

01:02.700 --> 01:04.860
for the AI agent, but occasionally choose

01:04.860 --> 01:06.810
a less promising option in order to encourage

01:06.810 --> 01:08.580
the agent to explore the environment.

01:08.580 --> 01:11.430
We really wanna find that optimal policy.

01:11.430 --> 01:13.800
Then, we wanna perform the chosen action

01:13.800 --> 01:15.330
and transition to the next state.

01:15.330 --> 01:16.560
Move to the next location.

01:16.560 --> 01:18.300
And why I'm saying this is I want you guys

01:18.300 --> 01:20.280
to think about how you can break this down

01:20.280 --> 01:23.700
into functions to solve this problem.

01:23.700 --> 01:26.790
Next, we need to receive a reward for going

01:26.790 --> 01:29.910
to a new state, and then calculate the temporal difference,

01:29.910 --> 01:31.470
we have to update the Q value

01:31.470 --> 01:33.150
for the previous state in action pair.

01:33.150 --> 01:36.360
And if the new current state is a terminal state,

01:36.360 --> 01:37.770
we would then go to one.

01:37.770 --> 01:39.600
Otherwise, we'd go to step number two.

01:39.600 --> 01:41.820
So, the entire process, we're gonna aim to run

01:41.820 --> 01:43.830
for a thousand episodes to train.

01:43.830 --> 01:48.480
This is gonna give us enough opportunity, or our AI agent,

01:48.480 --> 01:51.570
sufficient opportunity to calculate that shortest path

01:51.570 --> 01:54.330
between the item packaging area and other locations

01:54.330 --> 01:55.863
in our example, Citi.

01:56.730 --> 01:57.660
Awesome.

01:57.660 --> 01:59.880
So, please think about how you would approach this,

01:59.880 --> 02:03.180
and I wanna help give you an idea to try and solve this.

02:03.180 --> 02:05.970
So, we'd be looking at, in our solution, we're gonna be

02:05.970 --> 02:08.700
using the following functions.

02:08.700 --> 02:09.930
I can actually comment this out

02:09.930 --> 02:12.090
since it's a code cell, my apologies.

02:12.090 --> 02:13.440
We don't want a dollar sign.

02:13.440 --> 02:15.720
We want to comment these out.

02:15.720 --> 02:17.190
And I'll help get you started

02:17.190 --> 02:18.243
on the first one.

02:19.140 --> 02:23.400
Overall, we're gonna have these functions defining our steps

02:23.400 --> 02:25.110
for training the model.

02:25.110 --> 02:27.510
And to help get you started, let's take a look at how

02:27.510 --> 02:29.520
we would approach is terminal state

02:29.520 --> 02:31.860
in creating a function within Python for this.

02:31.860 --> 02:33.870
But first thing's first, we would of course want

02:33.870 --> 02:36.430
to define our function as is

02:37.770 --> 02:39.543
terminal state.

02:40.830 --> 02:42.930
And these are the names of the functions that you'll see

02:42.930 --> 02:45.630
to give you an idea that might help you break it down.

02:45.630 --> 02:49.590
What we wanna do is take the current row index,

02:49.590 --> 02:51.273
and the current column index.

02:53.130 --> 02:55.170
This is gonna help give us the position

02:55.170 --> 02:58.620
of our agent, and we can add

02:58.620 --> 03:02.100
here for our rewards, a true or false.

03:02.100 --> 03:03.300
So, we need an if statement.

03:03.300 --> 03:08.300
So, let's set our if, our rewards of our current

03:09.600 --> 03:14.250
row index, and current

03:14.250 --> 03:17.943
column index, equal to negative one.

03:18.900 --> 03:20.253
If they're in that state,

03:24.480 --> 03:27.060
we would return false.

03:27.060 --> 03:31.650
Otherwise, or else we would return

03:31.650 --> 03:32.483
true.

03:33.750 --> 03:35.640
Pretty straightforward enough.

03:35.640 --> 03:37.410
And this is how we're gonna get an idea

03:37.410 --> 03:39.540
of our is terminal state.

03:39.540 --> 03:42.450
After we have, if it's existing in the terminal state,

03:42.450 --> 03:44.460
we then wanna get the starting location.

03:44.460 --> 03:47.400
As a hint, you can take a look at using the current

03:47.400 --> 03:49.920
row index and the current column index,

03:49.920 --> 03:52.920
and setting NumPy with a random.

03:52.920 --> 03:54.840
We wanna initialize that randomly

03:54.840 --> 03:57.480
to the environment rows, environment columns.

03:57.480 --> 04:00.720
But in the next lecture, you'll see a breakdown,

04:00.720 --> 04:02.820
you'll see the rest of the functions

04:02.820 --> 04:05.010
with some notes to help give you an idea.

04:05.010 --> 04:06.960
I really hope that you guys take the chance to experiment

04:06.960 --> 04:09.600
with this because it is just an awesome way to learn,

04:09.600 --> 04:12.420
help give you an idea, and this is to get you started.

04:12.420 --> 04:15.510
So don't worry, you'll get the solution in the next lecture.

04:15.510 --> 04:18.660
And then, we're gonna wrap things up by actually

04:18.660 --> 04:21.960
assigning an epsilon, discount factor, learning rate,

04:21.960 --> 04:24.540
those type of things for training, running the training,

04:24.540 --> 04:26.220
and viewing the results.

04:26.220 --> 04:27.540
Awesome.

04:27.540 --> 04:29.040
Let's stop here.

04:29.040 --> 04:31.980
Again, try to solve this, but if not,

04:31.980 --> 04:33.840
if you wanna just advance, go to the next lecture,

04:33.840 --> 04:34.890
and you'll get the solution

04:34.890 --> 04:38.610
for the training, for defining of these functions.

04:38.610 --> 04:40.913
All right, I'll see you guys in the next lecture.
