WEBVTT

00:00.630 --> 00:02.250
-: Hi everyone and welcome back.

00:02.250 --> 00:03.090
In the last lecture

00:03.090 --> 00:05.430
we ended up starting to build our environment.

00:05.430 --> 00:08.430
We have essentially an 11x11 grid here

00:08.430 --> 00:11.580
that we want our Postman to iterate through

00:11.580 --> 00:14.160
to solve our challenge.

00:14.160 --> 00:17.610
We want to now start to think about,

00:17.610 --> 00:18.540
once we have the grid,

00:18.540 --> 00:20.520
we also need to have actions

00:20.520 --> 00:21.870
for our Agent.

00:21.870 --> 00:24.000
And I'm gonna delete these really quickly;

00:24.000 --> 00:25.080
we don't need these cells.

00:25.080 --> 00:27.960
I just wanted to have it in for some spacing.

00:27.960 --> 00:29.520
Just makes it easier to view.

00:29.520 --> 00:31.050
Gimme one second.

00:31.050 --> 00:36.050
Let me delete these and we can leave this one here.

00:36.090 --> 00:39.240
So, I left the text in that we have

00:39.240 --> 00:41.760
and we want to set our actions as an

00:41.760 --> 00:43.800
up, right, down and left.

00:43.800 --> 00:45.690
With Python, it's pretty straight forward.

00:45.690 --> 00:48.420
We can build the list and set our actions.

00:48.420 --> 00:53.220
Our actions, are going to be equal to,

00:53.220 --> 00:55.530
as we have, up, right, down, left.

00:55.530 --> 00:57.780
So let's set it up,

00:57.780 --> 00:59.400
right,

00:59.400 --> 01:00.690
down,

01:00.690 --> 01:01.530
and left.

01:01.530 --> 01:04.080
We need to give our act, our Agent, excuse me,

01:04.080 --> 01:06.540
some ability to maneuver with these actions

01:06.540 --> 01:07.390
through the maze.

01:08.430 --> 01:12.330
In addition to that, we also have to start setting rewards.

01:12.330 --> 01:13.950
Now, this is where it's going to

01:13.950 --> 01:16.140
start becoming a little more tricky

01:16.140 --> 01:20.220
because we have to set different states of our environment,

01:20.220 --> 01:21.600
different states in the sense of

01:21.600 --> 01:26.280
we want to be able to assign these -100 and -1,

01:26.280 --> 01:31.020
these steps or these state values to each square

01:31.020 --> 01:32.220
within the grid.

01:32.220 --> 01:35.490
So we see, to help our AI Agent learn

01:35.490 --> 01:37.590
each state or location in our city,

01:37.590 --> 01:39.390
we want to have a reward value.

01:39.390 --> 01:41.160
That's how our Agent's going to learn.

01:41.160 --> 01:43.590
So the Agent may begin at any white square

01:43.590 --> 01:45.420
but its goal is always the same.

01:45.420 --> 01:48.750
It wants to maximize its total rewards within Q-learning.

01:48.750 --> 01:49.583
Negative rewards,

01:49.583 --> 01:52.170
we know that they're referred to as punishments.

01:52.170 --> 01:54.180
These are used for all states except the goal.

01:54.180 --> 01:57.390
That's how we're gonna establish that optimal policy.

01:57.390 --> 01:59.490
Which encourages the AI to identify

01:59.490 --> 02:03.360
the shortest path to the goal by minimizing its punishments.

02:03.360 --> 02:07.290
All right, also to maximize the communal rewards,

02:07.290 --> 02:09.630
the AI Agent will need to find the shortest path

02:09.630 --> 02:12.660
between the item packaging area, our green square remember,

02:12.660 --> 02:13.950
and the other locations of the city

02:13.950 --> 02:16.860
where the Postmen can travel. The white squares.

02:16.860 --> 02:19.230
Agent's going to learn to avoid crashing

02:19.230 --> 02:20.550
into any of the city boundaries.

02:20.550 --> 02:23.280
Those are the black squares as we see with a -100.

02:23.280 --> 02:24.630
We wanna stay away from them.

02:24.630 --> 02:27.060
They have more of a punishment.

02:27.060 --> 02:29.550
So in order to do this, we have our grid,

02:29.550 --> 02:32.940
the environment that we created above with our rows,

02:32.940 --> 02:36.540
but we also want to assign these values to it.

02:36.540 --> 02:39.150
So in order to do that, let's try to think

02:39.150 --> 02:40.710
about how we can establish that.

02:40.710 --> 02:41.850
We can use NumPy,

02:41.850 --> 02:44.310
and we can also start setting it to a -100

02:44.310 --> 02:47.250
for environment rows and environment columns.

02:47.250 --> 02:49.710
So we have our environment rows and environment columns.

02:49.710 --> 02:52.120
So let's call this rewards

02:53.040 --> 02:54.630
equal to

02:54.630 --> 02:56.073
numPy.full,

02:57.450 --> 03:02.450
and let's pass in our environment rows, environment columns.

03:04.500 --> 03:09.300
And to set our values, we can start with -100.

03:12.060 --> 03:13.560
In addition to that,

03:13.560 --> 03:18.330
we also want to set our rewards window

03:18.330 --> 03:19.163
to

03:20.190 --> 03:24.420
use the indexes of zero and five,

03:24.420 --> 03:25.923
equal to 100.

03:26.820 --> 03:28.680
And this is gonna make sense in a second.

03:28.680 --> 03:30.330
So we're taking a look at our green square.

03:30.330 --> 03:31.440
We have our zero and five,

03:31.440 --> 03:33.540
we have our green square set as 100,

03:33.540 --> 03:37.680
taking these or this location to set the value.

03:37.680 --> 03:40.830
Now I'm gonna paste in the next snippet of code

03:40.830 --> 03:41.820
so we can go through it,

03:41.820 --> 03:43.530
so you don't have to watch me write each step out

03:43.530 --> 03:44.910
because it's a little repetitive.

03:44.910 --> 03:47.850
And now we have our white spaces.

03:47.850 --> 03:52.320
Let's set our note for our reward points.

03:52.320 --> 03:55.500
And in this snippet of code, we're using a dictionary

03:55.500 --> 03:57.570
and setting each of our values within our dictionary

03:57.570 --> 03:58.560
so we have our aisles.

03:58.560 --> 04:01.500
We're thinking about that as each individual row.

04:01.500 --> 04:05.400
And we can set with our slicing, with our index,

04:05.400 --> 04:08.970
one through nine, and we want to use an iteration

04:08.970 --> 04:12.030
with our for loop to set these values.

04:12.030 --> 04:13.860
In doing so, you'll see if we look

04:13.860 --> 04:16.380
at 1-10, 1, 7, and 9,

04:16.380 --> 04:19.650
and by using this, we can actually set the row index

04:19.650 --> 04:21.000
in our range of 1-10,

04:21.000 --> 04:23.670
which we're working through in our environment,

04:23.670 --> 04:28.670
we can set the column index in our aisles of our row index

04:29.280 --> 04:33.060
with our dictionary, our rewards row index and column index,

04:33.060 --> 04:35.010
we can set it to -1.

04:35.010 --> 04:37.110
So what this is doing is essentially,

04:37.110 --> 04:39.660
if we look at each specific-

04:39.660 --> 04:41.790
If we take a range here, for example

04:41.790 --> 04:43.920
aisle 9 for i in our range

04:43.920 --> 04:47.340
we have -1 set throughout the entire environment

04:47.340 --> 04:49.710
or each state within our environment.

04:49.710 --> 04:52.560
For eight, we have three and seven.

04:52.560 --> 04:54.060
So if we can scroll up,

04:54.060 --> 04:56.250
we can see that in three and seven

04:56.250 --> 04:57.450
we're setting a negative one,

04:57.450 --> 04:59.100
since they would all be -100,

04:59.100 --> 05:00.420
or set to a negative a hundred.

05:00.420 --> 05:03.480
And with this iteration, we can set those rewards

05:03.480 --> 05:06.540
or set each state that we're defining

05:06.540 --> 05:09.030
within our aisles to negative one.

05:09.030 --> 05:10.500
It makes it very easy.

05:10.500 --> 05:13.560
Instead of having to write further logic

05:13.560 --> 05:16.530
or maybe more detailed functions or statements

05:16.530 --> 05:19.110
we can iterate through and set these values.

05:19.110 --> 05:21.600
I highly recommend that you take a minute to

05:21.600 --> 05:24.390
explore experiment if you wanna change the environment later

05:24.390 --> 05:26.070
after we run this solution.

05:26.070 --> 05:29.550
It's a great way to help learn and reinforce these policies.

05:29.550 --> 05:32.190
But this is starting to take shape.

05:32.190 --> 05:35.130
And a cool thing we can do is we can actually visualize it.

05:35.130 --> 05:38.793
So let's do four-row in rewards.

05:40.200 --> 05:42.420
Print row.

05:42.420 --> 05:43.680
And let's print this.

05:43.680 --> 05:46.740
And we can see I might have to actually re-run the cells,

05:46.740 --> 05:48.630
my apologies, I wasn't connected

05:48.630 --> 05:50.430
to the notebook working through here.

05:50.430 --> 05:51.360
Give it one second.

05:51.360 --> 05:52.800
It's gonna throw an error.

05:52.800 --> 05:55.380
I need to go back through and re-run the cells.

05:55.380 --> 05:57.330
So let me just run this really quickly.

05:57.330 --> 05:58.950
I want to import numPy.

05:58.950 --> 06:02.280
I can actually just, oh, my apologies.

06:02.280 --> 06:04.980
Let me go through this here and I'm gonna come down.

06:04.980 --> 06:08.670
We wanna run our environment, we wanna run our actions.

06:08.670 --> 06:10.800
The other ones are just texts, so we don't need them

06:10.800 --> 06:12.570
but I like to have that in

06:12.570 --> 06:14.880
so you guys can have a reference to it.

06:14.880 --> 06:16.290
We want reward points

06:16.290 --> 06:19.020
and we finally want to visualize this.

06:19.020 --> 06:21.000
We can see the visualization

06:21.000 --> 06:23.760
the actual numerical representation

06:23.760 --> 06:26.070
in numPy of our environment.

06:26.070 --> 06:27.180
Really cool.

06:27.180 --> 06:29.310
So we have our environment set up.

06:29.310 --> 06:30.300
Amazing work.

06:30.300 --> 06:31.890
Hope you guys are finding this useful.

06:31.890 --> 06:34.080
Now we're gonna cap it off here because

06:34.080 --> 06:36.930
in the next lecture we're gonna start training the model.

06:36.930 --> 06:39.570
So this was to set our Agent's actions

06:39.570 --> 06:42.960
to set our environment, to set our rewards, our punishment.

06:42.960 --> 06:44.580
This is gonna help the Agent establish

06:44.580 --> 06:46.860
the optimal policy within Q Learning.

06:46.860 --> 06:49.920
Overall, we have this visual representation

06:49.920 --> 06:51.780
or this image built

06:51.780 --> 06:55.053
and we can see it here if we print out our row.

06:56.160 --> 06:57.060
Amazing.

06:57.060 --> 06:59.130
All right, I won't keep rambling on.

06:59.130 --> 07:00.750
Let's cap it off here in the next lecture.

07:00.750 --> 07:02.760
Let's start training the model.

07:02.760 --> 07:04.643
I'll see you guys in the next lecture.