WEBVTT

00:00.840 --> 00:02.220
-: Hello, and welcome back

00:02.220 --> 00:04.530
to the course on artificial intelligence.

00:04.530 --> 00:06.060
I hope you're excited about today's tutorial

00:06.060 --> 00:08.850
because we are taking our very first step

00:08.850 --> 00:10.440
into the world of AI.

00:10.440 --> 00:13.260
And today, we're talking about reinforcement learning.

00:13.260 --> 00:15.218
It's a very important tutorial

00:15.218 --> 00:17.552
because it will underpin everything else

00:17.552 --> 00:18.750
that's going to be happening in this course.

00:18.750 --> 00:20.730
So, let's get started.

00:20.730 --> 00:23.010
Here we've got a little maze,

00:23.010 --> 00:26.730
and this maze is our representation of an environment.

00:26.730 --> 00:27.780
And that's what we're going to be dealing

00:27.780 --> 00:29.220
with in this course.

00:29.220 --> 00:31.410
We're going to be dealing with certain environments

00:31.410 --> 00:33.420
in which our artificial intelligence

00:33.420 --> 00:35.130
is going to be performing,

00:35.130 --> 00:36.870
is going to be taking actions,

00:36.870 --> 00:39.110
is going to be looking to beat these environments

00:39.110 --> 00:42.330
going to be looking to win in these environments.

00:42.330 --> 00:44.340
And here, we've got an agent.

00:44.340 --> 00:47.040
The agent is our artificial intelligence.

00:47.040 --> 00:48.060
That's the person,

00:48.060 --> 00:50.700
or that's the mind that's going to be navigating

00:50.700 --> 00:51.690
these environments

00:51.690 --> 00:53.610
and learning from the feedback

00:53.610 --> 00:55.140
that the environments are going to be giving it

00:55.140 --> 00:57.150
in order to perform certain actions.

00:57.150 --> 00:58.110
And so the way it works

00:58.110 --> 01:02.340
is the agent performs certain actions in this environment.

01:02.340 --> 01:06.270
And as a result, the state in which it is in will change.

01:06.270 --> 01:08.220
So, it might be further or closer,

01:08.220 --> 01:10.050
or more to the left, more to the right.

01:10.050 --> 01:12.120
It might have a certain other parameters

01:12.120 --> 01:13.440
that describe its state.

01:13.440 --> 01:15.180
And those parameters are going to change.

01:15.180 --> 01:16.830
So, the state is going to change

01:16.830 --> 01:18.270
because of the action it takes,

01:18.270 --> 01:21.060
and it will also get rewards based on the actions.

01:21.060 --> 01:22.620
So, every time it takes an action,

01:22.620 --> 01:24.930
the state will change and it'll get a reward.

01:24.930 --> 01:26.820
Now, bear in mind sometimes it might happen

01:26.820 --> 01:28.290
that it won't change the state,

01:28.290 --> 01:29.820
the action won't change the state,

01:29.820 --> 01:31.590
or there won't be a reward

01:31.590 --> 01:34.680
for taking that action in that certain state it was in.

01:34.680 --> 01:36.153
But nevertheless, the agent's going to keep doing that.

01:36.153 --> 01:37.740
It's going to be taking actions,

01:37.740 --> 01:40.230
changing the state, getting rewards, changing action,

01:40.230 --> 01:42.810
taking actions, changing the state and getting rewards.

01:42.810 --> 01:44.640
And by doing that process,

01:44.640 --> 01:45.993
it's going to be learning about the environment,

01:45.993 --> 01:48.150
it's going to be exploring the environment,

01:48.150 --> 01:51.150
understanding what actions lead to good rewards,

01:51.150 --> 01:52.530
and favorable states,

01:52.530 --> 01:56.010
and what actions lead to bad rewards and unfavorable states.

01:56.010 --> 01:58.590
And this is a very simplistic representation

01:58.590 --> 01:59.670
of a very global problem.

01:59.670 --> 02:01.830
So, if you think about it,

02:01.830 --> 02:04.380
environments actually don't have to be just mazes.

02:04.380 --> 02:06.660
It's not just about getting out of a maze

02:06.660 --> 02:09.150
or finding a treasure in a maze.

02:09.150 --> 02:11.760
An environment can be pretty much anything in life.

02:11.760 --> 02:13.440
So, imagine you waking up

02:13.440 --> 02:15.390
in the morning and cooking an omelet.

02:15.390 --> 02:17.490
So, in order to make that omelet,

02:17.490 --> 02:19.860
you need to go through certain steps

02:19.860 --> 02:22.650
You need to get the salt, get the eggs,

02:22.650 --> 02:24.180
get the frying pans,

02:24.180 --> 02:25.140
switch the fire on, and so on.

02:25.140 --> 02:27.750
And it does sound like a routine mundane thing,

02:27.750 --> 02:28.590
but it's become routine

02:28.590 --> 02:29.970
because you've done it so many times.

02:29.970 --> 02:32.370
But in reality, it's an environment

02:32.370 --> 02:33.870
where you're performing certain actions.

02:33.870 --> 02:35.220
You're putting the fire on,

02:35.220 --> 02:36.870
you're putting the frying pan on,

02:36.870 --> 02:39.930
the fire, you're putting all the eggs into the frying pan,

02:39.930 --> 02:41.610
and you're putting some salt on the eggs,

02:41.610 --> 02:43.170
and you're turning them over and so on.

02:43.170 --> 02:46.038
So, as you can see, there are certain actions

02:46.038 --> 02:49.200
which you're taking in certain states.

02:49.200 --> 02:51.540
And those actions lead to certain other states

02:51.540 --> 02:52.500
and sometimes rewards.

02:52.500 --> 02:55.200
So, for instance, when you put the fire on

02:55.200 --> 02:56.550
and you wait, wait, wait, wait, wait,

02:56.550 --> 02:59.220
you're taking the action of wait, wait, wait, wait too long.

02:59.220 --> 03:01.890
And then you put the eggs into the frying pan,

03:01.890 --> 03:03.570
the rewards are gonna be very negative.

03:03.570 --> 03:05.130
It's all gonna burn.

03:05.130 --> 03:07.800
On the other hand, if you do all the correct actions

03:07.800 --> 03:09.000
in the correct times...

03:09.000 --> 03:10.620
So, it's also very important to understand

03:10.620 --> 03:13.860
that actions should be taken at the correct points in time.

03:13.860 --> 03:16.710
So, for instance, putting the salt in the frying pan

03:16.710 --> 03:20.760
before you put the eggs in might not be the best idea.

03:20.760 --> 03:23.160
You might want to take that action of putting the salt

03:23.160 --> 03:26.220
into the frying pan after the eggs are in there

03:26.220 --> 03:28.350
so in the different state.

03:28.350 --> 03:29.730
So, it's important to remember that.

03:29.730 --> 03:31.140
And at the same time,

03:31.140 --> 03:32.490
if you take all the correct actions

03:32.490 --> 03:34.560
in the correct order, in the correct states,

03:34.560 --> 03:37.470
your final reward could be that you get an omelet

03:37.470 --> 03:38.880
which you can eat.

03:38.880 --> 03:42.060
And so that's a very basic activity in your life.

03:42.060 --> 03:43.260
But if you think about it,

03:43.260 --> 03:44.970
it is actually an environment

03:44.970 --> 03:47.910
and you are the agent going through this environment

03:47.910 --> 03:48.960
and performing task.

03:48.960 --> 03:50.760
You don't really need to learn anything

03:50.760 --> 03:52.230
because you already know it pretty well.

03:52.230 --> 03:53.280
But at the same time, you could learn.

03:53.280 --> 03:55.560
Maybe you could learn how to make a better omelet.

03:55.560 --> 03:57.810
Or especially if it's your first omelet that you're making,

03:57.810 --> 03:59.010
you probably gonna screw it up.

03:59.010 --> 04:00.360
But you will learn from that

04:00.360 --> 04:02.730
because you will understand what actions lead

04:02.730 --> 04:04.140
towards states and rewards.

04:04.140 --> 04:06.690
And anything else in life, for instance,

04:06.690 --> 04:08.910
even trading on the stock market,

04:08.910 --> 04:09.743
and you know,

04:09.743 --> 04:11.910
buying and selling, and getting certain feedback

04:11.910 --> 04:15.150
from the market in the sense of return,

04:15.150 --> 04:17.730
positive or negative returns, that's also an environment

04:17.730 --> 04:18.840
and that's you participating

04:18.840 --> 04:20.190
in that environment as an agent.

04:20.190 --> 04:22.320
Driving a car is also an environment

04:22.320 --> 04:23.940
where you can turn the steering wheel,

04:23.940 --> 04:26.010
you can accelerate, you can break, and so on,

04:26.010 --> 04:27.780
and you're getting feedback from the environment.

04:27.780 --> 04:28.613
And you know,

04:28.613 --> 04:30.690
one of those feedbacks is the policeman

04:30.690 --> 04:32.130
giving you a speeding fine

04:32.130 --> 04:34.680
if you're going above the acceptable

04:34.680 --> 04:37.020
or allowed speed limit on that highway.

04:37.020 --> 04:38.887
And therefore from there, you learn that,

04:38.887 --> 04:40.950
"Okay, that's not something that should be done

04:40.950 --> 04:43.200
because it leads to a negative reward."

04:43.200 --> 04:44.490
So, rewards don't have to be just

04:44.490 --> 04:45.600
at the very end of the process.

04:45.600 --> 04:48.000
They can be throughout the journey, throughout the process.

04:48.000 --> 04:49.500
So, those are a couple of examples.

04:49.500 --> 04:51.450
And in terms of AI,

04:51.450 --> 04:53.430
the simplest way to think of reinforcement learning

04:53.430 --> 04:54.750
is like training a dog.

04:54.750 --> 04:58.200
When you train a dog, you give it certain commands,

04:58.200 --> 05:00.660
and if it obeys those commands, then you give it a treat.

05:00.660 --> 05:02.430
You give it like a biscuit or something.

05:02.430 --> 05:03.750
If it doesn't obey those commands,

05:03.750 --> 05:05.160
you tell it that it's a bad dog

05:05.160 --> 05:06.840
or you just don't give it a treat.

05:06.840 --> 05:08.910
And through that process,

05:08.910 --> 05:13.050
it learns what certain commands or what it needs to do,

05:13.050 --> 05:15.000
what action it needs to take in certain states.

05:15.000 --> 05:18.450
And the states are the commands that you're giving it.

05:18.450 --> 05:21.510
And based on that, it'll get some certain rewards.

05:21.510 --> 05:24.600
Of course, in the world of AI, it's not that complex.

05:24.600 --> 05:26.940
You don't have to give the AI treats,

05:26.940 --> 05:28.590
you don't have to have like a bag of biscuits

05:28.590 --> 05:30.150
with you every time.

05:30.150 --> 05:32.250
You just give it a plus one or a minus one.

05:32.250 --> 05:35.190
So, it's a huge advantage that in the world of AI,

05:35.190 --> 05:37.290
we've created these AIs ourselves.

05:37.290 --> 05:40.110
So, the rewards that we're giving them,

05:40.110 --> 05:41.610
if you think about, this is really cool,

05:41.610 --> 05:43.530
the rewards you're giving them, they don't actually exist.

05:43.530 --> 05:45.180
They're just a plus or a minus one,

05:45.180 --> 05:48.540
or a one or a zero, something like that.

05:48.540 --> 05:51.120
So, it's all non-existent, it's all imaginary stuff.

05:51.120 --> 05:53.190
But at the same time, it leads to great resources.

05:53.190 --> 05:55.143
We can create these amazing things,

05:57.140 --> 05:59.595
by this amazing artificial intelligence

05:59.595 --> 06:01.710
by just providing rewards which don't really exist,

06:01.710 --> 06:02.910
the plus and minus one.

06:02.910 --> 06:03.780
It doesn't cost us anything

06:03.780 --> 06:05.880
but at the same time, that leads to results.

06:05.880 --> 06:08.160
So, very similar to real world,

06:08.160 --> 06:09.810
and you know, that example of dogs,

06:09.810 --> 06:14.810
but here, the rewards are digital and just numbers.

06:15.090 --> 06:16.860
And with that in mind,

06:16.860 --> 06:18.780
we can talk a little bit about Robodog.

06:18.780 --> 06:19.613
I love this example.

06:19.613 --> 06:21.330
So, this is just a random picture.

06:21.330 --> 06:23.790
It's not necessarily that exact Robodog,

06:23.790 --> 06:26.700
you know, that is trained through enforcement learning.

06:26.700 --> 06:29.100
Some Robodogs, especially the older ones

06:29.100 --> 06:31.350
you'd have an algorithm in there.

06:31.350 --> 06:33.750
And this is actually a good example

06:33.750 --> 06:37.980
of the difference between pre-programmed agents,

06:37.980 --> 06:39.900
and reinforcement learning agents.

06:39.900 --> 06:41.520
So, you could have a Robodog

06:41.520 --> 06:45.180
which is pre-programmed to how to walk.

06:45.180 --> 06:46.170
It will say...

06:46.170 --> 06:48.240
So, in the algorithm behind the dog

06:48.240 --> 06:49.073
and the software will say,

06:49.073 --> 06:50.910
"Okay, so in order to walk,

06:50.910 --> 06:54.439
you need to move your left leg forward,

06:54.439 --> 06:57.360
front leg forward, then your back right leg forward,

06:57.360 --> 06:58.620
then your front right leg forward,

06:58.620 --> 07:00.780
then your back left leg forward, and repeat that action."

07:00.780 --> 07:02.970
And you know, that's a definition of walking.

07:02.970 --> 07:05.130
It's a function inside this dog.

07:05.130 --> 07:07.500
And then it might have, you know, how to sit,

07:07.500 --> 07:09.690
how to stand and things like that.

07:09.690 --> 07:12.840
Whereas in a Robodog that is trained

07:12.840 --> 07:14.580
through reinforcement learning,

07:14.580 --> 07:16.710
what happens is you don't pre-program it.

07:16.710 --> 07:18.870
This is the key concept to everything here,

07:18.870 --> 07:21.570
that you don't have any algorithm

07:21.570 --> 07:24.810
inside that is hard-coded into the dog.

07:24.810 --> 07:28.440
Instead, you have what we'll be discussing in the future.

07:28.440 --> 07:31.440
You have this reinforcement learning algorithm

07:31.440 --> 07:33.097
which is told that,

07:33.097 --> 07:38.097
"Okay, so the goal is to get from where you are now,

07:38.130 --> 07:41.334
not knowing anything to the end of the room," for example.

07:41.334 --> 07:42.231
"And here are the certain actions you can take.

07:42.231 --> 07:46.530
You can move your right foot, you can move your left foot,

07:46.530 --> 07:49.380
you can move your right back foot, or left back foot.

07:49.380 --> 07:51.270
So, here are all the degrees of freedom that you can do.

07:51.270 --> 07:53.160
You can move them like this, you can move them like that."

07:53.160 --> 07:55.230
So, like a list of actions you can take.

07:55.230 --> 07:59.190
And your rewards are every time you take a step forward

07:59.190 --> 08:00.300
you get a plus one.

08:00.300 --> 08:02.850
Every time you fall over, you get a minus one.

08:02.850 --> 08:04.140
And that's all there is to it.

08:04.140 --> 08:05.490
And then they just leave the dog

08:05.490 --> 08:07.350
and let it figure it out on its own.

08:07.350 --> 08:10.620
So, the dog tries to stand up, it falls,

08:10.620 --> 08:11.737
and it realizes that,

08:11.737 --> 08:13.920
"Okay, I shouldn't do that action that led to me falling,

08:13.920 --> 08:15.570
because every time I fall, I get a minus one

08:15.570 --> 08:16.740
which is not good for me."

08:16.740 --> 08:19.050
Then, so it does the other action that helped it stand up.

08:19.050 --> 08:21.120
And then it figures out, it just experiments,

08:21.120 --> 08:22.200
experiments, experiments,

08:22.200 --> 08:24.210
tries things randomly, and then figures out

08:24.210 --> 08:25.470
that you can make a step forward

08:25.470 --> 08:28.090
by moving its right front foot.

08:28.090 --> 08:30.067
And then, and it gets a plus one and it realizes,

08:30.067 --> 08:31.440
"Oh, I should do more of that."

08:31.440 --> 08:32.273
Okay, cool.

08:32.273 --> 08:34.950
So, it now learns that it should do more of this

08:34.950 --> 08:35.783
and less of that.

08:35.783 --> 08:37.590
And through this learning process,

08:37.590 --> 08:42.420
it very quickly understands how it can walk.

08:42.420 --> 08:45.840
And those dogs that figure it out on their own

08:45.840 --> 08:48.060
can actually sometimes walk better

08:48.060 --> 08:49.920
than dogs that are pre-programmed.

08:49.920 --> 08:51.210
Because when we pre-program things,

08:51.210 --> 08:52.650
we look at the real live dogs,

08:52.650 --> 08:53.730
and or you know,

08:53.730 --> 08:55.800
we use our own imagination how to do it.

08:55.800 --> 08:57.746
Whereas a reinforcement learning dog

08:57.746 --> 09:00.330
can optimize things on its own.

09:00.330 --> 09:01.830
And because it's in AI,

09:01.830 --> 09:03.630
sometimes it can get even better results.

09:03.630 --> 09:04.680
And that's how they can train

09:04.680 --> 09:07.500
these same Robodogs to play soccer.

09:07.500 --> 09:09.840
You can't train a normal dog to play soccer

09:09.840 --> 09:12.990
because you know, simply the whole approach is different,

09:12.990 --> 09:16.320
and it's not something that you know,

09:16.320 --> 09:19.800
probably a normal dog has been trained to do,

09:19.800 --> 09:23.010
or has ever done in its process of evolution.

09:23.010 --> 09:25.530
Whereas a reinforcement learning Robodogs,

09:25.530 --> 09:27.780
can very easily understand how to play soccer

09:27.780 --> 09:29.520
as long as you tell them what the rewards are,

09:29.520 --> 09:30.840
what the goals are,

09:30.840 --> 09:33.060
what the possible actions they can take are.

09:33.060 --> 09:36.960
So, that is how reinforcement learning works in general.

09:36.960 --> 09:39.150
That's a quick overview of reinforcement learning.

09:39.150 --> 09:40.830
I hope that got you very excited

09:40.830 --> 09:42.330
about what's going to come next,

09:42.330 --> 09:45.540
because it's a completely different world

09:45.540 --> 09:47.880
compared to pre-program solutions,

09:47.880 --> 09:50.070
or hard program, hard coded solutions

09:50.070 --> 09:51.960
where you have the if-else conditions.

09:51.960 --> 09:53.820
This is very different

09:53.820 --> 09:56.100
and we're gonna be talking more about that.

09:56.100 --> 09:59.220
In the meantime, we've got some additional reading for you.

09:59.220 --> 10:03.660
So, if you'd like to have some supporting materials,

10:03.660 --> 10:06.870
here's a great article which you can look into.

10:06.870 --> 10:09.390
It's called "Simple Reinforcement Learning with Tensorflow."

10:09.390 --> 10:10.560
It's got 10 parts.

10:10.560 --> 10:14.511
The link is here and you'll find the clickable link

10:14.511 --> 10:16.161
in the course resources.

10:16.161 --> 10:17.760
It's by Arthur Gillian.

10:17.760 --> 10:19.950
It's a 2016 article.

10:19.950 --> 10:22.290
And you can follow along this course

10:22.290 --> 10:24.780
and also get additional information from that article.

10:24.780 --> 10:27.630
But bear in mind that that article is with Tensorflow,

10:27.630 --> 10:29.670
whereas in this course we are using PyTorch.

10:29.670 --> 10:32.100
So, different implementations

10:32.100 --> 10:33.810
but at the same time,

10:33.810 --> 10:36.390
you might pick up a few things here and there

10:36.390 --> 10:40.080
that might supplement your learning

10:40.080 --> 10:41.250
that we're gonna be doing in this course.

10:41.250 --> 10:42.630
So, great article to follow

10:42.630 --> 10:45.030
even if you're not considering following it for sure.

10:45.030 --> 10:45.863
Still just in case,

10:45.863 --> 10:49.260
check out that first part and see if you like it.

10:49.260 --> 10:52.200
See if you would like to read it a bit more.

10:52.200 --> 10:55.530
And then we've got, specific to this tutorial

10:55.530 --> 10:56.670
about reinforcement learning,

10:56.670 --> 10:58.950
there's a paper by Richard Sutton,

10:58.950 --> 11:02.520
which is called "Reinforcement Learning One Introduction."

11:02.520 --> 11:04.860
It's a 1998 paper, so quite old,

11:04.860 --> 11:06.570
but at the same time,

11:06.570 --> 11:09.240
you can learn a bit about reinforcement learning,

11:09.240 --> 11:11.610
some of the examples like that omelet example

11:11.610 --> 11:12.600
and other examples

11:12.600 --> 11:15.120
of where reinforcement and learning can be applied.

11:15.120 --> 11:17.730
And just a general overview of reinforcement learning

11:17.730 --> 11:20.760
if you are looking for some additional reading.

11:20.760 --> 11:23.250
And on that note, we're going to wrap up this tutorial.

11:23.250 --> 11:24.660
Can't wait to see you next time.

11:24.660 --> 11:26.763
And until then, enjoy AI.