1
00:00:11,670 --> 00:00:17,400
In this lecture we are going to discuss reinforcement learning from a more technical standpoint and

2
00:00:17,400 --> 00:00:23,380
this will allow us to define most of the terminology involved in reinforcement learning problems.

3
00:00:23,400 --> 00:00:25,670
I'm a big believer in learning by example.

4
00:00:25,680 --> 00:00:31,410
So in this lecture it's not going to be so much about abstract and technical definitions as it is about

5
00:00:31,650 --> 00:00:35,410
providing examples of everything.

6
00:00:35,510 --> 00:00:38,670
Let's start with the main objects in a reinforcement learning problem.

7
00:00:38,750 --> 00:00:41,210
The agent and the environment.

8
00:00:41,210 --> 00:00:44,000
The best example of this is yourself.

9
00:00:44,000 --> 00:00:47,340
You are in a gym and the world is your environment.

10
00:00:47,690 --> 00:00:54,020
Maybe your long term goal is to ace your math exam and so just like an autonomous vehicle driving to

11
00:00:54,020 --> 00:01:00,320
a destination you must observe your environment and make the correct decisions every day in order to

12
00:01:00,320 --> 00:01:01,870
achieve your goal.

13
00:01:01,880 --> 00:01:08,330
That means studying going to class taking notes doing your homework asking questions when you are confused

14
00:01:08,360 --> 00:01:09,620
and so on.

15
00:01:09,710 --> 00:01:11,210
So that's one basic example

16
00:01:16,290 --> 00:01:21,960
here's another example that's closer to what we might actually use reinforcement learning for.

17
00:01:21,960 --> 00:01:25,340
Suppose you're writing a program that plays tic tac toe.

18
00:01:25,470 --> 00:01:29,260
What's the agent and what's the environment in this case.

19
00:01:29,260 --> 00:01:35,610
The environment is composed of the computer program that implements this tic tac toe game.

20
00:01:35,690 --> 00:01:41,480
Of course this computer program may also involve some form of A.I. that will be the other player in

21
00:01:41,480 --> 00:01:42,170
the game.

22
00:01:42,260 --> 00:01:49,330
But for all intents and purposes just pretend it's a bunch of if statements and predefined rules so

23
00:01:49,330 --> 00:01:51,270
it's not intelligence per say.

24
00:01:51,310 --> 00:02:01,360
It's just a computer program written by someone else just part of the greater tic tac toe program.

25
00:02:01,590 --> 00:02:07,140
You can imagine that this tic tac toe program will have an API that will allow you to interact with

26
00:02:07,140 --> 00:02:08,840
it programmatically.

27
00:02:08,880 --> 00:02:12,520
So for example there might be a function to start a new game.

28
00:02:12,600 --> 00:02:17,580
There might be a function to place your X or your O at some location on the board.

29
00:02:17,730 --> 00:02:22,380
There might be a function to read the State of the board so you can see where all the X's and O's have

30
00:02:22,380 --> 00:02:24,390
been placed so far.

31
00:02:24,390 --> 00:02:28,980
There might be a function to check whether the game is over and if so who won the game.

32
00:02:29,070 --> 00:02:30,480
The computer or your agent.

33
00:02:31,260 --> 00:02:32,400
So that's the environment

34
00:02:37,480 --> 00:02:44,240
your agent on the other hand will be another computer program that interfaces with the tic tac toe game.

35
00:02:44,260 --> 00:02:49,840
Your agent may use an algorithm such as one from reinforcement learning in order to learn how to play

36
00:02:49,840 --> 00:02:52,050
tic tac toe from experience.

37
00:02:52,480 --> 00:02:56,050
It makes use of the API to interface with the environment.

38
00:02:57,070 --> 00:02:59,700
So here we have a program that plays the game.

39
00:03:00,100 --> 00:03:04,990
Part of this code is where the agent reads the state of the game board and chooses the most intelligent

40
00:03:04,990 --> 00:03:06,190
action.

41
00:03:06,190 --> 00:03:09,070
That's our A.I. represented by our agent

42
00:03:14,220 --> 00:03:17,670
here's another popular example video games.

43
00:03:17,670 --> 00:03:23,750
This is a famous classic Atari game known as breakout by the way if you don't know how this game works.

44
00:03:23,880 --> 00:03:27,230
There are many places where you can play this game online for free.

45
00:03:27,330 --> 00:03:31,740
So if you've never played this game before please go and give it a try.

46
00:03:31,980 --> 00:03:35,810
In this game the environment is obviously the game itself.

47
00:03:35,880 --> 00:03:42,390
The goal is to clear all the blocks and your job is to move the paddle in such a way that the ball destroys

48
00:03:42,390 --> 00:03:45,300
the blocks but never falls to the ground.

49
00:03:45,330 --> 00:03:50,730
The agent would be your computer program which can read information from the game like what the screen

50
00:03:50,730 --> 00:03:55,830
currently looks like so it can figure out where the blocks are where the paddle is where the ball is

51
00:03:55,830 --> 00:03:57,900
going and so forth.

52
00:03:57,900 --> 00:04:00,810
Its job is to control where the paddle goes.

53
00:04:00,810 --> 00:04:03,510
So basically you can move left right or do nothing

54
00:04:08,710 --> 00:04:09,190
next.

55
00:04:09,250 --> 00:04:12,010
Let's continue defining more terms.

56
00:04:12,010 --> 00:04:15,340
So far you know about the Asian and the environment.

57
00:04:15,400 --> 00:04:21,560
The next term I want to define is episode what happens when I play a game of tic tac toe or break.

58
00:04:22,150 --> 00:04:29,450
Well some sequence of events will occur and then at the end I will win or lose with my math exam example.

59
00:04:29,530 --> 00:04:33,980
You will take your math exam and then you'll get your grade.

60
00:04:34,210 --> 00:04:37,900
Now we know that with learning algorithms the way that they learn is with data.

61
00:04:38,470 --> 00:04:43,930
So if you're training a dog versus cat classifier you'll need lots of labeled images of dogs and cats

62
00:04:45,350 --> 00:04:47,860
similarly with tic tac toe or breakout.

63
00:04:47,960 --> 00:04:51,440
Once the game is over I can opt to play again.

64
00:04:51,440 --> 00:04:57,340
This is the method through which I will gain experience or to be more technical data.

65
00:04:57,530 --> 00:05:03,380
You might call these games or rounds or matches but in reinforcement learning the official term is episode

66
00:05:05,080 --> 00:05:10,960
so when you're training an agent to play tic tac toe you're going to play multiple episodes and at the

67
00:05:10,960 --> 00:05:14,530
end of each episode your agent will have won or lost.

68
00:05:15,370 --> 00:05:20,560
Hopefully by the end of the training process or in other words after many episodes your agent will be

69
00:05:20,560 --> 00:05:26,730
winning more than losing.

70
00:05:26,770 --> 00:05:32,500
Of course not all reinforcement learning environments are episodic to say an environment is episodic

71
00:05:32,830 --> 00:05:37,390
means that they end at some point and you can start again with a new fresh episode.

72
00:05:39,110 --> 00:05:45,900
Furthermore there is no relationship between one episode and the next so the fact that I lost the previous

73
00:05:45,900 --> 00:05:50,310
tic tac toe episode will have no effect on the environment in the next episode.

74
00:05:52,200 --> 00:05:56,040
However there are examples of non episodic environments.

75
00:05:56,130 --> 00:05:58,460
Take for example the stock market.

76
00:05:58,470 --> 00:06:01,700
For all intents and purposes this can go on forever.

77
00:06:01,740 --> 00:06:03,420
There is no real notion of the end.

78
00:06:04,080 --> 00:06:07,830
Well if you lose all your money then technically there is nothing more you can do.

79
00:06:07,830 --> 00:06:11,990
But it's not the same as losing a game of tic tac toe and starting again.

80
00:06:12,090 --> 00:06:15,330
You can't go back in time and restart the stock market.

81
00:06:17,150 --> 00:06:24,100
Another example is an online advertising system your agent's job will be to choose the right advertisements

82
00:06:24,110 --> 00:06:29,560
to show to users at any given moment in order to maximize your company's revenue.

83
00:06:29,720 --> 00:06:31,620
It should do this continuously.

84
00:06:31,730 --> 00:06:37,330
There is no concept of an end to an online advertising service.

85
00:06:37,440 --> 00:06:43,290
All right so those are some examples of some non episodic environments we can refer to such environments

86
00:06:43,320 --> 00:06:50,660
as having infinite horizons.

87
00:06:50,750 --> 00:06:51,070
All right.

88
00:06:51,080 --> 00:06:57,860
So to recap the terms we've discussed so far now we have Agent environment and episode the next few

89
00:06:57,860 --> 00:07:02,350
terms I would like to think about are the state action and reward.

90
00:07:02,630 --> 00:07:09,690
These items help us describe what goes on when the agent and environment interact let's again use our

91
00:07:09,690 --> 00:07:11,390
tic tac toe example.

92
00:07:11,880 --> 00:07:15,120
In this scenario the state would be the configuration of the board.

93
00:07:15,900 --> 00:07:19,930
So for each position on the board I want to know is there an x there.

94
00:07:19,950 --> 00:07:22,700
Is there an O there or is it empty.

95
00:07:22,710 --> 00:07:28,500
This information is all that I need in order for my agent to make an intelligent decision about what

96
00:07:28,500 --> 00:07:30,770
move to play next.

97
00:07:30,800 --> 00:07:36,490
Speaking of which the moves that the agent makes are what we refer to as the action.

98
00:07:36,650 --> 00:07:44,450
So in tic tac toe to take an action would mean placing a new X or no somewhere on the board finally

99
00:07:44,480 --> 00:07:51,260
the reward is just a number that you can receive at any moment as you play an episode of the game by

100
00:07:51,260 --> 00:07:51,560
the way.

101
00:07:51,560 --> 00:07:56,940
Keep in mind when I say the word game I don't necessarily mean a board game like tic tac toe or a video

102
00:07:56,940 --> 00:07:58,610
game like breakout.

103
00:07:58,610 --> 00:08:02,690
When I say game I mean it in more of a generic sense.

104
00:08:02,690 --> 00:08:08,610
In any case perhaps the reward you get in tech tac toe maybe plus one for winning minus one for losing

105
00:08:08,630 --> 00:08:09,620
in 0 4 draw.

106
00:08:10,340 --> 00:08:12,360
Although this is just an example.

107
00:08:12,740 --> 00:08:18,500
In general you can always assign rewards yourself in order to improve the training of your reinforcement

108
00:08:18,500 --> 00:08:19,160
learning agent.

109
00:08:24,270 --> 00:08:28,020
Here's another example of states actions and rewards.

110
00:08:28,020 --> 00:08:30,770
Think of a maze in this maze.

111
00:08:30,780 --> 00:08:34,020
The state is your position in the maze.

112
00:08:34,050 --> 00:08:39,990
Your actions may consist of the various directions you can go for example up down left or right.

113
00:08:41,220 --> 00:08:43,110
The reward is tricky.

114
00:08:43,140 --> 00:08:48,780
Remember I said it's up to you to think of a good reward to assign to your agent to encourage it to

115
00:08:48,780 --> 00:08:54,810
learn how to solve the environment you might say plus one for solving the maze and zero otherwise.

116
00:08:55,090 --> 00:08:58,080
But ask yourself Is this a good strategy.

117
00:08:58,120 --> 00:09:02,520
Imagine you throw your agent into this maze and it has to learn what to do.

118
00:09:02,530 --> 00:09:07,810
Imagine your agent has played this game ten thousand times and has never solved the maze.

119
00:09:07,810 --> 00:09:13,240
We can pretend the environment is episodic so that after taking one hundred steps you reach a terminal

120
00:09:13,240 --> 00:09:16,550
state and the game is over.

121
00:09:16,800 --> 00:09:23,570
What happens if we get zero reward each time well in that case the agent learns that it does not matter

122
00:09:23,570 --> 00:09:27,740
at all what it does because doing anything always leads to the same reward.

123
00:09:27,770 --> 00:09:33,210
Zero in this case the agent has no incentive to solve the maze.

124
00:09:33,320 --> 00:09:39,230
Your agent will never prioritize one action over another because it knows that no matter what it does

125
00:09:39,290 --> 00:09:42,170
it always gets zero reward.

126
00:09:42,170 --> 00:09:44,060
In this case all actions are equal

127
00:09:49,320 --> 00:09:56,890
perhaps a better reward structure would be to assign a minus one reward at every stage in this case.

128
00:09:56,890 --> 00:10:01,360
You can maximize your reward by solving the maze as fast as possible.

129
00:10:01,600 --> 00:10:07,720
Doing any extraneous actions will lead to a more negative reward not solving the maze will lead to the

130
00:10:07,720 --> 00:10:09,850
most negative reward.

131
00:10:09,850 --> 00:10:15,760
So in this case assigning a negative reward upon reaching any state will allow your agent to solve the

132
00:10:15,760 --> 00:10:16,410
environment in

133
00:10:21,560 --> 00:10:27,440
now you have to keep in mind this is not English class so you have to remove any bias you may have about

134
00:10:27,440 --> 00:10:31,710
what connotations are associated with the term reward.

135
00:10:31,880 --> 00:10:34,890
You might think of reward as a good thing like a prize.

136
00:10:34,910 --> 00:10:40,520
For example if you're a dog and you've just successfully completed a trick your owner may give you a

137
00:10:40,520 --> 00:10:42,140
treat as a reward.

138
00:10:42,410 --> 00:10:46,530
But in reinforcement learning this is not what we mean by reward.

139
00:10:46,610 --> 00:10:50,160
The only constraint is that the reward is a real number.

140
00:10:50,180 --> 00:10:57,630
It can be positive negative or zero you will also receive this number at every step in the environment.

141
00:10:57,800 --> 00:11:01,340
Not just when you reach some goal or definitively fail to achieve it.

142
00:11:02,910 --> 00:11:09,270
The agent as you will learn later in this section will try to maximize its reward over each episode.

143
00:11:09,330 --> 00:11:12,740
For example you may get a reward of minus one hundred.

144
00:11:12,780 --> 00:11:19,230
This is better than a reward of minus one million maybe minus one hundred reward corresponds to successfully

145
00:11:19,230 --> 00:11:22,970
solving the environment but in the end it's just a number.

146
00:11:23,100 --> 00:11:28,800
Don't associate minus one hundred with negative connotation in plus one hundred with a positive connotation

147
00:11:31,250 --> 00:11:32,290
so just remember this.

148
00:11:32,300 --> 00:11:34,160
The reward is not a prize.

149
00:11:34,160 --> 00:11:37,670
The reward is a number which is to be maximized.

150
00:11:37,700 --> 00:11:40,730
You can think of it as kind of the opposite of a lost function.

151
00:11:41,060 --> 00:11:46,490
Whereas we want to minimize the loss in a supervised or unsupervised learning problem in reinforcement

152
00:11:46,490 --> 00:11:46,840
learning.

153
00:11:46,850 --> 00:11:48,680
We want to maximize reward

154
00:11:53,910 --> 00:11:55,200
since I love examples.

155
00:11:55,200 --> 00:11:56,820
Here's one more.

156
00:11:56,820 --> 00:11:59,030
Imagine again the game breakout.

157
00:11:59,550 --> 00:12:03,430
In this case we actually have several options for the state.

158
00:12:03,450 --> 00:12:06,960
For example we may have perfect information about the game.

159
00:12:07,140 --> 00:12:12,600
We could be told the exact positions of all the blocks we could be told the position and velocity of

160
00:12:12,600 --> 00:12:13,080
the ball.

161
00:12:13,860 --> 00:12:19,380
We could be told the location of our paddle and we could be told our current score and the number of

162
00:12:19,380 --> 00:12:21,360
lives we have left.

163
00:12:21,360 --> 00:12:26,750
Although I think you'll find that most reinforcement learning applications do not make use of such information

164
00:12:32,030 --> 00:12:37,520
another way you can read the information about the state of the environment and break out is to look

165
00:12:37,520 --> 00:12:38,910
at the Games ran.

166
00:12:39,140 --> 00:12:42,520
In other words look at the values it has stored in memory.

167
00:12:42,800 --> 00:12:49,700
In contrast to the above this actually is one method used in modern reinforcement learning applications.

168
00:12:49,700 --> 00:12:53,690
It's a proxy to the above perfectly defined state.

169
00:12:53,690 --> 00:12:58,940
You can imagine that it should be quite possible to derive the locations of the blocks and the position

170
00:12:58,940 --> 00:13:03,170
and velocity of the ball and so forth from the values stored in RAM

171
00:13:08,310 --> 00:13:13,980
although I think the most common way to represent the state in contemporary reinforcement learning is

172
00:13:13,980 --> 00:13:20,820
to use screenshots from the game in this way our reinforcement learning a game is learning to interpret

173
00:13:20,880 --> 00:13:24,940
images of the video game just as we do as humans.

174
00:13:24,990 --> 00:13:30,360
I think this is the most meaningful because it's the closest match to how you and I play video games.

175
00:13:30,360 --> 00:13:36,210
We look at the screen you can imagine that models like convolution all known that works would be useful

176
00:13:36,210 --> 00:13:36,540
here

177
00:13:41,620 --> 00:13:47,560
one complication that can arise from only looking at images of the screen is that you don't really have

178
00:13:47,560 --> 00:13:55,130
information about movement an image is only a frozen picture of the game at a single point in time.

179
00:13:55,540 --> 00:14:02,080
Looking at this image how can I tell which direction the ball is moving and so this allows us to consider

180
00:14:02,080 --> 00:14:08,010
an important point the state need not be what I observe in the environment.

181
00:14:08,050 --> 00:14:13,060
It can also be information derived from both current and past observations.

182
00:14:13,900 --> 00:14:21,280
So one way of dealing with the problem of frozen pictures is to simply include past frames as well.

183
00:14:21,280 --> 00:14:27,100
In the famous the U.N. paper they use the most recent four consecutive frames of the game to represent

184
00:14:27,130 --> 00:14:28,050
a single state.

185
00:14:33,030 --> 00:14:37,970
To get back to our states actions and rewards the rest of it is quite basic.

186
00:14:37,990 --> 00:14:41,340
The actions consist of the various moves you can do in the game.

187
00:14:41,500 --> 00:14:46,770
You can think of this in terms of pressing buttons on a joystick or a control pad in breakout.

188
00:14:46,770 --> 00:14:51,000
You can move the paddle left or right for the reward.

189
00:14:51,010 --> 00:14:55,300
As an example you might get plus one reward every time you destroy block

190
00:15:00,480 --> 00:15:02,070
as a final note in this lecture.

191
00:15:02,070 --> 00:15:07,510
I want to introduce you to the concept of state spaces and action spaces.

192
00:15:07,530 --> 00:15:13,380
This is important as we move down from high level ideas and concepts to the actual math that will allow

193
00:15:13,380 --> 00:15:19,880
us to solve reinforcement learning problems the particular math concepts that we need to describe state

194
00:15:19,880 --> 00:15:26,780
spaces and action spaces is the set the state space is the set of all possible states and the action

195
00:15:26,780 --> 00:15:29,810
space is the set of all possible actions.

196
00:15:29,930 --> 00:15:31,370
We don't need to go further than this.

197
00:15:31,370 --> 00:15:33,430
We just need to know what it means.

198
00:15:38,240 --> 00:15:44,180
So as an example consider the canonical example of a reinforcement learning problem known as grid a

199
00:15:44,180 --> 00:15:46,520
world in grid worlds.

200
00:15:46,520 --> 00:15:52,790
The idea is you're going to start in the bottom left square and your goal is to arrive at the top right

201
00:15:52,790 --> 00:16:00,490
square with a ruby is if you make it there you get a reward of plus 1 below that there is a losing state

202
00:16:00,640 --> 00:16:07,330
where if you arrive there you'll get a reward of minus 1 and in the second row second column there is

203
00:16:07,330 --> 00:16:14,790
a wall meaning that your agent cannot go to that square so that's the basics of the game to describe

204
00:16:14,790 --> 00:16:15,800
the state space.

205
00:16:15,810 --> 00:16:19,570
That's simply the set of all possible positions on the board.

206
00:16:19,740 --> 00:16:24,690
So you may want to pause this video and look closely at this list of coordinates to confirm that they

207
00:16:24,690 --> 00:16:36,630
correspond to positions on the board the action space consists of the actions up down left and right.

208
00:16:36,780 --> 00:16:42,150
Now the reason we had to talk about grid world for a bit is because for our other examples such as tic

209
00:16:42,150 --> 00:16:46,510
tac toe and breakout the state space is much more complicated.

210
00:16:46,700 --> 00:16:52,380
The action spaces are pretty simple since for tic tac toe it consists of all the possible positions

211
00:16:52,440 --> 00:16:54,320
you can draw an X or no.

212
00:16:54,450 --> 00:17:02,670
And for breakout it consists of moving left right or doing nothing but for tic tac toe the state space

213
00:17:02,700 --> 00:17:08,430
is quite large as there are many possible configurations of the board as an exercise.

214
00:17:08,460 --> 00:17:15,000
I would strongly recommend trying to write a computer program that can enumerate all the possible configurations

215
00:17:15,030 --> 00:17:17,010
of a tic tac toe board.

216
00:17:17,040 --> 00:17:21,690
This should give you some intuition about why games like chess and go are very difficult.

217
00:17:22,860 --> 00:17:27,990
You can imagine that if a three by three board with only two possible characters can have thousands

218
00:17:27,990 --> 00:17:31,860
of states imagine how many states are involved in chess and go

219
00:17:36,960 --> 00:17:37,670
for breakout.

220
00:17:37,680 --> 00:17:40,410
The number of states is even larger.

221
00:17:40,410 --> 00:17:45,780
It's equal to the screen resolution multiplied by the number of possible colors per pixel two to the

222
00:17:45,780 --> 00:17:50,090
power twenty four or two to the power eight to the power of three.

223
00:17:50,160 --> 00:17:57,120
But for all intents and purposes we can consider images just like Time series to be continuous valued.

224
00:17:57,150 --> 00:18:02,910
The only reason they appear to be the screen is because computers have finite position and so the values

225
00:18:02,910 --> 00:18:04,650
need to be quantized.

226
00:18:05,070 --> 00:18:11,040
When we have continuous values this means that the number of possible values is actually infinite.

227
00:18:11,050 --> 00:18:16,440
In fact it's possible for actions to be continuous also so that the action space is also infinite

228
00:18:21,560 --> 00:18:23,120
since this lecture was quite long.

229
00:18:23,210 --> 00:18:25,940
Let's summarize everything we learned.

230
00:18:25,940 --> 00:18:31,400
This lecture was all about defining some reinforcement learning terminology to help us in our discussion

231
00:18:31,400 --> 00:18:32,530
of reinforcement learning.

232
00:18:34,340 --> 00:18:36,950
First we define the terms agent and environment.

233
00:18:37,610 --> 00:18:42,170
You can think of the environment as the world or whatever computer game you are teaching your agent

234
00:18:42,170 --> 00:18:43,170
to win.

235
00:18:43,610 --> 00:18:50,970
You can think of your age as your computer program the one that does the learning.

236
00:18:51,140 --> 00:18:53,720
Next we define the term episode.

237
00:18:53,720 --> 00:19:00,340
This is like one round or one match of a game as you know machine learning models learn through data

238
00:19:00,760 --> 00:19:07,180
or in reinforcement learning parlance experience and so you can imagine that in order to sufficiently

239
00:19:07,180 --> 00:19:08,660
learn how to play a game.

240
00:19:08,800 --> 00:19:13,240
This is going to require multiple episodes.

241
00:19:13,300 --> 00:19:17,140
Next we learn about states actions and rewards.

242
00:19:17,140 --> 00:19:21,240
Rewards are a number which can be any number positive or negative.

243
00:19:21,280 --> 00:19:28,360
The job of a reinforcement learning a gym is to maximize its reward actions are what an agent does in

244
00:19:28,360 --> 00:19:29,440
an environment.

245
00:19:29,650 --> 00:19:33,440
For example playing a move in tic tac toe are going left or right.

246
00:19:33,460 --> 00:19:41,130
In a video game states are what we observe from the environment but they can also be values derived

247
00:19:41,190 --> 00:19:45,720
from those observations or even a sequence of past observations.

248
00:19:46,110 --> 00:19:52,230
To add a little more to this we call the last State of an episode a terminal state so when you reach

249
00:19:52,230 --> 00:19:55,080
a terminal state that is the end of your episode.

250
00:19:56,490 --> 00:20:00,240
Lastly we define the terms state space and action space.

251
00:20:00,240 --> 00:20:05,990
These are the set of all states and the set of all actions respectively using these terms.

252
00:20:06,000 --> 00:20:11,550
We can now talk about reinforcement learning coherently and build a framework that allows us to solve

253
00:20:11,820 --> 00:20:13,460
reinforcement learning problems.