1
00:00:11,720 --> 00:00:16,700
In this lecture we are going to discuss the concept of states actions and rewards.

2
00:00:16,700 --> 00:00:23,840
More in-depth this lecture is about how we encode states and actions when we are programming talking

3
00:00:23,840 --> 00:00:30,050
about states actions and rewards theoretically is nice but eventually we're going to have to put this

4
00:00:30,050 --> 00:00:32,700
into code as a side note.

5
00:00:32,870 --> 00:00:35,690
Since rewards are just numbers this is trivial.

6
00:00:35,690 --> 00:00:38,840
There is no need to discuss how we represent numbers in code

7
00:00:44,180 --> 00:00:45,850
let's start with the state.

8
00:00:46,070 --> 00:00:52,580
As mentioned previously the state can be discrete or continuous in a game like tic tac toe.

9
00:00:52,580 --> 00:00:58,530
The states are discrete because they are just different configurations of the tic tac toe board.

10
00:00:58,640 --> 00:01:04,490
If we make a robot that has sensors like a camera a microphone a gyroscope a proximity sensor and so

11
00:01:04,490 --> 00:01:05,190
forth.

12
00:01:05,330 --> 00:01:16,410
Those would all be continuous values so you would end up with a vector of continuous values.

13
00:01:16,500 --> 00:01:20,280
In fact this takes us back to regular supervised learning.

14
00:01:20,280 --> 00:01:25,050
If our targets are categorical how do we represent them in code.

15
00:01:25,050 --> 00:01:29,910
If there are key categories we will use the integers 0 up to K minus 1.

16
00:01:30,660 --> 00:01:35,430
So we might say that dog equals zero cat equals one mouse equals two and so forth.

17
00:01:36,390 --> 00:01:40,950
Obviously it does not matter which category is assigned to which number.

18
00:01:40,980 --> 00:01:46,080
And the reason we want to do this is because at some point we're going to have to use these categories

19
00:01:46,110 --> 00:01:47,540
as Array indices.

20
00:01:48,270 --> 00:01:55,290
And so similarly if we have big states then we'll represent them in code using the integers 0 up to

21
00:01:55,290 --> 00:01:59,310
a big S minus 1 for continuous values.

22
00:01:59,310 --> 00:02:04,590
It makes sense to store them in a vector although if you have something like an image then it will be

23
00:02:04,590 --> 00:02:06,470
a three dimensional tensor.

24
00:02:06,660 --> 00:02:12,870
So more generically if you have continuous values you can think of them as a tensor with one or more

25
00:02:12,870 --> 00:02:13,770
dimensions

26
00:02:18,920 --> 00:02:22,390
now that you know how to represent states and actions in code.

27
00:02:22,430 --> 00:02:25,310
It's time to talk about policies.

28
00:02:25,310 --> 00:02:30,920
When I first started learning about reinforcement learning I found policies to be kind of an odd concept.

29
00:02:30,980 --> 00:02:35,420
Well actually I found all of reinforcement learning to be kind of odd and foreign but you'll get used

30
00:02:35,420 --> 00:02:36,330
to it.

31
00:02:36,620 --> 00:02:42,590
The idea of a policy makes sense at a high level but it becomes somewhat ambiguous when you start thinking

32
00:02:42,590 --> 00:02:46,770
about how to represent it mathematically or in code.

33
00:02:46,910 --> 00:02:53,000
The policy is what the agent uses to determine what action to perform given a state.

34
00:02:53,000 --> 00:02:59,030
It's important to keep in mind that the policy yields and action using only the current state it doesn't

35
00:02:59,030 --> 00:03:04,160
use any combination of the current state and previous states and it doesn't use any information about

36
00:03:04,160 --> 00:03:08,220
rewards technically as I mentioned earlier.

37
00:03:08,300 --> 00:03:13,940
The state may be made up of multiple observations and that may also include the reward although that

38
00:03:13,970 --> 00:03:20,810
is unconventional but strictly speaking the policy will yield an action using only the current state

39
00:03:25,920 --> 00:03:31,860
the simplest way to think about a policy is that it's a dictionary mapping or a function that returns

40
00:03:31,860 --> 00:03:32,440
in action.

41
00:03:32,460 --> 00:03:40,590
Given a state here is such a function as you can see the only input is a state s and it returns an action

42
00:03:40,620 --> 00:03:48,120
a where the state s was the key to the dictionary and the action a was the value the real question is

43
00:03:48,390 --> 00:03:50,340
how can we represent this mathematically

44
00:03:55,430 --> 00:04:00,200
this is why it's useful to talk about how we encode states and actions first.

45
00:04:00,200 --> 00:04:06,860
So imagine again we are in grid world your agent now has a dictionary representing what action to perform

46
00:04:07,130 --> 00:04:10,660
given the state as you can see here.

47
00:04:10,790 --> 00:04:17,120
So for example if we're to the left of the goal state then the appropriate action is to move right and

48
00:04:17,120 --> 00:04:24,390
just to be clear the left of the goal state is 0 2 if we're at the initial state then the appropriate

49
00:04:24,390 --> 00:04:31,740
action is to move up and just to be clear that is the state 2 0 now from the initial state.

50
00:04:31,760 --> 00:04:36,560
Moving right is just as valid since we can still reach the goal from that point.

51
00:04:36,860 --> 00:04:43,040
You can see that although I've encoded the states here explicitly as tuples we could gain more efficiency

52
00:04:43,280 --> 00:04:49,540
by encoding them as integers corresponding to the tuples and using those integers to index an array.

53
00:04:49,760 --> 00:04:56,240
As you may recall from your computer science studies indexing arrays is faster than indexing dictionaries

54
00:05:01,390 --> 00:05:05,470
thinking about policies as dictionary mappings is somewhat limited.

55
00:05:05,470 --> 00:05:07,710
There are two reasons for this.

56
00:05:07,720 --> 00:05:09,570
The first is that this won't work.

57
00:05:09,580 --> 00:05:15,280
If you have an infinite state space you would need an infinite sized dictionary which is not possible

58
00:05:15,280 --> 00:05:17,050
to have.

59
00:05:17,050 --> 00:05:21,590
Second is that it doesn't allow our agent to explore its environment.

60
00:05:21,730 --> 00:05:25,350
Think of training your agent like teaching a baby at first.

61
00:05:25,350 --> 00:05:26,880
A baby knows nothing.

62
00:05:26,950 --> 00:05:33,730
It must try new things in order to figure out how the world works and build up its intuition a reinforcement

63
00:05:33,730 --> 00:05:35,620
learning agent is the same way.

64
00:05:35,940 --> 00:05:42,880
If it has a fixed policy and it only does the same thing all the time then it can't gain new experiences.

65
00:05:42,880 --> 00:05:49,560
Thus it makes sense for policies to be stochastic stochastic is just a fancy word for random.

66
00:05:49,630 --> 00:05:55,450
In other words a more general way to represent policies is to represent them as probabilities

67
00:06:00,600 --> 00:06:06,450
representing policies as probabilities actually solves both of the problems I posed above.

68
00:06:06,450 --> 00:06:11,290
Let's see how first it deals with this problem of randomness.

69
00:06:11,570 --> 00:06:17,120
The common way of dealing with this in reinforcement learning to allow the agent to explore is to give

70
00:06:17,120 --> 00:06:20,810
it a small chance of performing a random action.

71
00:06:20,840 --> 00:06:24,330
So here is a python function that can accomplish this.

72
00:06:24,440 --> 00:06:26,240
We first generate a random number.

73
00:06:26,900 --> 00:06:32,540
If this number is less than some small number Epsilon let's say zero point one then it will choose an

74
00:06:32,540 --> 00:06:39,570
action at random from the action space otherwise we will grab an action from our fixed policy dictionary

75
00:06:39,570 --> 00:06:41,100
mapping.

76
00:06:41,100 --> 00:06:46,860
This method is called Epsilon greedy and you'll learn later in this section of why it's useful and what

77
00:06:46,860 --> 00:06:53,780
the relevance of exploration is.

78
00:06:53,830 --> 00:06:56,480
So what about continuous state spaces.

79
00:06:56,650 --> 00:07:04,360
In fact seeing policies as probabilistic easily lends itself to continuous or infinite state spaces.

80
00:07:04,360 --> 00:07:07,320
Imagine your state is a vector s.

81
00:07:07,330 --> 00:07:14,710
Now imagine we have some policy parameters w the shape of W is the dimensionality of the state space

82
00:07:14,830 --> 00:07:17,300
by the size of the action space.

83
00:07:17,410 --> 00:07:24,580
For now we'll presume that the action space is still categorical So what do we do when we want to output

84
00:07:24,580 --> 00:07:27,820
probabilities for a set of categories.

85
00:07:27,910 --> 00:07:33,370
In fact this is just like classification we can use the soft max function.

86
00:07:33,370 --> 00:07:40,670
So now our policy is the soft max of W dotted with the state s as you can see.

87
00:07:40,810 --> 00:07:44,050
This allows us to introduce a little more notation.

88
00:07:44,050 --> 00:07:49,630
It's common in reinforcement learning to denote the policy with the symbol pi not to be confused with

89
00:07:49,630 --> 00:07:50,700
the number pi.

90
00:07:51,560 --> 00:07:57,650
For any given state we can calculate a probability distribution over the actions base Pi of a given

91
00:07:57,650 --> 00:08:02,630
S. then to decide which action to perform in the environment.

92
00:08:02,660 --> 00:08:05,840
We can simply sample from this distribution.

93
00:08:06,050 --> 00:08:11,780
This allows us to explore if necessary but we can still treat this policy deterministic Lee.

94
00:08:11,900 --> 00:08:19,490
If we want simply by using the ARG Max also note that it's not necessary to use a linear model as we

95
00:08:19,490 --> 00:08:20,840
are doing here.

96
00:08:20,930 --> 00:08:24,740
We can't in fact use any function approximating such as a neuron that we're

97
00:08:29,940 --> 00:08:30,630
now at this point.

98
00:08:30,630 --> 00:08:37,920
You may be wondering how can an intelligent agent possibly know what to do using only the current state.

99
00:08:37,950 --> 00:08:44,340
This is the problem we described before just by looking at a still image of the road how can I know

100
00:08:44,640 --> 00:08:50,750
what the correct action is if you're more inclined to think inside a supervised learning paradigm.

101
00:08:50,770 --> 00:08:56,200
You may insist that there should be a target here so that the agent learns to associate this state with

102
00:08:56,200 --> 00:09:04,000
this action but in fact it's quite possible for an agent to learn how to plan for the future using experience

103
00:09:04,210 --> 00:09:11,020
collected by playing multiple episodes even without an explicit target for a given state.

104
00:09:11,080 --> 00:09:17,360
The agent can still learn what action to perform such that it maximizes rewards in the future.

105
00:09:17,380 --> 00:09:19,720
This is what we will learn about in the coming lectures.