1
00:00:11,690 --> 00:00:17,090
Previously we learned about terminology the various words we use to describe reinforcement learning

2
00:00:17,090 --> 00:00:21,530
problems which allow us to solve reinforcement learning problems.

3
00:00:21,530 --> 00:00:27,560
Now that you understand concepts such as agents environments policies states actions and rewards we

4
00:00:27,560 --> 00:00:29,670
can build on this.

5
00:00:29,690 --> 00:00:34,270
The goal is to have a framework which we can then use to find solutions.

6
00:00:34,280 --> 00:00:40,130
So at this stage we are still working to more accurately and more narrowly define a problem.

7
00:00:40,130 --> 00:00:45,320
Once we have an accurately defined problem we can work within this framework and whatever assumptions

8
00:00:45,320 --> 00:00:52,720
it involves to derive solutions.

9
00:00:52,780 --> 00:00:57,340
The main assumption we make in the reinforcement learning is the mark of assumption.

10
00:00:57,400 --> 00:01:01,720
This is something we often discuss in terms of mark of models and sequence modelling.

11
00:01:01,810 --> 00:01:04,160
But let's review it here anyway.

12
00:01:04,180 --> 00:01:06,750
The Markov assumption goes like this.

13
00:01:06,790 --> 00:01:11,350
Suppose we want to predict whether tomorrow will be rainy sunny or cloudy.

14
00:01:11,350 --> 00:01:17,860
Perhaps your idea may be to base this on whether it was raining sunny or cloudy in the past seven days.

15
00:01:17,860 --> 00:01:23,340
Well the Markov assumption is that tomorrow's weather it doesn't depend on all of the past seven days.

16
00:01:23,410 --> 00:01:25,270
Only the immediate previous day

17
00:01:30,360 --> 00:01:33,430
here's another example of the Markov assumption.

18
00:01:33,450 --> 00:01:36,540
Suppose I want to predict the next word of a sentence.

19
00:01:36,540 --> 00:01:40,780
I tell you that the previous word in the sentence is lazy.

20
00:01:41,010 --> 00:01:45,570
The mark of assumption is that the next word only depends on the previous word.

21
00:01:45,570 --> 00:01:56,620
Therefore if the Markov assumption is true then you should be able to predict the next word in my sentence.

22
00:01:56,640 --> 00:02:02,400
Of course things are not so easy you might think because you are taking a course by the lazy programmer

23
00:02:02,700 --> 00:02:07,530
that the next word in the sentence is programmer but in fact that's not what I had in mind.

24
00:02:09,390 --> 00:02:15,870
Now let's suppose I tell you the full sentence so far is the quick brown fox jumps over the lazy.

25
00:02:16,020 --> 00:02:22,410
Of course we know that since we've seen this example many times that the next word is dog so you might

26
00:02:22,410 --> 00:02:26,800
think this mark of assumption thing doesn't really seem to be a great idea.

27
00:02:26,850 --> 00:02:31,590
In fact there has been some work in reinforcement learning where they do not make use of the mark of

28
00:02:31,590 --> 00:02:35,160
assumption although that is outside the scope of this course.

29
00:02:35,250 --> 00:02:38,600
The mark of assumption has actually been quite successful so far.

30
00:02:43,350 --> 00:02:49,980
In general the Markov assumption states that the probability of the state at time t depends only on

31
00:02:49,980 --> 00:02:55,840
the state at time T minus one and not on any state that came before that now by itself.

32
00:02:55,840 --> 00:03:01,540
The mark of assumption is weak but as you recall I said earlier that we can make the state whatever

33
00:03:01,540 --> 00:03:02,200
we want.

34
00:03:02,680 --> 00:03:06,790
So if we want to make the state three or four words long that's fine too.

35
00:03:07,000 --> 00:03:12,790
The words are merely observations but the state is made up of a sequence of observations.

36
00:03:12,970 --> 00:03:17,450
In this way the Markov assumption is not as bad as you might initially think.

37
00:03:22,570 --> 00:03:25,670
So why do we need to know about the mark of assumption.

38
00:03:25,720 --> 00:03:31,150
This is because reinforcement learning problems are commonly described as a Markov decision process

39
00:03:31,150 --> 00:03:32,900
or MVP.

40
00:03:32,980 --> 00:03:39,340
Previously we discussed the Markov assumption in terms of a state only but as you know reinforcement

41
00:03:39,340 --> 00:03:44,800
learning problems involve other objects as well namely actions and rewards.

42
00:03:44,800 --> 00:03:52,090
So the way we describe an MVP is to use the state transition probability it's the probability of arriving

43
00:03:52,090 --> 00:03:58,690
in the state at time T plus 1 and getting the reward at times plus one given the state at time t and

44
00:03:58,690 --> 00:04:05,950
taking the action at time t another simpler way of writing this without time indices is just to write

45
00:04:06,220 --> 00:04:14,170
P of s prime and are given as in a note that because the reward R has no prime symbol the prime symbol

46
00:04:14,200 --> 00:04:21,720
does not indicate time T plus 1 you get the reward at time T plus 1 for arriving in state as prime.

47
00:04:21,790 --> 00:04:23,890
But we do not put a prime symbol on r

48
00:04:28,970 --> 00:04:34,340
so I just showed you the most general way of writing down the state transition probability but often

49
00:04:34,340 --> 00:04:36,130
we can make it less general.

50
00:04:36,500 --> 00:04:43,160
For example if we are solving a maze then most likely we are going to make the reward deterministic.

51
00:04:43,160 --> 00:04:47,560
In other words there's no need to represent it as a probability distribution.

52
00:04:47,810 --> 00:04:53,900
In this case we can use the notation P of s prime given SSA and the reward it can be a symbol all by

53
00:04:53,900 --> 00:04:58,280
itself usually denoted as r of essay as prime.

54
00:04:58,310 --> 00:05:00,970
This encodes the idea that we were in status.

55
00:05:01,040 --> 00:05:08,010
We performed action a and we arrived in the next state as prime we can even just say r of S or R of

56
00:05:08,010 --> 00:05:08,910
as prime.

57
00:05:08,940 --> 00:05:14,120
In the case where the reward depends only on the state where you arrive at which is actually quite common

58
00:05:19,330 --> 00:05:25,600
an important point to consider is what is the usefulness of the state transition probability.

59
00:05:25,600 --> 00:05:31,480
You can imagine that if we are playing some game like breakout on a tree it is very unlikely we will

60
00:05:31,480 --> 00:05:37,150
ever be able to calculate these probabilities given that the state space would be in a feasible to enumerate

61
00:05:38,110 --> 00:05:43,330
and in fact for Q learning the main algorithm we're going to discuss in the section this probability

62
00:05:43,330 --> 00:05:44,530
is not used at all.

63
00:05:46,430 --> 00:05:52,300
I want you to think of the MVP and the state transition probability as stepping stones.

64
00:05:52,430 --> 00:05:58,580
They are simply conceptual tools which we will use to further advance our knowledge and take us to a

65
00:05:58,580 --> 00:06:05,420
point where we can actually come up with a practical algorithm for reinforcement learning in other words

66
00:06:05,450 --> 00:06:10,500
while we're not going to be using state transition probabilities directly in Q learning.

67
00:06:10,580 --> 00:06:16,250
They do help us build on what we've done so far so that we can actually arrive at Q learning in a logical

68
00:06:16,250 --> 00:06:21,730
manner.

69
00:06:21,760 --> 00:06:27,690
Why else is the state transition probability useful Imagine a game like tic tac toe.

70
00:06:27,810 --> 00:06:31,330
You might think there is nothing probabilistic about this game.

71
00:06:31,530 --> 00:06:35,420
When I write down an extra I know that's where the X or the O goes.

72
00:06:35,460 --> 00:06:38,340
Why is there a possibility associated with that.

73
00:06:38,340 --> 00:06:41,240
Why is my action not deterministic.

74
00:06:41,640 --> 00:06:49,520
And in fact it is entirely possible for your action to deterministic Lee bring you to the next day imagine

75
00:06:49,520 --> 00:06:53,200
for example a classic test known as the inverted pendulum.

76
00:06:54,170 --> 00:06:59,330
And this reinforcement learning task your job is to control an upside down pendulum so that it does

77
00:06:59,330 --> 00:07:04,370
not fall down by moving the cart left or right as necessary.

78
00:07:04,370 --> 00:07:08,340
Now you might think to yourself how do we describe such a system.

79
00:07:08,360 --> 00:07:15,820
Well we use the laws of physics and now think to yourself are the laws of physics not deterministic.

80
00:07:15,860 --> 00:07:21,410
For example when we learn Newton's three laws of motion to those laws of motion involve probability

81
00:07:21,950 --> 00:07:23,540
the answer is no.

82
00:07:23,570 --> 00:07:25,250
So then what in the world do we need.

83
00:07:25,250 --> 00:07:26,150
Probability for

84
00:07:31,250 --> 00:07:37,790
the answer is that your state may not completely capture all the possible information about the environment.

85
00:07:37,880 --> 00:07:41,060
Consider tic tac toe again in tic tac toe.

86
00:07:41,060 --> 00:07:46,840
There is another player that players moves cannot be predicted by a tic tac toe Asian.

87
00:07:46,970 --> 00:07:52,310
Therefore there are multiple possible moves that could occur between the agents previous move and the

88
00:07:52,310 --> 00:07:54,110
agent's next move.

89
00:07:54,110 --> 00:08:00,320
If we're talking about physical systems we also have to take into account chaos theory that is even

90
00:08:00,320 --> 00:08:02,630
if you know the exact laws of motion.

91
00:08:02,660 --> 00:08:05,970
This does not mean you can accurately predict the future.

92
00:08:06,080 --> 00:08:11,510
In fact the further into the future you try to predict the more unreliable your predictions become.

93
00:08:12,680 --> 00:08:18,350
Sometimes we refer to the transition probability as the environment dynamics which makes sense when

94
00:08:18,350 --> 00:08:21,780
you think about it in the context of physical systems.

95
00:08:22,070 --> 00:08:26,020
A system like an inverted pendulum is in fact a dynamical system.

96
00:08:31,210 --> 00:08:35,710
The last thing I want to mentioned in this lecture is to bring us back to this picture which you've

97
00:08:35,710 --> 00:08:37,600
probably seen several times at this point.

98
00:08:38,680 --> 00:08:45,400
An MVP or a reinforcement learning problem consists of these two objects the agent and the environment

99
00:08:45,700 --> 00:08:47,390
going back and forth.

100
00:08:47,620 --> 00:08:52,240
The agent reads the state from the environment and decides what action to take.

101
00:08:52,240 --> 00:08:58,510
It takes that action in the environment and the environment is updated based on that action and brings

102
00:08:58,510 --> 00:09:03,370
the agent to the next state while also returning an associated reward.

103
00:09:03,370 --> 00:09:07,140
The agent can then read this next day take the next action and so forth.

104
00:09:08,080 --> 00:09:11,220
So they just go back and forth in the circular pattern.

105
00:09:11,470 --> 00:09:18,160
What we've done so far is to represent both of these objects with probabilities the environment as represented

106
00:09:18,160 --> 00:09:22,760
by the state transition probability P of s prime and are given us and a.

107
00:09:22,780 --> 00:09:29,890
The agent is represented by the probability Pi of a given s this is more helpful than you probably realize

108
00:09:29,890 --> 00:09:36,640
at this point by representing both the agent and the environment as probabilities it allows us to describe

109
00:09:36,640 --> 00:09:40,840
reinforcement learning problems mathematically in particular.

110
00:09:40,840 --> 00:09:45,400
Once we have an equation we can solve that equation without an equation.

111
00:09:45,400 --> 00:09:46,990
There isn't really anything to solve.

112
00:09:47,560 --> 00:09:49,370
That's a pretty deep inside.

113
00:09:49,510 --> 00:09:55,870
In order to come up with a solution we have to have a well-defined problem using mathematics specifically

114
00:09:55,870 --> 00:09:59,740
probability allows us to create this well-defined problem.

115
00:09:59,740 --> 00:10:02,100
And that's the first step towards finding a solution.