1
00:00:11,570 --> 00:00:17,330
In this lecture, I'm going to discuss what it means to learn in a reinforcement learning problem and

2
00:00:17,330 --> 00:00:20,080
reinforcement learning, there are two main types of tasks.

3
00:00:20,900 --> 00:00:25,850
Previously, we discuss what I call the prediction problem that is given a policy pie.

4
00:00:26,120 --> 00:00:28,730
Find the associated value function VVS.

5
00:00:29,510 --> 00:00:32,300
The second type of task is called the control problem.

6
00:00:32,900 --> 00:00:37,830
This means to find the optimal policy pie, which leads to the maximum VVS.

7
00:00:38,450 --> 00:00:43,010
In other words, the control problem is maximizing the sum of future rewards.

8
00:00:44,270 --> 00:00:48,060
You've already known it since the beginning of the section that this is our goal.

9
00:00:48,800 --> 00:00:53,000
Now we can accurately define what this means and thus find a solution.

10
00:00:58,120 --> 00:01:03,400
First, we need a little bit more math and a few more symbols to accurately describe what we are doing.

11
00:01:04,210 --> 00:01:10,390
You already know about the value function A given a state VVS, to be more precise, we refer to this

12
00:01:10,390 --> 00:01:11,890
as the state value function.

13
00:01:12,640 --> 00:01:18,520
There is another quantity known as the action value function, where the value depends both on the state

14
00:01:18,520 --> 00:01:21,880
s and the action we denoted with the simple.

15
00:01:21,880 --> 00:01:29,620
Q As you can see, it has almost the same definition of yes, except that it's also conditioned on the

16
00:01:29,620 --> 00:01:29,950
action.

17
00:01:30,970 --> 00:01:36,550
And so in the bellman equation, since A is given, we do not sum over the policy distribution.

18
00:01:41,630 --> 00:01:47,240
One interesting question to consider from a computer science perspective is how much space is required

19
00:01:47,240 --> 00:01:48,770
to store the value functions?

20
00:01:49,770 --> 00:01:54,720
Let's consider the scenario where both states and actions are discrete and that we have a finite number

21
00:01:54,720 --> 00:01:55,120
of them.

22
00:01:55,770 --> 00:01:59,460
So let's say we have big states and big AI actions.

23
00:01:59,970 --> 00:02:04,770
In this case, we can store the state value function in an array of size, a big --.

24
00:02:05,440 --> 00:02:10,410
But for the action value function, we'll have to store it in a two dimensional array of size, big

25
00:02:10,410 --> 00:02:16,980
-- times bigger, and thus the storage required for Q is quadratic, whereas the storage required for

26
00:02:16,980 --> 00:02:18,420
V is only linear.

27
00:02:23,450 --> 00:02:26,450
OK, so why do we need this concept of the auction value?

28
00:02:27,080 --> 00:02:29,600
Well, this is going to help us find the optimal policy.

29
00:02:30,500 --> 00:02:34,150
Remember, some policies may be good and some policies may be bad.

30
00:02:34,820 --> 00:02:39,000
The optimal policy is the best policy, the one that maximizes the value.

31
00:02:40,100 --> 00:02:43,510
First, let's just think about how to compare two different policies.

32
00:02:44,180 --> 00:02:47,000
We can say that policy one is better than policy.

33
00:02:47,000 --> 00:02:55,100
Two, if given PPI, one is greater than Vivace given PPI, two for all the states in the states face.

34
00:02:56,180 --> 00:02:59,830
So this helps us describe the relative good ness of different policies.

35
00:03:04,950 --> 00:03:11,370
From here, we can define the best policy and the best value function, the best value function is the

36
00:03:11,370 --> 00:03:14,490
value function for which there is no greater value function.

37
00:03:15,060 --> 00:03:19,000
It's the max over all possible policies of Vivus given PI.

38
00:03:19,920 --> 00:03:22,150
So we'll denote it with the symbol Vistar.

39
00:03:23,430 --> 00:03:28,160
Similarly, the best policy will be the Amax overall policies pie.

40
00:03:28,890 --> 00:03:30,450
So we call that pie star.

41
00:03:35,590 --> 00:03:39,740
So far, we haven't needed to invoke the action value function, but let's see how it's related.

42
00:03:40,600 --> 00:03:45,050
First, the optimal action value is defined similarly to the optimal state value.

43
00:03:45,610 --> 00:03:47,680
It's the max over all possible policies.

44
00:03:47,680 --> 00:03:51,220
Pae of Q given PI, we'll call this Q Star.

45
00:03:52,620 --> 00:03:55,560
This must hold over all states and all actions.

46
00:03:56,550 --> 00:04:02,560
Furthermore, the optimal state value is equal to the max over all actions from the optimal action value.

47
00:04:04,020 --> 00:04:07,500
So from this, we can define the relationship between Q and V.

48
00:04:12,590 --> 00:04:16,850
The real value, no pun intended, of the action value function is this.

49
00:04:17,690 --> 00:04:22,830
Suppose we are playing some game and we would like to know what's the best action to perform right now?

50
00:04:23,570 --> 00:04:26,460
Well, we have a dictionary telling us exactly what to do.

51
00:04:27,110 --> 00:04:31,660
All we have to do is find the Amax or Rikyu given a status.

52
00:04:32,300 --> 00:04:36,200
In other words, the best action to perform becomes a simple dictionary.

53
00:04:36,200 --> 00:04:36,740
Look-Up.

54
00:04:41,850 --> 00:04:45,750
As a sign, no notice how the optimal policy is not unique.

55
00:04:46,800 --> 00:04:47,700
It could be there.

56
00:04:47,730 --> 00:04:51,510
Multiple different policies lead to the same at best value function.

57
00:04:52,290 --> 00:04:54,890
In this case, it suffices to find just one of them.

58
00:04:55,980 --> 00:05:00,750
On the other hand, at the optimal value function is unique because if there are two different value

59
00:05:00,750 --> 00:05:04,290
functions, then logically one of them must be greater than the other one.

60
00:05:09,410 --> 00:05:14,570
Before moving on, let's think about what a good first approach might be to actually finding an optimal

61
00:05:14,570 --> 00:05:15,110
policy.

62
00:05:16,160 --> 00:05:18,180
Remember, this is called the control problem.

63
00:05:19,100 --> 00:05:24,050
Suppose we are playing a game like Grid World or tic tac toe where our states based in action space

64
00:05:24,050 --> 00:05:24,980
are both finite.

65
00:05:25,790 --> 00:05:28,540
In this case, a naive search can solve this problem.

66
00:05:33,630 --> 00:05:39,870
First, let's create a list of all possible policies that can exist, obviously, some may be bad and

67
00:05:39,870 --> 00:05:43,770
some may be good, but one or more of them can be defined to be the best.

68
00:05:45,630 --> 00:05:51,690
Then in a loop, we can test each policy, so first we call a function to find the value for the current

69
00:05:51,690 --> 00:05:52,190
policy.

70
00:05:52,440 --> 00:05:53,940
So that's the evaluate function.

71
00:05:55,080 --> 00:05:59,030
Remember that as we discussed earlier, this is just a linear algebra problem.

72
00:06:00,330 --> 00:06:05,350
Then once we have the value function, we can compare it to our current best value function.

73
00:06:06,090 --> 00:06:11,640
If the new value function is better, then we make this the new best value function and make this policy

74
00:06:11,640 --> 00:06:12,690
the new best policy.

75
00:06:13,530 --> 00:06:17,580
When we are done looping through all the policies, we have found the optimal policy.

76
00:06:20,330 --> 00:06:23,840
In the next lecture, we'll discuss these two functions here in greater detail.

77
00:06:24,740 --> 00:06:27,810
You may have noticed that we don't yet really know how to implement them.

78
00:06:28,730 --> 00:06:30,970
What I would like you to know now is this.

79
00:06:31,700 --> 00:06:33,460
First, the evaluate function.

80
00:06:33,680 --> 00:06:36,750
Finding Vivus, given a policy is not all that hard.

81
00:06:37,390 --> 00:06:39,860
In fact, we already discussed how we would do this.

82
00:06:39,980 --> 00:06:45,980
If you knew both the policy distribution and the environment dynamics, it becomes a simple system of

83
00:06:45,980 --> 00:06:47,120
linear equations.

84
00:06:49,040 --> 00:06:52,520
In the next lecture, we'll look at a more practical way of implementing this.

85
00:06:54,070 --> 00:07:01,690
Second, the other function enumerate all possible policies is conceptually simple but impractical as

86
00:07:01,690 --> 00:07:06,780
an exercise, you may want to try implementing this in code for a simple example like Grid World.

87
00:07:08,110 --> 00:07:13,270
In other words, while this method here appears to be nice and simple, it is not actually practical

88
00:07:13,270 --> 00:07:13,660
to do.