1
00:00:11,670 --> 00:00:18,960
Earlier in this section I told you about the value of treating the policy as a probability distribution.

2
00:00:18,960 --> 00:00:25,560
First it allows us to describe the entire MBP as a set of two probabilities one representing the environment

3
00:00:25,560 --> 00:00:28,740
dynamics and one representing the agent.

4
00:00:28,740 --> 00:00:34,320
This allows us to reason about the MVP mathematically and find a solution to both the prediction and

5
00:00:34,320 --> 00:00:37,450
control problems as we just did.

6
00:00:37,680 --> 00:00:43,290
The other important reason which I mentioned briefly earlier is that in order to find the best action

7
00:00:43,380 --> 00:00:47,920
we must first know the result of performing different actions.

8
00:00:47,950 --> 00:00:55,370
In other words we must have collected those samples.

9
00:00:55,380 --> 00:01:01,800
Here is the problem when we take the ARG Max overall actions queue of essay The Queue value here is

10
00:01:01,800 --> 00:01:05,210
not changing that much from one episode to the next.

11
00:01:05,430 --> 00:01:10,980
Let's say we have three actions so we want to choose from queue of as a one queue of essay 2 and queue

12
00:01:10,980 --> 00:01:12,090
of as a 3.

13
00:01:12,660 --> 00:01:18,720
While perhaps we initialize queue of as a 1 and queue of as a 2 to 0 while we initialize queue of as

14
00:01:18,750 --> 00:01:27,180
a 3 to 1 also assume that all the rewards are positive during the process of playing the game.

15
00:01:27,180 --> 00:01:31,090
Maybe a queue of essay 3 gets updated to a value of 2.

16
00:01:31,500 --> 00:01:37,260
We will now always choose action 3 because that gives us the best queue value out of the three possible

17
00:01:37,260 --> 00:01:38,880
actions.

18
00:01:38,880 --> 00:01:44,610
The problem is we don't know the true values of queue of essay 1 and 2 of essay 2 because we never collected

19
00:01:44,610 --> 00:01:47,760
data from performing those actions.

20
00:01:47,760 --> 00:01:50,070
How do we know the values of the other actions.

21
00:01:50,220 --> 00:01:53,260
If we have never actually performed those actions.

22
00:01:53,580 --> 00:01:55,410
In fact we don't know those values at all.

23
00:01:56,650 --> 00:02:00,660
We just happen to initialize them to zero because of that.

24
00:02:00,760 --> 00:02:08,700
There is no way for us to confidently choose the right action.

25
00:02:08,710 --> 00:02:13,000
This is a more general problem known as the explore exploit dilemma.

26
00:02:13,000 --> 00:02:18,350
Suppose you go to a casino and you're playing a simplified version of slot machines.

27
00:02:18,460 --> 00:02:23,060
You have several slot machines to choose from and you don't know which one you should play.

28
00:02:23,140 --> 00:02:28,420
However you do know that not all these slot machines yield the same reward.

29
00:02:28,420 --> 00:02:32,520
As I said these slot machines are simplified so there are only two possible outcomes.

30
00:02:32,530 --> 00:02:39,430
You can either win or you lose if you win you get one dollar and if you lose you get zero dollars.

31
00:02:39,430 --> 00:02:49,280
Obviously you would like to play the slot machine in which your chances of winning are highest.

32
00:02:49,470 --> 00:02:49,760
OK.

33
00:02:49,800 --> 00:02:54,080
So how can you figure out which slot machine will give you the best chance of winning.

34
00:02:54,090 --> 00:02:58,060
Well you cannot ask the manager of the casino because he wants to make money.

35
00:02:58,140 --> 00:03:00,680
He will not share that information with you.

36
00:03:00,720 --> 00:03:02,480
So what do you do.

37
00:03:02,610 --> 00:03:07,610
Your only method of figuring out which slot machine is best is to collect some data.

38
00:03:07,800 --> 00:03:13,890
So let's say you play each slot machine 1000 times the win rate of the slot machine is simply equal

39
00:03:13,890 --> 00:03:18,120
to the number of times you won divided by the total number of times you played

40
00:03:23,430 --> 00:03:29,280
but what's the problem with what we just did let's say you have to choose between five slot machines

41
00:03:30,090 --> 00:03:35,620
if you played a slot machine 1000 times that means you've played 5000 games.

42
00:03:35,700 --> 00:03:40,800
Now let me remind you that playing a slot machine at a casino is not free.

43
00:03:40,800 --> 00:03:43,590
In general collecting data is not free.

44
00:03:43,590 --> 00:03:47,070
It usually requires time resources or both.

45
00:03:47,400 --> 00:03:51,650
Let's say each time you play a slot machine you spend 25 cents.

46
00:03:51,720 --> 00:03:57,280
You need to win at least one out of every four times in order to break even.

47
00:03:57,390 --> 00:03:59,270
Imagine that for some of these slot machines.

48
00:03:59,280 --> 00:04:02,010
You win zero times.

49
00:04:02,150 --> 00:04:07,760
Now it becomes clear that playing those slot machines 1000 times would be a huge waste of both time

50
00:04:07,760 --> 00:04:08,200
and money

51
00:04:13,250 --> 00:04:15,630
here's another question to consider.

52
00:04:15,770 --> 00:04:19,180
Why should we collect 1000 samples per slot machine.

53
00:04:19,340 --> 00:04:21,790
Why not one hundred Why not ten.

54
00:04:22,040 --> 00:04:29,920
As you know the more samples you collect the better and more accurate your estimate becomes unfortunately.

55
00:04:30,020 --> 00:04:37,910
This also equates to more time and money spent.

56
00:04:37,950 --> 00:04:39,960
That's why we call this a dilemma.

57
00:04:40,290 --> 00:04:46,410
The explore exploit dilemma means we are trying to balance both exploration and exploitation.

58
00:04:46,440 --> 00:04:53,150
We want to explore so that we can collect more data to find out which slot machine is truly the best.

59
00:04:53,190 --> 00:04:58,860
On the other hand we want to exploit because we want to play what we believe is the best slot machine

60
00:04:59,130 --> 00:05:02,400
in order to win more often and to make more money.

61
00:05:02,400 --> 00:05:04,650
These two forces always oppose each other

62
00:05:09,760 --> 00:05:15,040
the typical answer to this problem in reinforcement learning is called Epsilon greedy.

63
00:05:15,070 --> 00:05:20,730
This allows us to be greedy and exploit what we think will yield the highest future sum of rewards.

64
00:05:21,370 --> 00:05:27,640
But we have a small probability epsilon of choosing a random value so that we can explore and collect

65
00:05:27,640 --> 00:05:31,920
samples for our Q table to make our estimates of Q more accurate.

66
00:05:32,320 --> 00:05:37,210
And remember they do need to be somewhat accurate in order for us to confidently choose the best action

67
00:05:42,300 --> 00:05:48,780
here's how we will do this in code instead of always performing the action specified by the policy we

68
00:05:48,780 --> 00:05:52,010
will generate a random number between 0 and 1.

69
00:05:52,080 --> 00:05:56,810
If this random number is less than epsilon we will choose an action at random.

70
00:05:57,090 --> 00:06:02,700
Otherwise we will perform the action specified by the greedy policy which is greedy with respect to

71
00:06:02,700 --> 00:06:04,410
Q Given the state s.
