1
00:00:11,730 --> 00:00:17,860
In this lecture we are going to describe the agent the agent is our artificial intelligence.

2
00:00:17,880 --> 00:00:24,030
It's responsible for taking past experiences learning from them and taking actions such that they will

3
00:00:24,030 --> 00:00:26,790
maximize future rewards.

4
00:00:26,800 --> 00:00:32,530
First we have our constructor it takes into two arguments the size of a state vector and the number

5
00:00:32,530 --> 00:00:33,800
of actions.

6
00:00:34,000 --> 00:00:39,820
These correspond to the number of inputs and outputs of our neuron that work respectively.

7
00:00:39,820 --> 00:00:46,760
Next we initialize an instance of the replay buffer object with the memory size of 500 marks we have

8
00:00:46,760 --> 00:00:52,520
a few hyper parameters such as gamma the discount rate the initial value of epsilon the final value

9
00:00:52,520 --> 00:00:58,520
of epsilon and the factor to change Epsilon by on each round.

10
00:00:58,560 --> 00:01:03,330
Lastly we create an instance of our model by calling the MLP constructor.

11
00:01:03,360 --> 00:01:12,700
We also create the associated loss and optimizer and make these attributes of the agent.

12
00:01:12,810 --> 00:01:13,930
Next we have the update.

13
00:01:13,950 --> 00:01:15,370
Replay memory function.

14
00:01:15,390 --> 00:01:17,400
This takes in a state action reward.

15
00:01:17,400 --> 00:01:24,520
Next state and done flag and stores it in the replay buffer.

16
00:01:24,540 --> 00:01:26,630
Next we have the ACT function.

17
00:01:26,700 --> 00:01:32,490
This takes an estate and uses Epsilon greedy to choose an action based on that state.

18
00:01:32,720 --> 00:01:38,050
First we generate a random number between 0 and 1 and check if it's less than epsilon.

19
00:01:38,240 --> 00:01:42,370
If it is we perform a random action by calling n picked out random nut choice.

20
00:01:43,160 --> 00:01:50,940
Otherwise we perform a greedy action by grabbing all the queue values for the input state then the action

21
00:01:50,940 --> 00:01:55,260
to perform is the action which leads to the maximum Q value.

22
00:01:55,260 --> 00:02:01,760
So we take the ARG Max over the model predictions remember that the output of a PI to each model is

23
00:02:01,760 --> 00:02:03,970
backsides by number of outputs.

24
00:02:04,040 --> 00:02:09,000
So we have to index the return value by zero before taking the ARG Max.

25
00:02:09,320 --> 00:02:10,610
The batch size is just one.

26
00:02:10,640 --> 00:02:18,410
So we're looking at the 0 with index.

27
00:02:18,520 --> 00:02:23,410
Next we have the replay function which is the most important function for this class.

28
00:02:23,410 --> 00:02:25,860
This is the function that does the learning.

29
00:02:26,200 --> 00:02:32,520
It accepts one argument the backside which tells us how many samples to grab from the replay memory.

30
00:02:32,920 --> 00:02:38,360
So to start we first check if the size of the replay buffer is greater than the batch size.

31
00:02:38,410 --> 00:02:40,520
If it's not we can't grab a full batch.

32
00:02:40,540 --> 00:02:42,610
So we just return.

33
00:02:42,610 --> 00:02:45,140
Otherwise we continue.

34
00:02:45,170 --> 00:02:50,880
Next we call self memory that sample back with the batch size argument.

35
00:02:51,040 --> 00:02:58,430
Remember this returns a dictionary so we grab the state's actions rewards and so on using the corresponding

36
00:02:58,430 --> 00:03:05,600
keys next we calculate the targets for each state for reference.

37
00:03:05,600 --> 00:03:06,870
Here is the equation again.

38
00:03:06,900 --> 00:03:14,160
Mathematically it's a little different here because we are now working with a batch of data rather than

39
00:03:14,160 --> 00:03:16,950
just one sample.

40
00:03:16,980 --> 00:03:22,410
Also we have to remember that by definition the value of a terminal state is zero.

41
00:03:22,410 --> 00:03:27,080
Therefore if the next state is a terminal state then why should just be the reward.

42
00:03:27,100 --> 00:03:34,470
Ah to that end we can multiply by one minus done since day one equals one if it's a terminal state then

43
00:03:34,470 --> 00:03:38,670
you would get one minus one which is zero and anything times zero zero.

44
00:03:39,120 --> 00:03:42,180
So effectively the target would just be are the reward

45
00:03:47,390 --> 00:03:47,600
now.

46
00:03:47,600 --> 00:03:51,380
Interestingly this may or may not be correct for the scenario.

47
00:03:51,380 --> 00:03:57,410
Realistically the stock market never ends and so you don't really have a proper terminal state.

48
00:03:57,410 --> 00:03:59,700
This is only where our data ends.

49
00:03:59,780 --> 00:04:05,630
So one might theorize that it may be better not to do this saying that this is the end of the episode

50
00:04:05,630 --> 00:04:07,470
is sort of pedantic.

51
00:04:07,610 --> 00:04:10,390
It's not really the end of the episode it's just the end of our data.

52
00:04:10,400 --> 00:04:13,990
The stock market will effectively go on forever.

53
00:04:14,090 --> 00:04:17,300
Of course you can just test it out and see if it has any effect

54
00:04:23,050 --> 00:04:25,900
now at this point even though we've calculated the target.

55
00:04:25,990 --> 00:04:28,720
We are technically not done currently.

56
00:04:28,750 --> 00:04:35,110
The targets are just a 1 b array of length back size but the model predictions are a 2D array of batch

57
00:04:35,110 --> 00:04:39,660
size by number of actions so they don't match.

58
00:04:39,670 --> 00:04:45,100
The problem is the targets and the predictions don't have the same shape as always.

59
00:04:45,100 --> 00:04:48,400
I'll remind you how important it is to think about shapes.

60
00:04:48,670 --> 00:04:54,310
Now one option to solve this problem would be to just use a custom lost function but if you would like

61
00:04:54,310 --> 00:04:59,370
to use the default means squared error then the targets and the predictions have to be the same shape.

62
00:05:00,820 --> 00:05:05,010
In other words for each sample we must have a target for each action.

63
00:05:05,170 --> 00:05:11,620
Even if that action was not the one taken by the agent in order to do this we're going to use our model

64
00:05:12,010 --> 00:05:13,930
to make a prediction for each state.

65
00:05:13,930 --> 00:05:17,630
Any action so we'll call that targets full.

66
00:05:17,770 --> 00:05:22,380
So this will be batch size by number of actions.

67
00:05:22,470 --> 00:05:28,300
Now the key is we only want to change this array for actions in which we really have targets.

68
00:05:28,410 --> 00:05:32,280
So the targets we calculated above to do this.

69
00:05:32,280 --> 00:05:38,730
We can use double indexing where the first index which is empty are a range of that size.

70
00:05:39,330 --> 00:05:46,980
And this represents which row to choose the second index which is actions tells us which column to choose.

71
00:05:47,550 --> 00:05:51,950
And we can set these to the targets we calculated previously.

72
00:05:51,960 --> 00:05:58,590
The trick is for all other actions the target will be equal to the prediction because the target actually

73
00:05:58,590 --> 00:06:05,450
is the prediction therefore the error of those values will be 0 and they will not have any influence

74
00:06:05,450 --> 00:06:09,700
on the gradient descent step.

75
00:06:09,730 --> 00:06:16,030
Next we run one step of gradient descent using the function to model that train on batch passing in

76
00:06:16,030 --> 00:06:23,060
the inputs and targets we just calculated finally we update Epsilon to reduce the amount of exploration

77
00:06:23,060 --> 00:06:23,630
over time.

78
00:06:24,590 --> 00:06:30,260
If Epsilon is still bigger than epsilon men then we decrease Epsilon by a factor of epsilon decay

79
00:06:34,350 --> 00:06:38,820
after this we have two simple functions to save and load the model weights.

80
00:06:38,820 --> 00:06:42,480
These will be useful since the script takes some time to train.

81
00:06:42,540 --> 00:06:48,360
So what we can do is train the scripture in one run save the weights and then test it later with different

82
00:06:48,360 --> 00:06:49,470
configurations.