1
00:00:11,650 --> 00:00:16,980
In this lecture we are going to discuss the layout and design of our reinforcement learning trading

2
00:00:17,000 --> 00:00:22,570
but first at a very high level we are going to have two modes of operation.

3
00:00:22,570 --> 00:00:23,680
Train and test.

4
00:00:24,340 --> 00:00:30,580
As usual we want all of our training data to be in the past and all of our test data to be stock prices

5
00:00:30,580 --> 00:00:32,940
that came after the training data.

6
00:00:33,010 --> 00:00:40,700
So we are going to train our agent to maximize its reward over an episode using only the training data.

7
00:00:40,700 --> 00:00:46,630
Then we are going to use this trained agent on the test data to see what the value of our portfolio

8
00:00:46,630 --> 00:00:46,900
is.

9
00:00:46,900 --> 00:00:48,670
By the end of the test period

10
00:00:53,910 --> 00:00:59,580
to start let's think about how this would work if we had access to all the objects we needed.

11
00:00:59,680 --> 00:01:04,460
Even without the agent this will help us organize the majority of our code.

12
00:01:04,540 --> 00:01:07,700
The main part of the code will look something like this.

13
00:01:07,960 --> 00:01:13,790
First create an instance of the environment next create an instance of the agent.

14
00:01:13,840 --> 00:01:20,320
Don't worry about what this does yet then in a loop we're going to have a function called play one episode

15
00:01:20,770 --> 00:01:26,230
which accepts the environment and the agent and returns the value of the portfolio at the end of the

16
00:01:26,230 --> 00:01:27,450
episode.

17
00:01:27,760 --> 00:01:33,190
When our loop is done we're going to save the portfolio values for later so that we can plot them and

18
00:01:33,190 --> 00:01:33,970
analyze them.

19
00:01:34,690 --> 00:01:45,820
So this is pretty simple but now we have to figure out what should go in the function play one episode.

20
00:01:45,900 --> 00:01:49,650
So here's what the play 1 episode function might look like.

21
00:01:49,650 --> 00:01:54,270
As always we start by resetting the environment to get back to the initial stage.

22
00:01:55,740 --> 00:02:03,590
Next we initialize our done flag to false and enter a loop that only quits when done becomes true inside

23
00:02:03,590 --> 00:02:03,980
the loop.

24
00:02:03,980 --> 00:02:05,990
We choose an action.

25
00:02:06,080 --> 00:02:11,270
Now at this point you know that this action is coming from our agent but will defer how the agent works

26
00:02:11,300 --> 00:02:14,120
until later next.

27
00:02:14,150 --> 00:02:19,820
We're going to call that view that step function to perform the action and get back the next day reward.

28
00:02:19,820 --> 00:02:27,390
And so on next we're going to check if our script is in train mode if it is then we have to train our

29
00:02:27,390 --> 00:02:28,290
agent.

30
00:02:28,710 --> 00:02:35,430
We want to update our replay buffer by adding the most recent SARS prime and done tuple to the replay

31
00:02:35,430 --> 00:02:43,100
buffer then we'll call Agent replay to grab a sample from our replay buffer and run one step of gradient

32
00:02:43,100 --> 00:02:51,590
descent finally we'll set the current state to be the next day for the next iteration of this loop when

33
00:02:51,590 --> 00:02:53,990
we're done we'll return the value of our portfolio

34
00:02:59,120 --> 00:03:03,680
one additional detail to keep in mind is that our data is not yet normalized.

35
00:03:03,710 --> 00:03:09,740
You can imagine that our state which is composed of three parts can have vastly different ranges.

36
00:03:09,740 --> 00:03:12,860
The first part consists of the number of shares we own.

37
00:03:13,010 --> 00:03:18,740
The second part consists of share prices in the third part consists of how much cash we have sitting

38
00:03:18,740 --> 00:03:19,430
on invested.

39
00:03:20,150 --> 00:03:22,650
So we'll want to normalize this data.

40
00:03:22,760 --> 00:03:26,320
We can do this very simply whenever we get a new state.

41
00:03:26,390 --> 00:03:32,090
We'll have a scalar object from psychic learn which will take our state and standardize it to have zero

42
00:03:32,090 --> 00:03:33,800
mean and unit variance.

43
00:03:33,980 --> 00:03:36,320
So not a huge addition to our previous code

44
00:03:41,570 --> 00:03:42,260
next.

45
00:03:42,290 --> 00:03:46,310
Let's imagine what our environment objects will actually look like.

46
00:03:46,440 --> 00:03:53,380
First it's going to accept a time series of stock prices as input into its constructor we'll also have

47
00:03:53,380 --> 00:03:55,440
a pointer to tell us what day it is.

48
00:03:55,540 --> 00:04:01,520
So we know the current stock prices while also want to know how much cash we initially start with our

49
00:04:01,520 --> 00:04:09,510
initial investment from this we can do everything we need to do our reset function will bring our pointer

50
00:04:09,540 --> 00:04:15,210
back to the beginning of the time series and recalculate our stay which of course should be all cash

51
00:04:15,240 --> 00:04:22,510
and no investment our step function will take taken an action and then buy and sell the stocks specified

52
00:04:22,510 --> 00:04:29,220
by the action then it'll set our pointer to the next day's stock prices.

53
00:04:29,590 --> 00:04:33,280
We'll also calculate the next day and the portfolio value.

54
00:04:33,280 --> 00:04:35,600
From this we can calculate the reward.

55
00:04:36,090 --> 00:04:38,260
The done flag will simply be set to true.

56
00:04:38,320 --> 00:04:44,360
If we reach the end of our time series so that's basically it for the environment we have the Constructor

57
00:04:44,540 --> 00:04:46,340
a reset function and a step function

58
00:04:51,580 --> 00:04:57,210
Finally let's consider our agent object which is complicated but no more complicated than the environment

59
00:04:58,510 --> 00:05:00,390
the essential parts of the agent are.

60
00:05:00,400 --> 00:05:06,000
The replay buffer which is what we use to store our transitions in the environment and a neuron that

61
00:05:06,000 --> 00:05:12,140
work or some other kind of model which is what we will use to approximate our Q values.

62
00:05:12,190 --> 00:05:16,170
So what are the essential functions of the agent as we've seen.

63
00:05:16,180 --> 00:05:22,170
We'll need a one function called update replay memory which takes in a state action reward next day

64
00:05:22,180 --> 00:05:26,930
and done flag and stores this in our replay buffer object.

65
00:05:27,030 --> 00:05:28,170
Next we'll need a get.

66
00:05:28,170 --> 00:05:34,550
Action function which accepts as input a state and decides what action to perform in the environment.

67
00:05:34,830 --> 00:05:40,200
And because this is Q learning it's going to use the Q learning rule or some variant of it like Epsilon

68
00:05:40,200 --> 00:05:45,890
greedy finally we'll need a replay function which does the following.

69
00:05:46,000 --> 00:05:50,570
First it grabs a random sample from the replay buffer next.

70
00:05:50,590 --> 00:05:56,980
It uses this to calculate a supervised learning dataset which consists of input and target pairs which

71
00:05:56,980 --> 00:06:01,320
is what we need to train our model once we have our data set.

72
00:06:01,350 --> 00:06:10,610
We can call model that train to run one iteration of gradient descent.

73
00:06:10,650 --> 00:06:12,710
All right so that's it for this lecture.

74
00:06:12,750 --> 00:06:17,190
There are certainly more details we haven't yet discussed since this was designed to just give you an

75
00:06:17,190 --> 00:06:19,400
overview at this point.

76
00:06:19,410 --> 00:06:23,320
You understand that our script will have a train mode and a test mode.

77
00:06:23,520 --> 00:06:29,790
In both cases there will be a main loop where we call play one episode again and again playing when

78
00:06:29,790 --> 00:06:35,010
episode involves basically just going back and forth between the agent and the environment.

79
00:06:35,010 --> 00:06:41,460
The environment produces states and rewards the agent takes in states and returns actions to perform

80
00:06:41,460 --> 00:06:44,250
in the environment during train mode.

81
00:06:44,250 --> 00:06:47,970
The agent will store the state's actions and rewards and perform.

82
00:06:47,970 --> 00:06:51,600
Q learning updates in order to train the Q function approximate Herr.
