1
00:00:11,660 --> 00:00:17,070
In this lecture we are going to describe the environment we'll be working with in the following lectures.

2
00:00:17,150 --> 00:00:20,700
Firstly because we'll be working with historical stock data.

3
00:00:20,720 --> 00:00:22,460
This is a simulation.

4
00:00:22,700 --> 00:00:26,630
Of course you would not want to try such an experiment with real money.

5
00:00:26,630 --> 00:00:33,660
So really our job is to figure out how to build an environment object in code that simulates the stock

6
00:00:33,660 --> 00:00:34,060
market

7
00:00:39,200 --> 00:00:39,890
in general.

8
00:00:40,010 --> 00:00:43,660
Here's how we can think of the API for an environment by the way.

9
00:00:43,670 --> 00:00:47,950
If you're familiar with open gym then this is probably just review for you.

10
00:00:48,170 --> 00:00:50,200
But it's good to go over this anyway.

11
00:00:50,360 --> 00:00:52,550
So the idea is this.

12
00:00:52,550 --> 00:00:56,420
First we are going to instantiate an environment object.

13
00:00:56,420 --> 00:01:02,110
Then we are going to initialize a boolean done flag equal a false.

14
00:01:02,700 --> 00:01:08,770
We are also going to call in VDI reset which puts us back into the starting position for this environment

15
00:01:09,130 --> 00:01:11,480
and returns the initial state.

16
00:01:11,650 --> 00:01:17,390
As a side note this state vector may not be at a good scale to pass into a neural network.

17
00:01:17,710 --> 00:01:23,320
As you recall remember that we like to normalize data before passing it into a neural network or linear

18
00:01:23,320 --> 00:01:31,000
regression so keep in mind that you can do this as an optional step then we are going to enter a loop

19
00:01:31,330 --> 00:01:35,380
which only quits when done becomes true inside the loop.

20
00:01:35,400 --> 00:01:39,070
We are going to choose an action to perform in the environment.

21
00:01:39,100 --> 00:01:44,620
This might come from our agent but that's not a necessary detail at this stage because we are only thinking

22
00:01:44,620 --> 00:01:51,080
about the API for the environment you could just as easily choose a random action although most likely

23
00:01:51,080 --> 00:01:54,950
this will lead to a suboptimal reward.

24
00:01:54,990 --> 00:01:59,050
The next step is to actually perform the action in the environment.

25
00:01:59,220 --> 00:02:07,500
We do that by calling Ian out step and passing in the action as an argument this will return a few things.

26
00:02:07,510 --> 00:02:10,100
First it returns the next state.

27
00:02:10,100 --> 00:02:14,480
Second it returns the reward for arriving in the next state.

28
00:02:14,480 --> 00:02:19,970
Third it returns a done flag to tell us whether or not the episode is over.

29
00:02:19,970 --> 00:02:25,370
Finally it returns and info dictionary which can tell us additional information about the environment.

30
00:02:27,300 --> 00:02:33,510
This one is not strictly necessary and in fact it's empty for many environments but for us we actually

31
00:02:33,510 --> 00:02:39,230
going to populate the info dictionary to tell us the current value of our portfolio.

32
00:02:39,240 --> 00:02:43,530
This is in part of the state but can be calculated from the state variables.

33
00:02:44,100 --> 00:02:51,380
Thus it's easier to simply calculate it inside the environment and return it along with everything else.

34
00:02:51,390 --> 00:02:57,030
Finally we assign the next day variable to the state variable in the case where on the next step the

35
00:02:57,030 --> 00:02:59,640
agent needs to use the state to choose an action

36
00:03:04,810 --> 00:03:09,430
so that's pretty simple and you'll find that in general no matter what environment you are looking at

37
00:03:09,790 --> 00:03:11,060
it's going to have an API.

38
00:03:11,070 --> 00:03:18,460
Just like what we saw the questions we really want to answer now are what should the state be and what

39
00:03:18,460 --> 00:03:21,920
should the action be and what should the reward be.

40
00:03:21,940 --> 00:03:28,310
The reason we need to discuss these is because there are an endless number of possibilities and complications.

41
00:03:28,390 --> 00:03:34,120
We aren't necessarily going to have to simplify the problem a little bit but first let me explain to

42
00:03:34,120 --> 00:03:36,820
you why this simplification is necessary

43
00:03:41,940 --> 00:03:43,680
let's start with the state.

44
00:03:43,680 --> 00:03:46,200
There are many things you could consider here.

45
00:03:46,200 --> 00:03:49,630
First you can think of it exactly like a time series problem.

46
00:03:49,710 --> 00:03:55,020
Look at the pattern of stock movements in the past and from that make a decision.

47
00:03:55,020 --> 00:04:00,060
That's probably the first thing you and I would do when we decide if we're going to buy or sell a stock.

48
00:04:00,870 --> 00:04:03,640
However there are other things to consider.

49
00:04:03,660 --> 00:04:11,480
We also have to ask do we own enough cash to buy the stocks we want to buy and given the prices of existing

50
00:04:11,480 --> 00:04:12,580
shares I own.

51
00:04:12,770 --> 00:04:17,290
Is it worth it to sell them in order to get more cash to buy a different stock.

52
00:04:18,110 --> 00:04:26,110
So in fact this can become a complex decision problem.

53
00:04:26,260 --> 00:04:31,150
We are going to borrow some ideas from a paper called Practical deep reinforcement learning approach

54
00:04:31,210 --> 00:04:32,170
for stock trading.

55
00:04:33,840 --> 00:04:39,540
This approach used a more advanced reinforcement learning technique known as DDP but we can apply a

56
00:04:39,540 --> 00:04:42,060
few of the ideas they proposed.

57
00:04:42,060 --> 00:04:44,960
So here's how we're going to represent our state.

58
00:04:45,030 --> 00:04:50,800
It will consist of three parts First we're going to record how many shares of each stock we own.

59
00:04:51,400 --> 00:04:57,310
So for example if I'm looking at Apple Motorola and Starbucks this means I own three shares of Apple

60
00:04:57,580 --> 00:05:02,520
five shares of Motorola and seven Shares of Starbucks second.

61
00:05:02,530 --> 00:05:06,130
We're going to list out the current share price of each stock.

62
00:05:06,160 --> 00:05:09,630
So this means Apple is trading at fifty dollars per share.

63
00:05:09,640 --> 00:05:12,100
Motorola is trading at twenty dollars per share.

64
00:05:12,280 --> 00:05:15,560
And Starbucks is trading at thirty dollars per share.

65
00:05:15,800 --> 00:05:20,900
Finally the last value of the state is how much pure cash we have.

66
00:05:20,900 --> 00:05:27,440
That's cash that's not invested in any stock which just sits there and doesn't gain any interest so

67
00:05:27,470 --> 00:05:34,880
let's say we have one hundred dollars in cash then our total state vector will be 3 5 7 50 20 30 and

68
00:05:34,920 --> 00:05:41,990
100 you should be able to confirm that if we have any stocks than the size of our state vector will

69
00:05:41,990 --> 00:05:43,340
be to end plus one

70
00:05:48,570 --> 00:05:49,130
next.

71
00:05:49,170 --> 00:05:55,530
Let's consider the actions again if we consider the sheer amount of possibilities the action space would

72
00:05:55,530 --> 00:05:58,830
be extremely large for any given stock.

73
00:05:58,830 --> 00:06:00,520
I have three possible options.

74
00:06:00,630 --> 00:06:02,850
I can sell I can buy or I can hold.

75
00:06:02,850 --> 00:06:04,720
Which means do nothing.

76
00:06:04,740 --> 00:06:07,350
Now you might think three is not bad.

77
00:06:07,350 --> 00:06:12,230
But now remember we have three stocks to consider for each of these.

78
00:06:12,270 --> 00:06:15,720
I can exercise any of the three options above.

79
00:06:15,750 --> 00:06:21,220
So that gives me three to the power three possible actions or twenty seven actions.

80
00:06:21,360 --> 00:06:28,050
For example my action vector maybe sell sell sell which means sell my Apple shares sell my Motorola

81
00:06:28,050 --> 00:06:36,460
shares and sell my Starbucks shares or it might be buy sell hold which means by Apple shares sell Motorola

82
00:06:36,460 --> 00:06:43,080
shares and do nothing with my Starbucks shares however this is still not the end of the story because

83
00:06:43,080 --> 00:06:46,940
this doesn't say anything about how many shares to buy or sell.

84
00:06:47,250 --> 00:06:53,160
If I own ten shares of a stock I can sell anywhere from zero to 10 of those shares.

85
00:06:53,220 --> 00:06:56,110
Luckily we are going to simplify this problem a little bit.

86
00:07:01,120 --> 00:07:04,420
So here's how we're going to treat actions in our example.

87
00:07:04,420 --> 00:07:09,610
It's going to be extremely simplified compared to how things work in the real world but it's a decent

88
00:07:09,610 --> 00:07:10,000
start.

89
00:07:11,710 --> 00:07:14,770
First we're not going to consider any transaction costs.

90
00:07:14,830 --> 00:07:20,710
For example if you buy shares using your bank's investing platform usually that would cost you ten dollars

91
00:07:20,710 --> 00:07:21,490
or so.

92
00:07:21,730 --> 00:07:26,300
For us it will be zero next when we sell.

93
00:07:26,300 --> 00:07:28,700
We will always sell all of our shares for that stock.

94
00:07:29,600 --> 00:07:33,720
So let's say we own 10 shares of Apple stock and we decide to sell.

95
00:07:33,740 --> 00:07:41,400
That means we sell all 10 shares secondly when we buy we are going to buy as many shares as possible

96
00:07:41,700 --> 00:07:44,850
for the stock we choose to buy.

97
00:07:44,860 --> 00:07:50,400
Now you might wonder if I choose multiple stocks to buy and I want to buy as many as possible.

98
00:07:50,410 --> 00:07:51,870
How can I do that.

99
00:07:51,880 --> 00:07:53,860
Well it's kind of ambiguous.

100
00:07:53,860 --> 00:07:59,630
You might think you want to choose the stocks in such a way that leads to using up as much cash as possible.

101
00:07:59,830 --> 00:08:03,810
But in fact this is actually a hard problem known as the knapsack problem.

102
00:08:05,840 --> 00:08:10,520
So what we're going to do is we're going to take a simple greedy approach lived through every stock

103
00:08:10,550 --> 00:08:14,930
and buy one share of each stock and keep doing that in a loop until we run out of money.

104
00:08:16,540 --> 00:08:22,330
Third we will also sell the stocks we want to sell before we buy anything that will leave us with more

105
00:08:22,330 --> 00:08:29,890
cash that we can use to buy new stocks this may seem like a very cautious approach but in fact this

106
00:08:29,890 --> 00:08:36,400
already leaves us with 27 possible actions which means our neural network will have to approximate 27

107
00:08:36,490 --> 00:08:44,680
different values which is pretty large already and so an action in this environment is not just making

108
00:08:44,680 --> 00:08:54,720
a single trade but rather it will involve doing all the steps in the specified order.

109
00:08:54,720 --> 00:08:56,810
Finally we have the reward.

110
00:08:56,850 --> 00:08:58,080
This one is simple.

111
00:08:58,260 --> 00:09:02,270
The reward will just be the change in the value of our portfolio.

112
00:09:02,340 --> 00:09:08,800
Now it's We're thinking about how well we calculate the value of our portfolio as an example suppose

113
00:09:08,800 --> 00:09:14,140
we own 10 shares of Apple 5 shares a Motorola and 3 Shares of Starbucks.

114
00:09:14,140 --> 00:09:19,840
The corresponding share prices are 50 dollars for Apple twenty dollars for Motorola and thirty dollars

115
00:09:19,840 --> 00:09:21,580
for Starbucks.

116
00:09:21,580 --> 00:09:27,130
Let's also suppose we have one hundred dollars in cash not invested in any stock.

117
00:09:27,130 --> 00:09:34,270
Then the total value of our portfolio will be ten times 50 plus five times 20 plus three times 30 plus

118
00:09:34,270 --> 00:09:35,220
100.

119
00:09:35,380 --> 00:09:37,810
That's equal to seven hundred ninety dollars

120
00:09:42,990 --> 00:09:43,790
in general.

121
00:09:43,860 --> 00:09:49,290
If we store the shares we own in a vector called S and we store the corresponding share prices in an

122
00:09:49,290 --> 00:09:55,650
array called P and we store the amount of cash we have in a variable called C then the total value of

123
00:09:55,650 --> 00:09:58,980
our portfolio can be calculated as follows.

124
00:09:59,280 --> 00:10:06,620
It is equal to the dot product of S&amp;P plus C the reward then we'll just be the difference between these

125
00:10:06,620 --> 00:10:07,170
two.

126
00:10:07,250 --> 00:10:10,610
Comparing the most recent timestamp and the previous timestamp

127
00:10:15,880 --> 00:10:21,600
to summarize this lecture let's recap the important points about the environment and its implementation.

128
00:10:22,610 --> 00:10:29,090
First our environment will be an object that mimics the open AGM API so it will have functions like

129
00:10:29,090 --> 00:10:35,880
reset and step which returns all the information we need to implement our reinforcement learning program.

130
00:10:36,140 --> 00:10:44,190
Next for our environment we'll be considering three stocks Apple Motorola and Starbucks next.

131
00:10:44,190 --> 00:10:47,570
Our state is a vector with three pieces of information.

132
00:10:47,810 --> 00:10:52,510
First it includes the number of shares of each stock that we want to consider.

133
00:10:52,510 --> 00:10:56,740
Second it includes the share price for each of those stocks.

134
00:10:56,740 --> 00:11:02,320
Third it includes the amount of cash we have that's not invested in any stock.

135
00:11:02,620 --> 00:11:10,030
Next our actions are a simplified subset of the large number of actions we can perform in the real world.

136
00:11:10,030 --> 00:11:12,970
We also assume there are no transaction costs.

137
00:11:12,970 --> 00:11:18,280
Simply put we have three options for each stock buy sell or hold.

138
00:11:18,280 --> 00:11:23,500
We'll take an all or nothing approach where if we buy we're going to buy as many shares as possible

139
00:11:23,830 --> 00:11:26,910
and if we sell we're going to sell all of the shares we own.

140
00:11:28,400 --> 00:11:32,570
And these reactions can be applied in any combination for each stock we own.

141
00:11:33,350 --> 00:11:41,050
So if we're considering three stocks then we'll have three to the power three possible actions you'll

142
00:11:41,050 --> 00:11:46,870
notice that even with just three stocks and a much simplified action space we still have quite a large

143
00:11:46,870 --> 00:11:53,270
number of actions so if we have any stocks we would have three to the power and possible actions.

144
00:11:53,270 --> 00:11:55,910
It grows exponentially with the number of stocks we own.

145
00:11:56,060 --> 00:11:59,420
So encoding the actions in this way will not scale.

146
00:11:59,420 --> 00:12:07,420
If we have many stocks to consider finally the reward is just the change in value of our portfolio from

147
00:12:07,420 --> 00:12:10,000
the previous steps to the current step.

148
00:12:10,000 --> 00:12:13,900
The value of our portfolio is just the price of each stock we own.

149
00:12:13,900 --> 00:12:17,560
Times the number of shares we own plus any on invested cash we have.
