1
00:00:11,620 --> 00:00:17,050
In this lecture, we are going to finish looking at the rest of this script and take a look at the results.

2
00:00:17,950 --> 00:00:23,530
First, we have the play one episode function, which is pretty much exactly the same as what we discussed

3
00:00:23,530 --> 00:00:26,010
in the theory outside the loop.

4
00:00:26,020 --> 00:00:29,980
We reset the environment, grab the initial state and transform it.

5
00:00:30,820 --> 00:00:33,190
Next, we initialize Dunn to false.

6
00:00:35,810 --> 00:00:39,530
Next, we enter a loop which exits only when Dunn becomes true.

7
00:00:40,520 --> 00:00:43,940
Next, we use the agent to determine the next action.

8
00:00:44,750 --> 00:00:47,110
Then we perform the action in the environment.

9
00:00:48,020 --> 00:00:49,340
We get back the next day.

10
00:00:49,340 --> 00:00:51,750
Reward Dunn Flag and Info Dictionary.

11
00:00:52,700 --> 00:00:54,800
Next, we scale the next state.

12
00:00:56,250 --> 00:01:02,670
Next, we check if we are in train mode, if we are, then we add the latest transition to our replay

13
00:01:02,670 --> 00:01:07,470
buffer and we call the replay function to run one step of gradient descent.

14
00:01:08,690 --> 00:01:13,610
Next, we set the state variable to next day for the next iteration of the Loop.

15
00:01:14,970 --> 00:01:19,080
Lastly, when we exit the loop, we return the current value of the portfolio.

16
00:01:23,850 --> 00:01:30,990
Next, we have the main section first, there are a few configuration variables, so Models folder tells

17
00:01:30,990 --> 00:01:32,520
us where we will save our models.

18
00:01:33,030 --> 00:01:38,400
Rewards folder will tell us where we will store our rewards from both the training and testing phases.

19
00:01:39,360 --> 00:01:41,730
The number of episodes to run is 2000.

20
00:01:42,270 --> 00:01:48,120
The batch size for sampling from the replay memory is thirty two and our initial investment is twenty

21
00:01:48,120 --> 00:01:48,750
thousand.

22
00:01:52,360 --> 00:01:57,830
Next, we create an argument, the object, so that we can run the script with command line arguments.

23
00:01:58,420 --> 00:02:02,650
We'll have one argument, the mode where we can pass in a train or test.

24
00:02:07,890 --> 00:02:13,260
Next, we create the model directory and rewards directory in the case where they do not yet exist.

25
00:02:14,790 --> 00:02:18,110
Next, we call the getData function to get our Time series.

26
00:02:18,810 --> 00:02:23,040
We call the shape attribute to get the number of time steps and the number of stocks.

27
00:02:24,500 --> 00:02:30,410
Next, we split the data into train and test, the first half is train and the second half is test.

28
00:02:33,980 --> 00:02:37,790
Next, we create an instance of our environment objects with the training data.

29
00:02:38,480 --> 00:02:43,630
After that, we get the dimensionality of the state and the dimensionality of the action space.

30
00:02:44,750 --> 00:02:48,140
We pass these into the Asian constructor to get the agent.

31
00:02:49,230 --> 00:02:52,140
We also called get Scaler to get the scalar objects.

32
00:02:55,480 --> 00:03:00,720
Next, we initialize an empty list, which will store the portfolio values for each episode we play.

33
00:03:07,600 --> 00:03:13,180
Next, we have an important if statement, this is going to overwrite some of the things we just created

34
00:03:13,510 --> 00:03:15,540
in the case where we are in test mode.

35
00:03:16,210 --> 00:03:21,580
So if we're in test mode, we want to use the scalar that we had during training so that the neural

36
00:03:21,580 --> 00:03:26,320
network corresponds with the same scalar and we don't accidentally use a different scalar.

37
00:03:30,080 --> 00:03:35,210
Next, we recreate the environment with the test data rather than the training data.

38
00:03:36,140 --> 00:03:39,500
Also, we have to make sure to set Epsilon to a small value.

39
00:03:39,860 --> 00:03:43,980
Otherwise its default value is one which is just pure exploration.

40
00:03:45,080 --> 00:03:51,740
Note that if you set it to zero, there is no point in running multiple episodes because both the data

41
00:03:51,740 --> 00:03:53,720
and our agent will be deterministic.

42
00:03:54,110 --> 00:03:55,910
So you'll get the same result each time.

43
00:03:57,600 --> 00:04:02,340
Finally, we have to make sure we load up the weights that we saved during trainings, we call agent

44
00:04:02,490 --> 00:04:04,500
load and pass in those weights.

45
00:04:12,430 --> 00:04:17,840
Next, we enter a loop to play our episodes, some episodes, times Inside the Loop.

46
00:04:17,860 --> 00:04:22,660
We grab the current time since we would like to know the duration of each loop iteration.

47
00:04:23,920 --> 00:04:28,980
Next, we play one episode and receive the portfolio value at the end of the episode.

48
00:04:29,890 --> 00:04:35,140
Then we get the current time and subtract the start time to get the duration of the episode.

49
00:04:36,820 --> 00:04:42,220
Next, we print out all this information, the episode number, the portfolio value and the episode

50
00:04:42,220 --> 00:04:42,820
duration.

51
00:04:44,820 --> 00:04:49,110
Next, we append to the portfolio value to our list of portfolio values.

52
00:04:53,200 --> 00:04:59,190
Next, we check if the mode is train, at this point, we are finished playing all of our episodes,

53
00:04:59,680 --> 00:05:04,900
so if the mode is train, then we'll save our neural network to a file called DeQuan.

54
00:05:04,900 --> 00:05:09,550
Dot five will also save a scalar to Skellig PKU.

55
00:05:11,250 --> 00:05:17,730
Lastly, for both train and test will save the rewards we got so the train data will be saved in train

56
00:05:17,880 --> 00:05:21,480
NPI and the test data will be saved in test NPI.

57
00:05:26,030 --> 00:05:30,710
All right, so now we're going to look at another script, which is just for convenience, for plotting

58
00:05:30,710 --> 00:05:34,290
the rewards we saved, just so you don't have to write this yourself.

59
00:05:35,000 --> 00:05:36,230
This can be found in plot.

60
00:05:36,230 --> 00:05:37,790
R-AL rewards that pie.

61
00:05:38,480 --> 00:05:43,700
Basically, it loads up the rewards, which you can switch to train or test depending on what you want

62
00:05:43,700 --> 00:05:44,180
to plot.

63
00:05:44,690 --> 00:05:48,450
And then it prints the average reward, the men and the max.

64
00:05:49,100 --> 00:05:51,550
It also plots a histogram of the reward.

65
00:05:52,220 --> 00:05:55,580
So we know the distribution of rewards over each episode.

66
00:06:00,560 --> 00:06:05,330
Next, I'm going to show you how to run this, so when you want to run this in train mode and plot the

67
00:06:05,330 --> 00:06:12,860
train results, you're going to run Python, RL Trader Torpy minus M train and then you can do Python

68
00:06:13,400 --> 00:06:15,710
rewards that PI minus M train.

69
00:06:16,730 --> 00:06:19,240
When you want to test, you just switch the mode to test.

70
00:06:19,250 --> 00:06:26,780
So you do Python or outrated up pi minus M test and a python plot are all rewards that pi minus M a

71
00:06:26,780 --> 00:06:27,380
test.

72
00:06:32,590 --> 00:06:35,170
So here are the results for training.

73
00:06:35,200 --> 00:06:41,050
The rewards are pretty good, we can even get up to a five X increase in our original investment, which

74
00:06:41,050 --> 00:06:42,100
is extremely high.

75
00:06:48,300 --> 00:06:54,570
For testing, they are also pretty good, our original investment usually always increases by a significant

76
00:06:54,570 --> 00:06:54,980
amount.

77
00:06:56,530 --> 00:07:02,890
By the way, remember that our data set is from daily closed prices in the span of five years since

78
00:07:02,890 --> 00:07:05,090
the test period is half of the data set.

79
00:07:05,530 --> 00:07:09,420
That means this increase is over a span of about 2.5 years.

80
00:07:09,970 --> 00:07:16,450
So you can take this profit and calculate the effective yearly return to get a better idea of how this

81
00:07:16,450 --> 00:07:20,080
agent performs compared to, say, the S&amp;P 500.