1
00:00:11,670 --> 00:00:15,890
In this lecture I'm going to introduce you to the next project of this course.

2
00:00:16,050 --> 00:00:21,960
Creating a reinforcement learning training by usually when people think about applying machine learning

3
00:00:21,960 --> 00:00:28,230
to the stock market they usually think about it in terms of predicting the value of a stock which includes

4
00:00:28,260 --> 00:00:32,820
even just the direction like whether it will go up tomorrow or down tomorrow.

5
00:00:32,910 --> 00:00:37,130
Of course that information by itself doesn't do anything physically.

6
00:00:37,320 --> 00:00:41,280
You still need to sit down at your computer and make a trade.

7
00:00:41,340 --> 00:00:45,730
If we're talking about automated high frequency trading then that's a different story.

8
00:00:45,840 --> 00:00:50,950
Even so let's say your model predicts that the stocks you are looking at will go up tomorrow.

9
00:00:50,970 --> 00:00:52,770
Does that mean you'll make a trade.

10
00:00:52,770 --> 00:00:53,600
Maybe.

11
00:00:53,610 --> 00:00:55,830
But what if you're busy and you forget.

12
00:00:55,830 --> 00:01:01,470
Or what if you believe it's going to go up only slightly and then decrease rapidly then probably you

13
00:01:01,470 --> 00:01:03,080
don't want to buy that stock.

14
00:01:04,970 --> 00:01:10,550
This is the key difference between a traditional supervised and unsupervised learning versus reinforcement

15
00:01:10,550 --> 00:01:14,070
learning supervised learning only makes a prediction.

16
00:01:14,120 --> 00:01:17,750
It doesn't actually take any action based on that prediction.

17
00:01:17,900 --> 00:01:23,270
I can predict tomorrow's stock price but I still have to choose whether I will act on that information

18
00:01:23,270 --> 00:01:24,590
or not.

19
00:01:24,620 --> 00:01:30,290
Reinforcement learning on the other hand not only makes predictions but also takes actions in the environment

20
00:01:30,440 --> 00:01:31,340
that you provide.

21
00:01:32,120 --> 00:01:37,520
So in this section of the chorus we are going to study how a reinforcement learning algorithm might

22
00:01:37,520 --> 00:01:40,460
take such actions in a stock trading environment

23
00:01:45,610 --> 00:01:49,240
let's just go over a rough outline of how this is going to work.

24
00:01:49,330 --> 00:01:55,570
Currently you probably just think of stock prices as a simple time series data set at this time it has

25
00:01:55,570 --> 00:01:59,280
this value at the next time it has this other value and so on.

26
00:01:59,470 --> 00:02:04,590
That sounds more appropriate for a recurrent neural network rather than reinforcement learning.

27
00:02:04,720 --> 00:02:12,180
So what makes this a reinforcement learning problem.

28
00:02:12,190 --> 00:02:14,440
Well it's a matter of perspective.

29
00:02:14,500 --> 00:02:21,400
Imagine let's say you're hooked up to a stock trading API using this API you can call functions which

30
00:02:21,400 --> 00:02:23,680
do real world financial transactions.

31
00:02:24,250 --> 00:02:32,020
So if I call the BI function person the argument gee oh gee and 10 that means I'm going to buy 10 shares

32
00:02:32,020 --> 00:02:33,430
of Google stock.

33
00:02:33,760 --> 00:02:40,180
If each share is worth fifty dollars that means five hundred dollars is going to be deducted from my

34
00:02:40,180 --> 00:02:44,230
bank account and instead I will now own it 10 shares of Google

35
00:02:49,280 --> 00:02:54,680
let's say I call the cell function and I pass in the argument AAPL with the number five.

36
00:02:54,680 --> 00:02:58,100
That means I just sold five shares of Apple stock.

37
00:02:58,100 --> 00:03:03,770
If one share of Apple is worth thirty dollars then I will now receive one hundred fifty dollars in my

38
00:03:03,770 --> 00:03:13,900
bank account and I will own five less shares of Apple stock than I did before.

39
00:03:13,920 --> 00:03:19,080
Importantly you can see that the act of calling these functions is an action you might think of your

40
00:03:19,080 --> 00:03:22,860
state as information such as recent stock prices.

41
00:03:22,950 --> 00:03:25,220
How much cash you have to buy my stocks.

42
00:03:25,320 --> 00:03:30,210
How many shares of each stock you own the values of those stocks and so on.

43
00:03:30,210 --> 00:03:33,230
The environment is the actual stock market.

44
00:03:33,420 --> 00:03:38,820
It's inherently random because you can't really predict what's going to happen to tomorrow's stock price

45
00:03:39,540 --> 00:03:43,230
but your actions affect the environment.

46
00:03:43,230 --> 00:03:49,020
In other words these are all the ingredients we need to specify our problem as a reinforcement learning

47
00:03:49,020 --> 00:03:50,190
problem.

48
00:03:50,760 --> 00:03:57,300
We can perform actions such as buy and sell in the environment and our state is made up of information

49
00:03:57,300 --> 00:04:03,200
about various stocks in our own portfolio and the environment is the stock market itself.

50
00:04:03,240 --> 00:04:06,540
The reward is some function of the money we made or lost

51
00:04:11,760 --> 00:04:13,010
something useful to try.

52
00:04:13,410 --> 00:04:19,500
Which is probably something many of you have done already is to think about how you yourself are a reinforcement

53
00:04:19,500 --> 00:04:20,640
learning agent.

54
00:04:21,030 --> 00:04:26,430
When you are looking at a stock and trying to decide whether or not to purchase some shares you generally

55
00:04:26,430 --> 00:04:29,220
want to follow the rule buy low sell high.

56
00:04:29,940 --> 00:04:33,270
So for example here we can see a dip in value.

57
00:04:33,270 --> 00:04:35,010
This would be a really good time to buy.

58
00:04:35,970 --> 00:04:37,310
And here we see a peak.

59
00:04:37,320 --> 00:04:39,640
This would be a really good time to sell well.

60
00:04:39,660 --> 00:04:41,460
Only if you need the money.

61
00:04:41,460 --> 00:04:45,590
Hopefully you are investing in something where the general trend is always going up.

62
00:04:45,630 --> 00:04:50,790
So if you don't need the money then the best thing to do is just let it sit and continue to increase

63
00:04:50,790 --> 00:04:52,190
in value.

64
00:04:52,440 --> 00:04:58,470
Of course the problem is that in the real world you are trying to make these decisions without knowledge

65
00:04:58,470 --> 00:04:59,820
of the future.

66
00:04:59,820 --> 00:05:03,850
How do you know if the most recent price is a dip or a peak.

67
00:05:03,930 --> 00:05:05,430
In fact we do not.

68
00:05:05,430 --> 00:05:08,910
And so perhaps this is a job best left for the machines.