1
00:00:11,660 --> 00:00:17,600
In this lecture we are going to discuss how to implement a replay memory or replay buffer efficiently

2
00:00:18,470 --> 00:00:20,330
to understand why we would want to do this.

3
00:00:20,360 --> 00:00:26,690
Let's first consider how we might implement a replay buffer using a naive programming approach.

4
00:00:26,690 --> 00:00:31,030
Once we've done that we'll discuss why it's not ideal and then move on to a better version.

5
00:00:36,270 --> 00:00:40,040
In essence a replay buffer is simply a python list.

6
00:00:40,350 --> 00:00:46,610
Inside this list we are going to store tuples with five elements each SARS Prime and the Dunn flag.

7
00:00:47,400 --> 00:00:54,030
So as our agent takes a step and encounters a new tuple it will just store this tuple in our list.

8
00:00:54,030 --> 00:00:56,990
The buffer of course has a maximum size.

9
00:00:57,090 --> 00:01:01,290
So once we've hit the size limit we're going to throw out the old its value.

10
00:01:01,290 --> 00:01:02,740
How do we do that.

11
00:01:02,760 --> 00:01:07,360
Well we can use the python function pop and passing the argument 0.

12
00:01:07,530 --> 00:01:10,290
This means remove the element at index 0

13
00:01:15,450 --> 00:01:16,160
next.

14
00:01:16,170 --> 00:01:19,620
Let's consider how we will use the buffer during training.

15
00:01:19,620 --> 00:01:24,570
What we would like to do is grab a batch of random samples from our buffer.

16
00:01:24,570 --> 00:01:30,750
From this we can calculate input target pairs and then do one step of gradient descent.

17
00:01:30,750 --> 00:01:36,450
Remember that the target will be calculated as our plus gamma times the mass over a prime of Q of vast

18
00:01:36,450 --> 00:01:41,970
prime a prime and this only applies if as prime is not a terminal state.

19
00:01:42,230 --> 00:01:45,420
Otherwise the target is just the reward by itself.

20
00:01:45,440 --> 00:01:50,060
In other words we're going to do some kind of calculation over the batch of data we just sampled.

21
00:01:55,220 --> 00:01:57,010
Here's how you might do this.

22
00:01:57,020 --> 00:01:59,630
Start with an empty array of targets.

23
00:01:59,750 --> 00:02:06,080
Then loop through each of the tuples you sampled for each iteration check whether the done flag is true

24
00:02:06,080 --> 00:02:07,190
or not.

25
00:02:07,190 --> 00:02:09,440
If it's true then the target is just ah.

26
00:02:09,800 --> 00:02:13,070
Otherwise it's just the formula we saw previously.

27
00:02:13,280 --> 00:02:16,910
Of course as you may know we don't like doing for loops in Python.

28
00:02:16,940 --> 00:02:22,880
If we can avoid for loops we'll be slower than doing batch operations and num PI.

29
00:02:22,910 --> 00:02:28,400
It's true that you could gather the batches together and cast them into an umpire race but that means

30
00:02:28,400 --> 00:02:31,040
you still had to store a bunch of lists not ideal

31
00:02:36,230 --> 00:02:41,300
here's another problem with continually appending a new tuples and popping old ones.

32
00:02:41,360 --> 00:02:47,060
It may seem like this will take up only a constant amount of memory since the memory storing the old

33
00:02:47,120 --> 00:02:50,490
removed values should be freed by the garbage constructed.

34
00:02:51,200 --> 00:02:58,250
Unfortunately python will not do this and you will end up with a memory leak and a very slow computer.

35
00:02:58,250 --> 00:03:03,380
This probably won't be an issue with a modestly sized dataset like the one we have but it's good to

36
00:03:03,380 --> 00:03:09,560
be aware of the kinds of problems that may arise when you want to graduate to more advanced applications.

37
00:03:14,690 --> 00:03:20,430
So what's the solution the solution is to implement your own replay buffer.

38
00:03:20,430 --> 00:03:23,640
Here's a high level view of how it'll work.

39
00:03:23,670 --> 00:03:29,160
First we're going to pre allocate the arrays we will use to store the states actions rewards and so

40
00:03:29,160 --> 00:03:30,420
on.

41
00:03:30,420 --> 00:03:33,600
So this will really take up a constant amount of RAM.

42
00:03:33,660 --> 00:03:37,540
We will never allocate new arrays nor try to remove any arrays.

43
00:03:37,560 --> 00:03:42,140
All we will do is populate our pre allocated arrays.

44
00:03:42,150 --> 00:03:48,620
Second we'll use a pointer which will tell us the current insertion location of our transition tuples.

45
00:03:48,840 --> 00:03:55,510
So every time we encounter a new tuple we'll store it in the next available location in the array.

46
00:03:55,520 --> 00:03:57,860
Now what happens when the array is full.

47
00:03:58,030 --> 00:04:00,350
While all this value is at the beginning.

48
00:04:00,350 --> 00:04:05,750
So that means we can simply overwrite it in this way our buffer is circular.

49
00:04:05,750 --> 00:04:09,680
When we get to the end we just go back to the beginning.

50
00:04:09,950 --> 00:04:15,930
In this example you see here we can see that the pointer shows us the value at the current time t.

51
00:04:15,980 --> 00:04:18,490
But before that we have the value at the previous time.

52
00:04:18,500 --> 00:04:20,060
T minus 1.

53
00:04:20,060 --> 00:04:22,540
This goes back to T minus 2 T minus 3.

54
00:04:22,760 --> 00:04:24,740
But whereas T minus 4.

55
00:04:24,740 --> 00:04:30,970
In fact it's at the end of the array because when we store the value for T minus 4 we were at the end.

56
00:04:31,040 --> 00:04:36,980
So the next value T minus 3 would have been stored at the beginning since we had to circle back to replace

57
00:04:36,980 --> 00:04:41,040
the all this value on the next step.

58
00:04:41,130 --> 00:04:44,460
We would replace the current all this value which is at T minus nine

59
00:04:49,670 --> 00:04:50,500
one small detail.

60
00:04:50,510 --> 00:04:55,080
We have to think about is what happens when the array is not yet full.

61
00:04:55,100 --> 00:04:59,190
In this case we will still need to keep track of the current size of our buffer.

62
00:04:59,390 --> 00:05:06,710
For example the pre allocated array may be of size 10 but we will only have store 5 values than the

63
00:05:06,860 --> 00:05:09,480
effective size of the buffer will be 5.

64
00:05:09,860 --> 00:05:14,360
When we sample a batch from this buffer we can only sample values from zero up to 5

65
00:05:19,650 --> 00:05:25,020
so that's it for this lecture instead of using the naive approach of storing our transitions in a list

66
00:05:25,350 --> 00:05:28,200
then sampling from the list to get yet another list.

67
00:05:28,320 --> 00:05:34,500
We will instead use a fixed size array and store transitions in this fixed size array in a circular

68
00:05:34,500 --> 00:05:35,040
fashion.