1
00:00:11,680 --> 00:00:16,370
In this lecture we are going to use an R.N. in a four time series prediction.

2
00:00:16,450 --> 00:00:22,710
The previous lecturer looked at how to use an auto regressive linear model for time series prediction.

3
00:00:22,750 --> 00:00:30,070
So in this lecture our main goal is to ask the question does an R.N. end do better this lecture is going

4
00:00:30,070 --> 00:00:32,840
to walk you through a prepared call lab notebook.

5
00:00:32,980 --> 00:00:38,740
Although a very good exercise which I always recommend is once you know how this is done to try and

6
00:00:38,740 --> 00:00:42,740
recreate it yourself with as few references as possible.

7
00:00:43,030 --> 00:00:48,340
As usual you can look at the title of the notebook to determine what notebook we are currently looking

8
00:00:48,340 --> 00:00:48,580
at.

9
00:00:51,700 --> 00:00:55,650
As you know most of this script is the same as what you saw previously.

10
00:00:55,690 --> 00:01:03,260
So first we do all our usual inputs.

11
00:01:03,650 --> 00:01:07,910
Next we create our data which is a sine wave with some optional noise

12
00:01:13,040 --> 00:01:15,200
so here's the sine wave without noise

13
00:01:20,700 --> 00:01:26,160
as with this last lecture we are going to run this notebook twice once without noise and once with noise

14
00:01:26,220 --> 00:01:28,020
to see how the Arnon performs

15
00:01:32,130 --> 00:01:34,640
next we're going to create our data set.

16
00:01:34,650 --> 00:01:40,410
This turns our problem into a supervised learning format where we have an end by t by D input and an

17
00:01:40,410 --> 00:01:41,640
end length target.

18
00:01:43,940 --> 00:01:49,160
So I assume you already understand this loop since it's the same as the last script.

19
00:01:49,160 --> 00:01:55,050
The only difference here is at the end we need to reshape X to be of shape end by t by D.

20
00:01:55,130 --> 00:01:58,260
Where D is one in order to do this.

21
00:01:58,270 --> 00:02:04,420
We can call reshape minus one at t 1 so minus one acts as a wildcard here.

22
00:02:04,420 --> 00:02:09,670
So basically means that make it whatever size is necessary so that the second and third dimensions are

23
00:02:09,670 --> 00:02:17,690
t and 1.

24
00:02:17,750 --> 00:02:19,820
Next we're going to set the device

25
00:02:26,860 --> 00:02:29,590
next we're going to build our Orient n model.

26
00:02:29,620 --> 00:02:34,410
Luckily we saw this in the last lecture so this will just be a review first.

27
00:02:34,430 --> 00:02:39,320
Our simple Arnett will inherit from an end module inside the constructor.

28
00:02:39,340 --> 00:02:44,740
We're going to take in several arguments the number of input features the number of hidden units the

29
00:02:44,740 --> 00:02:52,420
number of on end layers and the number of outputs First we assign these arguments to instance variables

30
00:02:52,650 --> 00:02:57,260
D and an L as per our conventions in this chorus.

31
00:02:57,400 --> 00:02:59,620
Next we instantiate the R in an object

32
00:03:03,190 --> 00:03:09,430
for the input size we pass in D for the hidden size we pass in M for the number of layers we pass an

33
00:03:09,430 --> 00:03:15,260
L for the none then McGarrity we pass in the real you and we also want to set the argument that's first

34
00:03:15,260 --> 00:03:16,600
equal to true.

35
00:03:16,600 --> 00:03:23,100
This is to make sure the R9 module uses the convention that the end dimension comes first.

36
00:03:23,110 --> 00:03:33,360
Next we have our final dense layer of size m by k.

37
00:03:33,570 --> 00:03:36,360
Next we have the forward function as input.

38
00:03:36,360 --> 00:03:40,610
We take an x a batch of sequences of shape and by t by D.

39
00:03:41,010 --> 00:03:46,770
Inside the function we start by creating the initial head and state h not a tensor of zeroes of shape.

40
00:03:46,770 --> 00:03:48,480
L by end by M.

41
00:03:48,720 --> 00:03:53,700
This is because we'll have an initial hidden state for each hour and then layer each sample and any

42
00:03:53,700 --> 00:03:55,490
can feature.

43
00:03:55,620 --> 00:04:02,550
Next we pass an X and each not into the R and end module which returns us two sets of hidden states.

44
00:04:02,550 --> 00:04:07,940
We typically only want the first one since it's indexed by time rather than indexed by the hidden layer.

45
00:04:09,740 --> 00:04:15,260
In this script we're building a many to 1 or an N and in this case we only need the hidden state at

46
00:04:15,260 --> 00:04:18,410
the final time step in order to obtain it.

47
00:04:18,440 --> 00:04:24,410
We're going to index the R.N. and output by minus 1 at the second dimension which gives us back in array

48
00:04:24,410 --> 00:04:32,900
of size end by M finally we pass this end by a value into our final hidden layer which gives us an end

49
00:04:32,900 --> 00:04:43,530
by K output.

50
00:04:43,540 --> 00:04:57,490
Next we instantiate are simple or an N and move the parameters to the GP you.

51
00:04:57,720 --> 00:05:01,770
Next we create our loss and optimizer to MSE loss and the atom optimizer

52
00:05:04,760 --> 00:05:09,370
now I just want to mention this again since it's important when we do our train test split.

53
00:05:09,500 --> 00:05:15,180
We don't want to just randomly pick samples across the entire time period for the validation set.

54
00:05:15,380 --> 00:05:21,320
The purpose of forecasting is to predict the future and thus when you split up your data your validation

55
00:05:21,320 --> 00:05:23,460
set should always be future data points.

56
00:05:23,510 --> 00:05:30,030
And the train set should only contain points before that.

57
00:05:30,080 --> 00:05:31,940
Next we move the data to the GP you

58
00:05:36,210 --> 00:05:38,190
next we train the model.

59
00:05:38,310 --> 00:05:39,980
Again this is all stuff you've seen before.

60
00:05:39,980 --> 00:05:53,300
So there is no need to repeat it.

61
00:05:53,650 --> 00:05:58,270
Next we plot our loss.

62
00:05:58,420 --> 00:06:02,760
So that looks pretty good.

63
00:06:02,850 --> 00:06:05,360
Next we do our ones that forecast.

64
00:06:05,700 --> 00:06:11,030
This is almost the same as our auto regressive example except the shapes are a bit different.

65
00:06:11,040 --> 00:06:15,570
The important part of this is how we shape the input and how we retrieve the output.

66
00:06:15,630 --> 00:06:21,870
Since we would like the input to be of shape and by t by D we reshape it to one by T by one since and

67
00:06:21,870 --> 00:06:29,510
is one and the is one when we retrieve the output which is n by K we have N equals 1 and Kagan's 1.

68
00:06:29,640 --> 00:06:31,150
So we index by 0 0

69
00:06:35,530 --> 00:06:37,450
so let's plot the predictions

70
00:06:41,100 --> 00:06:42,020
and as expected.

71
00:06:42,040 --> 00:06:45,250
This looks pretty nice but of course we know this can be misleading

72
00:06:48,220 --> 00:06:50,940
so let's scroll down to the real multi-step forecast.

73
00:07:03,460 --> 00:07:08,640
So here we can see that our on n with the default parameters does not perform as well.

74
00:07:08,650 --> 00:07:10,710
This is a very intriguing result.

75
00:07:11,020 --> 00:07:16,840
If you recall the auto regressive linear model does this perfectly and in fact we proved that we could

76
00:07:16,840 --> 00:07:20,250
find the coefficients of an R2 model analytically.

77
00:07:20,380 --> 00:07:23,370
So why does an orange and not perform as well.

78
00:07:23,380 --> 00:07:26,920
The answer is that an hour an N has too much flexibility.

79
00:07:26,920 --> 00:07:30,580
This is just like how an ANZ is more general than a CNN.

80
00:07:30,580 --> 00:07:36,430
But just because in an ad is more general and more flexible than a CNN it does not mean that an ad then

81
00:07:36,430 --> 00:07:38,640
will perform better than a CNN.

82
00:07:38,790 --> 00:07:46,300
In fact constraining the parameters of a CNN to be shared makes us CNN better and we see the same theme

83
00:07:46,300 --> 00:07:52,700
here in our two model would work best because that model actually perfectly fits the data set.

84
00:07:52,720 --> 00:07:56,530
It's the correct model and aunt and model does not.

85
00:07:56,530 --> 00:07:59,740
It has too much flexibility and too many parameters.

86
00:07:59,740 --> 00:08:06,760
We say that it is over parameter at least although as you'll see later the flexibility of an origin

87
00:08:06,820 --> 00:08:13,960
does allow it to do more powerful things that an A.R. model cannot.

88
00:08:14,260 --> 00:08:16,420
Next we're going to go back and add some noise.

89
00:08:16,420 --> 00:08:20,020
This dataset so that it better mimics what you might find in the real world.

90
00:08:28,240 --> 00:08:29,410
So let's run all this again.

91
00:08:35,280 --> 00:08:36,190
So there's the data

92
00:08:44,780 --> 00:08:49,760
loss preparation is that once that forecast.

93
00:08:49,810 --> 00:08:53,660
So if we look at the ones that forecast we can see it it looks pretty good.

94
00:08:53,680 --> 00:08:54,820
Which is what we expect

95
00:09:03,110 --> 00:09:06,820
but if we look at the multi-step forecast it looks pretty terrible.

96
00:09:06,830 --> 00:09:12,680
So in fact in Orange performs worse in this scenario I should mention that sometimes when I run this

97
00:09:12,740 --> 00:09:17,680
I see good results from the aren't in but generally speaking the results are unpredictable.

98
00:09:17,870 --> 00:09:21,070
Sometimes a capture is periodicity and sometimes not.

99
00:09:21,290 --> 00:09:23,960
So it may suggest that my tuning would be required.
