1
00:00:11,580 --> 00:00:15,720
Okay, So in this video, we're going to use our hands to do time series forecasting.

2
00:00:16,600 --> 00:00:16,770
Okay.

3
00:00:16,770 --> 00:00:18,630
So what's on the agenda for this lecture?

4
00:00:18,720 --> 00:00:24,060
So this lecture is going to do the basically the same steps as we did with the previous deep learning

5
00:00:24,060 --> 00:00:30,330
architectures, where we do the one step forecast, the incremental multi step forecast and the multi

6
00:00:30,330 --> 00:00:32,520
output multi step forecast.

7
00:00:34,870 --> 00:00:35,200
Okay.

8
00:00:35,200 --> 00:00:39,410
And so just as a reminder, there are many configurations we can try.

9
00:00:39,490 --> 00:00:42,520
And so these are some examples of what you can do.

10
00:00:42,550 --> 00:00:47,830
So the most basic method is to take the final hidden state from the Arnon and then pass that through

11
00:00:47,830 --> 00:00:49,000
the final dense layer.

12
00:00:49,720 --> 00:00:54,910
The second method is to take all the hidden states use global max pooling, and then this will give

13
00:00:54,910 --> 00:00:59,260
us a single hidden vector which we can then pass into the final dense layer.

14
00:00:59,800 --> 00:01:03,820
And then the third option is to stack multiple almonds together.

15
00:01:03,880 --> 00:01:06,580
So in this lecture, we're going to use LCMS.

16
00:01:07,750 --> 00:01:13,300
But be aware that you can exchange LSB SMS with or use at any point in the script.

17
00:01:13,840 --> 00:01:15,550
Basically with no changes in code.

18
00:01:15,550 --> 00:01:17,200
So you might want to give that a try.

19
00:01:19,360 --> 00:01:21,430
But otherwise all the imports are the same.

20
00:01:25,720 --> 00:01:26,740
Okay.

21
00:01:27,040 --> 00:01:33,670
So the next step is to update psych and learn so that we get the latest metrics.

22
00:01:37,530 --> 00:01:38,340
Okay.

23
00:01:39,900 --> 00:01:40,350
Okay.

24
00:01:40,350 --> 00:01:44,430
And so from side, you'll learn we're going to import the metric.

25
00:01:49,540 --> 00:01:49,780
All right.

26
00:01:49,780 --> 00:01:53,440
So the next step is to download the airline passengers CSV.

27
00:01:56,410 --> 00:01:57,880
And you've seen all this before.

28
00:01:57,880 --> 00:02:09,130
So we read in the CSV using CSV, and then we take the log of the passengers column and then we split

29
00:02:09,130 --> 00:02:10,600
the data into train and test.

30
00:02:10,600 --> 00:02:13,690
So the last 12 data points will be used for the test set.

31
00:02:18,270 --> 00:02:18,750
Okay.

32
00:02:18,750 --> 00:02:24,330
And so for indexing purposes later on in this notebook, we're going to want the train index and the

33
00:02:24,330 --> 00:02:24,960
test index.

34
00:02:24,960 --> 00:02:29,370
So we want a Boolean to tell us which indices belong to which data sets.

35
00:02:34,800 --> 00:02:39,030
And so the next step is to calculate the first difference of the log passengers.

36
00:02:39,330 --> 00:02:41,130
Remember that you don't have to do this.

37
00:02:41,340 --> 00:02:46,320
You can always try this script without taking the difference first and then compare the results.

38
00:02:50,720 --> 00:02:54,410
Okay, so the next step is to make our supervised data set.

39
00:02:55,220 --> 00:02:59,720
So we begin by dropping any names from our difference to column.

40
00:03:00,140 --> 00:03:05,720
And then we set our past time steps to ten, and then we create our empty arrays for X and Y.

41
00:03:06,290 --> 00:03:13,370
And then we loop through the time series, and then we assign past data points to X.

42
00:03:13,610 --> 00:03:15,980
And the next single data point to Y.

43
00:03:16,280 --> 00:03:19,400
So this is our data set for making one step forecasts.

44
00:03:20,840 --> 00:03:25,310
And so note that at the end of this we reshape X.

45
00:03:25,940 --> 00:03:34,070
So originally X is just NW by DT because there's no feature dimension, but our ends expect an input

46
00:03:34,070 --> 00:03:35,900
of shape N by DT by D.

47
00:03:36,020 --> 00:03:37,760
Even though D is just one.

48
00:03:43,360 --> 00:03:47,660
And so you can see we have 133 data points for both X and Y.

49
00:03:51,080 --> 00:03:51,680
Okay.

50
00:03:51,680 --> 00:03:55,160
And so the next step is to split the data into train and test.

51
00:03:59,890 --> 00:04:02,890
Okay, so the next step is to build our RNN.

52
00:04:03,700 --> 00:04:04,730
So it's very simple.

53
00:04:04,750 --> 00:04:10,840
So notice this looks pretty much exactly like a regular arm, except that we've replaced the second

54
00:04:10,840 --> 00:04:15,400
layer or the second object with LCM instead of dense.

55
00:04:15,610 --> 00:04:21,250
And again, feel free to try different ana layers like the symbol N or the G or U.

56
00:04:24,160 --> 00:04:28,660
And so important to notice that the input shape is t by one.

57
00:04:28,870 --> 00:04:35,140
So t because we have t time steps and the feature dimension is just one because we only have one time

58
00:04:35,140 --> 00:04:35,800
series.

59
00:04:37,150 --> 00:04:44,230
And so for the lshtm now a lot of people will not a lot, but some get confused about what this number

60
00:04:44,230 --> 00:04:44,830
means.

61
00:04:44,980 --> 00:04:51,040
So remember that this number 24 represents the size of the hidden vector.

62
00:04:51,520 --> 00:04:55,810
And so one common mistake I see a lot is confusing.

63
00:04:55,840 --> 00:05:00,630
T with this value, which we've called M in the theory lectures.

64
00:05:00,640 --> 00:05:03,460
So that's M So M is not the same as t.

65
00:05:03,730 --> 00:05:09,430
M is the size of the hidden vector h, whereas t is the length of the sequence.

66
00:05:09,610 --> 00:05:11,710
So that's how many vectors we have.

67
00:05:13,940 --> 00:05:14,330
Okay.

68
00:05:14,330 --> 00:05:17,300
And some other comments with this.

69
00:05:18,080 --> 00:05:21,770
So the lshtm by default.

70
00:05:21,890 --> 00:05:25,330
So notice how we haven't specified any activation function.

71
00:05:25,370 --> 00:05:29,090
And this is because the LSHTM uses the ten fps by default.

72
00:05:29,090 --> 00:05:32,920
So it's kind of a weird layer where all the other layers just do nothing.

73
00:05:32,930 --> 00:05:39,050
The Lshtm uses ten by default and also by default.

74
00:05:39,050 --> 00:05:42,700
This returns just the final hidden state.

75
00:05:42,710 --> 00:05:48,530
So we're getting at Big T, and so all that's implicit.

76
00:05:48,530 --> 00:05:50,480
We don't have to specify that directly.

77
00:05:52,060 --> 00:05:58,360
Okay, So I just wanted to make an extra quick note that if you want to make use of the GPU while you're

78
00:05:58,360 --> 00:06:05,560
using an LZ team, you need to follow certain requirements so you can check the TensorFlow documentation

79
00:06:05,560 --> 00:06:06,010
for this.

80
00:06:06,010 --> 00:06:11,770
But basically it's that you have to use the ten inch activation, which we've mentioned is the default.

81
00:06:11,950 --> 00:06:14,880
You have to use the sigmoid as the recurrent activation.

82
00:06:14,890 --> 00:06:20,480
So you saw those in the equations from the theory lecture and you must not use dropout.

83
00:06:20,500 --> 00:06:24,100
You must use unroll equal the false and you must include the bias term.

84
00:06:24,580 --> 00:06:27,520
And the rest of this stuff is not really relevant at this time.

85
00:06:27,910 --> 00:06:28,980
So just keep that in mind.

86
00:06:28,990 --> 00:06:34,690
So if you were thinking about changing the activation to something like the REL, you keep in mind that

87
00:06:34,690 --> 00:06:37,720
this would prevent you from making use of the GPU.

88
00:06:40,610 --> 00:06:40,780
Okay.

89
00:06:40,850 --> 00:06:45,290
And so the final layer, we have just a dense one because we are making one prediction.

90
00:06:50,160 --> 00:06:53,340
Okay, so let's do a model summary to see what we got back.

91
00:06:56,650 --> 00:06:58,150
Okay, So it's pretty straightforward.

92
00:06:58,150 --> 00:07:02,320
So our input is of shape MN by DT by D as expected.

93
00:07:02,890 --> 00:07:10,210
And by the end of the lshtm we have N by 2424, because that's the size of our hidden vector and we

94
00:07:10,210 --> 00:07:13,810
only have one hidden vector at the end of the lshtm.

95
00:07:14,950 --> 00:07:16,750
So it's basically NW by DX.

96
00:07:17,740 --> 00:07:22,540
And then once we have that, we can pass it through a final dense layer and then we get NW by one.

97
00:07:22,930 --> 00:07:26,110
And that one is the prediction for the next time step.

98
00:07:29,330 --> 00:07:29,870
Okay.

99
00:07:30,020 --> 00:07:36,080
So the next step is to call compile with the losses mean squared error and optimizer is Adam as usual.

100
00:07:39,990 --> 00:07:45,510
And the next step is to fit our model on the train set and then validate on the test set.

101
00:07:45,540 --> 00:07:47,460
We'll do this for 100 epochs.

102
00:07:53,280 --> 00:07:53,930
Okay.

103
00:07:53,940 --> 00:08:00,600
And so while this is running, I'll just talk about briefly why did we choose 24 hidden units?

104
00:08:00,750 --> 00:08:04,800
And the answer again is this is just an arbitrary number, right?

105
00:08:04,800 --> 00:08:10,020
You have to test different numbers and see what works best on the out of sample data, of course.

106
00:08:11,220 --> 00:08:15,840
So there's no formula for this, as some people might expect.

107
00:08:18,690 --> 00:08:23,070
Okay, so this is our training loss and validation loss.

108
00:08:24,000 --> 00:08:29,010
It's always a good idea to plot the loss to see that it converged.

109
00:08:30,360 --> 00:08:34,290
So it looks like both the train loss and the test loss have converged.

110
00:08:36,620 --> 00:08:45,560
Okay, so the next step is to set the first T plus one values to false in the train idea.

111
00:08:46,130 --> 00:08:48,260
And this is because they are not predictable.

112
00:08:48,440 --> 00:08:53,510
So we're not going to make any predictions on the first few values because the model doesn't have enough

113
00:08:53,510 --> 00:08:55,520
pass values to make those predictions.

114
00:08:58,220 --> 00:08:58,640
Okay.

115
00:08:58,640 --> 00:09:05,270
And so the next step is to call predict on both the train and test set and then to flatten those outputs.

116
00:09:05,850 --> 00:09:05,990
Okay.

117
00:09:05,990 --> 00:09:08,300
So this gives us PE train and PE test.

118
00:09:13,260 --> 00:09:18,000
And so the next step is to compute the difference to predictions.

119
00:09:18,330 --> 00:09:23,400
As you recall, this requires us to have the data at the previous time step.

120
00:09:24,330 --> 00:09:29,520
So we shift the log passenger is backwards, and then we save it to a variable called preve.

121
00:09:32,970 --> 00:09:33,380
Okay.

122
00:09:34,710 --> 00:09:38,550
And we also keep track of the last known train value.

123
00:09:39,570 --> 00:09:45,810
So this is the value we're going to accumulate from when we're making our multistep predictions.

124
00:09:51,570 --> 00:09:55,230
Okay, so the next step is to make our one step forecast.

125
00:09:56,160 --> 00:10:02,960
And so remember that because we have a difference, the data, we need to add it back, basically.

126
00:10:03,300 --> 00:10:08,970
So the prediction is not just the prediction because these are the difference predictions, but we need

127
00:10:08,970 --> 00:10:10,890
to add it back to the previous value.

128
00:10:11,610 --> 00:10:17,190
So it's basically the previous value plus the delta that gives us our current prediction.

129
00:10:17,730 --> 00:10:19,680
And we do this for both train and test.

130
00:10:19,890 --> 00:10:22,390
And remember, these are just a one step forecast.

131
00:10:22,410 --> 00:10:27,900
So that's why we can use the previous value without any philosophical trouble.

132
00:10:32,930 --> 00:10:36,800
Okay, so the next step is to plot the one step forecast.

133
00:10:40,930 --> 00:10:43,840
And so as you can see, it looks pretty good as expected.

134
00:10:45,760 --> 00:10:50,500
So the orange line is the train prediction and the green line is the test prediction.

135
00:10:54,290 --> 00:10:54,740
Okay.

136
00:10:54,740 --> 00:10:58,220
So the next step is to do our multi step forecast.

137
00:10:58,460 --> 00:11:00,800
And so remember that we have to do this incrementally.

138
00:11:01,190 --> 00:11:04,190
So we create an empty array to store the predictions.

139
00:11:04,430 --> 00:11:08,180
And we begin by grabbing the first test input.

140
00:11:09,970 --> 00:11:10,210
Okay.

141
00:11:10,240 --> 00:11:19,360
And so once we have that, we do a loop and we loop until the length of our prediction array is less

142
00:11:19,360 --> 00:11:20,470
than end test.

143
00:11:20,650 --> 00:11:23,320
So once it's equal to end test, then we've made enough predictions.

144
00:11:25,300 --> 00:11:28,930
And so inside this loop we call models predict.

145
00:11:28,930 --> 00:11:31,450
And so what do we call models predict on?

146
00:11:32,050 --> 00:11:38,350
So we call models predict on last x, which is going to be updated within this loop, by the way, and

147
00:11:38,350 --> 00:11:41,500
we reshape it to be of size n by t by DX.

148
00:11:42,700 --> 00:11:47,530
And I'm just being lazy here with the minus one because minus one is a wildcard that fills in whatever

149
00:11:47,530 --> 00:11:48,760
value needs to go there.

150
00:11:49,780 --> 00:11:54,430
Some people believe that you should be more explicit, but I think this is fine for now.

151
00:11:54,700 --> 00:11:56,890
So you could put T in here if you wanted.

152
00:11:57,400 --> 00:11:58,870
It would be just as easy.

153
00:11:59,680 --> 00:12:07,870
But just remember this RN, ht and DX and both RN and DX or one because we only have one sample and

154
00:12:07,870 --> 00:12:11,170
we only have one individual univariate time series.

155
00:12:12,190 --> 00:12:12,720
Okay.

156
00:12:13,800 --> 00:12:20,700
And so once we have our prediction, we grab the zeroth element and we'll call that P.

157
00:12:20,820 --> 00:12:23,370
So remember the model dot predict always returns an array.

158
00:12:26,400 --> 00:12:30,570
And so once we have our prediction P, we append that to our list of predictions.

159
00:12:31,740 --> 00:12:35,400
And then once we have done that, we make our new input.

160
00:12:36,610 --> 00:12:38,110
Okay, so last X.

161
00:12:38,560 --> 00:12:43,180
So we're basically shifting everything backwards and adding the newest prediction to make the prediction

162
00:12:43,180 --> 00:12:44,470
for the next time step.

163
00:12:45,260 --> 00:12:48,670
Now, you've seen this before, so I won't spend much time explaining it again.

164
00:12:54,610 --> 00:12:55,180
Okay.

165
00:12:55,180 --> 00:13:00,580
And so the next step is to save the multi zip forecast to our data frame.

166
00:13:01,280 --> 00:13:06,970
And so just to recall, how do we make the multi step forecast, the incremental multistep forecast.

167
00:13:07,480 --> 00:13:10,240
So we take the last train point.

168
00:13:10,390 --> 00:13:10,680
Okay.

169
00:13:10,890 --> 00:13:13,150
So this is the last data point that we know.

170
00:13:13,720 --> 00:13:20,110
And then we do the cumulative sum of the different predictions.

171
00:13:20,500 --> 00:13:23,730
And so you saw the formula for that in the theory lecture.

172
00:13:23,740 --> 00:13:25,390
So you understand why this works.

173
00:13:31,370 --> 00:13:31,870
Okay.

174
00:13:31,880 --> 00:13:37,580
And so in this plot, we're going to do the multi step forecast and the one step forecast together.

175
00:13:38,360 --> 00:13:43,280
And so notice we're not bothering to do the train predictions anymore.

176
00:13:47,320 --> 00:13:47,890
Okay.

177
00:13:47,890 --> 00:13:51,010
So you can see that they're both pretty close.

178
00:13:51,010 --> 00:13:57,970
But of course, the one step forecast is slightly better because it uses the true past data.

179
00:14:03,500 --> 00:14:07,580
Okay, So the next step is to do the multi output forecast.

180
00:14:07,580 --> 00:14:10,400
So we're going to start by making the data set for this.

181
00:14:11,600 --> 00:14:11,880
Okay.

182
00:14:11,900 --> 00:14:14,270
And so just a reminder about how this works.

183
00:14:14,660 --> 00:14:21,500
We're going to have multiple time steps for our input and we have multiple time steps for our output.

184
00:14:22,310 --> 00:14:26,480
So our model is going to predict 12 different values simultaneously.

185
00:14:29,120 --> 00:14:36,200
And so you've seen all this code before, but just really quick, we create our X and Y again and notice

186
00:14:36,230 --> 00:14:38,810
t, x and t y may be different.

187
00:14:40,070 --> 00:14:47,150
And then we loop through the time series up to a point so that we don't go out of bounds.

188
00:14:48,200 --> 00:14:55,820
And so inside this loop we grab two data points and we make that into X, and then we grab T data points

189
00:14:55,820 --> 00:15:02,460
the next T data points, so offset by X and we append that to Y.

190
00:15:04,160 --> 00:15:12,500
And so outside this loop we have to reshape X and Y, and so X as usual is going to be n by T, by dx,

191
00:15:12,950 --> 00:15:16,970
whereas y is now going to be n by t y.

192
00:15:19,450 --> 00:15:23,930
Okay, So this is like a neural network that outputs multiple values for say, classification.

193
00:15:24,530 --> 00:15:27,070
It's just that instead of n by K, we have n by t.

194
00:15:27,070 --> 00:15:27,590
Y.

195
00:15:32,860 --> 00:15:33,550
Okay.

196
00:15:34,720 --> 00:15:39,400
And so the next step is to split our data into train and test.

197
00:15:44,990 --> 00:15:47,300
And the next step is to create our new r.

198
00:15:47,420 --> 00:15:47,960
N.

199
00:15:49,280 --> 00:15:50,450
So notice that for this r.

200
00:15:50,540 --> 00:15:51,080
N.

201
00:15:51,620 --> 00:15:55,040
We're using return sequences equal to true.

202
00:15:55,520 --> 00:15:57,500
And so this gives us all the hidden states.

203
00:15:57,500 --> 00:16:05,210
So h1h2 all the way up to T And then we use global max pooling to choose the best values of those of

204
00:16:05,210 --> 00:16:07,070
each component of the hidden states.

205
00:16:08,300 --> 00:16:10,160
And so that gives us a single vector.

206
00:16:10,850 --> 00:16:16,520
And then once we have that single feature vector, we pass it through a final dense layer with t outputs.

207
00:16:18,410 --> 00:16:24,260
And so there's no reason that we had to use return sequences equal to true and global Max pulling on

208
00:16:24,260 --> 00:16:27,410
this example, whereas we didn't use it on the previous example.

209
00:16:27,440 --> 00:16:27,860
Right?

210
00:16:27,860 --> 00:16:32,990
You can mix and match any way you like, but obviously we don't want to try every single combination

211
00:16:32,990 --> 00:16:34,790
since that would be very tedious.

212
00:16:39,670 --> 00:16:40,360
And okay.

213
00:16:40,360 --> 00:16:44,620
And so the next step, I believe the training was a bit erratic.

214
00:16:44,620 --> 00:16:47,830
So I decided to create a checkpoint to save the best model.

215
00:16:48,700 --> 00:16:55,660
So we have a model checkpoint, and this will save the best model according to the validation laws.

216
00:16:57,880 --> 00:17:05,170
And so the next step is to compile using the same loss, the MSI and optimizer atom as usual.

217
00:17:07,240 --> 00:17:12,070
And then we're going to call model fit for 300 epochs.

218
00:17:13,090 --> 00:17:19,480
And notice we're passing in our new data X train and we train for multi output.

219
00:17:19,780 --> 00:17:23,530
And we also have our callback to save the best model.

220
00:17:27,810 --> 00:17:27,900
And.

221
00:17:28,040 --> 00:17:29,460
Okay, so let's wait for this.

222
00:17:33,200 --> 00:17:33,950
Okay.

223
00:17:34,520 --> 00:17:37,910
So, as usual, we're going to plot our loss per epoch.

224
00:17:41,450 --> 00:17:43,670
Now case of the last epoch looks good.

225
00:17:44,630 --> 00:17:49,610
Maybe you could have trained it for longer, but it does look like the test losses may be creeping upwards.

226
00:17:53,560 --> 00:17:54,070
Okay.

227
00:17:54,160 --> 00:17:56,980
So the next step is to load our best model.

228
00:17:56,980 --> 00:18:00,760
So remember that we save the best model using a callback.

229
00:18:04,570 --> 00:18:04,840
Okay.

230
00:18:05,170 --> 00:18:11,770
And so the next step is to call models predict to get the predictions for the train set and the test

231
00:18:11,770 --> 00:18:12,370
set.

232
00:18:16,140 --> 00:18:22,920
Okay, so the next step is to check the shape of Petrie and test just as a sanity check and to make

233
00:18:22,920 --> 00:18:24,600
sure we understand what's going on.

234
00:18:26,040 --> 00:18:34,440
So the train predictions, we have 121 train predictions, and each of those will make a prediction

235
00:18:34,440 --> 00:18:37,440
for all 12 time steps in the future.

236
00:18:38,310 --> 00:18:42,990
So obviously there are some redundant predictions, but they're using different past data.

237
00:18:44,510 --> 00:18:45,140
Okay.

238
00:18:45,320 --> 00:18:49,010
And so for the test set, we only have one prediction.

239
00:18:49,190 --> 00:18:51,320
But for 12 different time steps.

240
00:18:51,830 --> 00:18:57,440
So basically, instead of 12 separate predictions, we have one prediction with 12 prediction heads.

241
00:19:00,610 --> 00:19:03,790
And so this is just to make sure that our data is in the right shape.

242
00:19:08,720 --> 00:19:09,410
Okay.

243
00:19:09,410 --> 00:19:14,600
And so the next step is to save the multi output forecast or data frame.

244
00:19:14,870 --> 00:19:21,530
And notice again, we use the cumulative sum because we've predicted the differences and not the actual

245
00:19:21,800 --> 00:19:23,300
time series values.

246
00:19:24,260 --> 00:19:29,960
So that's the last known data point plus the cumulative sum of the difference predictions of the deltas.

247
00:19:34,150 --> 00:19:34,870
Okay.

248
00:19:34,870 --> 00:19:38,470
And so the next step is to plot all the forecasts at the same time.

249
00:19:38,650 --> 00:19:43,510
So basically we're just adding the multi output forecast to the forecast we already made.

250
00:19:48,330 --> 00:19:48,950
Okay.

251
00:19:49,020 --> 00:19:56,610
And so it's hard to see, but it looks like the multi album forecast, which is in red, is closer than

252
00:19:56,610 --> 00:19:57,510
the other ones.

253
00:19:57,810 --> 00:20:01,800
But we'll use actual quantitative metrics later on in this lecture.

254
00:20:05,190 --> 00:20:12,690
And so just as one final experiment to show you the different possibilities of what you can try is we're

255
00:20:12,690 --> 00:20:15,120
going to add another lshtm layer.

256
00:20:15,270 --> 00:20:17,580
So we're going to stack Lshtm layers together.

257
00:20:18,330 --> 00:20:22,620
Now, people always assume that this is just going to be better.

258
00:20:22,890 --> 00:20:23,880
More is better.

259
00:20:24,210 --> 00:20:27,090
But maybe you'll be surprised at the result.

260
00:20:29,010 --> 00:20:35,640
So we have for this example, I've chosen 16 hidden units arbitrarily.

261
00:20:37,320 --> 00:20:38,820
And so why did I do that?

262
00:20:38,820 --> 00:20:40,140
No particular reason.

263
00:20:40,440 --> 00:20:46,230
Well, maybe one reason for this is because when we add less layers, we used more hidden units.

264
00:20:46,230 --> 00:20:49,380
And now that we have more layers, we can use less hidden units.

265
00:20:50,280 --> 00:20:55,140
So just to keep from blowing up the number of parameters, Okay.

266
00:20:55,290 --> 00:21:00,480
But otherwise, you know, you can use whatever number you like here, whatever you feel is appropriate

267
00:21:00,480 --> 00:21:02,040
and whatever you want to test.

268
00:21:03,210 --> 00:21:07,620
And then we're going to use our global max pooling once again.

269
00:21:07,770 --> 00:21:10,560
So just to be clear about what's happening.

270
00:21:10,680 --> 00:21:14,100
So the input is a bunch of vectors.

271
00:21:14,100 --> 00:21:18,180
So two vectors that make up a time series.

272
00:21:18,450 --> 00:21:20,010
And then we pass it through.

273
00:21:20,010 --> 00:21:23,490
The first LCM with return sequence is equal to true.

274
00:21:23,550 --> 00:21:26,730
So we again get a sequence of more hidden vectors.

275
00:21:27,210 --> 00:21:30,780
So this is again, vectors of length.

276
00:21:31,350 --> 00:21:33,300
So we still have x head in vectors.

277
00:21:34,500 --> 00:21:40,980
And so when we pass this through the second lshtm, again, we're passing in rt x hidden vectors and

278
00:21:40,980 --> 00:21:43,320
we get back to x head in vectors again.

279
00:21:44,880 --> 00:21:50,100
And it's only after we do global max pooling is that we take those x head in vectors and then we shrink

280
00:21:50,100 --> 00:21:51,780
them down into one hidden vector.

281
00:21:52,860 --> 00:21:55,470
Again, you can see you can do that as many times as you like, right?

282
00:21:55,470 --> 00:22:00,000
Because the input is kind of the same as the output input is x head in vectors.

283
00:22:00,000 --> 00:22:01,650
Output is head and vectors.

284
00:22:06,830 --> 00:22:07,070
Okay.

285
00:22:07,160 --> 00:22:10,580
And so, again, we're going to create a checkpoint to save the best model.

286
00:22:11,330 --> 00:22:19,880
And we're going to compile using the same arguments and we're going to fit with the same number of epochs.

287
00:22:28,090 --> 00:22:30,070
Okay, So it still goes pretty fast.

288
00:22:35,240 --> 00:22:36,560
Okay, so that's done.

289
00:22:36,560 --> 00:22:39,350
And we can now plot the loss again.

290
00:22:40,500 --> 00:22:43,980
Just to make sure that our model has trained successfully.

291
00:22:45,240 --> 00:22:47,580
And it does look that our loss has gone down.

292
00:22:47,880 --> 00:22:51,960
And again, perhaps you could have done more epochs if you wanted.

293
00:22:52,500 --> 00:22:54,450
You can try that on your own if you like.

294
00:22:56,250 --> 00:22:56,860
Okay.

295
00:22:56,880 --> 00:23:01,470
And so, again, we're going to load up the best model that was saved before during training.

296
00:23:03,420 --> 00:23:05,670
And so the same step is above.

297
00:23:05,670 --> 00:23:10,740
We're going to call models predict and we're going to grab the relevant indices.

298
00:23:15,140 --> 00:23:21,080
And again, we're going to do the cumulative sum to get the prediction into the right format.

299
00:23:21,380 --> 00:23:24,290
So we'll save this as the column multi output to.

300
00:23:28,050 --> 00:23:28,880
Okay.

301
00:23:28,890 --> 00:23:36,570
And so this time we're going to plot a thing, just the multi step and the multi output forecast for

302
00:23:36,570 --> 00:23:37,410
the second one.

303
00:23:38,100 --> 00:23:44,370
So not all the forecasts, just the incremental multi step and the second multi output forecast.

304
00:23:47,190 --> 00:23:47,460
Yeah.

305
00:23:47,460 --> 00:23:51,420
So I think if you did too many with the plow would be too crowded.

306
00:23:51,420 --> 00:23:59,610
So we're just doing these two and you can see that the second multi output forecast is a lot closer

307
00:23:59,610 --> 00:24:01,260
than the incremental forecast.

308
00:24:06,590 --> 00:24:06,810
Okay.

309
00:24:07,030 --> 00:24:12,070
And so the final step is to compute our metrics for each of our experiments.

310
00:24:13,330 --> 00:24:18,160
And so notice we're not bothering with the one step because that's not really comparable, because it

311
00:24:18,160 --> 00:24:23,320
uses true past data versus all of these methods do not use true pass data.

312
00:24:25,430 --> 00:24:26,780
Okay, so let's run this.

313
00:24:28,850 --> 00:24:39,770
And we see that we get the best result with the single LSB is 0.0054 versus when we had two LSB LCMS,

314
00:24:39,770 --> 00:24:44,290
it was 0.00579 OC.

315
00:24:45,770 --> 00:24:51,590
So this shows that just adding more layers isn't necessarily better, as some people assume.

316
00:24:52,340 --> 00:24:56,330
OC And it also could be because we didn't choose the right number of hidden units.

317
00:24:56,330 --> 00:24:59,120
So I will let you try that on your own as well.

318
00:25:02,750 --> 00:25:03,260
Okay.

319
00:25:03,260 --> 00:25:09,740
And so as a final thing for this lecture, here are some exercises, things for you to think about.

320
00:25:10,190 --> 00:25:16,280
And as always, you can try this in the code as well to confirm or deny your predictions.

321
00:25:17,300 --> 00:25:18,500
So think about.

322
00:25:19,870 --> 00:25:20,800
Is our improvement.

323
00:25:20,800 --> 00:25:26,500
So notice how the model improved after we added global max pooling, but we also changed other things

324
00:25:26,500 --> 00:25:27,550
at the same time.

325
00:25:28,120 --> 00:25:29,890
So when did we add global max pooling?

326
00:25:32,050 --> 00:25:34,570
So we initially added global max pooling.

327
00:25:36,120 --> 00:25:40,260
So our first experiment was without pulling, we just took the final head state.

328
00:25:41,340 --> 00:25:49,440
And then from that we did both the one step and multistep forecast.

329
00:25:54,690 --> 00:25:57,840
And then we made a multi output data set.

330
00:25:58,110 --> 00:26:01,560
And so we use global max pooling for the multi output forecast.

331
00:26:02,580 --> 00:26:05,940
So really we changed multiple things at the same time.

332
00:26:06,120 --> 00:26:10,140
So it's not clear why that model was better.

333
00:26:10,350 --> 00:26:12,960
It's better because it's predicting multiple outputs at the same time.

334
00:26:12,960 --> 00:26:15,390
Or is it better because it's using global max pooling?

335
00:26:16,110 --> 00:26:23,610
So think about that and also think about does multiple LSB layers help or hurt?

336
00:26:24,450 --> 00:26:28,500
And so remember that this also depends on the number of hidden units you choose.

337
00:26:28,740 --> 00:26:30,270
So it's not just one thing.

338
00:26:32,040 --> 00:26:32,550
Okay.

339
00:26:32,550 --> 00:26:38,190
And so another question to consider is, do you think different thing is necessary?

340
00:26:39,030 --> 00:26:44,580
So I know a lot of people who come to this course, they came here because they wanted to learn how

341
00:26:44,580 --> 00:26:47,160
to use LCMS for Time series forecasting.

342
00:26:47,670 --> 00:26:54,930
They didn't really care so much about Arima or ETS or and NS or CNN's, but we've seen that those models

343
00:26:54,930 --> 00:26:55,950
work pretty well.

344
00:26:57,020 --> 00:26:57,650
Okay.

345
00:26:59,230 --> 00:26:59,550
Okay.

346
00:26:59,590 --> 00:27:04,540
But we also noticed that they all don't work very well when you don't use different thing.

347
00:27:06,610 --> 00:27:13,030
So what I'm getting at is people often believe that because lshtm and rans have been so powerful for

348
00:27:13,030 --> 00:27:17,560
NLP that they must be really, really powerful for Time series as well.

349
00:27:17,560 --> 00:27:22,600
And you don't have to do anything like different thing because you know, Rand's are just magic, right?

350
00:27:22,600 --> 00:27:24,420
They'll automatically learn the pattern.

351
00:27:24,430 --> 00:27:26,290
This is what some people believe.

352
00:27:26,800 --> 00:27:33,460
So if you think this is true, if you believe this is true, I would recommend trying to not do different

353
00:27:33,460 --> 00:27:36,760
thing and seeing if that gives you a good result.

354
00:27:37,090 --> 00:27:40,450
So just give that a try if you think different thing isn't necessary.

355
00:27:43,270 --> 00:27:44,660
And there's another thing to consider.

356
00:27:44,680 --> 00:27:48,130
Does logging work or is it also unnecessary?

357
00:27:48,880 --> 00:27:52,840
So try not taking the log transform of the time series and see if it's okay.

358
00:27:54,520 --> 00:27:54,790
Ken.

359
00:27:54,790 --> 00:28:00,190
So another question that you might want to consider is do you think including more past legs would have

360
00:28:00,190 --> 00:28:00,940
been useful?

361
00:28:01,810 --> 00:28:08,380
So this data set obviously is very small, so you can have like 100 past data points.

362
00:28:08,560 --> 00:28:12,280
But if you have a larger data set, you might want to give that a try as well.

363
00:28:13,570 --> 00:28:17,860
OC compare, say ten passed data points versus 100 past data points.

364
00:28:18,700 --> 00:28:23,350
So these are always experiments you have to do, and the answer is going to be different based on which

365
00:28:23,350 --> 00:28:24,610
data set you are using.

366
00:28:27,130 --> 00:28:28,900
And here's another thing to consider.

367
00:28:29,260 --> 00:28:35,650
When you're doing this in the so called real world, you might want to consider using mock forward validation

368
00:28:35,650 --> 00:28:37,690
to optimize those parameters.

369
00:28:37,810 --> 00:28:42,730
For example, the number of hidden units, the number of hidden layers, whether you should use global

370
00:28:42,730 --> 00:28:44,020
max pooling or not.

371
00:28:45,400 --> 00:28:49,270
So all of these questions, they can be answered by doing walk forward validation.