1
00:00:11,650 --> 00:00:16,710
In this lecture we are going to talk about how to write code in PI talk for a simple Orient in

2
00:00:19,780 --> 00:00:24,940
this is in preparation for the next lecture where we are going to do the same forecasting exercise that

3
00:00:24,940 --> 00:00:30,780
we previously did but we're going to replace the auto regressive linear model with a simple or an n

4
00:00:31,750 --> 00:00:32,230
by the way.

5
00:00:32,230 --> 00:00:38,560
Remember that I'm not saying simple r and n to mean that simple is an adjective to describe the complexity

6
00:00:38,560 --> 00:00:44,320
of this kind of Aren n rather this is the actual name we use for this kind of an N.

7
00:00:44,320 --> 00:00:48,820
In other words we call this a simple or an N because that is its name.

8
00:00:48,970 --> 00:00:54,880
And I'm not saying it's simple as opposed to being complex because in fact what we are doing now is

9
00:00:54,880 --> 00:00:56,320
actually quite complex

10
00:01:01,490 --> 00:01:05,920
so let's begin again by recalling the basic steps we are going to do in our script.

11
00:01:06,740 --> 00:01:12,680
Luckily we already encountered most of this in the previous script so most of your focus in this lecture

12
00:01:13,010 --> 00:01:16,680
is on how to make an art in as usual.

13
00:01:16,730 --> 00:01:22,640
Step number one is to load in the data we are going to use our synthetically generated dataset once

14
00:01:22,640 --> 00:01:25,880
again so that remains the same.

15
00:01:25,880 --> 00:01:32,550
The only difference is that it is not going to be the right shape for an aunt in Step number two is

16
00:01:32,550 --> 00:01:34,580
to instantiate our model.

17
00:01:34,620 --> 00:01:40,010
This is the focus of this lecture because we're learning about a new kind of neuron that we're.

18
00:01:40,230 --> 00:01:42,930
Step number three is to train the model.

19
00:01:42,930 --> 00:01:48,640
Luckily this is very simple due to my rule all machine learning interfaces are the same.

20
00:01:48,810 --> 00:01:53,100
Step number four is to evaluate the model which again stays the same.

21
00:01:53,220 --> 00:01:56,880
Step number five is to make predictions using the model.

22
00:01:56,880 --> 00:02:00,310
This will be somewhat tricky again because of the shapes.

23
00:02:00,390 --> 00:02:04,020
Have I mentioned how important it is to keep track of the shapes in an ornate

24
00:02:09,250 --> 00:02:09,570
alright.

25
00:02:09,580 --> 00:02:11,950
So let's review step number one.

26
00:02:11,950 --> 00:02:14,230
Basically is the same setup as before.

27
00:02:14,470 --> 00:02:20,110
We're going to create a sine wave with and without noise then we're going to create a supervised learning

28
00:02:20,130 --> 00:02:27,310
dataset with both inputs and targets the input being two steps of the sequence and the target being

29
00:02:27,310 --> 00:02:30,150
the next step after those T steps.

30
00:02:30,190 --> 00:02:36,910
Remember that we count from zero up to lend series minus Big T because we want the final target to be

31
00:02:36,910 --> 00:02:41,610
land series minus 1 which is of course the final value of our series.

32
00:02:41,650 --> 00:02:47,440
The difference between this script and the previous script is that linear regression expects a 2D array

33
00:02:47,470 --> 00:02:51,100
as input so we passed in an end by t array.

34
00:02:51,850 --> 00:02:57,520
However as you know and Arnon expects an end by t by D array as input.

35
00:02:57,520 --> 00:03:08,220
Therefore we need to add a superfluous one dimension at the end to make it end by t by 1.

36
00:03:08,280 --> 00:03:10,160
Step two is to instantiate the model.

37
00:03:10,770 --> 00:03:15,840
Luckily I've already shown you at the most crucial part of this which is how to create custom models

38
00:03:15,840 --> 00:03:17,310
in PI torch.

39
00:03:17,310 --> 00:03:22,250
Now you understand why it was so important to do it early in a simpler environment.

40
00:03:22,290 --> 00:03:26,430
To recap we know that we're going to have to subclass and end up module.

41
00:03:26,430 --> 00:03:28,690
The next step is to set up our layers.

42
00:03:28,800 --> 00:03:31,280
But the question is what layers do we need

43
00:03:36,420 --> 00:03:37,180
in PI talk.

44
00:03:37,200 --> 00:03:40,310
The simple or an end layer is simply called RNA.

45
00:03:40,800 --> 00:03:46,020
And the reason we need to use a custom model is because Arnon layers don't work like the previous layers

46
00:03:46,020 --> 00:03:51,750
we've seen in fact with most deep learning libraries aren't ends is where things start to get complicated

47
00:03:52,860 --> 00:03:58,020
with linear models and ends and CNN is everything is pretty straightforward and the interface to each

48
00:03:58,020 --> 00:04:00,320
layer follows the same kind of pattern.

49
00:04:00,420 --> 00:04:02,700
Specifically a feed forward pattern.

50
00:04:02,700 --> 00:04:08,860
You have one input and you get one output Arnolds are tricky because they have this feedback loop.

51
00:04:09,030 --> 00:04:13,470
We know that we want to implement this equation which looks kind of like a logistic regression neuron

52
00:04:14,190 --> 00:04:20,670
and we know that the origin module does just that but it seems to have a lot of interesting arguments.

53
00:04:20,690 --> 00:04:24,670
So let's discuss them one by one.

54
00:04:24,670 --> 00:04:30,880
First we have the input side in the head and size which should be pretty easy to understand linear layers

55
00:04:30,880 --> 00:04:33,550
and convolution layers have similar arguments.

56
00:04:33,580 --> 00:04:39,970
We have the input dimensionality and the output dimensionality but as you can see and have a few more

57
00:04:39,970 --> 00:04:46,080
arguments that we have to specify the next one is num layers unlike aliens and CNN.

58
00:04:46,090 --> 00:04:49,690
You don't need to add new layers manually as you recall.

59
00:04:49,690 --> 00:04:54,730
If you want more dense layers in an N.M. you would add more layers of type and end up linear.

60
00:04:54,730 --> 00:05:00,880
If you wanted more convolutions you would add more convolution layers but with aren't ends.

61
00:05:00,910 --> 00:05:06,250
If you want more R and then layers you don't need to declare more than one on an object.

62
00:05:06,250 --> 00:05:13,670
Instead you just pass in a different number for the argument num layers the next argument is the non

63
00:05:13,670 --> 00:05:18,230
linearity where you can pass in something like the real you or 10 age.

64
00:05:18,240 --> 00:05:23,990
Again this is different from an ends and CNN is where if you want a nonlinear activation function you

65
00:05:23,990 --> 00:05:29,150
would specify that manually as a separate object or function with Arnold's ends.

66
00:05:29,150 --> 00:05:37,280
It's simply an argument also on ns and Pi talks only except to the real you or 10 page activation functions.

67
00:05:37,640 --> 00:05:43,300
The basic idea is because Arnolds are tricky to implement if you need to stack multiple or an end layers.

68
00:05:43,310 --> 00:05:48,800
It's more efficient to have them implemented like this rather than gluing multiple R and ends together

69
00:05:49,010 --> 00:05:51,040
and doing the computations one at a time.

70
00:05:53,480 --> 00:05:59,030
Finally we set the argument batch first decoded true which tells the RNA layer that our sequence data

71
00:05:59,060 --> 00:06:02,780
will be a shape end by t by D instead of T by end by the

72
00:06:07,910 --> 00:06:10,470
so the full constructor looks like this.

73
00:06:10,520 --> 00:06:16,840
First we inherit from an end up module inside the constructor we call the parent classes constructor.

74
00:06:16,880 --> 00:06:21,890
Next we assign some instance variables including the number of inputs the number of hidden units the

75
00:06:21,890 --> 00:06:26,240
number of hidden layers and the number of outputs.

76
00:06:26,620 --> 00:06:29,830
Next we declare are an N layer or layers.

77
00:06:29,830 --> 00:06:33,850
And finally we declare our final dense layer which is an N and the linear

78
00:06:38,840 --> 00:06:44,390
the forward function is similarly tricky because unlike a and ends and CNS it's not just a matter of

79
00:06:44,390 --> 00:06:47,300
passing the data through one layer after another.

80
00:06:47,420 --> 00:06:55,510
Instead we have to pay special attention to the interface of the R9 layer unlike in an N or a CNN layer

81
00:06:55,860 --> 00:06:59,920
where you only have one input and a set of weights with an orange layer.

82
00:06:59,920 --> 00:07:07,380
You have two inputs and the set of weights specifically the output depends not only on the input x but

83
00:07:07,380 --> 00:07:13,280
also on the initial head and state h not therefore in the forward function before we even called on

84
00:07:13,320 --> 00:07:13,940
in.

85
00:07:14,100 --> 00:07:16,820
We have to first define the initial hidden state.

86
00:07:17,340 --> 00:07:22,380
As mentioned typically we just said it's all zeros in Pittsburgh.

87
00:07:22,410 --> 00:07:28,310
This is done using the torch that zeros function which is analogous to the number right zeros function.

88
00:07:28,320 --> 00:07:32,430
We also have to remember to move all the values to the GP you.

89
00:07:32,430 --> 00:07:37,680
One important thing to keep in mind is that due to the interface of the Arnon module it may actually

90
00:07:37,680 --> 00:07:40,590
comprise multiple are and end layers.

91
00:07:40,590 --> 00:07:46,350
If you recall we store the number of layers in an attribute called El and because each are an N layer

92
00:07:46,350 --> 00:07:48,270
requires its own initial state.

93
00:07:48,690 --> 00:07:53,400
We need to define the initial hit and states to be a three dimensional tensor rather than just a one

94
00:07:53,400 --> 00:08:01,350
dimensional vector and not only can we have separate initial hit in states per layer but also percentile.

95
00:08:01,460 --> 00:08:08,090
Because of that the initial hidden state is three dimensional l by n by M where l is the number of R9

96
00:08:08,120 --> 00:08:15,710
layers and is the number of samples or the batch size and m is the number of hidden features as you

97
00:08:15,710 --> 00:08:20,920
recall from earlier we can obtain the bad size by calling X that size of zero.

98
00:08:20,960 --> 00:08:26,450
Once we have our initial hit state each night we can pass it into our or an end module.

99
00:08:26,450 --> 00:08:32,200
So the first argument is X which is the data and the second argument is h not which is the initial hidden

100
00:08:32,240 --> 00:08:38,870
state you can see that the output interface of the RNA module is also a little different.

101
00:08:38,930 --> 00:08:41,660
It seems to return it to outputs rather than just one

102
00:08:46,820 --> 00:08:51,560
output from the PI torch or an end module is very strange if you haven't worked with it before.

103
00:08:51,860 --> 00:08:57,380
Just to remind you the Arnon module in Pittsburgh is not just a single Arnon layer but rather it could

104
00:08:57,380 --> 00:08:59,780
be a stack of and layers.

105
00:08:59,780 --> 00:09:03,160
If you think about all the possible hidden states we can calculate.

106
00:09:03,170 --> 00:09:09,870
We would have quite a few dimensions first we have N which is the number of samples of the bad size.

107
00:09:09,890 --> 00:09:16,370
This one we expect because it's always there so small and it tells us which sample next we have L which

108
00:09:16,370 --> 00:09:18,060
tells us the number of layers.

109
00:09:18,200 --> 00:09:24,840
So the small l index tells us which layer next we have t which tells us the number of timestamps or

110
00:09:24,840 --> 00:09:26,130
the sequence length.

111
00:09:26,250 --> 00:09:28,530
Therefore a little T index is the timestamp.

112
00:09:29,460 --> 00:09:33,420
Finally we have little J which as usual index is the feature dimension.

113
00:09:34,020 --> 00:09:38,580
So overall we have four possible ways to index all the hidden states.

114
00:09:38,640 --> 00:09:44,670
If we conceptualize them as being one giant array for each sample each layer each time step and each

115
00:09:44,670 --> 00:09:45,090
feature

116
00:09:50,150 --> 00:09:54,050
in Pike's works the return values are broken up in kind of a strange way.

117
00:09:54,410 --> 00:10:00,670
The first output gives us the hidden states for the final layer at each time step by default.

118
00:10:00,680 --> 00:10:06,410
Its shape is t by n by M where t is the sequence length and is the batch size and m is the number of

119
00:10:06,410 --> 00:10:07,620
hidden features.

120
00:10:07,850 --> 00:10:13,400
But since we passed in the argument batch first equals true earlier that means it is reshaped to be

121
00:10:13,460 --> 00:10:21,750
n by T by M the second output gives us the hidden states over all hidden layers but only at the final

122
00:10:21,750 --> 00:10:22,800
timestamp.

123
00:10:22,950 --> 00:10:29,130
So by default a shape is L by n by M where l is the number of layers and is the batch size and m is

124
00:10:29,130 --> 00:10:36,190
the number of hidden features curiously most common ends make use of the first output rather than the

125
00:10:36,190 --> 00:10:37,020
second.

126
00:10:37,030 --> 00:10:39,500
So in this course we'll be ignoring the second output.

127
00:10:40,600 --> 00:10:45,340
However I did want to give you an explanation for why is there and not just ignore it without telling

128
00:10:45,340 --> 00:10:51,220
you what it is.

129
00:10:51,250 --> 00:10:56,020
The last thing we have to do in the forward function is pass the and stay through the final dense layer

130
00:10:56,740 --> 00:11:01,090
since the hidden state we've obtained from the R9 module is end by T by M.

131
00:11:01,150 --> 00:11:07,330
It's not necessarily the right shape for our use case remember that this gives us all the hidden states

132
00:11:07,330 --> 00:11:13,450
for every timestamp but sometimes like for the next example we only want the head and state at the final

133
00:11:13,450 --> 00:11:20,490
time step because that takes into account all the previous sequence data in order to obtain the hidden

134
00:11:20,490 --> 00:11:22,360
state at the final timestamp.

135
00:11:22,410 --> 00:11:29,330
We do exactly the same thing we would do if it were a num pi array if the RNA an output is called out

136
00:11:29,660 --> 00:11:35,360
then I can index it with a call in then a minus one and then another call in the first column means

137
00:11:35,390 --> 00:11:40,850
grab all the samples in the first dimension the minus 1 means grab only the last item from the second

138
00:11:40,850 --> 00:11:47,180
dimension and the final column means grab all the features in the final dimension after indexing and

139
00:11:47,180 --> 00:11:53,390
this way the output will be of size and by M which is the same shape we have when we work with and ends.

140
00:11:53,390 --> 00:11:56,150
So what happens after this should now be trivial for you

141
00:12:01,240 --> 00:12:06,300
now that we've done all the hard work the rest of the code will mostly be business as usual.

142
00:12:06,340 --> 00:12:08,980
First we instantiate the model we just created.

143
00:12:08,980 --> 00:12:16,190
We also move the model to the GP you we create the loss and optimizer as before we'll be using the means

144
00:12:16,190 --> 00:12:22,980
squared error with the atom optimizer we create the inputs and targets for both the train and test sets.

145
00:12:22,980 --> 00:12:24,860
We move the data to the GP you.

146
00:12:25,170 --> 00:12:30,450
We do full gradient descent and we plot the last per iteration to make sure the laws converges

147
00:12:35,600 --> 00:12:36,550
the last step.

148
00:12:36,680 --> 00:12:43,000
Making our predictions might be a little complicated since we have to pay special attention to the shapes.

149
00:12:43,040 --> 00:12:49,550
Remember that in general our input is of shape and by T by D where n is the number of samples t is the

150
00:12:49,550 --> 00:12:56,390
sequence length and D the number of features the output is of shape and by K or K is the number of output

151
00:12:56,390 --> 00:12:57,480
nodes.

152
00:12:57,890 --> 00:13:02,350
In our case this is just one because we are doing scalar regression.

153
00:13:02,380 --> 00:13:09,250
Importantly we recognize that a single input which is the time series is just technically a one dimensional

154
00:13:09,250 --> 00:13:12,010
array of length T.

155
00:13:12,010 --> 00:13:18,040
So in order to make this into an appropriate input for our own n we actually have to reshape it to one

156
00:13:18,040 --> 00:13:20,030
by t by 1.

157
00:13:20,420 --> 00:13:26,130
That's because the number of samples is 1 and the number of feature dimensions is also 1.

158
00:13:26,150 --> 00:13:32,940
Then when we get our output it's going to be one by one since again and is 1 and k is 1.

159
00:13:33,110 --> 00:13:40,910
Therefore in order to get this value as a scalar we need to index it as 0 twice you saw this already

160
00:13:40,910 --> 00:13:44,490
in our linear regression script so it shouldn't be too difficult by now.