1
00:00:11,730 --> 00:00:16,430
In this lecture we are going to begin discussing recurrent neural networks.

2
00:00:16,470 --> 00:00:20,350
Let's first think about why an end would be useful first.

3
00:00:20,550 --> 00:00:24,830
As usual we are going to start with our dumb as possible approach.

4
00:00:24,900 --> 00:00:27,370
What is the simplest thing we could do.

5
00:00:27,390 --> 00:00:31,290
Let's think back to how we approach the image classification.

6
00:00:31,290 --> 00:00:36,690
Before we learned about CNS which are specialized for looking at images we used an ETS.

7
00:00:36,960 --> 00:00:42,750
We were able to do this because we took an image and we just flattened it into a feature vector which

8
00:00:42,750 --> 00:00:46,400
is the only type of data you can pass into an ANZ.

9
00:00:46,440 --> 00:00:48,650
Recall that this goes along with our model.

10
00:00:48,720 --> 00:00:50,420
All data is the same.

11
00:00:50,550 --> 00:00:56,370
It doesn't matter if your feature vector is a flattened image or some data you collected in a survey.

12
00:00:56,370 --> 00:00:57,840
The ANZ doesn't care

13
00:01:02,970 --> 00:01:06,350
so let's consider whether we can take the same approach now.

14
00:01:06,510 --> 00:01:10,910
In fact we already did when we looked at our time series forecasting model.

15
00:01:10,980 --> 00:01:17,690
We just said pretend t is D and pass that end by Team matrix into a linear regression model.

16
00:01:17,820 --> 00:01:21,620
The model has no idea what you're passing in as a time series.

17
00:01:21,660 --> 00:01:23,850
It's just doing the same algorithm it's always done.

18
00:01:24,840 --> 00:01:28,650
But here's the question what if your D is not one.

19
00:01:28,650 --> 00:01:35,380
In other words you have a multi-dimensional time series recall that one example of this is if you work

20
00:01:35,380 --> 00:01:39,820
at neural link and you've implanted multiple electrodes into your brain.

21
00:01:39,910 --> 00:01:45,310
Now you have multiple recordings at every time step making it a multi-dimensional time series

22
00:01:50,520 --> 00:01:54,360
Well here's an idea that requires pretty much no imagination at all.

23
00:01:54,360 --> 00:02:02,520
Why not flatten this data to let's take a TBD matrix and turn it into a t times D sized vector.

24
00:02:02,520 --> 00:02:07,390
Let's say you have five electrodes in your head and you make a recording with sequence length one hundred

25
00:02:07,470 --> 00:02:09,990
so one hundred times steps.

26
00:02:10,080 --> 00:02:14,050
Then if you flatten this you would just get a vector of size 500

27
00:02:16,870 --> 00:02:20,800
in fact this is even easier to visualize than a flattened image.

28
00:02:20,800 --> 00:02:23,590
Here we just have five different times three signals.

29
00:02:23,680 --> 00:02:28,440
Now all we're going to do is concatenate them together and say it's one big vector.

30
00:02:28,510 --> 00:02:29,230
Easy peasy

31
00:02:34,400 --> 00:02:39,790
all right but we already know that this approach is possible and we've pretty much done it in code already.

32
00:02:39,820 --> 00:02:45,130
If you're confused what I mean by that it would be pretty much the same as our M.A. script which at

33
00:02:45,130 --> 00:02:50,740
this point is pretty simple compared to what we've done so far so let's go to the next step and think

34
00:02:50,740 --> 00:02:53,020
about why that might not be a good idea.

35
00:02:58,100 --> 00:03:03,940
The first problem is the same as with CNN as if you were to do a full matrix multiplication.

36
00:03:04,130 --> 00:03:10,670
It would just take up way too much space if d equals one hundred and T equals ten thousand which by

37
00:03:10,670 --> 00:03:13,100
the way is not unrealistic.

38
00:03:13,100 --> 00:03:17,930
Then all of a sudden you're flat in a feature vector is of size 1 million.

39
00:03:17,930 --> 00:03:22,280
You probably don't want a one million size feature vector going into your and n

40
00:03:27,430 --> 00:03:34,250
as with CNN s aren't and take advantage of the special structure of the data and CNN is the most general.

41
00:03:34,270 --> 00:03:40,600
It just connects every input to every feature every feature in other words is directly composed of every

42
00:03:40,600 --> 00:03:41,830
single input.

43
00:03:42,220 --> 00:03:47,410
We know that for images this is not really a great idea because we're actually looking for very tiny

44
00:03:47,410 --> 00:03:50,710
patterns relative to the full image.

45
00:03:50,740 --> 00:03:54,790
The question is In what way can we exploit the structure of a sequence

46
00:03:59,900 --> 00:04:03,950
here's the idea we can take inspiration from forecasting.

47
00:04:04,370 --> 00:04:11,070
We know that to predict the next value past values are useful but what if we apply this to the hidden

48
00:04:11,070 --> 00:04:12,480
feature vectors.

49
00:04:12,600 --> 00:04:18,110
So let's take an ANZ where the hidden feature vector is calculated from the input vector.

50
00:04:18,330 --> 00:04:20,940
The output is then calculated from the head in vector

51
00:04:26,050 --> 00:04:32,590
now to make an aunt in all we need to do is make the head and vector loop back to itself.

52
00:04:32,590 --> 00:04:39,420
In other words the hidden vector depends not only on the input but also on its own previous value.

53
00:04:39,430 --> 00:04:44,530
This is just like our very basic linear regression forecasting model where the prediction was just a

54
00:04:44,530 --> 00:04:47,190
linear function of past values.

55
00:04:47,200 --> 00:04:54,730
Now we are saying the hidden state is a nonlinear function of past values specifically that the nonlinear

56
00:04:54,730 --> 00:04:57,430
function is a neuron.

57
00:04:57,430 --> 00:05:02,500
As a side note the loop back implicitly assumes that there's a time delay of one step

58
00:05:07,680 --> 00:05:10,300
all right so how do we calculate the output of an Ana.

59
00:05:11,320 --> 00:05:17,290
Well we use the calculation you see here as you can see there's only one new thing the head and state

60
00:05:17,450 --> 00:05:22,970
age of T now depends on the hidden state at the previous time step as well.

61
00:05:23,050 --> 00:05:28,770
Also x y and H are index by time but that should be a given.

62
00:05:29,140 --> 00:05:34,300
Some other assumptions we made here are that we're using the sigmoid activation for both the hidden

63
00:05:34,300 --> 00:05:36,040
and output layers.

64
00:05:36,040 --> 00:05:40,870
This is of course not necessary but we need to write something down so that's what I decided to write

65
00:05:40,870 --> 00:05:41,860
down.

66
00:05:42,400 --> 00:05:47,470
As usual output activation is dependent on the tasks being done.

67
00:05:47,470 --> 00:05:50,050
So a sigmoid for binary classification.

68
00:05:50,230 --> 00:05:54,130
Nothing for regression in a soft Max for multi class classification.

69
00:05:56,280 --> 00:06:00,260
Another assumption is that this neural network only has one hidden layer.

70
00:06:00,400 --> 00:06:03,960
In fact with orange ends this is pretty often the case.

71
00:06:04,020 --> 00:06:07,540
You could stack up more hidden layers but I don't see that too often.

72
00:06:07,620 --> 00:06:11,600
As usual you'll want to test that out on your particular dataset.

73
00:06:11,670 --> 00:06:13,230
If you think it might be a good idea

74
00:06:18,350 --> 00:06:21,870
it's helpful to name these weight matrices so we don't get them confused later on.

75
00:06:22,670 --> 00:06:24,890
Luckily it's pretty intuitive.

76
00:06:24,950 --> 00:06:32,010
The reason we call w x H W H is because the input is X and the output is H.

77
00:06:32,210 --> 00:06:35,240
So we call this the input to head and wait.

78
00:06:35,510 --> 00:06:46,010
The reason we call w h h w h h is because the input is the previous H and the output is the next H.

79
00:06:46,320 --> 00:06:54,030
So we call this the head into head hidden way B of h we call the hidden bias and finally w o and below

80
00:06:54,390 --> 00:06:57,270
are the output weight and output bias.

81
00:06:57,300 --> 00:07:03,750
The important thing there was to differentiate between w x h in w h h since they are both weights belonging

82
00:07:03,750 --> 00:07:06,000
to the recurrent layer.

83
00:07:06,000 --> 00:07:12,150
As a side note this layer itself has several names the most common name is just the simple recurrent

84
00:07:12,150 --> 00:07:17,430
unit but since this has been studied in the past it's also associated with a person.

85
00:07:17,790 --> 00:07:24,940
And so it is also known as the mean unit.

86
00:07:25,080 --> 00:07:29,680
All right so for most people these equations are pretty intuitive but for some it's not.

87
00:07:29,940 --> 00:07:35,310
So let's just be super clear about how we were going to calculate the output prediction given a sequence

88
00:07:35,310 --> 00:07:36,760
of inputs.

89
00:07:36,810 --> 00:07:42,150
Actually this will help us uncover some hidden details that we need to consider that are not obvious

90
00:07:42,150 --> 00:07:43,250
at first.

91
00:07:44,520 --> 00:07:51,780
First we assume that we are given a sequence of input vectors x 1 all the way up to X Big T each individual

92
00:07:51,780 --> 00:07:54,540
X is of course a vector of size D.

93
00:07:55,680 --> 00:08:02,280
So this entire sequence would be stored in code as a matrix of size bigger t by D.

94
00:08:02,370 --> 00:08:06,530
In math however we assume that we are working with individual vectors.

95
00:08:06,810 --> 00:08:09,980
So let's start with x 1 from x 1.

96
00:08:09,990 --> 00:08:13,210
We can calculate H1 but wait.

97
00:08:13,530 --> 00:08:20,330
Now you see one of these hidden details that you probably didn't think about H1 depends on h 0.

98
00:08:20,500 --> 00:08:22,490
But what is h zero.

99
00:08:22,480 --> 00:08:30,030
In fact we refer to this as the initial hidden state typically we just set this to an array of zeroes

100
00:08:30,360 --> 00:08:33,000
but it can also be a learned parameter.

101
00:08:33,000 --> 00:08:38,170
In other words you can learn this vector using gradient descent in PI talk.

102
00:08:38,170 --> 00:08:42,470
It's possible to make the initial state a learnable parameter but it's simpler not to.

103
00:08:42,490 --> 00:08:44,260
And in fact quite conventional.

104
00:08:44,380 --> 00:08:52,270
So in this course that's the approach we're going to take.

105
00:08:52,390 --> 00:08:52,680
All right.

106
00:08:52,690 --> 00:09:04,000
So now we have each one from this we can calculate y hat one using the usual neuron formula.

107
00:09:04,060 --> 00:09:07,310
Next we consider the input vector x 2.

108
00:09:07,360 --> 00:09:13,410
From this we can calculate H2 H2 depends on x2 and H1.

109
00:09:13,540 --> 00:09:15,020
We just calculated each one.

110
00:09:15,070 --> 00:09:17,110
So that's not a problem.

111
00:09:17,110 --> 00:09:19,450
From this we can calculate y had two

112
00:09:24,570 --> 00:09:31,530
now we can repeat this process to calculate age 3 and why have 3 age 4 and why have 4 and so on all

113
00:09:31,530 --> 00:09:33,430
the way up to age back T.

114
00:09:33,450 --> 00:09:34,580
And why have Big T

115
00:09:39,690 --> 00:09:40,400
Okay great.

116
00:09:40,400 --> 00:09:42,440
That all makes sense by the way.

117
00:09:42,440 --> 00:09:47,240
If you found that a little confusing don't worry because we are also going to do this calculation in

118
00:09:47,240 --> 00:09:48,890
code as well.

119
00:09:48,890 --> 00:09:52,640
Just as a little extra practice to make sure you understand how it works.

120
00:09:53,420 --> 00:09:56,960
But here's something you probably noticed which is strange.

121
00:09:56,960 --> 00:10:02,450
Why do we have a y hat for every time step if we are doing something like forecasting.

122
00:10:02,450 --> 00:10:08,460
We only want to predict the very next value if we are predicting something like the sentiment of a tweet.

123
00:10:08,540 --> 00:10:09,870
We only have one answer.

124
00:10:09,920 --> 00:10:12,410
Positive sentiment or negative sentiment.

125
00:10:12,410 --> 00:10:17,090
So what is the purpose of having a Y have for each timestamp.

126
00:10:17,450 --> 00:10:21,200
Each of the Y hats depends only on the x's up to that point.

127
00:10:26,350 --> 00:10:32,140
The answer is that for these types of problems which I just described all the y hats except for the

128
00:10:32,140 --> 00:10:34,190
final y and are ignored.

129
00:10:34,390 --> 00:10:41,540
So why hat of Big T gives us the final prediction and all the previous y hats are discarded.

130
00:10:41,670 --> 00:10:44,130
You can think of them as like temporary variables

131
00:10:49,290 --> 00:10:50,500
at the same time.

132
00:10:50,610 --> 00:10:54,910
There are cases where we would want to keep all the y hats at each time.

133
00:10:55,170 --> 00:11:01,000
One such example is neural machine translation in a neural machine translation.

134
00:11:01,080 --> 00:11:04,410
Both your input and your output are sentences.

135
00:11:04,410 --> 00:11:07,410
They are just sentences in different languages.

136
00:11:07,410 --> 00:11:13,110
In this case both the input and the target are sequences and therefore you need to capture the predictions

137
00:11:13,110 --> 00:11:13,980
at each time point

138
00:11:19,030 --> 00:11:21,370
considering neural machine translation.

139
00:11:21,370 --> 00:11:27,590
Let's think about how we can reason about what an Arnon is predicting in probabilistic terms.

140
00:11:27,760 --> 00:11:33,910
If you recall when we look at an ends and hence this also applies to CNN is that the output prediction

141
00:11:34,240 --> 00:11:40,960
after applying the soft max function is a probability distribution for a regular and N or a CNN.

142
00:11:41,050 --> 00:11:48,520
It's the probability that Y equals K given x x is the input vector or in the case of CNN is an image

143
00:11:48,940 --> 00:11:53,260
and y is the label for normal machine translation.

144
00:11:53,260 --> 00:11:55,100
This is also classification.

145
00:11:55,410 --> 00:12:00,110
We are trying to predict a category which is a word in the target language.

146
00:12:00,400 --> 00:12:07,120
So the question for Arnold's is if we are doing classification we will have the probability of Y of

147
00:12:07,120 --> 00:12:10,110
T equals K given something.

148
00:12:10,330 --> 00:12:11,560
But what is this something

149
00:12:16,710 --> 00:12:21,060
one picture that helps us to understand this is the unrolled Arnon.

150
00:12:21,210 --> 00:12:26,230
This is like a block diagram of all of our calculations and their dependencies.

151
00:12:26,400 --> 00:12:32,130
First we can see that each one depends on h 0 and x 1 y had one.

152
00:12:32,130 --> 00:12:37,830
It depends on each one and therefore it indirectly depends on X1 H2.

153
00:12:37,830 --> 00:12:41,580
Depends on H1 and X to Y had to.

154
00:12:41,580 --> 00:12:43,290
Depends on H2.

155
00:12:43,470 --> 00:12:50,990
Thus y had to indirectly depends on X1 and X2 similarly y had 3.

156
00:12:50,990 --> 00:12:53,610
Depends on X1 X2 and X3.

157
00:12:53,780 --> 00:12:57,720
Why have 4 depends on X1 X2 X3 and x 4.

158
00:12:58,370 --> 00:13:00,170
Finally y 5.

159
00:13:00,170 --> 00:13:05,350
Depends on all the X's x1 x 2 x 3 x 4 and x 5.

160
00:13:05,420 --> 00:13:08,980
In general we can say that Y had a big T.

161
00:13:09,110 --> 00:13:16,910
Depends on every X in the input sequence.

162
00:13:17,010 --> 00:13:23,820
It should be clear from this picture that each y of T overall depends only on the current and past values

163
00:13:23,820 --> 00:13:24,990
of x.

164
00:13:25,170 --> 00:13:32,110
So what does this mean in terms of the probability distribution that the Arnon is modelling the probability

165
00:13:32,110 --> 00:13:36,620
of Y when equals K given X1 is the first distribution.

166
00:13:36,700 --> 00:13:44,100
But now notice that my have to depends on both x 2 and each one which in turn depends on x 1.

167
00:13:44,170 --> 00:13:53,780
So we write the probability of Y of 2 equals K given x 1 and x 2 Next we have that the probability why

168
00:13:53,780 --> 00:13:57,800
three equals K given x 1 x 2 and x 3.

169
00:13:57,800 --> 00:14:04,910
And this pattern continues all the way up to the probability of Y a big T was K given x1 x 2 x 3 all

170
00:14:04,910 --> 00:14:06,300
the way up to expert T.

171
00:14:11,220 --> 00:14:14,000
Now this is not that important in and of itself.

172
00:14:14,100 --> 00:14:18,960
You're not really going to be manipulating these possibilities or doing anything mathematically but

173
00:14:18,960 --> 00:14:19,740
it's interesting.

174
00:14:19,740 --> 00:14:26,500
If you've ever studied Markoff models the idea with the markup model is that they make the mark of assumption.

175
00:14:26,640 --> 00:14:28,890
What is the mark of assumption.

176
00:14:28,890 --> 00:14:35,410
It's easiest to understand with words the Markov assumption says that the probability of the next word

177
00:14:35,470 --> 00:14:41,260
depends only on the current word in the sentence but not on any previous words.

178
00:14:41,440 --> 00:14:44,720
You might think that sounds very unrealistic.

179
00:14:44,740 --> 00:14:50,560
For example if the current word in a sentence is the how can you possibly predict the next word in a

180
00:14:50,560 --> 00:14:54,500
sentence the next where it could be any number of words.

181
00:14:54,550 --> 00:14:57,390
So Mark our models are pretty weak if you use them that way

182
00:15:02,530 --> 00:15:06,530
now consider an R and then for predicting the next word in a sentence.

183
00:15:06,550 --> 00:15:14,290
So we're modeling the probability of x of T plus 1 the word at time T plus 1 then our output distribution

184
00:15:14,320 --> 00:15:21,400
is the probability of X at T plus 1 given X of one x of 2 all the way up to x of T.

185
00:15:21,400 --> 00:15:27,250
As you can see the Arnon accounts for all the previous words in a sentence rather than just the most

186
00:15:27,250 --> 00:15:28,490
recent word.

187
00:15:28,630 --> 00:15:32,710
This is one reason why our own ends are so powerful for modeling sequences

188
00:15:37,830 --> 00:15:41,480
as a final thought exercise for helping you understand or an ends.

189
00:15:41,580 --> 00:15:47,090
Let's write some pseudocode that would give us the Y has at each time.

190
00:15:47,140 --> 00:15:49,760
First we have to consider what we're given.

191
00:15:49,780 --> 00:15:57,970
We're given the initial state a 0 and the weights of the R N so we have w x H which maps from X to h

192
00:15:58,390 --> 00:16:02,980
we have w h h which maps from the previous H to the current H.

193
00:16:02,980 --> 00:16:09,460
We have B of H the bias term for the recurrent hidden layer we have w o the output way and below the

194
00:16:09,460 --> 00:16:17,270
output bias and of course we're also given a single input sequence X which has the shape TBD.

195
00:16:17,270 --> 00:16:20,080
It represents t sequential vectors of size D.

196
00:16:20,900 --> 00:16:24,470
Let's assume a 10 H activation and a soft Max at the output

197
00:16:29,570 --> 00:16:29,890
okay.

198
00:16:29,910 --> 00:16:36,300
So to calculate the output predictions we're going to start by initializing y had the output as an empty

199
00:16:36,300 --> 00:16:37,420
list.

200
00:16:37,770 --> 00:16:44,800
Then we're going to set H last the previous value of H T H zero then we're going to live through each

201
00:16:44,830 --> 00:16:47,080
timestamp from one up to big T

202
00:16:50,080 --> 00:16:50,850
inside the loop.

203
00:16:50,850 --> 00:16:56,830
We're going to calculate h of T using the recurrent neuron equation we saw earlier.

204
00:16:56,830 --> 00:17:01,110
Then we can use h of t to calculate y hat at time T.

205
00:17:01,120 --> 00:17:08,100
As usual by the way this is just the dense layer next We append white have to the list of predictions

206
00:17:08,100 --> 00:17:16,890
why finally and this step is important not to forget we assign h of t to H last so that H last always

207
00:17:16,890 --> 00:17:22,140
represents the last value of H of t when we do our calculation for the next stage of T

208
00:17:27,280 --> 00:17:33,220
at this point we are going to briefly go back to the biological inspiration for a neuron that works.

209
00:17:33,490 --> 00:17:39,640
Recall earlier when we talked about how neurons are organized in the brain we can imagine that there

210
00:17:39,710 --> 00:17:43,960
is no reason for the neurons in your brain to all go in one direction.

211
00:17:43,960 --> 00:17:48,240
In fact recurrent circuits in the brain have been well studied in biology.

212
00:17:48,490 --> 00:17:52,540
The idea that neurons can connect in a loop goes back decades.

213
00:17:52,540 --> 00:17:56,600
And personally I find this topic very interesting.

214
00:17:56,620 --> 00:18:02,230
One such example is the hot field network one of the methods used to train hot field networks is called

215
00:18:02,230 --> 00:18:05,640
heavy and learning invented by Donald head.

216
00:18:05,680 --> 00:18:11,670
He inspired the famous rule neurons that fire together wire together and neurons that fire out of sync

217
00:18:11,680 --> 00:18:13,190
fails a link.

218
00:18:13,210 --> 00:18:19,330
It's actually quite unfortunate that most people have a very narrowly focused view on deep learning.

219
00:18:19,450 --> 00:18:24,820
It would be very interesting to see if we applied modern technology to these old ideas what we could

220
00:18:24,820 --> 00:18:30,430
discover.

221
00:18:30,460 --> 00:18:35,230
The last thing we are going to do in this lecture now that you understand our own ends is to ask the

222
00:18:35,230 --> 00:18:38,450
question what are our savings.

223
00:18:38,620 --> 00:18:45,040
If you recall we did this example with CNN is also the idea was because CNN is take advantage of the

224
00:18:45,040 --> 00:18:52,660
structure of images we can use shared awaits the same idea applies to art ends with our own ends.

225
00:18:52,660 --> 00:18:58,480
We apply the same weight matrix to each input vector x of tea and we apply the same headings ahead and

226
00:18:58,480 --> 00:19:01,610
wait for each calculation of each of T.

227
00:19:01,750 --> 00:19:07,450
Thus it's useful to compare how much we would save if we use that aren't in instead of banana N N with

228
00:19:07,450 --> 00:19:10,180
a flattened time series.

229
00:19:10,630 --> 00:19:14,730
So let's do a simple calculation for an N for simplicity's sake.

230
00:19:14,740 --> 00:19:16,480
We won't consider bias terms

231
00:19:21,630 --> 00:19:28,190
so let's say that t the sequence length is one hundred and b the feature vector size is 10.

232
00:19:28,200 --> 00:19:32,060
Let's say that the head and vector size M is 15.

233
00:19:32,100 --> 00:19:35,850
Then if we flatten this we would have a 1000 length input vector

234
00:19:38,250 --> 00:19:39,180
correspondingly.

235
00:19:39,180 --> 00:19:41,350
We would also have t hat and states.

236
00:19:41,520 --> 00:19:46,720
So the hint and size in the end would be t times M which is fifteen hundred.

237
00:19:46,720 --> 00:19:51,750
Finally let's suppose we are just doing binary classification so the number of output nodes is just

238
00:19:51,750 --> 00:19:55,020
1 in this scenario.

239
00:19:55,020 --> 00:20:00,870
Our inputs ahead and way it would be of size one thousand times fifteen hundred which is one point five

240
00:20:00,870 --> 00:20:07,960
million are hit into Apple way would be of size fifteen hundred by one which is just fifteen hundred.

241
00:20:08,100 --> 00:20:12,480
Therefore in total we have approximately one point five million weights

242
00:20:17,620 --> 00:20:17,830
now.

243
00:20:17,830 --> 00:20:19,540
Compare this to the simple RNA.

244
00:20:20,320 --> 00:20:27,350
Remember we are calculating the size of W x H W H H and W O W x H.

245
00:20:27,370 --> 00:20:36,490
Since it goes from inputs ahead in must be of size D by M which is 10 times 15 which is 150 w h h since

246
00:20:36,490 --> 00:20:45,060
it goes from head into head n must be of size m by m which is 15 times 15 which is to twenty five.

247
00:20:45,120 --> 00:20:53,830
Finally we have w o which goes from head into output which must be of size m by 1 which is just 15 so

248
00:20:53,830 --> 00:21:00,310
in total we have 150 plus to twenty five plus 50 which is 390.

249
00:21:00,430 --> 00:21:06,370
So if we take an N N which has one point five million one thousand five hundred weights and divide that

250
00:21:06,370 --> 00:21:14,050
by 390 we find that N N N has three thousand eight hundred fifty times more parameters which is a huge

251
00:21:14,050 --> 00:21:15,680
amount of savings.

252
00:21:15,730 --> 00:21:22,240
Thus we learn again that by taking advantage of the structure of specialized data we can create much

253
00:21:22,240 --> 00:21:26,200
simpler and more compact models compared to full and NS.
