1
00:00:11,550 --> 00:00:16,770
In this lecture, we are going to start discussing modern art and units, you may have heard of these

2
00:00:16,770 --> 00:00:19,080
already, so I'll just tell you right now what they are.

3
00:00:19,230 --> 00:00:21,270
The LSM and the GROU.

4
00:00:21,840 --> 00:00:27,600
ASTM stands for long and short term memory and Giuse stands for gaited recurrent unit.

5
00:00:28,410 --> 00:00:34,290
They are both essentially the same thing with the GIU meant to be a more simplified and efficient version

6
00:00:34,290 --> 00:00:35,370
of the LSM.

7
00:00:36,120 --> 00:00:42,270
This lecture will discuss why the l'Est Yemanja are you are needed in the first place when the simple

8
00:00:42,270 --> 00:00:43,980
answer seems pretty good already.

9
00:00:44,580 --> 00:00:50,310
We'll also look at the underlying equations for the LSM and who are you and importantly a convenient

10
00:00:50,310 --> 00:00:51,880
perspective for looking at them.

11
00:00:52,710 --> 00:00:58,460
One mistake a lot of beginners make is that they see a bunch of equations and they get very intimidated.

12
00:00:58,740 --> 00:01:00,180
They think, what the heck is this?

13
00:01:00,180 --> 00:01:01,430
Just a bunch of equations.

14
00:01:01,440 --> 00:01:02,510
What do I do with this?

15
00:01:02,910 --> 00:01:09,330
So hopefully this lecture will take you from that lazy beginner's mindset to how a proper machine learning

16
00:01:09,330 --> 00:01:12,630
engineer would think about Elston's and Giuse.

17
00:01:17,710 --> 00:01:22,070
First, we need to discuss why we have these fancy new recurring units in the first place.

18
00:01:22,540 --> 00:01:26,620
Why is a simple Orendain unit not good enough to understand this?

19
00:01:26,620 --> 00:01:29,170
We have to go back to the vanishing gradient problem.

20
00:01:30,750 --> 00:01:36,840
As with our CNN, we can represent the output prediction, we had a big T as a function of the inputs

21
00:01:37,050 --> 00:01:39,300
x1 x2 all the way up texte.

22
00:01:40,470 --> 00:01:44,460
Once we do this, we can see that it's just a really big composite function.

23
00:01:45,180 --> 00:01:51,930
Now, suppose we want to take the derivative with respect to what age we can see that age appears several

24
00:01:51,930 --> 00:01:53,220
times in this equation.

25
00:01:58,220 --> 00:02:02,980
Now, the actual derivatives themselves and how to find them are not what we have to consider here,

26
00:02:03,680 --> 00:02:07,070
but what is important is which derivatives are going to show up?

27
00:02:07,910 --> 00:02:15,380
Well, if X H appears multiplied by X of T, then we're going to have to find the derivative of W H

28
00:02:15,650 --> 00:02:24,050
times, X of T at some point in our expression for the gradient of x h, it should be clear that H appears

29
00:02:24,050 --> 00:02:26,750
in front of all the is for every timestep.

30
00:02:27,380 --> 00:02:32,870
So ultimately all of these single derivatives are going to appear somewhere in our expression for the

31
00:02:32,870 --> 00:02:34,160
gradient of our costs.

32
00:02:34,400 --> 00:02:36,290
With respect to x h.

33
00:02:41,350 --> 00:02:46,810
But another thing that's important to consider is how deeply nested each of these terms is with respect

34
00:02:46,810 --> 00:02:54,160
to why hat of Big T, what is clear is that the term involving X1 is the most deeply nested.

35
00:02:54,490 --> 00:02:58,430
The term involving X2 is the second most deeply nested and so on.

36
00:02:59,110 --> 00:03:04,750
Now, what do we remember about the derivative of composite functions for my theory of Andsnes?

37
00:03:05,380 --> 00:03:09,760
Well, we know that composite functions turn into multiplication in the gradient.

38
00:03:10,240 --> 00:03:12,460
This is due to the chain rule of calculus.

39
00:03:13,120 --> 00:03:19,060
Therefore, the more deeply nested you are, the more terms you have to multiply by when you're finding

40
00:03:19,060 --> 00:03:19,640
the gradient.

41
00:03:20,350 --> 00:03:25,300
In other words, Arnon's are particularly vulnerable to the vanishing gradient problem.

42
00:03:26,110 --> 00:03:32,320
The farther back in the sequence the input is, the more vulnerable it is to the effects of the vanishing

43
00:03:32,320 --> 00:03:32,860
gradient.

44
00:03:34,360 --> 00:03:40,180
This is why it's often said that Arnon's have problems learning long term dependencies, the answer

45
00:03:40,330 --> 00:03:43,240
simply can't learn from inputs that are too far back.

46
00:03:48,460 --> 00:03:55,030
So how does the vanishing Iranian problem manifest in actual machine learning applications, let's suppose

47
00:03:55,030 --> 00:03:59,450
we're doing natural language processing and we would like to extract some information about a document.

48
00:03:59,920 --> 00:04:06,010
So to give you a concrete example, suppose I is reading the Wikipedia page about Albert Einstein.

49
00:04:06,340 --> 00:04:07,940
It's read the entire page.

50
00:04:08,410 --> 00:04:11,910
Now you ask your eye on what day was Albert Einstein born.

51
00:04:12,610 --> 00:04:18,610
Unfortunately, a simple answer would not be able to answer this, even though this information appears

52
00:04:18,610 --> 00:04:19,310
in the article.

53
00:04:19,960 --> 00:04:22,700
The problem is it appeared at the beginning of the article.

54
00:04:22,870 --> 00:04:28,530
So by the end of the article, the simple answer has simply forgotten what it had read earlier.

55
00:04:33,760 --> 00:04:39,130
Now, you might think, aha, I have a solution to this vanishing ingredient problem, it is to use

56
00:04:39,140 --> 00:04:39,940
rellenos.

57
00:04:41,690 --> 00:04:47,000
Unfortunately, with Arnon's, things aren't so simple, of course, you can always try to use while

58
00:04:47,000 --> 00:04:53,750
using your own ends and verify experimentally what the results are, but in deep learning, we found

59
00:04:53,750 --> 00:04:59,300
that the most effective way of dealing with this problem is to use entirely different units, namely

60
00:04:59,300 --> 00:05:01,430
the LSM and the glue.

61
00:05:06,490 --> 00:05:12,440
One interesting historical fact is that the LSM was created a very long time ago, 1997.

62
00:05:13,630 --> 00:05:17,150
This is more than a decade before deep learning was even called deep learning.

63
00:05:17,800 --> 00:05:21,610
So this is why I always say to people research, interesting ideas.

64
00:05:21,940 --> 00:05:25,120
Don't just go through your career chasing what is popular.

65
00:05:25,750 --> 00:05:29,920
By doing that, you were like a hamster running on a spinning wheel, going nowhere.

66
00:05:30,580 --> 00:05:36,840
If you're only chasing popular mainstream ideas, you would have never come across this idea of Yepes.

67
00:05:37,600 --> 00:05:41,680
It's more than a decade old part of a field that nobody even cares about.

68
00:05:42,430 --> 00:05:44,620
Do what is interesting, not what is popular.

69
00:05:46,450 --> 00:05:54,100
Now, the LSM is an extremely complex unit, the GIU was invented more recently in 2014, but it uses

70
00:05:54,100 --> 00:05:55,470
a lot of the same ideas.

71
00:05:55,930 --> 00:06:01,510
So I always like to start with the Guiyu so that you can learn the principles in a simpler setting and

72
00:06:01,510 --> 00:06:03,130
then go back to the LSM.

73
00:06:08,140 --> 00:06:09,920
So what is the gaited recurrent unit?

74
00:06:10,840 --> 00:06:14,620
First, let's start with the basic principle from a simple arnet.

75
00:06:15,250 --> 00:06:21,970
We still want to have some head in state at time t h of t this head and state will still depend on of

76
00:06:21,970 --> 00:06:26,590
t the current input and each of T minus one, the previous head and state.

77
00:06:27,400 --> 00:06:30,310
The only thing that will change is how each of T is calculated.

78
00:06:35,410 --> 00:06:41,920
Now, there's a little bit of a debate on how to present that UNRWA, is it more useful to look at the

79
00:06:41,920 --> 00:06:45,370
equations or is it more useful to look at diagrams?

80
00:06:45,910 --> 00:06:51,590
Now, if you are a beginner, you might automatically say pictures, pictures of the best for learning.

81
00:06:52,030 --> 00:06:55,030
Unfortunately for you, that's not what we are going to do.

82
00:06:55,900 --> 00:07:01,540
A lot of poorly written blogs that I'll copy from each other use these same diagrams as if they were

83
00:07:01,540 --> 00:07:02,610
self-explanatory.

84
00:07:02,860 --> 00:07:07,890
You look at this picture and all of a sudden, bam, you understand Elston's and Giuse.

85
00:07:08,380 --> 00:07:10,540
Personally, I find this doesn't work for me.

86
00:07:11,050 --> 00:07:13,570
I know that other prominent lecturers feel the same way.

87
00:07:14,050 --> 00:07:19,360
Of course, if you find these pictures useful, you're welcome to look at them and try to make sense

88
00:07:19,360 --> 00:07:19,780
of them.

89
00:07:20,140 --> 00:07:26,470
But for me and many others I know, it's actually the equations which are most useful that might seem

90
00:07:26,470 --> 00:07:32,080
backwards to a beginner who is very terrified of math, but hopefully you are past that point right

91
00:07:32,080 --> 00:07:32,440
now.

92
00:07:33,010 --> 00:07:36,190
I promise you that there's nothing here that we haven't seen before.

93
00:07:41,230 --> 00:07:42,680
OK, so enough chitchat.

94
00:07:42,700 --> 00:07:43,720
Where's the group?

95
00:07:44,410 --> 00:07:49,180
Well, here are the equations for the GIU to sort of give you some perspective on this.

96
00:07:49,190 --> 00:07:54,670
I've also included the equation for a simple recurrent unit so that you can compare the complexity of

97
00:07:54,670 --> 00:07:55,150
the two.

98
00:07:55,900 --> 00:07:58,690
The simple recurring unit is essentially just one line.

99
00:07:59,110 --> 00:08:06,730
HFT depends on activity and each of T minus one, the Gosu is three lines because we have to calculate

100
00:08:06,940 --> 00:08:08,160
three different things.

101
00:08:08,800 --> 00:08:12,610
First, we calculate CFT, which is called the update gate vector.

102
00:08:13,300 --> 00:08:17,240
Second, we calculate gravity, which is called the reset gate vector.

103
00:08:17,770 --> 00:08:23,920
And finally we calculate HFT, which is the head and state vector, which is, as with the simple recurrent

104
00:08:23,930 --> 00:08:28,150
unit, what gets passed on to the next layer in the neural network.

105
00:08:28,630 --> 00:08:32,500
So let's analyze these equations so that we can make sense of the Guiyu.

106
00:08:37,660 --> 00:08:43,030
First, it's important to discuss shapes, you want some visualization of the objects we're dealing

107
00:08:43,030 --> 00:08:43,430
with?

108
00:08:43,990 --> 00:08:49,810
Well, it's helpful to know that all of these new vectors are just vectors of size m the same size as

109
00:08:49,810 --> 00:08:52,070
H of T, as always.

110
00:08:52,120 --> 00:08:56,720
This is a hyper parameters, so you can choose how many features the hidden layer should have.

111
00:08:57,580 --> 00:09:00,190
This also implies the shape of all these weights.

112
00:09:00,820 --> 00:09:07,000
If a weight is going from X 50 to one of the gate vectors, then it must be of size D Biem.

113
00:09:07,690 --> 00:09:13,780
If weight is going from each of T minus one to one of the gate vectors, then it must be of size MBM.

114
00:09:14,470 --> 00:09:20,440
And of course, since all the vectors are of size M, then all of the bias terms must also be of size

115
00:09:20,440 --> 00:09:20,920
M.

116
00:09:26,020 --> 00:09:28,960
Next, let's consider what the group is trying to do.

117
00:09:29,530 --> 00:09:31,330
I think the names are very helpful here.

118
00:09:32,090 --> 00:09:36,640
CFT is the update gate vector and AFTE is the reset gate vector.

119
00:09:37,330 --> 00:09:42,650
By the way, if you didn't know the circle with a dot in, it means Element Y's multiplication.

120
00:09:43,300 --> 00:09:50,040
So one minus Z is an M size vector and each of T minus one is also an M size vector.

121
00:09:50,560 --> 00:09:54,350
When you multiply these together, you are doing Element Y's multiplication.

122
00:09:55,240 --> 00:09:58,210
So let's start by looking at t how is it used?

123
00:09:58,960 --> 00:10:06,040
We can see that zaftig gets multiplied by one term involving activity and each of T minus one, then

124
00:10:06,040 --> 00:10:10,700
one minus Z gets multiplied by the previous hidden state, each of T minus one.

125
00:10:11,440 --> 00:10:12,490
This is very helpful.

126
00:10:13,120 --> 00:10:16,330
You can think of ZT as telling us what should we do?

127
00:10:16,630 --> 00:10:21,250
Should we take this new value for the head and stay, or should we just remember the previous hidden

128
00:10:21,250 --> 00:10:23,090
state state of T minus one?

129
00:10:23,950 --> 00:10:24,970
Why is that helpful?

130
00:10:26,230 --> 00:10:28,840
Well, remember what we learned about the vanishing gradient.

131
00:10:29,350 --> 00:10:33,610
If we use a simple answer, it's parents are forgetting things that it's seen in the past.

132
00:10:34,360 --> 00:10:40,180
By using this gate, it explicitly allows us to remember the previous hidden state so that the hidden

133
00:10:40,190 --> 00:10:43,110
state can be carried forward to the next hidden state.

134
00:10:44,230 --> 00:10:49,720
By the way, since ZT is the output of a sigmoid, its values are always between a zero and one.

135
00:10:50,380 --> 00:10:56,440
Therefore, for each of T, we are always taking a weighted some of the previous H of T minus one and

136
00:10:56,440 --> 00:10:57,340
this other function.

137
00:10:58,030 --> 00:11:03,310
So if the left is close to zero, then we'll remember the old value of T minus one.

138
00:11:03,940 --> 00:11:09,720
But if it is close to one, then we'll take the new value and forget the old H of T minus one.

139
00:11:10,630 --> 00:11:15,810
By the way, you can think of this other function as analogous to what the simple AURIN and is doing.

140
00:11:16,510 --> 00:11:21,310
This is very important, but there's one more piece of the puzzle before we can move on to a helpful

141
00:11:21,310 --> 00:11:23,020
interpretation of what's going on.

142
00:11:28,100 --> 00:11:33,050
I want you to look very carefully at this equation for CFT, what does this look like to you?

143
00:11:33,860 --> 00:11:37,340
Well, it's a sigmoid of some DOT products, plus a bias term.

144
00:11:38,000 --> 00:11:39,140
What does that look like?

145
00:11:39,770 --> 00:11:42,950
Well, you might recognize this as exactly like a dense layer.

146
00:11:43,880 --> 00:11:46,360
Now, you might object saying that there are two weights.

147
00:11:46,640 --> 00:11:48,460
So it's not exactly like a dense layer.

148
00:11:49,310 --> 00:11:55,160
But in fact, the way that these units are often implemented, what we do is we concatenate X of T and

149
00:11:55,160 --> 00:11:57,720
H of T minus one into a single vector.

150
00:11:57,770 --> 00:11:59,240
Let's just call it V for sure.

151
00:12:00,020 --> 00:12:05,660
If we can coordinate those inputs and we concatenate the weights, then we just get back to our regular

152
00:12:05,660 --> 00:12:11,000
old dense layer where we have one input multiplied by one weight matrix plus a bias term.

153
00:12:11,750 --> 00:12:13,050
In other words, what is this?

154
00:12:13,430 --> 00:12:14,670
This is just a neuron.

155
00:12:15,380 --> 00:12:17,900
Now notice that it also ends with a sigmoid.

156
00:12:18,470 --> 00:12:21,330
Importantly, this sigmoid is not a hyper parameter.

157
00:12:21,950 --> 00:12:27,130
This is usually always a sigmoid because the output is always a number between a zero and one.

158
00:12:27,590 --> 00:12:32,210
That's how we can take Z of T and one minus CFT later on in the Q.

159
00:12:33,290 --> 00:12:34,910
So what is this neuron predicting?

160
00:12:35,660 --> 00:12:41,150
It's like a binary classifier telling us essentially the probability that we should take the new value

161
00:12:41,150 --> 00:12:45,140
for each of T or if we should keep the old value of T minus one.

162
00:12:50,250 --> 00:12:56,680
So if we ignore the reset gate for now, here's how we can interpret the GIU first, remember that each

163
00:12:57,360 --> 00:13:04,640
is essentially the weighted sum of two things, each of T minus one and the output of a simple R zaev

164
00:13:04,680 --> 00:13:11,010
t the update gate vector is like an output probability, telling us the probability that we should keep

165
00:13:11,010 --> 00:13:12,620
the output of the simple line.

166
00:13:13,440 --> 00:13:20,610
In other words, keep each of T minus one with probability one minus C of T and keep the simple on and

167
00:13:20,610 --> 00:13:23,160
output with probability of T.

168
00:13:24,090 --> 00:13:29,910
The final activity is then just a mixture of these two allowing us to keep the most important parts

169
00:13:30,030 --> 00:13:32,810
of each of team minus one and discard the rest.

170
00:13:33,980 --> 00:13:39,510
Now, I want to make one small clarification, please note that this is really for intuition only.

171
00:13:40,100 --> 00:13:45,340
These are not really probabilities of keeping and discarding because we never actually keep or discard.

172
00:13:45,830 --> 00:13:50,120
Instead, what do we actually do is take a mixture of the two components.

173
00:13:50,630 --> 00:13:55,640
For instance, twenty five percent of component one and seventy five percent of component, too.

174
00:13:56,060 --> 00:13:57,890
But we don't actually keep or discard.

175
00:14:02,540 --> 00:14:05,120
In actuality, the word gate is very helpful.

176
00:14:05,720 --> 00:14:07,970
Imagine two openings side by side.

177
00:14:08,960 --> 00:14:14,600
Now, imagine a gate which is the same size as each opening, but can only slide between the two openings.

178
00:14:15,170 --> 00:14:21,480
Thus it can only fully cover one of the openings at once, or it can simply cover a part of each.

179
00:14:22,040 --> 00:14:27,060
But notice that in total, it always covers an area equal in size to one full opening.

180
00:14:27,740 --> 00:14:33,020
In other words, if one opening is twenty five percent clear, the other opening will be seventy five

181
00:14:33,020 --> 00:14:36,310
percent clear and they always sum up to one hundred percent.

182
00:14:37,130 --> 00:14:42,140
So picture this like an ice cream maker that can give you some mixture of chocolate ice cream and vanilla

183
00:14:42,140 --> 00:14:42,820
ice cream.

184
00:14:43,400 --> 00:14:49,220
Maybe you like chocolate ice cream more so you choose 90 percent chocolate and 10 percent vanilla sliding

185
00:14:49,220 --> 00:14:51,950
the gate to the appropriate point to achieve this.

186
00:14:52,760 --> 00:14:55,920
Note that in practice, this is called a convex combination.

187
00:14:56,330 --> 00:14:59,570
You can also think of it like a soft mixture, a soft selection.

188
00:15:04,190 --> 00:15:09,230
Let's now turn our attention to the other great victory, the reset gate, Victor, how does this come

189
00:15:09,230 --> 00:15:09,740
into play?

190
00:15:10,700 --> 00:15:13,070
Well, first, let's consider how this is calculated.

191
00:15:14,410 --> 00:15:19,900
As you can see, it's just another neuron, so that should give you some comfort, there is really nothing

192
00:15:19,900 --> 00:15:20,510
new here.

193
00:15:20,530 --> 00:15:26,530
As I promised, we're just making use of neurons, which we've already learned about, just neurons

194
00:15:26,530 --> 00:15:28,060
connected to more neurons.

195
00:15:28,810 --> 00:15:32,680
Now it's called the resegregate vector, which is a good hint about what it's for.

196
00:15:33,490 --> 00:15:40,030
Remember that we called the second part of the calculation a simple answer, but this wasn't entirely

197
00:15:40,030 --> 00:15:40,390
true.

198
00:15:41,230 --> 00:15:44,250
This is because this is where the Redgate appears.

199
00:15:45,160 --> 00:15:51,220
As you can see, the reset game is used by doing an element Y's multiplication with each of T minus

200
00:15:51,220 --> 00:15:51,550
one.

201
00:15:52,570 --> 00:15:57,800
Importantly, remember that the values of our of T are always between zero and one.

202
00:15:58,540 --> 00:16:03,040
What happens if we multiply each of T minus one by a value close to zero?

203
00:16:03,610 --> 00:16:05,230
Well, it just gets closer to zero.

204
00:16:05,860 --> 00:16:07,330
What if we multiply by one?

205
00:16:07,750 --> 00:16:09,760
Then it just stays the same as it was before.

206
00:16:10,420 --> 00:16:17,350
In other words, before multiplying by the headwinds ahead and wave H, we decide which parts of each

207
00:16:17,350 --> 00:16:21,340
of T minus one we want to remember and which parts we want to forget.

208
00:16:22,690 --> 00:16:27,430
So we forget them simply by resetting them back to zero, hence the reset gate.

209
00:16:28,540 --> 00:16:32,670
Note that this performs a very similar function to the update gazy of T.

210
00:16:32,680 --> 00:16:34,140
It's just in a different place.

211
00:16:34,630 --> 00:16:40,390
So it's just reinforcing our ability to remember and forget different parts of T minus one.

212
00:16:45,370 --> 00:16:50,350
OK, so to summarize what we've learned so far, what is the goal and why is it useful?

213
00:16:50,960 --> 00:16:56,140
Ajamu, you first and foremost has the same API as the simple recurring unit.

214
00:16:56,770 --> 00:17:01,840
The output is still HFT and it still depends on each of T minus one index of T.

215
00:17:02,650 --> 00:17:08,740
The difference between this and the simple recurring unit is it has functionality for remembering and

216
00:17:08,740 --> 00:17:11,450
forgetting what was in each of T minus one.

217
00:17:12,460 --> 00:17:17,860
This solves a problem we had with simple Arnon's where due to the vanishing gradient problem, it would

218
00:17:17,860 --> 00:17:20,680
forget things that I saw earlier in a sequence.

219
00:17:21,400 --> 00:17:27,340
We accomplish this by adding little binary classifiers in the form of logistic regression neurones that

220
00:17:27,340 --> 00:17:30,490
I'll put a prediction on whether to remember or forget.