1
00:00:11,610 --> 00:00:15,480
In this lecture, we are going to start discussing modern art and units.

2
00:00:15,900 --> 00:00:19,080
You may have heard of these already, so I'll just tell you right now what they are.

3
00:00:19,200 --> 00:00:27,210
The Alice team and the guru Alice TM stands for long short term memory and group stands for gated recurrent

4
00:00:27,210 --> 00:00:27,600
unit.

5
00:00:28,410 --> 00:00:34,290
They are both essentially the same thing, with the argument to be a more simplified and efficient version

6
00:00:34,290 --> 00:00:35,260
of the last year.

7
00:00:36,120 --> 00:00:42,510
This lecture will discuss why the LSD Yemen's you are needed in the first place when the simple answer

8
00:00:42,690 --> 00:00:43,980
seems pretty good already.

9
00:00:44,550 --> 00:00:50,940
We'll also look at the underlying equations for the lithium and giu and importantly, a convenient perspective

10
00:00:50,940 --> 00:00:51,870
for looking at them.

11
00:00:52,680 --> 00:00:58,410
One mistake a lot of beginners make is that they see a bunch of equations and they get very intimidated.

12
00:00:58,740 --> 00:01:01,400
They think, What the heck is this just a bunch of equations?

13
00:01:01,410 --> 00:01:02,490
What do I do with this?

14
00:01:02,910 --> 00:01:09,300
So hopefully this lecture will take you from that lazy beginner's mindset to how a proper machine learning

15
00:01:09,300 --> 00:01:12,600
engineer would think about Elysium's, and she argues.

16
00:01:17,710 --> 00:01:22,060
First, we need to discuss why we have these fancy new recurrent units in the first place.

17
00:01:22,540 --> 00:01:26,620
Why is this simple Arnon unit not good enough to understand this?

18
00:01:26,620 --> 00:01:29,170
We have to go back to the vanishing gradient problem.

19
00:01:30,780 --> 00:01:36,810
As with our CNN, we can represent the output prediction y hat of Big T as a function of the inputs

20
00:01:37,050 --> 00:01:39,300
x one x two all the way up to x t.

21
00:01:40,440 --> 00:01:44,430
Once we do this, we can see that it's just a really big composite function.

22
00:01:45,180 --> 00:01:51,930
Now suppose we want to take the derivative with respect to W X H, we can see that w x h appears several

23
00:01:51,930 --> 00:01:53,190
times in this equation.

24
00:01:58,220 --> 00:02:02,960
Now, the actual derivatives themselves and how to find them are not what we have to consider here.

25
00:02:03,680 --> 00:02:07,040
But what is important is which derivatives are going to show up.

26
00:02:07,880 --> 00:02:14,750
Well, if w x h appears multiplied by X of T, then we're going to have to find the derivative of W

27
00:02:14,750 --> 00:02:20,750
x h times x of T at some point in our expression for the gradient of W x h.

28
00:02:21,380 --> 00:02:26,690
It should be clear that W x h appears in front of all the XS for every time step.

29
00:02:27,380 --> 00:02:32,780
So ultimately, all of these single derivatives are going to appear somewhere in our expression for

30
00:02:32,780 --> 00:02:36,260
the gradient of our costs with respect to w x h.

31
00:02:41,380 --> 00:02:46,810
But another thing that's important to consider is how deeply nested each of these terms is with respect

32
00:02:46,810 --> 00:02:48,670
to why have of Big T.

33
00:02:49,570 --> 00:02:54,130
What is clear is that the term involving X1 is the most deeply nested.

34
00:02:54,520 --> 00:02:58,420
The term involving X2 is the second most deeply nested and so on.

35
00:02:59,080 --> 00:03:02,770
Now what do we remember about the derivative of composite functions?

36
00:03:03,040 --> 00:03:04,720
For my theory of aliens?

37
00:03:05,380 --> 00:03:09,730
Well, we know that composite functions turn into multiplication in the gradient.

38
00:03:10,240 --> 00:03:12,430
This is due to the chain rule of calculus.

39
00:03:13,120 --> 00:03:19,060
Therefore, the more deeply nested you are, the more terms you have to multiply by when you're finding

40
00:03:19,060 --> 00:03:19,630
the gradient.

41
00:03:20,350 --> 00:03:25,270
In other words, aren't ends are particularly vulnerable to the vanishing gradient problem.

42
00:03:26,080 --> 00:03:32,290
The farther back in the sequence the input is, the more vulnerable it is to the effects of the vanishing

43
00:03:32,290 --> 00:03:32,830
gradient.

44
00:03:34,330 --> 00:03:38,890
This is why it's often said that awnings have problems learning long term dependencies.

45
00:03:39,370 --> 00:03:43,210
The answer simply can't learn from inputs that are too far back.

46
00:03:48,460 --> 00:03:53,770
So how does the vanishing gradient problem manifest in actual machine learning applications?

47
00:03:54,460 --> 00:03:58,600
Let's suppose we're doing natural language processing, and we would like to extract some information

48
00:03:58,600 --> 00:03:59,440
about a document.

49
00:03:59,920 --> 00:04:05,980
So to give you a concrete example, suppose our A.I. is reading the Wikipedia page about Albert Einstein.

50
00:04:06,370 --> 00:04:07,930
It's read the entire page.

51
00:04:08,440 --> 00:04:11,890
Now you ask your A.I. on what day was Albert Einstein born?

52
00:04:12,640 --> 00:04:18,579
Unfortunately, a simple answer would not be able to answer this, even though this information appears

53
00:04:18,579 --> 00:04:19,300
in the article.

54
00:04:19,959 --> 00:04:22,750
The problem is it appeared at the beginning of the article.

55
00:04:22,870 --> 00:04:28,480
So by the end of the article, the simple answer has simply forgotten what it had read earlier.

56
00:04:33,790 --> 00:04:38,110
Now you might think, AHA, have a solution to this vanishing gradient problem.

57
00:04:38,200 --> 00:04:39,910
It is to use real use.

58
00:04:41,690 --> 00:04:44,660
Unfortunately, with Aunt Ends, things aren't so simple.

59
00:04:45,230 --> 00:04:50,780
Of course, you can always try to use while using your aunt ends and verify experimentally what the

60
00:04:50,780 --> 00:04:51,500
results are.

61
00:04:52,100 --> 00:04:57,980
But in deep learning, we found that the most effective way of dealing with this problem is to use entirely

62
00:04:57,980 --> 00:05:01,400
different units, namely the LTCM and the glue.

63
00:05:06,490 --> 00:05:12,490
One interesting historical fact is that the system was created a very long time ago, 1997.

64
00:05:13,600 --> 00:05:17,140
This is more than a decade before deep learning was even called deep learning.

65
00:05:17,800 --> 00:05:21,580
So this is why I always say to people research interesting ideas.

66
00:05:21,910 --> 00:05:25,090
Don't just go through your career chasing what is popular.

67
00:05:25,750 --> 00:05:29,860
By doing that, you were like a hamster running on a spinning wheel, going nowhere.

68
00:05:30,610 --> 00:05:36,610
If you're only chasing popular mainstream ideas, you would have never come across this idea of LACMA.

69
00:05:37,600 --> 00:05:41,650
It's more than a decade old part of a feel that nobody even cares about.

70
00:05:42,430 --> 00:05:44,620
Do what is interesting, not what is popular.

71
00:05:46,450 --> 00:05:48,970
Now, the LSHTM is an extremely complex unit.

72
00:05:49,570 --> 00:05:55,450
The guru was invented more recently in 2014, but it uses a lot of the same ideas.

73
00:05:55,930 --> 00:06:01,480
So I always like to start with the GIU so that you can learn the principles in a simpler setting and

74
00:06:01,480 --> 00:06:03,130
then go back to the LSHTM.

75
00:06:08,170 --> 00:06:09,910
So what is the gated recurrent unit?

76
00:06:10,840 --> 00:06:14,400
First, let's start with the basic principle from a simple aren't.

77
00:06:15,220 --> 00:06:19,000
We still want to have some hidden state at time t h of T.

78
00:06:19,810 --> 00:06:26,110
This hidden state will still depend on x of T the current input and h of T minus one the previous head

79
00:06:26,110 --> 00:06:26,560
in state.

80
00:06:27,370 --> 00:06:30,280
The only thing that will change is how each of T is calculated.

81
00:06:35,410 --> 00:06:39,880
Now, there's a little bit of a debate on how to present the group analyst Yeah.

82
00:06:40,450 --> 00:06:42,820
Is it more useful to look at the equations?

83
00:06:43,090 --> 00:06:45,340
Or is it more useful to look at diagrams?

84
00:06:45,880 --> 00:06:51,580
Now, if you are a beginner, you might automatically say pictures pictures are the best for learning.

85
00:06:52,060 --> 00:06:58,630
Unfortunately for you, that's not what we are going to do a lot of poorly written blogs that all copy

86
00:06:58,630 --> 00:06:59,290
from each other.

87
00:06:59,560 --> 00:07:02,620
Use these same diagrams as if they were self-explanatory.

88
00:07:02,860 --> 00:07:07,870
You look at this picture and all of a sudden, bam, you understand LATimes and gurus.

89
00:07:08,380 --> 00:07:10,510
Personally, I find this doesn't work for me.

90
00:07:11,050 --> 00:07:13,540
I know that other prominent lecturers feel the same way.

91
00:07:14,080 --> 00:07:19,330
Of course, if you find these pictures useful, you're welcome to look at them and try to make sense

92
00:07:19,330 --> 00:07:19,750
of them.

93
00:07:20,140 --> 00:07:25,360
But for me and many others, I know it's actually the equations which are most useful.

94
00:07:25,870 --> 00:07:31,600
That might seem backwards to a beginner who is very terrified of math, but hopefully you are past that

95
00:07:31,600 --> 00:07:32,440
point right now.

96
00:07:33,040 --> 00:07:36,150
I promise you that there's nothing here that we haven't seen before.

97
00:07:41,230 --> 00:07:42,670
OK, so enough chitchat.

98
00:07:42,700 --> 00:07:43,720
Where's the glue?

99
00:07:44,380 --> 00:07:49,140
Well, here are the equations for the glue to sort of give you some perspective on this.

100
00:07:49,150 --> 00:07:54,550
I've also included the equation for a simple, recurring unit so that you can compare the complexity

101
00:07:54,550 --> 00:07:55,150
of the two.

102
00:07:55,900 --> 00:07:58,660
The simple, recurring unit is essentially just one line.

103
00:07:59,110 --> 00:08:02,710
Each of T depends on acts of T and H of T minus one.

104
00:08:03,520 --> 00:08:08,140
The Giu is three lines because we have to calculate three different things.

105
00:08:08,770 --> 00:08:12,610
First, we calculate Z of T, which is called the update gate vector.

106
00:08:13,300 --> 00:08:17,230
Second, we calculate R of T, which is called the reset gate vector.

107
00:08:17,800 --> 00:08:23,500
And finally, we calculate H of T, which is the head and state vector, which is as with the simple

108
00:08:23,500 --> 00:08:27,940
recurrent unit, what gets passed on to the next layer in the neural network.

109
00:08:28,630 --> 00:08:32,500
So let's analyze these equations so that we can make sense of the giu.

110
00:08:37,630 --> 00:08:43,390
First, it's important to discuss shapes you want some visualization of the objects we're dealing with.

111
00:08:43,960 --> 00:08:49,570
Well, it's helpful to know that all of these new vectors are just as vectors of size m the same size

112
00:08:49,570 --> 00:08:50,380
as age of T.

113
00:08:51,220 --> 00:08:56,230
As always, this is a hyper parameters, so you can choose how many features the hidden layers should

114
00:08:56,230 --> 00:08:56,680
have.

115
00:08:57,580 --> 00:09:00,160
This also implies the shape of all these weights.

116
00:09:00,850 --> 00:09:06,910
If a weight is going from X of T to one of the gate vectors, then it must be of size D by M.

117
00:09:07,720 --> 00:09:13,420
If a weight is going from H of T minus one to one of the gate vectors, then it must be of size m by

118
00:09:13,420 --> 00:09:13,780
m.

119
00:09:14,500 --> 00:09:20,410
And of course, since all the vectors are of size M, then all of the bias terms must also be of size

120
00:09:20,410 --> 00:09:20,920
m.

121
00:09:26,020 --> 00:09:28,960
Next, let's consider what the GRU is trying to do.

122
00:09:29,560 --> 00:09:31,330
I think the names are very helpful here.

123
00:09:32,080 --> 00:09:36,640
ZTE is the update gate victor in R.T. is the reset gate victor.

124
00:09:37,330 --> 00:09:42,640
By the way, if you didn't know the circle with a dot in, it means element wise multiplication.

125
00:09:43,300 --> 00:09:50,020
So one minus z of T is an m size vector and H of T minus one is also an m size vector.

126
00:09:50,560 --> 00:09:54,310
When you multiply these together, you are doing element wise multiplication.

127
00:09:55,240 --> 00:09:56,740
So let's start by looking at zero.

128
00:09:57,250 --> 00:09:58,180
How is it used?

129
00:09:58,960 --> 00:10:04,870
We can see the z of T gets multiplied by one term involving X of T and each of T minus one.

130
00:10:05,530 --> 00:10:10,690
Then one minus z of T gets multiplied by the previous hidden state h of T minus one.

131
00:10:11,410 --> 00:10:12,490
This is very helpful.

132
00:10:13,120 --> 00:10:16,360
You can think of Z of T as telling us What should we do?

133
00:10:16,600 --> 00:10:18,850
Should we take this new value for the head and state?

134
00:10:19,150 --> 00:10:23,080
Or should we just remember the previous hidden state age of T minus one?

135
00:10:23,920 --> 00:10:24,940
Why is that helpful?

136
00:10:26,200 --> 00:10:28,780
Well, remember what we learned about the vanishing gradient.

137
00:10:29,380 --> 00:10:33,580
If we use a simple aren't in its printed forgetting things that it's seen in the past?

138
00:10:34,360 --> 00:10:40,060
By using this gate, it explicitly allows us to remember the previous head and state so that the head

139
00:10:40,060 --> 00:10:43,090
and state can be carried forward to the next hidden state.

140
00:10:44,230 --> 00:10:49,720
By the way, since Z of T is the output of a sigmoid, its values are always between a zero and one.

141
00:10:50,350 --> 00:10:55,930
Therefore, for each of T, we are always taking a weighted sum of the previous H of C minus one.

142
00:10:56,200 --> 00:10:57,310
And this other function.

143
00:10:58,030 --> 00:11:03,280
So if zero is close to zero, then we'll remember the old value h of T minus one.

144
00:11:03,940 --> 00:11:09,700
But if Z of T is close to one, then we'll take the new value and forget the old H of T minus one.

145
00:11:10,600 --> 00:11:15,790
By the way, you can think of this other function as analogous to what the simple orient and is doing.

146
00:11:16,480 --> 00:11:21,280
This is very important, but there's one more piece of the puzzle before we can move on to a helpful

147
00:11:21,280 --> 00:11:22,990
interpretation of what's going on.

148
00:11:28,130 --> 00:11:33,050
I want you to look very carefully at this equation for Z of T, what does this look like to you?

149
00:11:33,860 --> 00:11:37,340
Well, it's a sigmoid of some dot products, plus a bias term.

150
00:11:38,000 --> 00:11:39,110
What does that look like?

151
00:11:39,740 --> 00:11:42,980
Well, you might recognize this as exactly like a dense layer.

152
00:11:43,910 --> 00:11:48,440
Now you might object saying that there are two weights, so it's not exactly like a dense layer.

153
00:11:49,310 --> 00:11:55,130
But in fact, the way that these units are often implemented, what we do is we concatenate X of T and

154
00:11:55,130 --> 00:11:57,710
H of T minus one into a single vector.

155
00:11:57,740 --> 00:11:59,210
Let's just call it V for sure.

156
00:12:00,050 --> 00:12:05,630
If we concatenate those inputs and we concatenate the weights, then we just get back to our regular

157
00:12:05,630 --> 00:12:10,970
old dense layer where we have one input multiplied by one weight matrix plus a bias term.

158
00:12:11,750 --> 00:12:13,040
In other words, what is this?

159
00:12:13,400 --> 00:12:14,660
This is just a neuron.

160
00:12:15,350 --> 00:12:17,870
Now, notice that it also ends with a sigmoid.

161
00:12:18,500 --> 00:12:21,320
Importantly, this sigmoid is not a hyper parameter.

162
00:12:21,920 --> 00:12:27,110
This is usually always a sigmoid because the output is always a number between a zero and one.

163
00:12:27,590 --> 00:12:32,180
That's how we can take Z of T and a one minus z later on in the G, are you?

164
00:12:33,290 --> 00:12:34,910
So what is this neuron predicted?

165
00:12:35,660 --> 00:12:41,120
It's like a binary classifier telling us essentially the probability that we should take the new value

166
00:12:41,120 --> 00:12:45,110
for h of T, or if we should keep the old value of T minus one.

167
00:12:50,250 --> 00:12:54,570
So if we ignore the reset gate for now, here's how we can interpret the GIU.

168
00:12:55,350 --> 00:13:01,500
First, remember that each of T is essentially the weighted sum of two things each of T minus one in

169
00:13:01,500 --> 00:13:05,100
the output of a simple aan n z of T.

170
00:13:05,130 --> 00:13:10,980
The update gate vector is like an output probability, telling us the probability that we should keep

171
00:13:10,980 --> 00:13:12,660
the output of the simple answer.

172
00:13:13,470 --> 00:13:20,460
In other words, keep each of T minus one with probability one minus C of T and keep the simple aren't

173
00:13:20,460 --> 00:13:23,160
an output with probability C of T.

174
00:13:24,060 --> 00:13:29,880
The final H of T is then just a mixture of these two, allowing us to keep the most important parts

175
00:13:30,060 --> 00:13:32,760
of each of T minus one and discard the rest.

176
00:13:33,980 --> 00:13:36,290
Now, I want to make one small clarification.

177
00:13:36,800 --> 00:13:39,500
Please note that this is really for intuition only.

178
00:13:40,070 --> 00:13:45,290
These are not really probabilities of keeping and discarding because we never actually keep her discard.

179
00:13:45,830 --> 00:13:52,490
Instead, what we actually do is take a mixture of the two components, for instance, 25 percent of

180
00:13:52,490 --> 00:13:55,700
component one and 75 percent of component two.

181
00:13:56,060 --> 00:13:57,830
But we don't actually keep or discard.

182
00:14:02,570 --> 00:14:05,120
In actuality, the word gate is very helpful.

183
00:14:05,750 --> 00:14:07,910
Imagine two openings side by side.

184
00:14:08,960 --> 00:14:14,570
Now imagine a gate, which is the same size as each opening, but can only slide between the two openings.

185
00:14:15,200 --> 00:14:21,440
Thus, it can only fully cover one of the openings at once, or it can simply cover a part of each.

186
00:14:22,040 --> 00:14:27,080
But notice that in total, it always covers an area equal in size to one full opening.

187
00:14:27,800 --> 00:14:34,400
In other words, if one opening is 25 percent clear, the other opening will be 75 percent clear and

188
00:14:34,400 --> 00:14:36,290
they always seem up to 100 percent.

189
00:14:37,130 --> 00:14:42,140
So picture this like an ice cream maker that can give you some mixture of chocolate ice cream and vanilla

190
00:14:42,140 --> 00:14:42,770
ice cream.

191
00:14:43,400 --> 00:14:48,440
Maybe you like chocolate ice cream more so you choose 90 percent chocolate and 10 percent vanilla.

192
00:14:48,710 --> 00:14:51,920
Sliding the gates at the appropriate point to achieve this.

193
00:14:52,760 --> 00:14:55,910
Note that in practice, this is called a convex combination.

194
00:14:56,360 --> 00:14:59,570
You can also think of it like a soft mixture or a soft selection.

195
00:15:04,160 --> 00:15:07,970
Let's now turn our attention to the other gate, Victor, the reset gate, victor.

196
00:15:08,570 --> 00:15:09,740
How does this come into play?

197
00:15:10,700 --> 00:15:13,040
Well, first, let's consider how this is calculated.

198
00:15:14,440 --> 00:15:18,400
As you can see, it's just another neuron, so that should give you some comfort.

199
00:15:19,030 --> 00:15:20,500
There is really nothing new here.

200
00:15:20,530 --> 00:15:26,950
As I promised, we're just making use of neurons, which we've already learned about just neurons connected

201
00:15:26,950 --> 00:15:28,030
to motor neurons.

202
00:15:28,810 --> 00:15:32,710
Now it's called the reset gate vector, which is a good hint about what it's for.

203
00:15:33,490 --> 00:15:39,490
Remember that we called the second part of the age of T calculation a simple answer, but this wasn't

204
00:15:39,490 --> 00:15:40,390
entirely true.

205
00:15:41,200 --> 00:15:44,230
This is because this is where the reset gate appears.

206
00:15:45,160 --> 00:15:51,190
As you can see, the reset gate is used by doing an element wise multiplication with each of T minus

207
00:15:51,190 --> 00:15:51,520
one.

208
00:15:52,570 --> 00:15:57,760
Importantly, remember that the values of of T are always between zero and one.

209
00:15:58,510 --> 00:16:03,040
What happens if we multiply each of T minus one by a value close to zero?

210
00:16:03,580 --> 00:16:05,230
Well, it just gets closer to zero.

211
00:16:05,830 --> 00:16:07,320
What if we multiply by one?

212
00:16:07,750 --> 00:16:09,760
Then it just stays the same as it was before.

213
00:16:10,450 --> 00:16:17,860
In other words, before multiplying by the a hidden way w h, we decide which parts of each of T minus

214
00:16:17,860 --> 00:16:21,310
one we want to remember and which parts we want to forget.

215
00:16:22,690 --> 00:16:25,780
So we forget them simply by resetting them back to zero.

216
00:16:25,990 --> 00:16:27,430
Hence, the reset gate.

217
00:16:28,540 --> 00:16:32,650
Note that this performs a very similar function to the update gaisie of tea.

218
00:16:32,680 --> 00:16:34,120
It's just in a different place.

219
00:16:34,630 --> 00:16:40,380
So it's just reinforcing our ability to remember and forget different parts of each of T minus one.

220
00:16:45,430 --> 00:16:47,680
OK, so to summarize what we've learned so far.

221
00:16:48,010 --> 00:16:50,350
What is the glue and why is it useful?

222
00:16:51,310 --> 00:16:51,820
Are you?

223
00:16:51,850 --> 00:16:58,990
First and foremost, has the same API as the simple, recurring unit that put is still HFT, and it

224
00:16:58,990 --> 00:17:01,810
still depends on age of T minus one and 50.

225
00:17:02,650 --> 00:17:08,710
The difference between this and the simple recurring unit is it has functionality for remembering and

226
00:17:08,710 --> 00:17:11,440
forgetting what was in h of T minus one.

227
00:17:12,430 --> 00:17:17,710
This solves a problem we had with Simple Rain ends where due to the vanishing gradient problem, it

228
00:17:17,710 --> 00:17:20,619
would forget things that it saw earlier in a sequence.

229
00:17:21,400 --> 00:17:26,710
We accomplished this by adding little binary classifiers in the form of logistic regression neurons

230
00:17:27,069 --> 00:17:30,460
that output a prediction on whether to remember or forget.