1
00:00:11,680 --> 00:00:18,580
In this lecture, we are going to continue our discussion of the LSM and GIU, previously we discussed

2
00:00:18,580 --> 00:00:25,450
the GIU and now we are going to move on to the Alstrom to recap, what did we learn about the GIU?

3
00:00:26,350 --> 00:00:32,320
First we learn that the simple Arnon unit by itself is problematic because it can't remember long term

4
00:00:32,320 --> 00:00:33,190
dependencies.

5
00:00:33,970 --> 00:00:39,130
The way we solve this is we make the head and state the way that some of the previous head and state

6
00:00:39,400 --> 00:00:45,070
and the new value for the state, which is essentially the same calculation as the symbol or end unit

7
00:00:45,880 --> 00:00:52,150
in this way that you always has the opportunity to remember the previous and instead state allowing

8
00:00:52,150 --> 00:00:56,860
it to carry that state forward in time, reducing the chance of forgetting it later.

9
00:00:58,560 --> 00:01:04,320
Importantly, we recognize that these gates, which decide whether we should remember or forget, are

10
00:01:04,320 --> 00:01:06,090
just binary classifiers.

11
00:01:06,750 --> 00:01:10,890
There are little logistic regression neurons, something we are very comfortable with.

12
00:01:15,910 --> 00:01:21,490
Now we are going to move on to the LSM, which is a little more complicated, but follows the same idea.

13
00:01:22,180 --> 00:01:28,660
Interestingly, when Giuse first came out, the initial research suggested that there was no clear winner

14
00:01:28,660 --> 00:01:29,440
between the two.

15
00:01:30,040 --> 00:01:34,600
It was reported that the group performed comparably with the LDM.

16
00:01:35,350 --> 00:01:40,330
This made the GIU a great choice because it has less parameters and therefore is more performant than

17
00:01:40,330 --> 00:01:41,120
the allaster.

18
00:01:42,070 --> 00:01:47,920
These days, the current belief is that this isn't true and that the LSM actually ended up being better.

19
00:01:48,580 --> 00:01:51,340
In fact, at the time am making this course.

20
00:01:51,760 --> 00:01:57,450
This latest research only came out about a year ago, which is actually two years after I made my first

21
00:01:57,450 --> 00:01:58,330
nine Encore's.

22
00:01:58,840 --> 00:02:01,810
So that just goes to show how fast things move in this field.

23
00:02:02,680 --> 00:02:08,740
Experimental results are always changing our thinking about best practices and which techniques to choose

24
00:02:08,740 --> 00:02:09,520
over others.

25
00:02:10,570 --> 00:02:16,020
This also is a great example of Mirrool machine learning, experimentation and not philosophy.

26
00:02:16,750 --> 00:02:20,640
I know for a lot of beginners, the approach they take is a philosophical approach.

27
00:02:21,100 --> 00:02:24,340
They would try to use logical reasoning to deduce which is better.

28
00:02:24,490 --> 00:02:30,070
The LACMA, the G are you the SDM has these Maritz, the YOU has these merits and so on.

29
00:02:31,390 --> 00:02:32,650
Such things are useless.

30
00:02:32,710 --> 00:02:38,790
Please don't waste your time on philosophy instead follow the example that these researchers have set

31
00:02:39,040 --> 00:02:41,800
and use experiments to decide which is best.

32
00:02:42,370 --> 00:02:46,850
There are two papers I've attached an extra reading that text that you'll want to check out.

33
00:02:47,530 --> 00:02:54,340
The first is called on the Practical Computational Power of Finite Precision Arnon's for language recognition.

34
00:02:54,730 --> 00:03:00,200
And the second is called massive exploration of neural machine translation architectures.

35
00:03:00,820 --> 00:03:03,970
They both conclude that the LSM outperforms Giuse.

36
00:03:03,970 --> 00:03:08,500
And so given the choice, I would probably choose the LSM by default.

37
00:03:13,610 --> 00:03:15,320
So how does the system work?

38
00:03:16,040 --> 00:03:22,340
Basically, you can think of it as like the Gyoo but with more state vectors and more gates, again,

39
00:03:22,340 --> 00:03:25,850
it's my preference to understand that Elysium through the equations.

40
00:03:25,850 --> 00:03:31,180
But if you find these visualizations helpful, then you can feel free to look at that as well.

41
00:03:36,430 --> 00:03:43,540
The first difference with the GIU and the storm is that the storm has two states in addition to HFT,

42
00:03:43,990 --> 00:03:45,320
which we call the head and state.

43
00:03:45,610 --> 00:03:52,100
We also have CFT, the cells that sometimes the cell state is considered an additional hidden state.

44
00:03:52,540 --> 00:03:56,350
So you actually might pass both of these on to the next layer.

45
00:03:57,930 --> 00:04:02,820
For us, we will usually ignore the cell state so you can think of it more like an intermediate value

46
00:04:02,820 --> 00:04:05,300
that you calculate, just like the gate vectors.

47
00:04:05,340 --> 00:04:09,100
So you calculate it, but you don't actually use it to pass on to the next layer.

48
00:04:09,930 --> 00:04:17,460
So even though the LSM has an extra step, you can still think of it as having almost the same API as

49
00:04:17,460 --> 00:04:18,840
a simple AURIN in energy.

50
00:04:18,840 --> 00:04:19,300
Are you?

51
00:04:19,920 --> 00:04:23,160
It still outputs the final head and state each of Big T.

52
00:04:24,940 --> 00:04:30,940
It's just that for the inputs, in addition to X of one all the way up to X of Big T and the initial

53
00:04:30,940 --> 00:04:32,360
head and state H of zero.

54
00:04:32,740 --> 00:04:40,330
We also have the initial cells they see of zero optionally the LSM unit can output the cells states,

55
00:04:40,540 --> 00:04:42,160
but we'll usually ignore this.

56
00:04:47,180 --> 00:04:53,240
OK, so let's get to the LSM calculations, I want to remind you that while this might seem complicated,

57
00:04:53,240 --> 00:04:59,150
it's not don't be scared by the sight of equations, but instead try to see the forest for the trees.

58
00:04:59,630 --> 00:05:01,680
Each of these is just a neuron.

59
00:05:02,270 --> 00:05:08,360
Once you start to realize that each thing here is a binary classifier or logistic regression, it becomes

60
00:05:08,360 --> 00:05:09,710
a lot simpler to interpret.

61
00:05:10,520 --> 00:05:11,430
So what do we see?

62
00:05:12,410 --> 00:05:15,160
First, we see that there are more of these gates.

63
00:05:15,290 --> 00:05:18,980
Specifically, we have F.T. the forget gate vector.

64
00:05:19,730 --> 00:05:25,850
Next we have iev t the input gate vector, also known as the update gate vector, analogously to the

65
00:05:25,850 --> 00:05:26,480
Jiahu.

66
00:05:26,930 --> 00:05:30,110
And next we have all of T the output gate vector.

67
00:05:31,770 --> 00:05:36,990
So hopefully this is the easy to remember because the word forget starts with f the word input starts

68
00:05:36,990 --> 00:05:42,540
with I and the word output starts with, oh, next we have the cell state CFT.

69
00:05:43,260 --> 00:05:49,410
You might notice that this actually takes on the role of each of T from the you specifically.

70
00:05:49,530 --> 00:05:51,570
We have the weighted sum of two terms.

71
00:05:52,140 --> 00:05:55,920
The first term is just the previous cells, the C of T minus one.

72
00:05:56,580 --> 00:06:01,410
How much of this we end up keeping is controlled by the forget gate F of T.

73
00:06:01,830 --> 00:06:05,100
So control is how much of C of T minus one we want to forget.

74
00:06:06,150 --> 00:06:08,500
The second term is the simple answer.

75
00:06:09,060 --> 00:06:12,440
How much of this we end up keeping is controlled by the input gate.

76
00:06:12,450 --> 00:06:20,130
I have t and notice that this has exactly the same equation as the simple answer, unlike the GIU.

77
00:06:21,470 --> 00:06:27,980
Now, note that for the simple answer, we have the activation function f of see, this can be anything,

78
00:06:28,250 --> 00:06:30,690
but usually by default, it's the Tanach.

79
00:06:31,790 --> 00:06:37,280
Finally, we have the hidden state, each of T, which is just a simple transformation on the substate

80
00:06:37,280 --> 00:06:39,500
CFT in particular.

81
00:06:39,680 --> 00:06:46,010
We apply another activation function F of H, which also can be anything but by default, it's also

82
00:06:46,010 --> 00:06:47,210
usually a teenage.

83
00:06:47,780 --> 00:06:50,600
We then multiply this by the output of T.

84
00:06:52,400 --> 00:06:58,160
So the output gap controls which values of the cell state we actually pass through to the head and state

85
00:06:58,190 --> 00:06:58,910
H of t.

86
00:07:03,930 --> 00:07:11,130
Not that intensive care carries, it's actually not possible to choose FCF of each individually, of

87
00:07:11,130 --> 00:07:16,650
course, if you writing your own ASTM, you could customize this any way you wanted, but for tends

88
00:07:16,650 --> 00:07:17,570
to flow and carries.

89
00:07:17,910 --> 00:07:20,980
This is controlled by the activation argument, as usual.

90
00:07:21,810 --> 00:07:26,370
So if you set activation to value, then both of these would be changed to value.

91
00:07:31,480 --> 00:07:38,170
So you can see that by building up our knowledge first through simple Arnon's, then Giuse, then storms,

92
00:07:38,590 --> 00:07:42,370
we can make something complex like the LSM seem very simple.

93
00:07:42,910 --> 00:07:49,120
It's nothing but three logistic regressions or neurons to give us the forget, get the input gate and

94
00:07:49,120 --> 00:07:49,870
the output gate.

95
00:07:50,560 --> 00:07:56,890
Then the cell state is the weighted some of the previous cells there and the simple aren't weighted

96
00:07:56,890 --> 00:07:58,700
by the forget gate and the input gate.

97
00:07:59,680 --> 00:08:05,530
And remember, we want to offer the cell state the possibility to remember its old state so that we

98
00:08:05,530 --> 00:08:07,750
can learn long term dependencies.

99
00:08:09,730 --> 00:08:16,120
Finally, the hidden state of tea is just to squash the version of the there with the output gate controlling

100
00:08:16,120 --> 00:08:18,070
which values are allowed to pass through.

101
00:08:18,820 --> 00:08:24,330
By the way, I say squashed because most of the time we will use the default tannish activation.

102
00:08:25,000 --> 00:08:28,120
So this is really not a hyper parameter you have to play around with.

103
00:08:28,570 --> 00:08:30,430
Just choose the tannish by default.

104
00:08:31,420 --> 00:08:38,050
As a side note, if you change the activation, you're LSM unit won't use the GPU, at least in this

105
00:08:38,050 --> 00:08:39,150
version of Tenzer flow.

106
00:08:39,730 --> 00:08:44,320
There are some other requirements too for the LSM to be GPU compatible.

107
00:08:44,620 --> 00:08:49,110
So you want to check out that sense flow documentation for the most up to date list.

108
00:08:54,190 --> 00:08:59,270
In terms of code, how did that go, you and ASTM work as promised?

109
00:08:59,320 --> 00:09:02,350
These have the same API as the simple answer.

110
00:09:02,680 --> 00:09:09,280
So if all you want to do is plug in Ajamu or LSM in place of the simple or an end, it's as simple as

111
00:09:09,280 --> 00:09:10,280
changing the name.

112
00:09:10,930 --> 00:09:16,080
In other words, just type in a different name and all of a sudden you have a more powerful model.

113
00:09:16,390 --> 00:09:16,870
Easy.

114
00:09:21,900 --> 00:09:26,700
Later on, there will be some useful options for these Arnon units that we can make use of.

115
00:09:27,150 --> 00:09:32,520
I want to briefly mention these to you now so that, you know, they exist, although we won't be making

116
00:09:32,520 --> 00:09:33,620
use of them right away.

117
00:09:35,220 --> 00:09:41,040
First, recall that for each input, X of one of two all the way up to X, a big T. We will calculate

118
00:09:41,040 --> 00:09:44,800
ahead in state H of one age of two, all the way up to H of Big T..

119
00:09:45,780 --> 00:09:52,800
Currently we've seen that all the RNA units, the simple print in the GOP analysts only return at this

120
00:09:52,800 --> 00:09:55,080
final hit and stay value H of Big T.

121
00:09:56,270 --> 00:10:01,490
But there are scenarios where we would want the other head and states to so that we can calculate the

122
00:10:01,490 --> 00:10:03,980
corresponding output predictions, why have one?

123
00:10:03,980 --> 00:10:06,950
Why out of two and all the way up to Y had a big T.

124
00:10:08,210 --> 00:10:14,060
In order to do that, it's very simple, all we need to do is pass in the argument return sequences

125
00:10:14,060 --> 00:10:14,780
equals true.

126
00:10:15,470 --> 00:10:21,500
When you do this, the output of the Arnett's unit will be of shape and by T, by M, because remember,

127
00:10:21,650 --> 00:10:27,560
the hidden feature vector is of size M, and then when you apply a dense layer on top of this, your

128
00:10:27,560 --> 00:10:32,390
final output is of shape and by t by K where K is the number of output nodes.

129
00:10:34,780 --> 00:10:38,410
So you have an output for every timestep, for every sample.

130
00:10:43,400 --> 00:10:49,520
Another interesting argument is the return state argument for the simple answer, and Ajamu, you this

131
00:10:49,520 --> 00:10:50,820
is a bit superfluous.

132
00:10:51,230 --> 00:10:53,800
We already know that the output is the hidden state.

133
00:10:54,290 --> 00:10:59,730
So when we use the functional API and we grab the return value, this is already the hidden state.

134
00:11:00,560 --> 00:11:02,540
So if you set returns, that equals true.

135
00:11:02,690 --> 00:11:04,040
You'll just get back the same thing.

136
00:11:04,040 --> 00:11:06,890
Return twice for the LSM.

137
00:11:07,100 --> 00:11:09,360
The cell state is also considered a state.

138
00:11:09,770 --> 00:11:11,660
So if you set returns, that equals true.

139
00:11:11,840 --> 00:11:12,890
For the LSM.

140
00:11:13,130 --> 00:11:19,040
It'll return three things the output, the hidden state and the cell state, the output and the head

141
00:11:19,040 --> 00:11:22,340
and state are still the same thing, but the cell state is different.

142
00:11:23,030 --> 00:11:27,680
Again, this is kind of random information at this point since we won't be using it immediately.

143
00:11:28,070 --> 00:11:31,430
But it's good to have in the back of your mind just in case.