1
00:00:11,710 --> 00:00:18,580
In this lecture, we are going to continue our discussion of the lithium and giu previously we discussed

2
00:00:18,580 --> 00:00:22,360
the GIU, and now we are going to move on to the lithium.

3
00:00:23,170 --> 00:00:25,450
To recap, what did we learn about the GIU?

4
00:00:26,320 --> 00:00:32,290
First, we learn that the simple ANA unit by itself is problematic because it can't remember long term

5
00:00:32,290 --> 00:00:33,160
dependencies.

6
00:00:33,940 --> 00:00:39,120
The way we solve this is we make the head and state the weighted sum of the previous head and state

7
00:00:39,430 --> 00:00:44,740
and the new value for the state, which is essentially the same calculation as the symbol or an end

8
00:00:44,740 --> 00:00:45,040
unit.

9
00:00:45,910 --> 00:00:52,210
In this way, the GIU always has the opportunity to remember the previous hidden state, allowing it

10
00:00:52,210 --> 00:00:56,850
to carry that state forward in time, reducing the chance of forgetting it later.

11
00:00:58,560 --> 00:01:04,319
Importantly, we recognize that these gates, which decide whether we should remember or forget, are

12
00:01:04,319 --> 00:01:06,060
just binary classifiers.

13
00:01:06,720 --> 00:01:10,860
There are little logistic regression neurons, something we are very comfortable with.

14
00:01:15,910 --> 00:01:20,980
Now we are going to move on to the lithium, which is a little more complicated, but follows the same

15
00:01:20,980 --> 00:01:21,430
idea.

16
00:01:22,210 --> 00:01:28,300
Interestingly, when Jir use first came out, the initial research suggested that there was no clear

17
00:01:28,300 --> 00:01:29,440
winner between the two.

18
00:01:30,070 --> 00:01:34,580
It was reported that the GIU performed comparably with the lithium.

19
00:01:35,350 --> 00:01:40,330
This made the GIU a great choice because it has less parameters and therefore is more performant than

20
00:01:40,330 --> 00:01:41,080
the last year.

21
00:01:42,070 --> 00:01:47,470
These days, the current belief is that this isn't true and that the lithium actually ended up being

22
00:01:47,470 --> 00:01:47,890
better.

23
00:01:48,610 --> 00:01:51,340
In fact, at the time, I'm making this course.

24
00:01:51,730 --> 00:01:57,420
This latest research only came out about a year ago, which is actually two years after I made my first

25
00:01:57,420 --> 00:01:58,300
nine encores.

26
00:01:58,840 --> 00:02:01,750
So that just goes to show how fast things move in this field.

27
00:02:02,680 --> 00:02:08,710
Experimental results are always changing our thinking about best practices and which techniques to choose

28
00:02:08,710 --> 00:02:09,490
over others.

29
00:02:10,539 --> 00:02:12,710
This also is a great example of my rule.

30
00:02:13,060 --> 00:02:16,000
Machine learning, experimentation and not philosophy.

31
00:02:16,750 --> 00:02:20,620
I know for a lot of beginners, the approach they take is a philosophical approach.

32
00:02:21,100 --> 00:02:24,310
They would try to use logical reasoning to deduce which is better.

33
00:02:24,430 --> 00:02:25,990
The LACMA, the gee are you.

34
00:02:26,560 --> 00:02:30,040
The less GM has these merits, the GIU has these merits and so on.

35
00:02:31,390 --> 00:02:34,600
Such things are useless, please don't waste your time on philosophy.

36
00:02:35,230 --> 00:02:41,260
Instead, follow the example that these researchers have set and use experiments to decide which is

37
00:02:41,260 --> 00:02:41,800
best.

38
00:02:42,370 --> 00:02:46,840
There are two papers I've attached an extra reading text that you'll want to check out.

39
00:02:47,500 --> 00:02:54,340
The first is called on the practical computational power of finite precision and ends for language recognition.

40
00:02:54,730 --> 00:03:00,160
And the second is called massive exploration of neural machine translation architectures.

41
00:03:00,790 --> 00:03:06,700
They both conclude that the STM outperforms joyeux use, and so given the choice, I would probably

42
00:03:06,700 --> 00:03:08,470
choose the lithium by default.

43
00:03:13,640 --> 00:03:15,260
So how does the election work?

44
00:03:16,040 --> 00:03:22,310
Basically, you can think of it as like the GIU, but with more state vectors and more gates, again,

45
00:03:22,310 --> 00:03:25,840
it's my preference to understand the LSM through the equations.

46
00:03:25,850 --> 00:03:31,190
But if you find these visualizations helpful, then you can feel free to look at that as well.

47
00:03:36,370 --> 00:03:43,540
The first difference with the GIU and the LSE team is that the LSHTM has two states in addition to HFT,

48
00:03:43,960 --> 00:03:45,310
which we call the hidden state.

49
00:03:45,580 --> 00:03:47,850
We also have CFT, the SLC.

50
00:03:48,700 --> 00:03:52,120
Sometimes the sell state is considered an additional hit in state.

51
00:03:52,540 --> 00:03:56,320
So you actually might pass both of these on to the next dense layer.

52
00:03:57,930 --> 00:04:02,790
For us, we will usually ignore the cell state, so you can think of it more like an intermediate value

53
00:04:02,790 --> 00:04:07,530
that you calculate, just like the gate vectors, so you calculate it, but you don't actually use it

54
00:04:07,710 --> 00:04:09,090
to pass on to the next layer.

55
00:04:09,930 --> 00:04:17,430
So even though the STM has an extra step, you can still think of it as having almost the same API as

56
00:04:17,430 --> 00:04:19,260
a simple origin in energy, are you?

57
00:04:19,950 --> 00:04:23,160
It still outputs the final hidden state of Big T.

58
00:04:24,970 --> 00:04:30,910
It's just that for the inputs, in addition to X of one all the way up to X of Big T and the initial

59
00:04:30,910 --> 00:04:32,350
hit and state each of zero.

60
00:04:32,710 --> 00:04:35,600
We also have the initial cell state C of zero.

61
00:04:36,370 --> 00:04:42,130
Optionally, the lithium unit can output the cell states, but we'll usually ignore this.

62
00:04:47,210 --> 00:04:49,850
OK, so let's get to the lithium calculations.

63
00:04:50,360 --> 00:04:53,780
I want to remind you that while this might seem complicated, it's not.

64
00:04:54,440 --> 00:04:59,120
Don't be scared by the sight of equations, but instead try to see the forest for the trees.

65
00:04:59,630 --> 00:05:01,640
Each of these is just a neuron.

66
00:05:02,270 --> 00:05:08,330
Once you start to realize that each thing here is a binary classifier or logistic regression, it becomes

67
00:05:08,330 --> 00:05:09,700
a lot simpler to interpret.

68
00:05:10,520 --> 00:05:11,450
So what do we see?

69
00:05:12,410 --> 00:05:15,170
First, we see that there are more of these gates.

70
00:05:15,320 --> 00:05:18,980
Specifically, we have F50 the Forget gate vector.

71
00:05:19,730 --> 00:05:24,740
Next, we have I have T the input gate vector, also known as the update gate vector.

72
00:05:24,740 --> 00:05:26,450
Analogously to the grou.

73
00:05:26,930 --> 00:05:30,080
And next, we have O of T the output gate vector.

74
00:05:31,770 --> 00:05:36,960
So hopefully this is easy to remember because the word forget starts with f, the word input starts

75
00:05:36,960 --> 00:05:39,370
with I and the word output starts with O.

76
00:05:40,440 --> 00:05:42,510
Next, we have a cell state C of T.

77
00:05:43,260 --> 00:05:49,410
You might notice that this actually takes on the role of H of T from the G you specifically.

78
00:05:49,500 --> 00:05:51,540
We have the weighted sum of two terms.

79
00:05:52,140 --> 00:05:55,890
The first term is just the previous cells, the C of T minus one.

80
00:05:56,610 --> 00:06:01,410
How much of this we end up keeping is controlled by the forget gate f of T.

81
00:06:01,860 --> 00:06:05,070
So controls how much of C of T minus one we want to forget.

82
00:06:06,150 --> 00:06:08,490
The second term is the simple answer.

83
00:06:09,060 --> 00:06:12,410
How much of this we end up keeping is controlled by the input gate.

84
00:06:12,420 --> 00:06:20,100
I have T and notice that this has exactly the same equation as the simple aan n unlike the G are you?

85
00:06:21,500 --> 00:06:27,950
Now, note that for the simple answer, we have the activation function f of see, this can be anything,

86
00:06:28,250 --> 00:06:30,600
but usually by default, it's the 10h.

87
00:06:31,790 --> 00:06:37,280
Finally, we have the hidden state h of T, which is just a simple transformation on the cell state

88
00:06:37,280 --> 00:06:37,970
C of T.

89
00:06:38,540 --> 00:06:45,530
In particular, we apply another activation function f of H, which also can be anything but by default,

90
00:06:45,530 --> 00:06:47,150
it's also usually a 10 h.

91
00:06:47,750 --> 00:06:50,600
We then multiply this by the output of T.

92
00:06:52,400 --> 00:06:58,150
So the Applegate controls which values of the cell state we actually pass through to the head and state

93
00:06:58,460 --> 00:06:58,880
of T.

94
00:07:03,960 --> 00:07:10,350
Note that in TensorFlow and Keras, it's actually not possible to choose F, C and F of H individually.

95
00:07:10,890 --> 00:07:15,390
Of course, if you were writing your own LDM, you could customize this any way you wanted.

96
00:07:15,930 --> 00:07:20,970
But for TensorFlow and Keras, this is controlled by the activation argument as usual.

97
00:07:21,810 --> 00:07:26,370
So if you set activation to Rel you, then both of these would be changed to allow you.

98
00:07:31,480 --> 00:07:37,210
So you can see that by building up our knowledge first through simple answers, then she argues, then

99
00:07:37,390 --> 00:07:42,370
storms, we can make something complex like the lithium seem very simple.

100
00:07:42,940 --> 00:07:48,930
It's nothing but three logistic regression ins or neurons to give us the forget gate, the input gate

101
00:07:48,940 --> 00:07:49,840
and the output gate.

102
00:07:50,560 --> 00:07:57,040
Then the cell state is the weighted sum of the previous cell state and the simple ring in weighted by

103
00:07:57,040 --> 00:07:58,690
the forget gate and the input gate.

104
00:07:59,680 --> 00:08:05,530
And remember, we want to offer the cell state the possibility to remember its old state so that we

105
00:08:05,530 --> 00:08:07,690
can learn long term dependencies.

106
00:08:09,730 --> 00:08:15,310
Finally, the hidden state h of T is just a squashed the version of the cells there with the output

107
00:08:15,310 --> 00:08:18,040
gate controlling which values are allowed to pass through.

108
00:08:18,790 --> 00:08:24,280
By the way, I say squashed because most of the time we will use the default 10h activation.

109
00:08:25,000 --> 00:08:28,100
So this is really not a hyper parameter you have to play around with.

110
00:08:28,570 --> 00:08:30,430
Just choose the tarnish by default.

111
00:08:31,450 --> 00:08:38,049
As a side note, if you change the activation, your team unit won't use the GPU, at least in this

112
00:08:38,049 --> 00:08:39,159
version of TensorFlow.

113
00:08:39,730 --> 00:08:45,520
There are some other requirements, too, for the team to be GPU compatible, so you'll want to check

114
00:08:45,520 --> 00:08:49,090
out the TensorFlow documentation for the most up to date list.

115
00:08:54,220 --> 00:08:59,270
In terms of code, how do the GIU and loss team work as promised?

116
00:08:59,290 --> 00:09:02,350
These have the same API as the simple answer.

117
00:09:02,680 --> 00:09:09,280
So if all you want to do is plug in a Geu or LSM in place of the simple aren't in, it's as simple as

118
00:09:09,280 --> 00:09:10,270
changing the name.

119
00:09:10,930 --> 00:09:16,060
In other words, just type in a different name, and all of a sudden you have a more powerful model.

120
00:09:16,360 --> 00:09:16,840
Easy.

121
00:09:21,870 --> 00:09:26,670
Later on, there will be some useful options for these Arnon units that we can make use of.

122
00:09:27,180 --> 00:09:32,490
I want to briefly mention needs to you now so that you know they exist, although we won't be making

123
00:09:32,490 --> 00:09:33,600
use of them right away.

124
00:09:35,190 --> 00:09:41,010
First, recall that for each input X of one x of two, all the way up to X a big T. We will calculate

125
00:09:41,010 --> 00:09:44,820
ahead and state each of one of two all the way up to each of Big T..

126
00:09:45,760 --> 00:09:52,530
Currently, we've seen that all the art in units, the simpering in the GIU and the LSHTM only return

127
00:09:52,530 --> 00:09:54,110
at this final hidden value.

128
00:09:54,150 --> 00:09:55,050
Each of Big T.

129
00:09:56,240 --> 00:10:01,460
But there are scenarios where we would want the other headed states to so that we can calculate the

130
00:10:01,460 --> 00:10:06,920
corresponding output predictions we had of one way out of two and all the way up to why had a big tea.

131
00:10:08,240 --> 00:10:10,130
In order to do that, it's very simple.

132
00:10:10,550 --> 00:10:16,260
All we need to do is pass in the argument return sequences equals true when you do this.

133
00:10:16,280 --> 00:10:22,340
The output of the orange unit will be of shape and the bitey by em because remember the hidden feature

134
00:10:22,340 --> 00:10:23,840
vector is of size m.

135
00:10:24,440 --> 00:10:30,770
And then when you apply a dense layer on top of this, your final output is a shape and by T by K where

136
00:10:30,770 --> 00:10:32,360
K is the number of output nodes.

137
00:10:34,810 --> 00:10:38,380
So you have an output for every time, step, for every sample.

138
00:10:43,430 --> 00:10:49,070
Another interesting argument is the return state argument for the simple Arnon and the Giu.

139
00:10:49,310 --> 00:10:50,810
This is a bit superfluous.

140
00:10:51,230 --> 00:10:53,810
We already know that the output is the head in state.

141
00:10:54,290 --> 00:10:59,720
So when we use the functional API and we grab the return value, this is already the head and state.

142
00:11:00,560 --> 00:11:02,540
So if you set returns, state equals true.

143
00:11:02,690 --> 00:11:04,040
You'll just get back the same thing.

144
00:11:04,040 --> 00:11:06,740
Return twice for the yam.

145
00:11:07,100 --> 00:11:12,860
The cell state is also considered a state, so if you set returns, state equals true for the LSM.

146
00:11:13,160 --> 00:11:19,010
It overturned three things the output the head and state and the cell state, the output and the head

147
00:11:19,010 --> 00:11:22,310
and state are still the same thing, but the cell state is different.

148
00:11:23,060 --> 00:11:28,460
Again, this is kind of random information at this point since we won't be using it immediately, but

149
00:11:28,730 --> 00:11:31,400
it's good to have in the back of your mind that just in case.