1
00:00:11,680 --> 00:00:18,580
In this lecture, we are going to continue our discussion of the LSM and GIU, previously we discussed

2
00:00:18,580 --> 00:00:25,450
the GIU and now we are going to move on to the Alstrom to recap, what did we learn about the GIU?

3
00:00:26,350 --> 00:00:32,320
First we learn that the simple Arnon unit by itself is problematic because it can't remember long term

4
00:00:32,320 --> 00:00:33,190
dependencies.

5
00:00:33,970 --> 00:00:39,130
The way we solve this is we make the head and state the way that some of the previous head and state

6
00:00:39,400 --> 00:00:45,090
and the new value for the state, which is essentially the same calculation as the symbol or end unit

7
00:00:45,880 --> 00:00:52,150
in this way that you always has the opportunity to remember the previous and instead state allowing

8
00:00:52,150 --> 00:00:56,860
it to carry that state forward in time, reducing the chance of forgetting it later.

9
00:00:58,560 --> 00:01:04,320
Importantly, we recognize that these gates, which decide whether we should remember or forget, are

10
00:01:04,320 --> 00:01:06,090
just binary classifiers.

11
00:01:06,750 --> 00:01:10,890
There are little logistic regression neurons, something we are very comfortable with.

12
00:01:15,910 --> 00:01:21,490
Now we are going to move on to the LSM, which is a little more complicated, but follows the same idea.

13
00:01:22,180 --> 00:01:28,660
Interestingly, when Giuse first came out, the initial research suggested that there was no clear winner

14
00:01:28,660 --> 00:01:29,440
between the two.

15
00:01:30,040 --> 00:01:34,600
It was reported that the group performed comparably with the LDM.

16
00:01:35,350 --> 00:01:40,330
This made the GIU a great choice because it has less parameters and therefore is more performant than

17
00:01:40,330 --> 00:01:41,120
the allaster.

18
00:01:42,100 --> 00:01:47,920
These days, the current belief is that this isn't true and that the LSM actually ended up being better.

19
00:01:48,580 --> 00:01:54,370
In fact, at the time I am making this course, this latest research only came out about a year ago,

20
00:01:54,730 --> 00:01:58,330
which is actually two years after I made my first nine Encore's.

21
00:01:58,840 --> 00:02:01,810
So that just goes to show how fast things move in this field.

22
00:02:02,680 --> 00:02:08,740
Experimental results are always changing our thinking about best practices and which techniques to choose

23
00:02:08,740 --> 00:02:09,500
over others.

24
00:02:10,570 --> 00:02:16,020
This also is a great example of Mirrool machine learning, experimentation and not philosophy.

25
00:02:16,720 --> 00:02:20,640
I know for a lot of beginners, the approach they take is a philosophical approach.

26
00:02:21,100 --> 00:02:24,340
They would try to use logical reasoning to deduce which is better.

27
00:02:24,490 --> 00:02:30,070
The LACMA, the G are you the SDM has these Maritz, the YOU has these merits and so on.

28
00:02:31,390 --> 00:02:32,650
Such things are useless.

29
00:02:32,710 --> 00:02:38,790
Please don't waste your time on philosophy instead follow the example that these researchers have set

30
00:02:39,040 --> 00:02:41,800
and use experiments to decide which is best.

31
00:02:42,400 --> 00:02:46,850
There are two papers I've attached an extra reading that text that you'll want to check out.

32
00:02:47,530 --> 00:02:54,340
The first is called on the Practical Computational Power of Finite Precision Arnon's for language recognition.

33
00:02:54,730 --> 00:03:00,200
And the second is called massive exploration of neural machine translation architectures.

34
00:03:00,820 --> 00:03:03,970
They both conclude that the LSM outperforms Giuse.

35
00:03:03,970 --> 00:03:08,500
And so given the choice, I would probably choose the LSM by default.

36
00:03:13,610 --> 00:03:15,320
So how does the system work?

37
00:03:16,040 --> 00:03:22,340
Basically, you can think of it as like the Gyoo but with more state vectors and more gates, again,

38
00:03:22,340 --> 00:03:25,850
it's my preference to understand that Elysium through the equations.

39
00:03:25,850 --> 00:03:31,180
But if you find these visualizations helpful, then you can feel free to look at that as well.

40
00:03:36,430 --> 00:03:43,540
The first difference with the GIU and the storm is that the storm has two states in addition to HFT,

41
00:03:43,990 --> 00:03:45,320
which we call the head and state.

42
00:03:45,610 --> 00:03:52,110
We also have CFT, the cells that sometimes the cell state is considered an additional head and state.

43
00:03:52,540 --> 00:03:56,350
So you actually might pass both of these on to the next layer.

44
00:03:57,930 --> 00:04:02,820
For us, we will usually ignore the cell state so you can think of it more like an intermediate value

45
00:04:02,820 --> 00:04:05,300
that you calculate, just like the gate vectors.

46
00:04:05,340 --> 00:04:09,100
So you calculate it, but you don't actually use it to pass on to the next layer.

47
00:04:09,930 --> 00:04:17,460
So even though the LSM has an extra step, you can still think of it as having almost the same API as

48
00:04:17,460 --> 00:04:18,840
a simple AURIN in energy.

49
00:04:18,840 --> 00:04:19,300
Are you?

50
00:04:19,920 --> 00:04:23,160
It still outputs the final hit and state of Big T.

51
00:04:24,940 --> 00:04:30,940
It's just that for the inputs, in addition to acts of one all the way up to acts of Big T and the initial

52
00:04:30,940 --> 00:04:32,360
head and state H of zero.

53
00:04:32,740 --> 00:04:37,280
We also have the initial cells they see of zero for the outputs.

54
00:04:37,300 --> 00:04:43,360
It's a little strange due to how high torch works, basically will get H of big T and C of Big T, but

55
00:04:43,360 --> 00:04:47,380
the details might be surprising, but we'll discuss that more later in this lecture.

56
00:04:52,380 --> 00:04:55,030
OK, so let's get to the LSM calculations.

57
00:04:55,530 --> 00:05:00,780
I want to remind you that while this might seem complicated, it's not don't be scared by the sight

58
00:05:00,780 --> 00:05:04,360
of equations, but instead try to see the forest for the trees.

59
00:05:04,830 --> 00:05:06,880
Each of these is just a neuron.

60
00:05:07,470 --> 00:05:13,560
Once you start to realize that each thing here is a binary classifier or logistic regression, it becomes

61
00:05:13,560 --> 00:05:14,910
a lot simpler to interpret.

62
00:05:15,720 --> 00:05:16,650
So what do we see?

63
00:05:17,610 --> 00:05:20,450
First, we see that there are more of these gates.

64
00:05:20,490 --> 00:05:24,180
Specifically, we have F.T. the forget gate vector.

65
00:05:24,930 --> 00:05:31,050
Next we have IMT, the input gate vector, also known as the update gate vector, analogously to the

66
00:05:31,050 --> 00:05:31,680
Jiahu.

67
00:05:32,130 --> 00:05:35,310
And next we have all of T the output gate vector.

68
00:05:36,970 --> 00:05:42,190
So hopefully this is the easy to remember because the word forget starts with f the word input starts

69
00:05:42,190 --> 00:05:47,740
with I and the word output starts with, oh, next we have the cell state CFT.

70
00:05:48,460 --> 00:05:52,620
You might notice that this actually takes on the role of HFT from the chair.

71
00:05:52,630 --> 00:05:56,770
You specifically, we have the weighted sum of two terms.

72
00:05:57,340 --> 00:06:01,120
The first term is just the previous cells, the C of T minus one.

73
00:06:01,810 --> 00:06:06,610
How much of this we end up keeping is controlled by the forget game F of T?

74
00:06:07,030 --> 00:06:10,300
So control is how much of C of T minus one we want to forget.

75
00:06:11,350 --> 00:06:13,690
The second term is the simple answer.

76
00:06:14,260 --> 00:06:17,640
How much of this we end up keeping is controlled by the input gate.

77
00:06:17,650 --> 00:06:24,030
I have t and notice that this has exactly the same equation as the simple answer.

78
00:06:24,070 --> 00:06:25,330
Unlike the G are you.

79
00:06:26,680 --> 00:06:29,050
Now note that for the simple answer.

80
00:06:29,320 --> 00:06:35,900
We have the activation function F of C this can be anything but usually by default is the Tanach.

81
00:06:36,970 --> 00:06:42,490
Finally we have the hidden state each of T, which is just a simple transformation on the South State

82
00:06:42,490 --> 00:06:50,320
CFT in particular, we apply another activation function F of H, which also can be anything but by

83
00:06:50,320 --> 00:06:52,420
default, it's also usually a teenage.

84
00:06:52,990 --> 00:06:55,810
We then multiply this by the output of T.

85
00:06:57,600 --> 00:07:03,360
So the output gap controls which values of the cell state we actually pass through to the head and state

86
00:07:03,390 --> 00:07:04,110
each of T.

87
00:07:09,150 --> 00:07:14,910
Note that in PI, it's not possible to choose FFC enough of each individually and in fact, you can't

88
00:07:14,910 --> 00:07:21,360
even change them from the default, as mentioned before, because Arnon's are so complicated, they're

89
00:07:21,360 --> 00:07:26,730
usually programmed all in one module for the sake of efficiency, and they're not really composable

90
00:07:26,730 --> 00:07:28,470
like the other modules we've learned about.

91
00:07:30,220 --> 00:07:35,200
Of course, if you were writing your own a.T.M, you could customize this any way you wanted, but its

92
00:07:35,200 --> 00:07:37,270
performance would most likely be suboptimal.

93
00:07:42,340 --> 00:07:49,000
So you can see that by building up our knowledge first through simple Arnon's, then Giuse, then storms,

94
00:07:49,420 --> 00:07:53,210
we can make something complex like the LSM seem very simple.

95
00:07:53,740 --> 00:07:59,950
It's nothing but three logistic regressions or neurons to give us the forget, get the input gate and

96
00:07:59,950 --> 00:08:00,700
the output gate.

97
00:08:01,420 --> 00:08:07,990
Then the cell state is the weighted some of the previous cells day and the simple an weighted by the

98
00:08:07,990 --> 00:08:09,530
forget gate and the input gate.

99
00:08:10,510 --> 00:08:16,390
And remember, we want to offer the cell state the possibility to remember its old state so that we

100
00:08:16,390 --> 00:08:18,580
can learn long term dependencies.

101
00:08:20,570 --> 00:08:26,390
Finally, the hidden state HFT is just to squash the version of the cells there with the output gate

102
00:08:26,390 --> 00:08:28,890
controlling which values are allowed to pass through.

103
00:08:29,720 --> 00:08:34,940
By the way, I say squashed because we use the Tanach activation by default and it's not possible to

104
00:08:34,940 --> 00:08:35,520
change.

105
00:08:35,960 --> 00:08:39,080
So this is really not a hyper parameter you have to play around with.

106
00:08:44,380 --> 00:08:49,730
In terms of code, how do the G are you in Elysium work as promised?

107
00:08:49,780 --> 00:08:52,990
These have the same or similar API as the simple answer.

108
00:08:53,380 --> 00:08:59,650
So if all you wanted to do is plug in a G or you are LSM in place of a simple or an end, it can be

109
00:08:59,650 --> 00:09:01,690
almost as simple as changing the name.

110
00:09:02,470 --> 00:09:05,580
Let's start with how to instantiate a group and list here.

111
00:09:06,370 --> 00:09:09,960
As you can see, it's almost exactly the same as a simple answer.

112
00:09:10,960 --> 00:09:16,210
We pass on a number of input features d the number of hidden features M and the number of Oranje and

113
00:09:16,210 --> 00:09:17,590
layers the stack l.

114
00:09:18,130 --> 00:09:23,920
I also pass in batch first equal to true so that all my data is always end by something instead of having

115
00:09:23,920 --> 00:09:25,450
the end dimension somewhere else.

116
00:09:26,800 --> 00:09:32,380
The crucial difference is that in a group analysts, there is no non-linearity argument because it's

117
00:09:32,380 --> 00:09:33,700
not possible to change.

118
00:09:38,760 --> 00:09:42,780
Let's not talk about how the GIU and LSM will be used in the forward function.

119
00:09:43,680 --> 00:09:48,000
First, let's discuss the initial hit and state and what to pass into the Arnet.

120
00:09:48,840 --> 00:09:55,050
Conveniently, the Jiahu has the exact same interface as the simple answer as the input we pass in the

121
00:09:55,050 --> 00:10:00,020
data into the first argument and the initial stage, not into the second argument.

122
00:10:01,640 --> 00:10:08,510
As before, the initial stage not is of shape L by N by M number of layers, by number of samples,

123
00:10:08,510 --> 00:10:16,460
by a number of features for the SDM, we also require an initial state, luckily because the sun has

124
00:10:16,490 --> 00:10:18,630
the same dimensionality as the hidden state.

125
00:10:18,860 --> 00:10:21,950
We will also initialize this to L by N by M..

126
00:10:26,980 --> 00:10:33,640
Now, let's talk about how to interpret the outputs from the Jiahu and LASU, as you recall, Arnon's

127
00:10:33,640 --> 00:10:40,030
and PAE to return to things, not just one with the Elmina and we get one each, which is of shape and

128
00:10:40,030 --> 00:10:46,870
by T by M, which tells us the hidden state at the final orendain and layer for each sample, each timestep,

129
00:10:46,870 --> 00:10:47,650
any feature.

130
00:10:48,430 --> 00:10:53,670
The second H is of shape l by and by m which tells us the hidden state at the final time.

131
00:10:53,680 --> 00:10:57,430
Step for each sample, each RNA layer, any feature.

132
00:10:58,500 --> 00:11:00,720
In the Gyoo It works exactly the same.

133
00:11:01,700 --> 00:11:04,790
In the storm, we have something similar, but not the same.

134
00:11:05,540 --> 00:11:09,950
The first thing that has returned is the same as what we get with the Alman unit, the hidden value

135
00:11:09,950 --> 00:11:12,240
at each sample, each time step any feature.

136
00:11:12,950 --> 00:11:14,200
The second thing is different.

137
00:11:14,750 --> 00:11:19,310
We get the hidden state H and the cells they see, but only at the final timestep.

138
00:11:20,830 --> 00:11:26,410
So this second age will be of shape by and by M, which tells us the hidden state at each sample, each

139
00:11:26,410 --> 00:11:32,320
layer, any feature, we also get the cells of the same shape, which means it's the cells stay at each

140
00:11:32,320 --> 00:11:34,200
sample, each layer and each feature.

141
00:11:34,930 --> 00:11:39,100
Unfortunately, I'm unaware of how you would get the cells states at different times steps.