1
00:00:11,100 --> 00:00:15,270
OK, so we've just learned about the simple answer and also called the Alman unit.

2
00:00:15,870 --> 00:00:21,690
Now, the next obvious question is how do we use this to solve actual problems at this point?

3
00:00:21,700 --> 00:00:24,810
I'm going to introduce you to two kinds of problems we can solve.

4
00:00:25,380 --> 00:00:27,380
One is called the many to one task.

5
00:00:27,870 --> 00:00:31,320
That means you have a whole sequence of inputs, but only one output.

6
00:00:32,010 --> 00:00:36,070
Some examples of this are spam detection and sentiment analysis.

7
00:00:36,660 --> 00:00:39,960
So for spam, we're going to take in a whole sequence of words.

8
00:00:39,960 --> 00:00:43,500
But the target is just a single class, either spam or not spam.

9
00:00:44,160 --> 00:00:45,440
For sentiment analysis.

10
00:00:45,450 --> 00:00:46,580
We have the same idea.

11
00:00:46,920 --> 00:00:52,710
Our input is a whole sequence of words, like a movie review, but the target is just whether or not

12
00:00:52,710 --> 00:00:54,190
it was positive or negative.

13
00:00:55,410 --> 00:00:58,660
The second kind of task is called a many to many task.

14
00:00:59,070 --> 00:01:02,720
This is when we have a sequence of inputs and the sequence of outputs.

15
00:01:03,270 --> 00:01:08,400
Some examples of this are parts of speech, tagging and anomaly detection in a time series.

16
00:01:09,270 --> 00:01:13,580
So, as you recall, parts of speech are labels like noun, verb and so forth.

17
00:01:14,100 --> 00:01:19,890
Clearly these need to be applied to each word and thus every word in the input will have a corresponding

18
00:01:19,890 --> 00:01:22,300
target for anomaly detection.

19
00:01:22,320 --> 00:01:26,880
You can imagine that any point in the time series could be classified as anomalous.

20
00:01:27,270 --> 00:01:33,840
For example, you might want to monitor CPU usage on a cluster of machines every point in the time series

21
00:01:33,840 --> 00:01:36,060
would be labeled as anomalous or not.

22
00:01:40,800 --> 00:01:44,340
OK, so what's the point of discussing these different kinds of problems?

23
00:01:45,180 --> 00:01:48,300
Well, this will determine the architecture of the arnet.

24
00:01:49,140 --> 00:01:53,920
Conceptually, our neural network will end with a final dense layer, as it always does.

25
00:01:54,450 --> 00:01:59,320
As usual, it will have a final activation, which is appropriate for the task at hand.

26
00:01:59,880 --> 00:02:04,530
However, recognize that there are multiple ways to connect this final dense layer.

27
00:02:05,430 --> 00:02:09,450
As you recall, the RNA and outputs a head and state for every timestep.

28
00:02:09,810 --> 00:02:12,690
We have one two all the way up to a T..

29
00:02:13,350 --> 00:02:16,500
The question is, what do we do with all these hidden states?

30
00:02:17,070 --> 00:02:19,830
The answer depends on which kind of task we are doing.

31
00:02:21,150 --> 00:02:26,610
If we are doing a many to one task, then we only keep the final hidden state, which contains all the

32
00:02:26,610 --> 00:02:29,090
information from the entire time series.

33
00:02:29,490 --> 00:02:33,120
We then pass this through a final dense layer to get a single prediction.

34
00:02:34,380 --> 00:02:39,960
If we are doing a many to many task, then we keep all the hidden states, each of which contain only

35
00:02:39,960 --> 00:02:41,830
the information up to that point.

36
00:02:42,480 --> 00:02:48,420
In this case, we pass every hidden state through the final dense layer to get big to separate predictions

37
00:02:48,690 --> 00:02:54,090
one for each time step and note that the same dense layer is applied to all time steps.

38
00:02:54,330 --> 00:02:55,170
Just does the same.

39
00:02:55,170 --> 00:02:59,440
Simple answer is applied to all time steps of the Input Time series.

40
00:02:59,970 --> 00:03:04,440
This is another example of the concept of shared weights, which you may have seen when you studied

41
00:03:04,440 --> 00:03:05,310
CNN's.

42
00:03:10,020 --> 00:03:15,390
Now, there's one more option to consider in the many to one case, this will bring us back yet again

43
00:03:15,450 --> 00:03:16,950
to the concept of shape's.

44
00:03:17,340 --> 00:03:22,020
So hopefully you're beginning to see why thinking about shapes is such an important thing.

45
00:03:23,010 --> 00:03:26,010
So if you studied CNN, this will make a lot of sense.

46
00:03:26,010 --> 00:03:29,470
But if not, that's OK to just try to think of the intuition.

47
00:03:30,480 --> 00:03:33,300
So suppose we're looking at one convolutions.

48
00:03:33,720 --> 00:03:39,090
After doing a series of convolutions and pullings, you'll have a featurettes time series of size T

49
00:03:39,090 --> 00:03:42,230
by M where M represents the number of feature maps.

50
00:03:42,780 --> 00:03:47,220
But note something interesting, which is that Arnon's give us the exact same shape.

51
00:03:47,790 --> 00:03:52,850
After going through an island with hidden output size M, we will have a T by sequence.

52
00:03:53,310 --> 00:03:57,590
In this case we say that we have t hit state vectors each of sizes.

53
00:03:58,410 --> 00:04:05,370
So in both cases the output sizes T by M, but just because they have a different name does not mean

54
00:04:05,370 --> 00:04:06,660
they are different things.

55
00:04:06,990 --> 00:04:10,060
They are just different perspectives on the same kind of data.

56
00:04:10,650 --> 00:04:14,870
One perspective is convolutions, while the other perspective is Arnon's.

57
00:04:15,210 --> 00:04:18,960
Now again, if you haven't seen convolutions before, please don't worry.

58
00:04:23,740 --> 00:04:29,110
Now, as you recall, before passing our data through the final dense layers, we need to obtain a single

59
00:04:29,110 --> 00:04:30,390
flat feature vector.

60
00:04:31,000 --> 00:04:34,800
One way of obtaining such a feature vector is to use global max pooling.

61
00:04:35,350 --> 00:04:41,060
What this does is it takes the maximum value over time such that you end up with different features.

62
00:04:41,530 --> 00:04:44,130
Put simply, we're getting rid of the time dimension.

63
00:04:44,770 --> 00:04:49,990
It makes sense to pick the maximum since we use that as a proxy for which value matters the most.

64
00:04:51,650 --> 00:04:55,440
Intuitively, you can think of this in terms of sentiment analysis.

65
00:04:55,970 --> 00:05:01,430
Suppose that we're looking at some movie review and the word terrible appears in the review, but not

66
00:05:01,430 --> 00:05:03,440
necessarily at the end of the sentence.

67
00:05:03,950 --> 00:05:08,840
Due to the vanishing gradient problem, the Arnold might not be able to recognize the word terrible

68
00:05:09,050 --> 00:05:11,130
if it appears too far away from the end.

69
00:05:11,600 --> 00:05:16,880
However, by taking the maximum, we can look at all the hidden values from every time step, which

70
00:05:16,880 --> 00:05:21,440
lets us see more clearly which words in the sentence matter most for predicting the target.

71
00:05:26,140 --> 00:05:30,700
So for the many to many case, let's consider what the shape of our output would be.

72
00:05:31,480 --> 00:05:37,180
Suppose that we have an input sequence of shape, TBD after the orange, and later we will have a head

73
00:05:37,180 --> 00:05:39,250
and state vector sequence of shape.

74
00:05:39,250 --> 00:05:47,200
T by that is every time step gets its own head and state vector after passing each of these hidden state

75
00:05:47,200 --> 00:05:52,900
vectors through one or more final dense layers, each with output shape k, we will have a sequence

76
00:05:52,900 --> 00:05:54,350
of shape T by K.

77
00:05:55,090 --> 00:05:59,710
So imagine that our task is to predict parts of speech of which there are eight kinds.

78
00:06:00,040 --> 00:06:01,810
In this case, K would be eight.

79
00:06:03,140 --> 00:06:09,530
OK, so you can see how this is analogous to Anan's with Anan's, we have an input vector of size D

80
00:06:09,710 --> 00:06:16,310
ahead and vector of size M and an output vector of size K with many to many Arnon's, we have an input

81
00:06:16,310 --> 00:06:22,540
of size T build a hidden sequence of size T Biem and an output sequence of size T by K.

82
00:06:27,170 --> 00:06:33,230
Now, one question students always ask is, can you stack multiple Aaryn and layers and of course,

83
00:06:33,230 --> 00:06:35,520
this is deep learning, so the answer is yes.

84
00:06:36,170 --> 00:06:39,620
Remember that neural networks are essentially repeating structures.

85
00:06:39,920 --> 00:06:46,380
If the output of one RNA layer is T by M1, that's just another multivariate time series we can pass

86
00:06:46,380 --> 00:06:48,710
as input through another RNA layer.

87
00:06:49,430 --> 00:06:53,670
From this, we'll get another head and state sequence of size T by M2.

88
00:06:54,590 --> 00:06:59,510
Of course, this is just another multivariate time series which we can pass through yet another RNA

89
00:06:59,510 --> 00:07:01,100
and layer ad infinitum.

90
00:07:02,090 --> 00:07:05,290
Now, whether or not this actually helps remains to be seen.

91
00:07:05,810 --> 00:07:10,550
As always, it's just a matter of testing it on your data set to see how it performs.

92
00:07:15,260 --> 00:07:19,250
Here's another way of looking at a neural network with multiple ernan layers.

93
00:07:20,120 --> 00:07:25,280
Now, this is a good time to mention that a pretty common mistake amongst beginners is that they confuse

94
00:07:25,280 --> 00:07:29,500
T, which is the sequence length with M, which is the number of hidden units.

95
00:07:30,830 --> 00:07:35,930
Basically, it's important to remember that these neural networks are much larger than Anan's, so we

96
00:07:35,930 --> 00:07:37,970
can't draw them with the same granularity.

97
00:07:38,780 --> 00:07:43,370
Previously, when you saw a circle in a diagram for an and then you could think of it like a single

98
00:07:43,370 --> 00:07:43,890
number.

99
00:07:44,330 --> 00:07:49,720
But for these are diagrams, notice that each circle does not represent a number, but a whole are in

100
00:07:49,720 --> 00:07:51,760
a unit which outputs a vector.

101
00:07:53,120 --> 00:07:58,370
Another point to notice is that although it's possible to make it so that each Arnold layer has a different

102
00:07:58,370 --> 00:08:01,600
number of hidden units, it would be unusual to do so.

103
00:08:02,120 --> 00:08:06,320
In other words, it's typically the case that M1 is equal to M2.

104
00:08:11,010 --> 00:08:15,960
One important fact I want to mention is that the architectures you learned about in this lecture apply

105
00:08:15,960 --> 00:08:17,790
to all kinds of Afnan units.

106
00:08:18,570 --> 00:08:24,390
At this point we've studied the Elmina and later on we will study that GIU in the LSM.

107
00:08:25,800 --> 00:08:31,230
However, although the units themselves are different, the way they are incorporated into Arnon's remains

108
00:08:31,230 --> 00:08:31,860
the same.

109
00:08:32,700 --> 00:08:37,920
So when it comes to using Arnon's with Tenzer flow, although what you are using might be more complex,

110
00:08:38,250 --> 00:08:41,380
the code just amounts to changing what kind of object you use.

111
00:08:41,760 --> 00:08:45,420
In other words, the interface to all these are units is the same.

112
00:08:50,200 --> 00:08:55,510
OK, so to summarize this lecture, we've just learned how to build on and architectures for two different

113
00:08:55,510 --> 00:08:56,210
tasks.

114
00:08:56,770 --> 00:08:59,460
The first kind of task is the many to one task.

115
00:08:59,980 --> 00:09:04,030
The second kind of task is the many to many task in the many.

116
00:09:04,030 --> 00:09:06,490
To one case, we learned two different options.

117
00:09:06,940 --> 00:09:12,430
The first option is to simply take the final hit and state vector each of Big T and pass this through

118
00:09:12,430 --> 00:09:13,720
the final dense layers.

119
00:09:14,350 --> 00:09:20,680
The second option is to use global max pooling instead for many to many tasks, we have to keep all

120
00:09:20,680 --> 00:09:22,810
big T hidden state vectors.

121
00:09:23,500 --> 00:09:29,710
We pass each of these through the same final dense layers and our output has the shape T by K work is

122
00:09:29,710 --> 00:09:31,510
the number of outputs for each target.

123
00:09:32,740 --> 00:09:37,270
So for example, if you're doing spam or not spam, then K would just be one or two.
