1
00:00:11,100 --> 00:00:15,270
OK, so we've just learned about the simple answer and also called the Elleman units.

2
00:00:15,870 --> 00:00:21,670
Now the next obvious question is how do we use this to solve actual problems at this point?

3
00:00:21,690 --> 00:00:24,780
I'm going to introduce you to two kinds of problems we can solve.

4
00:00:25,350 --> 00:00:27,360
One is called the many to one task.

5
00:00:27,870 --> 00:00:31,320
That means you have a whole sequence of inputs, but only one output.

6
00:00:32,040 --> 00:00:36,060
Some examples of this are spam detection and sentiment analysis.

7
00:00:36,690 --> 00:00:42,060
So for spam, we're going to take in a whole sequence of words, but the target is just a single class,

8
00:00:42,060 --> 00:00:43,500
either spam or not spam.

9
00:00:44,160 --> 00:00:46,560
For sentiment analysis, we have the same idea.

10
00:00:46,950 --> 00:00:50,310
Our input is a whole sequence of words like a movie review.

11
00:00:50,580 --> 00:00:54,150
But the target is just whether or not it was positive or negative.

12
00:00:55,380 --> 00:00:58,620
The second kind of task is called a many to many task.

13
00:00:59,070 --> 00:01:02,700
This is when we have a sequence of inputs and the sequence of outputs.

14
00:01:03,270 --> 00:01:08,370
Some examples of this are parts of speech tagging and anomaly detection in a time series.

15
00:01:09,270 --> 00:01:13,560
So as you recall, parts of speech are labels like noun, verb and so forth.

16
00:01:14,100 --> 00:01:19,860
Clearly, these need to be applied to each word, and thus every word in the input will have a corresponding

17
00:01:19,860 --> 00:01:22,260
target for anomaly detection.

18
00:01:22,290 --> 00:01:26,850
You can imagine that any point in the time series could be classified as anomalous.

19
00:01:27,270 --> 00:01:33,810
For example, you might want to monitor CPU usage on a cluster of machines every point in the time series

20
00:01:33,810 --> 00:01:36,030
would be labeled as anomalous or not.

21
00:01:40,860 --> 00:01:44,310
OK, so what's the point of discussing these different kinds of problems?

22
00:01:45,180 --> 00:01:48,030
Well, this will determine the architecture of the answer.

23
00:01:49,140 --> 00:01:53,910
Conceptually, our neural network will ends with a final dense layer, as it always does.

24
00:01:54,450 --> 00:01:59,310
As usual, it will have a final activation, which is appropriate for the task at hand.

25
00:01:59,880 --> 00:02:04,530
However, recognize that there are multiple ways to connect this final dance layer.

26
00:02:05,460 --> 00:02:09,419
As you recall the art and outputs ahead and state for every time step.

27
00:02:09,780 --> 00:02:12,690
We have each H2 all the way up to HPT.

28
00:02:13,350 --> 00:02:16,500
The question is what do we do with all these hidden states?

29
00:02:17,100 --> 00:02:19,830
The answer depends on which kind of task we are doing.

30
00:02:21,150 --> 00:02:26,610
If we are doing a many to one task, then we only keep the final hidden state, which contains all the

31
00:02:26,610 --> 00:02:29,070
information from the entire time series.

32
00:02:29,490 --> 00:02:33,120
We then pass this through a final dense layer to get a single prediction.

33
00:02:34,410 --> 00:02:39,930
If we are doing a many to many task, then we keep all the hidden states, each of which contain only

34
00:02:39,930 --> 00:02:41,820
the information up to that point.

35
00:02:42,510 --> 00:02:47,640
In this case, we pass every hidden state through the final dense layer to get big teams, separate

36
00:02:47,640 --> 00:02:54,060
predictions, one for each timestamp, and note that the same dense layer is applied to all timestamps.

37
00:02:54,300 --> 00:02:59,430
Just as the same simple Arnon is applied to all timestamps of the input time series.

38
00:02:59,940 --> 00:03:04,410
This is another example of the concept of shared weights, which you may have seen when you studied

39
00:03:04,410 --> 00:03:05,280
CNN's.

40
00:03:10,020 --> 00:03:15,390
Now, there's one more option to consider in the many to one case, this will bring us back yet again

41
00:03:15,420 --> 00:03:16,950
to the concept of shapes.

42
00:03:17,340 --> 00:03:22,020
So hopefully you're beginning to see why thinking about shapes is such an important thing.

43
00:03:23,010 --> 00:03:26,000
So if you studied scenes, this will make a lot of sense.

44
00:03:26,010 --> 00:03:27,690
But if not, that's OK, too.

45
00:03:27,720 --> 00:03:29,460
Just try to think of the intuition.

46
00:03:30,510 --> 00:03:33,270
So suppose we're looking at one Z convolutions.

47
00:03:33,750 --> 00:03:39,060
After doing a series of convolutions and pooling, you'll have a feature race time series of size T

48
00:03:39,060 --> 00:03:42,210
by M where M represents the number of feature maps.

49
00:03:42,780 --> 00:03:47,160
But note something interesting, which is that earnings give us the exact same shape.

50
00:03:47,820 --> 00:03:52,800
After going through an art end with hidden output size M, we will have a T by M sequence.

51
00:03:53,340 --> 00:03:57,930
In this case, we say that we have T hit state vectors each of size M.

52
00:03:58,410 --> 00:04:05,370
So in both cases, the output sizes T by M, but just because they have a different name does not mean

53
00:04:05,370 --> 00:04:06,630
they are different things.

54
00:04:06,990 --> 00:04:09,990
They are just different perspectives on the same kind of data.

55
00:04:10,620 --> 00:04:14,840
One perspective is convolutions, while the other perspective is Arnett's.

56
00:04:15,210 --> 00:04:18,930
Now again, if you haven't seen convolutions before, please don't worry.

57
00:04:23,740 --> 00:04:29,080
Now, as you recall, before passing our data through the final dense layers, we need to obtain a single

58
00:04:29,080 --> 00:04:30,380
flat feature vector.

59
00:04:30,970 --> 00:04:34,780
One way of obtaining such a feature vector is to use global max pooling.

60
00:04:35,350 --> 00:04:41,050
What this does is it takes the maximum value over time, such that you end up with different features.

61
00:04:41,530 --> 00:04:44,110
Put simply, we're getting rid of the time dimension.

62
00:04:44,800 --> 00:04:49,960
It makes sense to pick the maximum, since we use that as a proxy for which value matters the most.

63
00:04:51,680 --> 00:04:55,430
Intuitively, you can think of this in terms of sentiment analysis.

64
00:04:55,970 --> 00:05:01,400
Suppose that we're looking at some movie review and the word terrible appears in the review, but not

65
00:05:01,400 --> 00:05:06,800
necessarily at the end of the sentence due to the vanishing gradient problem, the aunt and might not

66
00:05:06,800 --> 00:05:11,120
be able to recognise the word terrible if it appears too far away from the end.

67
00:05:11,630 --> 00:05:16,880
However, by taking the maximum, we can look at all the hidden values from every time step, which

68
00:05:16,880 --> 00:05:21,440
lets us see more clearly which words in the sentence matter most for predicting the target.

69
00:05:26,170 --> 00:05:30,610
So for the many, to many case, let's consider what the shape of our output would be.

70
00:05:31,480 --> 00:05:34,720
Suppose that we have an input sequence of sheep TBD.

71
00:05:35,200 --> 00:05:40,090
After the orange layer, we will have a head and state vector sequence of sheep t by M.

72
00:05:40,930 --> 00:05:44,320
That is, every time step gets its own head and state vector.

73
00:05:45,370 --> 00:05:50,410
After passing each of these head and state vectors through one or more final dense layers, each with

74
00:05:50,410 --> 00:05:57,190
output shape k, we will have a sequence of sheep t by K. So imagine that our task is to predict parts

75
00:05:57,190 --> 00:05:59,680
of speech of which there are eight kinds.

76
00:06:00,070 --> 00:06:01,780
In this case, K would be eight.

77
00:06:03,170 --> 00:06:09,050
OK, so you can see how this is analogous to and ends with an ends, we have an input vector of size

78
00:06:09,050 --> 00:06:15,560
d, a hidden vector of size M and an output vector of size K with many too many Arnaz.

79
00:06:15,560 --> 00:06:17,660
We have an input of size TBD.

80
00:06:17,810 --> 00:06:22,520
A hidden sequence of size T by M ends an output sequence of size T by K.

81
00:06:27,260 --> 00:06:32,120
Now, one question students always ask is, can you stack multiple aren't and layers?

82
00:06:32,570 --> 00:06:35,480
And of course, this is deep learning, so the answer is yes.

83
00:06:36,170 --> 00:06:41,630
Remember that neural networks are essentially repeating structures if the output of one random layer

84
00:06:41,630 --> 00:06:42,920
is T by one.

85
00:06:43,220 --> 00:06:45,530
That's just another multivariate time series.

86
00:06:45,680 --> 00:06:48,710
We can pass as input through another Arnold and layer.

87
00:06:49,430 --> 00:06:53,690
From this, we'll get another head and state sequence of size T by M2.

88
00:06:54,650 --> 00:06:59,540
Of course, this is just another multivariate time series which we can pass through yet another Arnold

89
00:06:59,550 --> 00:07:01,100
layer ad infinitum.

90
00:07:02,090 --> 00:07:05,250
Now, whether or not this actually helps remains to be seen.

91
00:07:05,810 --> 00:07:10,520
As always, it's just a matter of testing it on your data set to see how it performs.

92
00:07:15,260 --> 00:07:19,190
Here's another way of looking at a neural network with multiple Arnon layers.

93
00:07:20,120 --> 00:07:25,250
Now this is a good time to mention that a pretty common mistake amongst beginners is that they confuse

94
00:07:25,250 --> 00:07:29,480
T, which is the sequence length with M, which is the number of hidden units.

95
00:07:30,860 --> 00:07:35,900
Basically, it's important to remember that these neural networks are much larger than Aon's, so we

96
00:07:35,900 --> 00:07:42,020
can't draw them with the same granularity previously when you saw a circle in a diagram for an A9.

97
00:07:42,050 --> 00:07:43,880
You could think of it like a single number.

98
00:07:44,330 --> 00:07:46,040
But for these range diagrams.

99
00:07:46,070 --> 00:07:51,710
Notice that each circle does not represent a number, but a whole are in a unit which outputs a vector.

100
00:07:53,120 --> 00:07:58,340
Another point to notice is that although it's possible to make it so that each Arnon layer has a different

101
00:07:58,340 --> 00:08:01,580
number of hidden units, it would be unusual to do so.

102
00:08:02,120 --> 00:08:06,320
In other words, it's typically the case that M1 is equal to M2.

103
00:08:10,980 --> 00:08:15,930
One important fact I want to mention is that the architecture you learned about in this lecture apply

104
00:08:15,930 --> 00:08:17,730
to all kinds of Arnon units.

105
00:08:18,570 --> 00:08:24,360
At this point, we've studied the aluminum, then later on we will study that grew and the lithium.

106
00:08:25,800 --> 00:08:30,660
However, although the units themselves are different, the way they are incorporated into our own ends

107
00:08:30,810 --> 00:08:31,830
remains the same.

108
00:08:32,700 --> 00:08:37,890
So when it comes to using R and ends with TensorFlow, although what you are using might be more complex,

109
00:08:38,250 --> 00:08:41,340
the coach's demands the changing what kind of objects you use.

110
00:08:41,760 --> 00:08:45,420
In other words, the interface to all these are in a unit is the same.

111
00:08:50,260 --> 00:08:55,480
OK, so to summarize this lecture, we've just learned how to build art and architectures for two different

112
00:08:55,480 --> 00:08:56,200
tasks.

113
00:08:56,740 --> 00:08:59,440
The first kind of task is the many to one task.

114
00:08:59,950 --> 00:09:02,980
The second kind of task is the many to many task.

115
00:09:03,490 --> 00:09:06,490
In the many to one case, we learned two different options.

116
00:09:06,940 --> 00:09:12,430
The first option is to simply take the final hit and state vector each of Big T and pass this through

117
00:09:12,430 --> 00:09:13,690
the final dense layers.

118
00:09:14,350 --> 00:09:17,170
The second option is to use global max pooling instead.

119
00:09:18,100 --> 00:09:22,750
For many too many tasks, we have to keep all big T hidden state vectors.

120
00:09:23,470 --> 00:09:28,900
We pass each of these through the same final dense layers, and our output has the shape t by K..

121
00:09:29,110 --> 00:09:31,510
Work is the number of outputs for each target.

122
00:09:32,770 --> 00:09:37,270
So for example, if you're doing spam or not spam, then K would just be one or two.

