1
00:00:11,760 --> 00:00:16,980
In this lecture, we are going to discuss how we can use convolutional neuron that works for sequences,

2
00:00:17,040 --> 00:00:23,100
specifically text, although the type of CNN we are about to discuss will work just as well for generic

3
00:00:23,100 --> 00:00:23,980
sequences to.

4
00:00:24,840 --> 00:00:28,580
You might be wondering, how can this be if CNN Zafra images?

5
00:00:28,650 --> 00:00:32,190
How is it possible for CNN to also work with sequences?

6
00:00:37,330 --> 00:00:39,850
First, let's recall the basics of convolution.

7
00:00:40,510 --> 00:00:44,290
The idea is this you have an image, which is the big square.

8
00:00:44,620 --> 00:00:49,930
You have a filter, which is the small square, and you're going to slide the filter along each possible

9
00:00:49,930 --> 00:00:51,520
position in the big square.

10
00:00:51,910 --> 00:00:57,370
And at each point, you're going to multiply all the overlapping values and add them together just like

11
00:00:57,370 --> 00:00:58,240
a DOT product.

12
00:00:58,990 --> 00:01:04,330
Now, I know this is obvious, but it's worth taking note of an image has two dimensions, height and

13
00:01:04,330 --> 00:01:04,810
width.

14
00:01:05,590 --> 00:01:07,120
There is also the feature dimension.

15
00:01:07,300 --> 00:01:09,970
But that's not an actual dimension which has correlation.

16
00:01:10,660 --> 00:01:12,460
And recall what I mean by correlation.

17
00:01:13,030 --> 00:01:19,090
If you have a picture of a red car and you find a pixel in the image which is red, probably the neighboring

18
00:01:19,090 --> 00:01:20,380
pixels are also red.

19
00:01:21,070 --> 00:01:25,120
In other words, pixels beside each other likely have similar values.

20
00:01:30,220 --> 00:01:35,680
Now, think of a sequence, unlike an image, a sequence has just one dimension that is not a feature

21
00:01:35,980 --> 00:01:36,490
time.

22
00:01:37,210 --> 00:01:38,860
So it's time instead of space.

23
00:01:39,490 --> 00:01:41,140
But notice one important detail.

24
00:01:41,710 --> 00:01:46,080
We have the same type of correlation and sequences, data which are nearby.

25
00:01:46,090 --> 00:01:49,160
Other data in time are also close in value.

26
00:01:50,020 --> 00:01:55,960
That's why this appears like a smooth curve rather than just noise jumping around with random, uncorrelated

27
00:01:55,960 --> 00:01:56,650
values.

28
00:01:57,370 --> 00:02:00,220
This suggests that convolution might be useful here as well.

29
00:02:05,370 --> 00:02:10,830
Luckily, convolution in one dimension is actually much simpler than a convolution in two dimensions

30
00:02:11,430 --> 00:02:12,720
instead of a big square.

31
00:02:13,050 --> 00:02:16,170
We just have a big line and the filter is a smaller line.

32
00:02:17,010 --> 00:02:20,640
Then we slide the small line across every position in the big line.

33
00:02:21,150 --> 00:02:24,130
Multiply all the overlapping values and add them together.

34
00:02:29,320 --> 00:02:32,740
Let's do a simple example for the sequence data we have.

35
00:02:32,830 --> 00:02:34,240
One, two, three, two, one.

36
00:02:34,510 --> 00:02:39,340
And for the filter, we have simply plus one, minus one as an exercise.

37
00:02:39,400 --> 00:02:43,600
You might want to try to figure out the answer first before we move on to the solution.

38
00:02:48,770 --> 00:02:52,460
OK, so here's what we get at the first position we have.

39
00:02:52,520 --> 00:02:56,130
One times one plus two times minus one, which is minus one.

40
00:03:01,300 --> 00:03:07,210
At the second position, we get two times one plus three times minus one, which is minus one.

41
00:03:12,440 --> 00:03:17,960
At the third position, we get three times one plus two times minus one, which is one.

42
00:03:23,090 --> 00:03:29,030
And at the fourth and final position, we get two times one plus one times minus one, which is one.

43
00:03:29,730 --> 00:03:32,250
So essentially exactly what we would expect.

44
00:03:37,450 --> 00:03:41,650
As with images, it's possible to express this operation as an equation.

45
00:03:43,570 --> 00:03:49,370
And remember that in deep learning, while we call this convolution, pure mathematicians and statisticians

46
00:03:49,370 --> 00:03:51,080
would call this cross correlation.

47
00:03:51,710 --> 00:03:55,830
So just keep in mind the sign is reversed as an exercise.

48
00:03:55,880 --> 00:04:01,100
You might want to try and confirm to yourself that this equation does, in fact, implement the operation

49
00:04:01,130 --> 00:04:03,500
we just performed in the previous example.

50
00:04:08,600 --> 00:04:11,960
And remember that another thing we can do is add on features.

51
00:04:12,530 --> 00:04:18,770
So imagine X as a T by D array where T is the number of times, steps and D the number of input features.

52
00:04:19,610 --> 00:04:25,850
Then imagine the output as another two dimensional array of size T by M, where T is the number of times

53
00:04:25,850 --> 00:04:28,370
steps and M is the number of output features.

54
00:04:29,690 --> 00:04:33,890
Then W must be a three dimensional read T by D by M.

55
00:04:34,250 --> 00:04:36,460
And then we would have the equation that we see here.

56
00:04:38,680 --> 00:04:43,240
This is just like convolution with images for a two dimensional convolution.

57
00:04:43,330 --> 00:04:49,030
We have two spatial dimensions, plus one dimension for the input features plus one dimension for the

58
00:04:49,030 --> 00:04:49,930
outer features.

59
00:04:50,350 --> 00:04:54,700
So that's four dimensions in total for a one dimensional convolution.

60
00:04:55,120 --> 00:05:00,460
We have one time dimension plus one dimension for the input features plus one dimension for the output

61
00:05:00,460 --> 00:05:01,030
features.

62
00:05:01,330 --> 00:05:03,220
So that's three dimensions in total.

63
00:05:08,420 --> 00:05:12,280
As usual, while convolution might seem like an abstract concept.

64
00:05:12,650 --> 00:05:17,270
Remember that there are multiple convenient perspectives on it that make it seem like things we are

65
00:05:17,390 --> 00:05:18,740
already familiar with.

66
00:05:19,520 --> 00:05:22,610
So one perspective is that it's just matrix multiplication.

67
00:05:23,030 --> 00:05:25,070
Just like a regular feed, flawed neural network.

68
00:05:25,370 --> 00:05:31,160
Except that we have shared weights in order to take advantage of the special structure and correlation

69
00:05:31,160 --> 00:05:31,760
and the data.

70
00:05:34,020 --> 00:05:39,420
Another intuitive perspective is that it's just a sliding dye product and the dye product is just a

71
00:05:39,420 --> 00:05:40,560
correlation finder.

72
00:05:41,940 --> 00:05:45,390
Correlation is just another name for a pattern matching or similarity.

73
00:05:45,990 --> 00:05:51,030
So really what we are doing is asking, is this part of the sequence similar to my filter?

74
00:05:51,660 --> 00:05:55,170
And thus the filter becomes a pattern matcher or a pattern finder.

75
00:05:56,190 --> 00:06:00,270
All these concepts that you learn about convolution before still apply here.

76
00:06:05,510 --> 00:06:10,430
All right, so now that you know how Convolution works with sequences, how do we apply these the text?

77
00:06:11,210 --> 00:06:17,510
Well, luckily when we use them beddings, that already gives us exactly what we need for a one dimensional

78
00:06:17,510 --> 00:06:18,230
convolution.

79
00:06:18,290 --> 00:06:24,080
We need an input, which is a TBD sequence, where to use a number of timestamps, and this is the number

80
00:06:24,080 --> 00:06:24,860
of features.

81
00:06:25,400 --> 00:06:30,700
And of course, this is exactly what we get after we pass our sentence through and embedding layer.

82
00:06:31,730 --> 00:06:38,570
We go from a length to sequence of words to a length t sequence of integers to a length T sequence of

83
00:06:38,570 --> 00:06:39,890
D length vectors.

84
00:06:40,580 --> 00:06:45,320
Since we have T vectors each of length D this makes up a T by the Matrix.

85
00:06:45,350 --> 00:06:51,650
When you stack them all together and thus we have exactly what we need to build a CNN four text.

86
00:06:56,880 --> 00:06:58,380
So here's how it all looks in code.

87
00:06:59,070 --> 00:07:00,150
First, we have our input.

88
00:07:00,330 --> 00:07:02,970
That's a lengthy sequence of word indexes.

89
00:07:03,600 --> 00:07:04,920
Then we have an embedding layer.

90
00:07:05,490 --> 00:07:09,030
The output of that is a TBD sequence of word vectors.

91
00:07:09,960 --> 00:07:11,610
Then we have a one de convolution.

92
00:07:11,700 --> 00:07:13,630
This is just the kind of one declasse.

93
00:07:14,580 --> 00:07:20,190
Then we follow the same typical CNN architecture where we have convolution followed by pooling and so

94
00:07:20,190 --> 00:07:20,670
forth.

95
00:07:21,270 --> 00:07:24,390
So the same pattern applies here that we had four images.

96
00:07:25,230 --> 00:07:30,480
Generally speaking, the data shrinks in the time dimension, but grows in the feature dimension.

97
00:07:31,260 --> 00:07:34,500
So that's why you see the number of feature maps getting larger and larger.

98
00:07:35,400 --> 00:07:40,470
Once we've done that, then we can do a flatten or we can do a global max pooling, which will give

99
00:07:40,470 --> 00:07:42,750
us a single vector of size M three.

100
00:07:45,310 --> 00:07:50,620
Finally, we pass this through one or more dense layers to get a single scalar, assuming we are doing

101
00:07:50,620 --> 00:07:52,030
binary classification.

102
00:07:53,540 --> 00:07:59,570
As you can see, this is no different from a CNN meant for images, except that all the convolutions

103
00:07:59,570 --> 00:08:02,660
and puling are one dimensional instead of two dimensional.

104
00:08:07,670 --> 00:08:10,350
There is one small caveat to what we just discussed.

105
00:08:10,620 --> 00:08:16,140
But it's not obvious until we tried to implement the forward function inside the constructor.

106
00:08:16,200 --> 00:08:19,950
There's no real problem because all we're doing is instantiating some objects.

107
00:08:20,430 --> 00:08:26,520
But remember that in Pae Torch, the general convention is that for convolution features come first.

108
00:08:27,150 --> 00:08:31,350
We saw this earlier in the context of two dimensional convolution on images.

109
00:08:31,990 --> 00:08:38,700
Pae Torch Convolution works on images of size and by C, by H, by W, which means that color comes

110
00:08:38,700 --> 00:08:40,350
before the spatial dimensions.

111
00:08:40,710 --> 00:08:42,630
In other words, a feature comes first.

112
00:08:43,320 --> 00:08:48,270
However, in Tensor flow, open, c.v and other libraries will use the convention.

113
00:08:48,510 --> 00:08:49,080
And by H.

114
00:08:49,080 --> 00:08:50,160
By W, by C.

115
00:08:50,550 --> 00:08:52,830
Which means that the feature dimension comes last.

116
00:08:53,580 --> 00:08:59,730
Tortue vision in some sense accepts that this is the convention and makes its interface such that you

117
00:08:59,730 --> 00:09:05,190
can work with N by H, by W, by C all throughout your code and never have to worry about the fact that

118
00:09:05,190 --> 00:09:07,020
in pi tausche things are backwards.

119
00:09:07,740 --> 00:09:11,580
This is because you don't ever interact with the data explicitly your vision.

120
00:09:11,610 --> 00:09:12,750
Does all that work for you?

121
00:09:14,340 --> 00:09:16,650
But now we have this mixture of conventions.

122
00:09:17,250 --> 00:09:19,290
The output of an embedding is N by T.

123
00:09:19,290 --> 00:09:19,760
By D.

124
00:09:20,130 --> 00:09:22,860
Where T is the sequence length and D the feature dimension.

125
00:09:23,370 --> 00:09:24,660
So the feature comes last.

126
00:09:24,690 --> 00:09:26,200
When it's the output of an embedding.

127
00:09:27,300 --> 00:09:32,820
But the convoy when B function, just like the continuity function, expects the features to be first.

128
00:09:34,250 --> 00:09:37,700
In other words, it expects the features to come before the time dimension.

129
00:09:38,060 --> 00:09:42,200
So it wants to see end by D, by T and not and by T, by D.

130
00:09:43,070 --> 00:09:47,900
Well, we end up having to do is per muting the data right after the embedding and before the first

131
00:09:47,900 --> 00:09:48,650
convolution.

132
00:09:49,640 --> 00:09:55,280
So after the embedding its features last, but before the first convolution, we reshape it to features

133
00:09:55,280 --> 00:09:55,910
first.

134
00:09:56,540 --> 00:10:02,090
Then after all of our convolutions and pullings have been done, we permute the data back from features

135
00:10:02,090 --> 00:10:02,430
first.

136
00:10:02,440 --> 00:10:03,410
It features last.

137
00:10:06,130 --> 00:10:10,960
At that point, we can pass the data through the usual dense layers that make up the rest of the CNN.

138
00:10:16,080 --> 00:10:19,770
So here it is in action, since there are many layers in a CNN.

139
00:10:19,860 --> 00:10:21,690
I've split this up to make it more readable.

140
00:10:22,380 --> 00:10:25,320
Here we take in the input and pass it through and embedding layer.

141
00:10:25,980 --> 00:10:28,860
Then we do a series of convolutions and pullings.

142
00:10:29,430 --> 00:10:33,960
I'm not going to show all the convolutions and pullings since we just repeat the same pattern over and

143
00:10:33,960 --> 00:10:34,350
over.

144
00:10:35,010 --> 00:10:36,750
But we can reason about the shapes.

145
00:10:37,410 --> 00:10:41,100
So first, the input X is an end by T array of integers.

146
00:10:41,640 --> 00:10:45,630
After the embedding it becomes N by T by D word vectors.

147
00:10:46,350 --> 00:10:47,820
Next we permute the dimensions.

148
00:10:47,880 --> 00:10:49,320
So it becomes N by D.

149
00:10:49,350 --> 00:10:49,920
By T.

150
00:10:50,670 --> 00:10:53,700
After the first convolution, we change the number of features.

151
00:10:53,910 --> 00:10:59,700
So it becomes and by m by T the real you which comes next, doesn't change the shape of the data.

152
00:11:00,510 --> 00:11:05,770
After puling, which changes the length of a sequence we get entered by M by T two.

153
00:11:06,660 --> 00:11:12,300
You can think of T two as T divided by two or two divided by three depending on the pulling parameters.

154
00:11:13,020 --> 00:11:16,250
Next we go through the second convolution and so on and so forth.

155
00:11:21,260 --> 00:11:23,450
So let's pretend that we had a three convolutions.

156
00:11:23,540 --> 00:11:26,540
So after the final convolution, our data has the shape.

157
00:11:26,630 --> 00:11:31,670
And by M three by two, three at this point, we need the features to be last again.

158
00:11:32,030 --> 00:11:38,900
So we call permute again and we get in by T three by M3 three after doing a global max pool along the

159
00:11:38,900 --> 00:11:40,220
time dimension we get.

160
00:11:40,400 --> 00:11:47,090
And by M three, note that because we can specify the axis to perform the global max pool on, we didn't

161
00:11:47,090 --> 00:11:50,630
necessarily have to permit the data after the final convolution.

162
00:11:50,810 --> 00:11:52,580
You could just choose a different axis.

163
00:11:53,180 --> 00:11:55,580
Although I think this makes it more clear what's going on.

164
00:11:56,540 --> 00:12:01,430
Lastly, we passed the data through a final dense layer and as usual, we get back an array of size

165
00:12:01,490 --> 00:12:02,300
and by K!