1
00:00:11,680 --> 00:00:16,990
In this lecture, we are going to discuss how we can use convolutional neural networks for sequences,

2
00:00:17,020 --> 00:00:23,080
specifically text, although the type of CNN we are about to discuss will work just as well for generic

3
00:00:23,080 --> 00:00:23,980
sequences too.

4
00:00:24,760 --> 00:00:26,620
You might be wondering how can this be?

5
00:00:27,100 --> 00:00:32,140
If CNN's are for images, how is it possible for CNN's to also work with sequences?

6
00:00:37,300 --> 00:00:39,820
First, let's recall the basics of convolution.

7
00:00:40,480 --> 00:00:46,150
The idea is this you have an image which is the big square, you have a filter, which is the small

8
00:00:46,150 --> 00:00:51,520
square, and you're going to slide the filter along each possible position in the big square.

9
00:00:51,820 --> 00:00:57,340
And at each point, you're going to multiply all the overlapping values and add them together just like

10
00:00:57,340 --> 00:00:58,180
a dot product.

11
00:00:58,930 --> 00:01:01,480
Now I know this is obvious, but it's worth taking note of.

12
00:01:01,960 --> 00:01:04,750
An image has two dimensions height and width.

13
00:01:05,530 --> 00:01:09,970
There is also the feature dimension, but that's not an actual dimension which has correlation.

14
00:01:10,600 --> 00:01:12,430
And recall what I mean by correlation.

15
00:01:12,970 --> 00:01:19,090
If you have a picture of a red car and you find a pixel in the image, which is red, probably the neighboring

16
00:01:19,090 --> 00:01:20,320
pixels are also red.

17
00:01:21,010 --> 00:01:25,060
In other words, pixels beside each other likely have similar values.

18
00:01:30,130 --> 00:01:35,680
Now, think of a sequence, unlike an image, a sequence has just one dimension that is not a feature

19
00:01:35,950 --> 00:01:38,800
time, so it's time instead of space.

20
00:01:39,490 --> 00:01:40,990
But notice one important detail.

21
00:01:41,620 --> 00:01:46,050
We have the same type of correlation and sequences data which are nearby.

22
00:01:46,060 --> 00:01:49,300
Other data in time are also close in a value.

23
00:01:49,960 --> 00:01:55,930
That's why this appears like a smooth curve, rather than just noise jumping around with random uncorrelated

24
00:01:55,930 --> 00:01:56,590
values.

25
00:01:57,340 --> 00:02:00,240
This suggests that convolution might be useful here as well.

26
00:02:05,310 --> 00:02:11,760
Luckily, convolution in one dimension is actually much simpler than convolution in two dimensions instead

27
00:02:11,760 --> 00:02:12,720
of a big square.

28
00:02:12,990 --> 00:02:16,140
We just have a big line and the filter is the smaller line.

29
00:02:16,980 --> 00:02:20,610
Then we slide the small line across every position in the big line.

30
00:02:21,090 --> 00:02:24,120
Multiply all the overlapping values and add them together.

31
00:02:29,260 --> 00:02:34,210
Let's do a simple example for the sequence data, we have one two three two one.

32
00:02:34,450 --> 00:02:37,480
And for the filter, we have simply plus one minus one.

33
00:02:38,380 --> 00:02:43,570
As an exercise, you might want to try to figure out the answer first before we move on to the solution.

34
00:02:48,700 --> 00:02:52,420
OK, so here's what we get at the first position we have.

35
00:02:52,480 --> 00:02:56,110
One times, one plus two times minus one, which is minus one.

36
00:03:01,210 --> 00:03:07,180
At the second position, we get two times one plus three times minus one, which is minus one.

37
00:03:12,350 --> 00:03:17,900
At the third position, we get three times one plus two times minus one, which is one.

38
00:03:23,010 --> 00:03:29,010
And at the fourth and final position, we get two times one plus one times minus one, which is one.

39
00:03:29,670 --> 00:03:32,220
So essentially exactly what we would expect.

40
00:03:37,360 --> 00:03:41,650
As with images, it's possible to express this operation as an equation.

41
00:03:43,500 --> 00:03:49,380
And remember that in deep learning, while we call this convolution, pure mathematicians and statisticians

42
00:03:49,380 --> 00:03:51,090
would call this cross-correlation.

43
00:03:51,660 --> 00:03:57,210
So just keep in mind that the sign is reversed as an exercise, you might want to try and confirm to

44
00:03:57,210 --> 00:04:03,480
yourself that this equation does in fact implement the operation we just performed in the previous example.

45
00:04:08,510 --> 00:04:11,900
And remember that another thing we can do is add on features.

46
00:04:12,500 --> 00:04:18,740
So imagine X as a TBD array where T is the number of time steps and D is the number of input features.

47
00:04:19,550 --> 00:04:25,820
Then imagine the output as another two-dimensional array of size T by M where T is the number of time

48
00:04:25,820 --> 00:04:33,860
steps and M is the number of output features, then w must be a three dimensional array t by d by M

49
00:04:34,160 --> 00:04:36,440
and then we would have the equation that we see here.

50
00:04:38,640 --> 00:04:45,360
This is just like convolution with images for a two dimensional convolution, we have two spatial dimensions,

51
00:04:45,660 --> 00:04:49,890
plus one dimension for the input features, plus one dimension for the output features.

52
00:04:50,310 --> 00:04:54,690
So that's four dimensions in total for a one dimensional convolution.

53
00:04:55,050 --> 00:05:00,150
We have one time dimension, plus one dimension for the input features, plus one dimension for the

54
00:05:00,150 --> 00:05:00,990
output features.

55
00:05:01,290 --> 00:05:03,150
So that's three dimensions in total.

56
00:05:08,330 --> 00:05:13,850
As usual, while convolution might seem like an abstract concept, remember that there are multiple

57
00:05:13,850 --> 00:05:18,710
convenient perspectives on it that make it seem like things we are already familiar with.

58
00:05:19,460 --> 00:05:24,560
So one perspective is that it's just matrix multiplication, just like a regular, feedforward neural

59
00:05:24,560 --> 00:05:30,530
network, except that we have shared weights in order to take advantage of the special structure and

60
00:05:30,530 --> 00:05:31,730
correlation in the data.

61
00:05:33,920 --> 00:05:39,410
Another intuitive perspective is that it's just a sliding DI product, and the DOT product is just a

62
00:05:39,410 --> 00:05:40,580
correlation finder.

63
00:05:41,900 --> 00:05:45,380
Correlation is just another name for a pattern matching or similarity.

64
00:05:45,950 --> 00:05:52,130
So really, what we are doing is asking, is this part of the sequence similar to my filter and thus

65
00:05:52,130 --> 00:05:55,190
the filter becomes a pattern matcher or a pattern finder?

66
00:05:56,120 --> 00:06:00,260
All of these concepts that you learn about convolution before still apply here.

67
00:06:05,430 --> 00:06:10,380
All right, so now that you know how convolution works with sequences, how do we apply these to text?

68
00:06:11,130 --> 00:06:17,490
Well, luckily when we use embeddings, that already gives us exactly what we need for one dimensional

69
00:06:17,490 --> 00:06:18,230
convolution.

70
00:06:18,240 --> 00:06:24,060
We need an input, which is a T by the sequence where to use a number of timestamps and these the number

71
00:06:24,060 --> 00:06:24,810
of features.

72
00:06:25,320 --> 00:06:30,810
And of course, this is exactly what we get after we pass our sentence through and embedding layer.

73
00:06:31,650 --> 00:06:39,120
We go from a lengthy sequence of words to a lengthy sequence of integers to a length T sequence of length

74
00:06:39,120 --> 00:06:39,840
vectors.

75
00:06:40,530 --> 00:06:46,110
Since we have two vectors each of length d, this makes up a T by the matrix when you stack them all

76
00:06:46,110 --> 00:06:51,570
together, and thus we have exactly what we need to build a CNN for text.

77
00:06:56,820 --> 00:06:58,350
So here's how it all looks in code.

78
00:06:59,010 --> 00:07:00,210
First, we have our input.

79
00:07:00,270 --> 00:07:02,910
That's a lengthy sequence of word indexes.

80
00:07:03,540 --> 00:07:04,920
Then we have an embedding layer.

81
00:07:05,430 --> 00:07:08,970
The output of that is a T by the sequence of word vectors.

82
00:07:09,900 --> 00:07:11,640
Then we have a one deconvolution.

83
00:07:11,670 --> 00:07:13,620
This is just the kind of one class.

84
00:07:14,520 --> 00:07:19,800
Then we follow the same typical CNN architecture where we have convolution two, followed by pooling

85
00:07:19,800 --> 00:07:20,610
and so forth.

86
00:07:21,210 --> 00:07:24,330
So the same pattern applies here that we had for images.

87
00:07:25,170 --> 00:07:30,420
Generally speaking, the data shrinks in the time dimension, but grows in the feature dimension.

88
00:07:31,200 --> 00:07:34,500
So that's why you see the number of feature maps getting larger and larger.

89
00:07:35,340 --> 00:07:40,470
Once we've done that, then we can do a flattening or we can do a global max pooling, which will give

90
00:07:40,470 --> 00:07:42,750
us a single vector of size M3.

91
00:07:45,280 --> 00:07:49,360
Finally, we pass this through one or more dense layers to get a single scalar.

92
00:07:49,660 --> 00:07:52,000
Assuming we are doing binary classification.

93
00:07:53,440 --> 00:07:59,650
As you can see, this is no different from a CNN man for images, except that all the convolutions and

94
00:07:59,650 --> 00:08:02,620
pooling are one dimensional instead of two dimensional.