1
00:00:11,600 --> 00:00:16,790
In this lecture, we are going to continue our discussion on alternative perspectives of convolution.

2
00:00:17,330 --> 00:00:23,150
Again, this is very helpful for understanding convolution, but it doesn't teach you anything new mechanically.

3
00:00:23,360 --> 00:00:25,130
So this lecture is optional.

4
00:00:25,130 --> 00:00:29,030
If you want to just move on and learn about how to make scenes.

5
00:00:29,990 --> 00:00:36,050
So in this lecture, what we want to talk about is the equivalence of convolution and a matrix multiplication.

6
00:00:36,470 --> 00:00:40,010
In other words, in what scenario are they actually the same thing?

7
00:00:44,750 --> 00:00:46,100
Let's do an example.

8
00:00:46,280 --> 00:00:49,900
This is easier to see if we only do a one dimensional convolution.

9
00:00:49,910 --> 00:00:52,280
So let's first go over how that would work.

10
00:00:52,460 --> 00:00:56,960
This should be easy since two dimensional convolution is just a generalization of this.

11
00:00:57,530 --> 00:01:00,080
So let's start with a one dimensional image.

12
00:01:00,110 --> 00:01:02,600
A equals a one, two, three, four.

13
00:01:02,810 --> 00:01:04,150
We also have a filter.

14
00:01:04,160 --> 00:01:05,660
W equals W one.

15
00:01:05,660 --> 00:01:06,500
W two.

16
00:01:07,510 --> 00:01:14,500
Then the convolution output will be B, which is a key involved with W, and the elements are a one

17
00:01:14,500 --> 00:01:22,660
times w one plus a two times w2a2 times w one plus a three times W two and a three times w one plus

18
00:01:22,660 --> 00:01:24,160
a four times w two.

19
00:01:28,980 --> 00:01:37,440
So in general we can write one dimensional convolution as the sum from ei prime equals one up to K of

20
00:01:37,530 --> 00:01:41,640
at ei plus prime times w of prime.

21
00:01:42,480 --> 00:01:46,950
You'll notice that this is the same equation we had for a two dimensional convolution, just with the

22
00:01:46,950 --> 00:01:48,510
second index dropped.

23
00:01:53,330 --> 00:01:58,910
Using this, it's pretty easy to see how we would implement the same thing using matrix multiplication.

24
00:01:59,510 --> 00:02:05,630
What we've done here is we created a matrix where we repeat the filter along each row, but on each

25
00:02:05,630 --> 00:02:08,720
row we shift it to the right by one space.

26
00:02:09,110 --> 00:02:12,530
Now, I recommend you do this by hand so you can see that it works.

27
00:02:12,530 --> 00:02:18,860
If you multiply the matrix by the original input vector A, you get the correct output of the convolution.

28
00:02:23,630 --> 00:02:25,070
So what's the lesson here?

29
00:02:25,430 --> 00:02:31,730
It's that by repeating the same filter again and again inside a matrix, we can implement convolution

30
00:02:31,730 --> 00:02:34,020
without actually doing convolution.

31
00:02:34,040 --> 00:02:37,220
Instead, we can just use matrix multiplication.

32
00:02:41,930 --> 00:02:47,360
But there is another problem with this particular method of doing matrix multiplication, and this is

33
00:02:47,360 --> 00:02:51,680
that the equivalent matrix repeats the filter multiple times.

34
00:02:51,890 --> 00:02:57,680
So the result is that the matrix takes up a lot more space than the original filter, which was just

35
00:02:57,680 --> 00:02:59,540
a two element array originally.

36
00:03:00,080 --> 00:03:05,300
In other words, while this was a helpful perspective, you don't necessarily want to implement convolution

37
00:03:05,300 --> 00:03:05,960
this way.

38
00:03:10,840 --> 00:03:14,440
Instead, it's helpful to look at this from the opposite direction.

39
00:03:14,710 --> 00:03:20,800
What if there are instances where instead of doing a full matrix multiplication, we can replace it

40
00:03:20,800 --> 00:03:21,850
with convolution?

41
00:03:26,730 --> 00:03:29,760
This is the idea behind parameter sharing or weight sharing.

42
00:03:30,240 --> 00:03:36,690
Think of it this way Inside a neural network, we are constantly doing this operation A equals W transpose

43
00:03:36,690 --> 00:03:43,530
X Here A is the output activation and x is the input feature vector W is the weight matrix.

44
00:03:48,260 --> 00:03:53,540
Well, what if instead of a full weight matrix, we just had the same two weights repeating over and

45
00:03:53,540 --> 00:03:54,260
over again?

46
00:03:54,590 --> 00:03:56,530
Then we have less parameters.

47
00:03:56,540 --> 00:04:02,030
Our neural network takes up less memory in RAM, and thus we can make the computation more efficient.

48
00:04:02,660 --> 00:04:07,670
In other words, convolution saves both space and time by using less weights.

49
00:04:12,470 --> 00:04:14,090
Why might we want to do this?

50
00:04:14,720 --> 00:04:19,790
Consider what we did in the previous section where we had a fully connected neural network looking at

51
00:04:19,790 --> 00:04:21,050
a single image.

52
00:04:21,440 --> 00:04:28,700
Luckily, those images were just gray scale images of size 28 by 28, which gives us 784 features.

53
00:04:29,120 --> 00:04:32,420
But what if we had a slightly larger image and add color?

54
00:04:32,960 --> 00:04:35,870
The key for ten data set is one such example.

55
00:04:36,200 --> 00:04:39,080
Those images are 32 by 30, two by three.

56
00:04:39,560 --> 00:04:42,890
In this case, we have 3072 features.

57
00:04:42,920 --> 00:04:45,470
That's quite a large increase from 784.

58
00:04:45,800 --> 00:04:51,080
And keep in mind a 32 by 32 image is quite modest in size, to say the least.

59
00:04:51,530 --> 00:04:57,500
Modern scenes such as VG look at images of size 224 by 224.

60
00:04:57,770 --> 00:05:04,580
If you use a full weight matrix, then you would have 150,528 features.

61
00:05:05,090 --> 00:05:11,150
Suppose you're looking at an image with modern HD resolution that's 1280 by 720.

62
00:05:11,660 --> 00:05:15,590
In this case you would have about 2.8 million features.

63
00:05:15,770 --> 00:05:19,040
As you can imagine, this is much too large for a neural network.

64
00:05:24,010 --> 00:05:28,810
But there is another good reason to use weight sharing, which is that you don't need different weights

65
00:05:28,810 --> 00:05:30,640
for each part of the neural network.

66
00:05:30,970 --> 00:05:35,830
Remember, that convolution is actually correlation and the filter is really a pattern finder.

67
00:05:36,100 --> 00:05:41,620
In this scenario, we actually want the same filter to look at all locations on the image.

68
00:05:41,740 --> 00:05:44,920
This is the idea behind Translational Invariance.

69
00:05:49,790 --> 00:05:52,730
Suppose we are building a dog versus cat recognizer.

70
00:05:53,360 --> 00:05:55,430
Here is an image containing a cat.

71
00:05:56,090 --> 00:05:58,190
Here's another image containing a cat.

72
00:05:58,580 --> 00:06:01,160
But what's the difference between these two images?

73
00:06:01,520 --> 00:06:04,850
The only difference is that the cat is in a different position.

74
00:06:05,630 --> 00:06:10,400
Now, if we use the fully connected or dense neural network, we would have to learn the weights for

75
00:06:10,400 --> 00:06:12,290
each of these positions separately.

76
00:06:12,710 --> 00:06:15,650
And on top of that, this neural network won't generalize.

77
00:06:15,650 --> 00:06:21,140
Well, because if we come across the same cat, but in a new position, the neural network would have

78
00:06:21,140 --> 00:06:22,250
failed to recognize it.

79
00:06:22,790 --> 00:06:28,220
So, in fact, it's better to have a shared pattern finder because this pattern finder looks at all

80
00:06:28,220 --> 00:06:29,750
locations on the image.

81
00:06:29,870 --> 00:06:35,510
You don't need to learn the weights to look at a cat at every single possible point, which would actually

82
00:06:35,510 --> 00:06:36,590
be infeasible.
