1
00:00:11,590 --> 00:00:16,660
In this lecture, we are going to continue our discussion on alternative perspectives of convolution.

2
00:00:17,410 --> 00:00:22,900
Again, this is very helpful for understanding convolution, but it doesn't teach you anything new mechanically.

3
00:00:23,560 --> 00:00:28,810
So this lecture is optional if you want to just move on and learn about how to make science.

4
00:00:30,100 --> 00:00:35,050
So in this lecture, what were you going to talk about as the equivalence of convolution and a matrix

5
00:00:35,050 --> 00:00:35,830
multiplication?

6
00:00:36,550 --> 00:00:39,760
In other words, in what scenario are they actually the same thing?

7
00:00:44,810 --> 00:00:45,830
Let's do an example.

8
00:00:46,460 --> 00:00:49,730
This is easier to see if we only do a one dimensional conversation.

9
00:00:50,060 --> 00:00:52,010
So let's first go over how that would work.

10
00:00:52,640 --> 00:00:56,690
This should be easy since two dimensional convolution is just a generalization of this.

11
00:00:57,710 --> 00:00:59,900
So let's start with a one dimensional image.

12
00:01:00,080 --> 00:01:02,320
A equals A1, A2 a three eight four.

13
00:01:02,900 --> 00:01:06,320
We also have filter w equals W1 W2.

14
00:01:07,610 --> 00:01:14,760
Then the convolution output will be B, which is a convulsed with W, and the elements are A1 times

15
00:01:15,050 --> 00:01:22,730
one plus a two times W two, a two times W one plus a three times W two and a three times W one plus

16
00:01:22,760 --> 00:01:23,960
four times W two.

17
00:01:29,080 --> 00:01:36,580
So in general, we can write one dimensional convolution as the sum from I prime equals one up to K

18
00:01:37,030 --> 00:01:41,350
of eight I plus I prime times W of I prime.

19
00:01:42,470 --> 00:01:46,910
You'll notice that this is the same equation we had for two dimensional convolution, just with the

20
00:01:46,910 --> 00:01:48,260
second index dropped.

21
00:01:53,370 --> 00:01:58,710
Using this, it's pretty easy to see how we would implement the same thing using matrix multiplication.

22
00:01:59,640 --> 00:02:05,610
What we've done here is we created a matrix where we repeat the filter along each row, but on each

23
00:02:05,610 --> 00:02:08,430
row we shift it to the right by one space.

24
00:02:09,270 --> 00:02:12,330
Now I recommend you do this by hand so you can see that it works.

25
00:02:12,600 --> 00:02:18,660
If you multiply the matrix by the original input vector A., You get the correct output of the convolution.

26
00:02:23,750 --> 00:02:30,020
So what's the lesson here is that by repeating the same filter again and again inside a matrix, we

27
00:02:30,020 --> 00:02:33,830
can implement convolution without actually doing convolution.

28
00:02:34,130 --> 00:02:37,040
Instead, we can just use matrix multiplication.

29
00:02:42,080 --> 00:02:47,330
But there is another problem with this particular method of doing matrix multiplication, and this is

30
00:02:47,330 --> 00:02:51,440
that the equivalent matrix repeats the filter multiple times.

31
00:02:52,070 --> 00:02:57,670
So the result is that the Matrix takes up a lot more space than the original filter, which was just

32
00:02:57,680 --> 00:02:59,330
a two element array originally.

33
00:03:00,140 --> 00:03:05,270
In other words, while this was a helpful perspective, you don't necessarily want to implement convolution

34
00:03:05,270 --> 00:03:05,780
this way.

35
00:03:10,830 --> 00:03:14,160
Instead, it's helpful to look at this from the opposite direction.

36
00:03:14,880 --> 00:03:20,820
What if there are instances where instead of doing a faux matrix multiplication, we can replace it

37
00:03:20,820 --> 00:03:21,660
with convolution?

38
00:03:26,810 --> 00:03:29,690
This is the idea behind parameter sharing or weight sharing.

39
00:03:30,410 --> 00:03:34,220
Think of it this way inside a neuron that we are constantly doing.

40
00:03:34,220 --> 00:03:37,100
This operation equals W Transpose X.

41
00:03:37,670 --> 00:03:40,850
Here is the output activation and X is the input feature.

42
00:03:40,850 --> 00:03:43,310
Vector W is the weight matrix.

43
00:03:48,370 --> 00:03:53,530
Well, what if instead of a full weight matrix, we just had the same two weights repeating over and

44
00:03:53,530 --> 00:03:59,830
over again, then we have less parameters are known that work takes up less memory and ram, and thus

45
00:03:59,830 --> 00:04:01,810
we can make the computation more efficient.

46
00:04:02,740 --> 00:04:07,450
In other words, convolution saves both space and time by using less weights.

47
00:04:12,480 --> 00:04:13,800
Why might we want to do this?

48
00:04:14,790 --> 00:04:19,740
Consider what we did in the previous section where we had a fully connected neural network looking at

49
00:04:19,740 --> 00:04:20,790
a single image.

50
00:04:21,600 --> 00:04:28,470
Luckily, those images were just grayscale images of size 28 by 28, which gives us 784 features.

51
00:04:29,250 --> 00:04:32,220
But what if we had a slightly larger image and had color?

52
00:04:33,090 --> 00:04:35,640
The CFR 10 dataset is one such example.

53
00:04:36,360 --> 00:04:38,940
Those images are 32 by 32 by three.

54
00:04:39,660 --> 00:04:42,660
In this case, we have three thousand seventy two features.

55
00:04:43,110 --> 00:04:45,210
That's quite a large increase from 784.

56
00:04:45,960 --> 00:04:50,820
And keep in mind, a 32 by 32 image is quite modest in size, to say the least.

57
00:04:51,690 --> 00:04:57,240
Modern CNN, such as Viji look at images of size 224 by 224.

58
00:04:57,930 --> 00:05:03,660
If you use a full weight matrix, then you would have one hundred fifty thousand five hundred twenty

59
00:05:03,660 --> 00:05:04,350
eight features.

60
00:05:05,250 --> 00:05:11,010
Suppose you're looking at an image with modern HD resolution that's twelve eighty by 720.

61
00:05:11,820 --> 00:05:15,330
In this case, you would have about 2.8 million features.

62
00:05:15,900 --> 00:05:18,990
As you can imagine, this is much too large for a neural network.

63
00:05:24,150 --> 00:05:29,160
But there's another good reason to use change, which is that you don't need different weights for each

64
00:05:29,160 --> 00:05:30,420
part of the neural network.

65
00:05:31,110 --> 00:05:35,580
Remember that convolution is actually correlation and the filter is really a pattern finer.

66
00:05:36,270 --> 00:05:41,310
In this scenario, we actually want the same filter to look at all locations on the image.

67
00:05:41,880 --> 00:05:44,700
This is the idea behind translational and variance.

68
00:05:49,870 --> 00:05:52,460
Suppose we are building a dog versus cat, recognize it.

69
00:05:53,500 --> 00:05:55,330
Here is an image containing a cat.

70
00:05:56,200 --> 00:05:58,030
Here's another image containing a cat.

71
00:05:58,750 --> 00:06:00,910
But what's the difference between these two images?

72
00:06:01,660 --> 00:06:04,570
The only difference is that the cat is in a different position.

73
00:06:05,650 --> 00:06:10,390
Now, if we use the fully connected or dense neural network, we would have to learn the weights for

74
00:06:10,390 --> 00:06:12,100
each of these positions separately.

75
00:06:12,880 --> 00:06:18,400
And on top of that, this nerve network won't generalize well because if we come across the same cat

76
00:06:18,700 --> 00:06:22,090
but in a new position, the neural network would have failed to recognize it.

77
00:06:22,900 --> 00:06:28,780
So in fact, it's better to have a shared pattern finder because this pattern finder looks at all locations

78
00:06:28,780 --> 00:06:29,530
on the image.

79
00:06:30,010 --> 00:06:35,500
You don't need to learn the weights to look at a cat at every single possible point, which would actually

80
00:06:35,500 --> 00:06:36,400
be infeasible.

