1
00:00:00,000 --> 00:00:01,755
In the previous example,

2
00:00:01,755 --> 00:00:04,440
you saw how you could create
a neural network called

3
00:00:04,440 --> 00:00:06,720
a deep neural network
to pattern match

4
00:00:06,720 --> 00:00:09,585
a set of images of
fashion items to labels.

5
00:00:09,585 --> 00:00:11,790
In just a couple of minutes,
you're able to train it to

6
00:00:11,790 --> 00:00:14,520
classify with pretty high
accuracy on the training set,

7
00:00:14,520 --> 00:00:16,575
but a little less
on the test set.

8
00:00:16,575 --> 00:00:18,600
Now, one of the things
that you would have

9
00:00:18,600 --> 00:00:20,190
seen when you looked at

10
00:00:20,190 --> 00:00:21,720
the images is that
there's a lot of

11
00:00:21,720 --> 00:00:23,835
wasted space in each image.

12
00:00:23,835 --> 00:00:26,490
While there are only 784 pixels,

13
00:00:26,490 --> 00:00:28,230
it will be interesting
to see if there

14
00:00:28,230 --> 00:00:29,940
was a way that we could
condense the image

15
00:00:29,940 --> 00:00:32,219
down to the important features

16
00:00:32,219 --> 00:00:34,160
that distinguish what
makes it a shoe,

17
00:00:34,160 --> 00:00:35,780
or a handbag, or a shirt.

18
00:00:35,780 --> 00:00:37,885
That's where
convolutions come in.

19
00:00:37,885 --> 00:00:40,005
So, what's convolution?
You might ask.

20
00:00:40,005 --> 00:00:43,340
Well, if you've ever done
any kind of image processing,

21
00:00:43,340 --> 00:00:46,040
it usually involves having
a filter and passing

22
00:00:46,040 --> 00:00:47,840
that filter over
the image in order

23
00:00:47,840 --> 00:00:50,030
to change the underlying image.

24
00:00:50,030 --> 00:00:52,820
The process works a
little bit like this.

25
00:00:52,820 --> 00:00:55,245
For every pixel, take its value,

26
00:00:55,245 --> 00:00:57,485
and take a look at
the value of its neighbors.

27
00:00:57,485 --> 00:00:59,450
If our filter is three by three,

28
00:00:59,450 --> 00:01:01,550
then we can take a look at
the immediate neighbor,

29
00:01:01,550 --> 00:01:04,225
so that you have a corresponding
three by three grid.

30
00:01:04,225 --> 00:01:06,795
Then to get the new value
for the pixel,

31
00:01:06,795 --> 00:01:09,050
we simply multiply each neighbor

32
00:01:09,050 --> 00:01:11,470
by the corresponding value
in the filter.

33
00:01:11,470 --> 00:01:13,320
So, for example, in this case,

34
00:01:13,320 --> 00:01:15,660
our pixel has the value 192,

35
00:01:15,660 --> 00:01:18,560
and its upper left neighbor
has the value zero.

36
00:01:18,560 --> 00:01:21,560
The upper left value and
the filter is negative one,

37
00:01:21,560 --> 00:01:23,920
so we multiply zero
by negative one.

38
00:01:23,920 --> 00:01:26,390
Then we would do the same
for the upper neighbor.

39
00:01:26,390 --> 00:01:28,280
Its value is 64 and

40
00:01:28,280 --> 00:01:30,260
the corresponding
filter value was zero,

41
00:01:30,260 --> 00:01:32,105
so we'd multiply those out.

42
00:01:32,105 --> 00:01:34,280
Repeat this for each neighbor and

43
00:01:34,280 --> 00:01:36,230
each corresponding filter value,

44
00:01:36,230 --> 00:01:39,650
and would then have the new
pixel with the sum of each of

45
00:01:39,650 --> 00:01:41,480
the neighbor values multiplied

46
00:01:41,480 --> 00:01:43,430
by the corresponding
filter value,

47
00:01:43,430 --> 00:01:44,945
and that's a convolution.

48
00:01:44,945 --> 00:01:47,005
It's really as simple as that.

49
00:01:47,005 --> 00:01:48,580
The idea here is that

50
00:01:48,580 --> 00:01:50,965
some convolutions
will change the image

51
00:01:50,965 --> 00:01:52,390
in such a way that

52
00:01:52,390 --> 00:01:55,045
certain features in
the image get emphasized.

53
00:01:55,045 --> 00:01:57,565
So, for example, if you
look at this filter,

54
00:01:57,565 --> 00:02:00,530
then the vertical lines in
the image really pop out.

55
00:02:00,530 --> 00:02:03,915
With this filter,
the horizontal lines pop out.

56
00:02:03,915 --> 00:02:06,100
Now, that's a very
basic introduction

57
00:02:06,100 --> 00:02:07,810
to what convolutions do,

58
00:02:07,810 --> 00:02:10,450
and when combined with
something called pooling,

59
00:02:10,450 --> 00:02:12,545
they can become really powerful.

60
00:02:12,545 --> 00:02:16,245
But simply, pooling is a way
of compressing an image.

61
00:02:16,245 --> 00:02:18,160
A quick and easy way to do this,

62
00:02:18,160 --> 00:02:21,310
is to go over the image of
four pixels at a time, i.e,

63
00:02:21,310 --> 00:02:22,840
the current pixel and

64
00:02:22,840 --> 00:02:25,475
its neighbors underneath
and to the right of it.

65
00:02:25,475 --> 00:02:29,425
Of these four, pick the biggest
value and keep just that.

66
00:02:29,425 --> 00:02:31,835
So, for example, you
can see it here.

67
00:02:31,835 --> 00:02:34,220
My 16 pixels on the left are

68
00:02:34,220 --> 00:02:36,710
turned into the four
pixels on the right,

69
00:02:36,710 --> 00:02:38,750
by looking at them
in two-by-two grids

70
00:02:38,750 --> 00:02:40,370
and picking the biggest value.

71
00:02:40,370 --> 00:02:42,290
This will preserve
the features that

72
00:02:42,290 --> 00:02:44,135
were highlighted by
the convolution,

73
00:02:44,135 --> 00:02:47,540
while simultaneously quartering
the size of the image.

74
00:02:47,540 --> 00:02:51,210
We have the horizontal
and vertical axes.