1
00:00:11,670 --> 00:00:15,750
In this lecture we're going to fill in the missing detail from the last lecture.

2
00:00:15,750 --> 00:00:23,680
How do we calculate the dimensionality of the initial feature vector into the dense layers of a CNN.

3
00:00:23,870 --> 00:00:28,240
For this we need to know about a topic called conversational arithmetic.

4
00:00:28,280 --> 00:00:33,410
Basically these are the concepts that allow us to calculate what the output size of a convolution will

5
00:00:33,410 --> 00:00:40,700
be given some input size some filter size and various options such as padding and strides as a reminder.

6
00:00:40,700 --> 00:00:46,550
The reason why we need this is because pi talk requires you to specify both the input size and output

7
00:00:46,550 --> 00:00:48,320
size of each layer.

8
00:00:48,320 --> 00:00:51,690
Even though we implicitly already know this information.

9
00:00:51,770 --> 00:00:54,420
What's weird is that these calculations are well-defined.

10
00:00:54,440 --> 00:00:58,790
So there isn't really a great reason to require us to do them ourselves.

11
00:00:58,850 --> 00:01:04,970
Libraries such as cars make things super easy by automatically inferring the input size of each layer.

12
00:01:04,970 --> 00:01:12,540
So you only need to specify the output size.

13
00:01:12,540 --> 00:01:17,970
The reason I mentioned carries a lot is because KRS is usually a student's first introduction to deep

14
00:01:17,970 --> 00:01:18,420
learning.

15
00:01:19,080 --> 00:01:24,000
And even if you don't know carries you could easily pick it up in a few minutes provided you have some

16
00:01:24,000 --> 00:01:30,820
base level of programming skill so another great thing about carries is that it allows you to pass in

17
00:01:30,820 --> 00:01:36,400
a string to specify the type of padding that is used for convolution as a reminder.

18
00:01:36,400 --> 00:01:43,390
We generally have three modes of convolution invalid mode the output sizes and minus K plus 1 in the

19
00:01:43,390 --> 00:01:49,480
same mode of output size is just n the input size and in full mode which no longer exists by the way

20
00:01:49,840 --> 00:01:52,850
the output sizes end plus K minus 1.

21
00:01:52,900 --> 00:01:57,180
And remember that these are just the basic form of convolution with no strides.

22
00:01:57,760 --> 00:01:59,440
So you might think well that's great.

23
00:01:59,440 --> 00:02:05,770
If I just use same mode for all my convolutions then calculating the output size is super easy but way

24
00:02:06,610 --> 00:02:11,760
passing in the mode of convolution as a string exists only in cars not pi torch

25
00:02:16,940 --> 00:02:22,640
the reason I mentioned this is because it's a natural question to ask why not use the same mode so that

26
00:02:22,640 --> 00:02:26,640
it's easy to compute the size of the image after each convolution.

27
00:02:26,990 --> 00:02:32,660
If we do that then let's say we have an input image of size 32 that goes through three convolutions

28
00:02:32,660 --> 00:02:39,320
with a straight of two after the first convolution the image shrinks down to 16 after the second convolution

29
00:02:39,350 --> 00:02:44,240
as shrinks down 8 and after the third convolution it shrinks down to 4 easy.

30
00:02:44,900 --> 00:02:49,660
Well let's ask ourselves how would we do the equivalent of same mode in PI torch.

31
00:02:50,000 --> 00:02:56,930
Let's see we have to specify this padding argument which tells PI talk how many spaces to pad the input

32
00:02:56,930 --> 00:03:04,040
image on each side but way in order to specify this argument correctly so that you achieve the equivalent

33
00:03:04,040 --> 00:03:05,300
of same mode.

34
00:03:05,300 --> 00:03:10,140
You need to do convolution all arithmetic to calculate the correct amount of padding.

35
00:03:10,190 --> 00:03:16,070
In other words you still need to do convolution of arithmetic whether you have same mode or not.

36
00:03:16,070 --> 00:03:23,590
So there's no benefit to using say mode because either way you still have to do convolution or arithmetic.

37
00:03:23,630 --> 00:03:28,860
In fact this might be a little harder because you have to do convolution arithmetic in reverse.

38
00:03:28,890 --> 00:03:38,930
The only good reason to use say mode would be if you really believe it benefits your model's predictions.

39
00:03:38,930 --> 00:03:44,470
OK so now you're convinced that there is no way out of doing convolution or arithmetic and this course

40
00:03:44,470 --> 00:03:46,960
is not my goal to go in-depth about this topic.

41
00:03:47,560 --> 00:03:52,510
If you want to gain a better intuition about convolution or arithmetic I've left a link to deep learning

42
00:03:52,510 --> 00:03:55,440
dot net in the file extra reading DST.

43
00:03:55,450 --> 00:04:00,400
Which leads you to a tutorial about convolution or arithmetic and how the calculations change when you

44
00:04:00,400 --> 00:04:02,560
have padding strides and so forth.

45
00:04:02,560 --> 00:04:10,160
It's pretty non trivial however an easy way to get around this is to simply use the formula that pi

46
00:04:10,160 --> 00:04:13,040
twitch provides in their documentation.

47
00:04:13,040 --> 00:04:18,330
With this formula you can plug in all your values and calculate the output dimensions.

48
00:04:18,420 --> 00:04:22,400
You don't need to understand convoluted mental arithmetic to use this formula.

49
00:04:22,500 --> 00:04:25,070
You just need to know how to multiply add and divide.

50
00:04:30,230 --> 00:04:35,360
Let's do a simple example to make sure you know how to use this formula since the formulas for both

51
00:04:35,360 --> 00:04:37,360
the height and with dimensions are the same.

52
00:04:37,370 --> 00:04:44,280
We only need to do one firstly the concept of dilation is outside the scope of this course and we will

53
00:04:44,280 --> 00:04:47,350
use the default value of one next.

54
00:04:47,370 --> 00:04:52,660
Let's assume we have padding equals one colonel size equals three and straight equals two.

55
00:04:52,740 --> 00:04:58,260
Let's set each in a two 32 which is common for deep learning image datasets.

56
00:04:58,350 --> 00:05:05,340
If we plug these values into our formula we get thirty two plus two times one minus one at times three

57
00:05:05,340 --> 00:05:13,020
minus one minus one all divided by two then we plus one and then take the floor.

58
00:05:13,350 --> 00:05:19,050
So that gives us thirty one over two plus one which is fifteen point five plus one which is sixteen

59
00:05:19,050 --> 00:05:20,210
point five.

60
00:05:20,310 --> 00:05:27,010
Then we take the floor of sixteen point five and we get sixteen so each out equals sixteen.

61
00:05:27,130 --> 00:05:30,450
Now it's easy to make mistakes with these and I wouldn't be surprised if I did.

62
00:05:30,520 --> 00:05:36,430
So if you want to double check my work please do alternatively you could just implement this in code

63
00:05:41,630 --> 00:05:46,070
you may have noticed something a little curious on the previous slide which is that in the PI talks

64
00:05:46,070 --> 00:05:52,560
documentation the input size and the output size are specified as and by color by height by width.

65
00:05:52,640 --> 00:05:57,680
This is contrary to what we've been working with so far which is data sets of size and by height by

66
00:05:57,680 --> 00:05:59,180
with my color.

67
00:05:59,180 --> 00:06:01,330
Why is there this discrepancy.

68
00:06:01,340 --> 00:06:04,250
Well it has to do with the conventions of the library.

69
00:06:04,460 --> 00:06:09,080
Remember that when you're a programmer building a library you have full control over how it works.

70
00:06:09,410 --> 00:06:14,330
So you get to decide what the conventions are how you want to structure the data and so forth.

71
00:06:16,410 --> 00:06:21,990
It turns out that some developers decided to put the color channel first while other developers decided

72
00:06:21,990 --> 00:06:23,970
to put the color channel last.

73
00:06:24,390 --> 00:06:29,740
In our case libraries such as the ANO and Pi torch decided to put the color channel first.

74
00:06:29,940 --> 00:06:35,490
Libraries like open v tensor flow map plot lib and pillow use the convention that the colour Channel

75
00:06:35,490 --> 00:06:36,980
comes last.

76
00:06:37,320 --> 00:06:43,560
Although open C does other weird things like using the BGR colour ordering instead of RG B so it seems

77
00:06:43,560 --> 00:06:46,670
that putting the colour channel last has now become more common.

78
00:06:47,190 --> 00:06:50,300
But pi torch was built with the colour channel first.

79
00:06:50,340 --> 00:06:54,430
We can't just switch it around one day and expect everyone to update their code.

80
00:06:54,570 --> 00:06:59,680
Therefore pi torch still uses the convention that the colour channel comes first.

81
00:06:59,730 --> 00:07:01,220
This detail is hidden from you.

82
00:07:01,260 --> 00:07:08,390
If you use utilities such as those from the torch vision library since we haven't worked with colour

83
00:07:08,390 --> 00:07:13,820
images yet in this course we haven't had the opportunity to uncover these hidden details but now we

84
00:07:13,820 --> 00:07:14,750
do.

85
00:07:15,080 --> 00:07:20,870
As you recall if you inspect the data attribute of a dataset loaded in from torch vision it gives you

86
00:07:20,870 --> 00:07:23,640
the UN a transformed version of the data.

87
00:07:23,660 --> 00:07:30,140
These will be end by height by with by colour but when you yield batches of data from the data loader

88
00:07:30,470 --> 00:07:35,790
this data is transformed in whatever way is needed before it goes into the neural network.

89
00:07:35,870 --> 00:07:41,390
And so when you inspect these batches of data yielded by the data loader you will see that they appear

90
00:07:41,390 --> 00:07:48,700
as end by colour by height by width so in fact pi towards still uses the convention that the colour

91
00:07:48,700 --> 00:07:49,970
channel comes first.

92
00:07:50,110 --> 00:07:55,060
But it comes with interfaces that work with the more commonly used convention that the colour channel

93
00:07:55,060 --> 00:07:55,810
comes last.