1
00:00:00,330 --> 00:00:08,940
Welcome to this section on modeling and the section which I'll work with the CSR net model, which,

2
00:00:08,940 --> 00:00:16,850
when given an input image and output density map can be used to train a people account and model.

3
00:00:16,860 --> 00:00:23,400
This model is based on the paper dialectic convolutional neural networks for understanding the highly

4
00:00:23,400 --> 00:00:34,590
congested scenes by you hung Li Zhao and Zheng and Deming Cheng in this paper the proposed the CSR need

5
00:00:34,590 --> 00:00:41,670
to provide a data driven and deep learning matter that can understand highly congested scenes and perform

6
00:00:41,670 --> 00:00:47,460
accurate count estimation as well as present high quality density maps.

7
00:00:48,270 --> 00:00:55,680
The proposed CSR net is composed of two major components a convolutional neural network as the front

8
00:00:55,680 --> 00:00:58,050
end for 2D feature extraction.

9
00:00:58,050 --> 00:01:06,150
This, while we generally call our base model and a dilated scene for the back end, which uses dilated

10
00:01:06,150 --> 00:01:14,790
canals to deliver larger receptive fields and to replace pulling operations, CSR net is easy to train

11
00:01:14,790 --> 00:01:19,920
because of its purely convolutional structure, the backbone of the CSR net.

12
00:01:20,070 --> 00:01:28,890
So VG 16 model and this was picked because of its flexible architecture for easily concatenate in the

13
00:01:28,890 --> 00:01:29,610
back end.

14
00:01:29,610 --> 00:01:37,290
For density map generation, we shall look at it shortly and also because of its strong transfer learning

15
00:01:37,290 --> 00:01:38,130
ability.

16
00:01:39,540 --> 00:01:44,490
Looking at this configuration right here, we have four different possibilities.

17
00:01:44,490 --> 00:01:51,390
That's A, B, C, and D, which are later on choose the B because of the results they obtained right

18
00:01:51,390 --> 00:01:51,790
here.

19
00:01:51,810 --> 00:01:56,820
Thus, B had the lowest mean average error and mean square error.

20
00:01:57,000 --> 00:02:00,720
Anyway, we have this four possibilities A, b, c, and D.

21
00:02:00,720 --> 00:02:03,470
Initially all of them have the same kind of input.

22
00:02:03,480 --> 00:02:07,170
In this case it's 768 by 1024.

23
00:02:07,170 --> 00:02:12,930
Then we have our base model, which is fine tuned from the VG 16.

24
00:02:12,930 --> 00:02:17,400
Yeah, we remove the top layers so we're left with the first layers.

25
00:02:17,640 --> 00:02:23,970
Now to see exactly the layers from the G 16, which we shall use for the feature extraction we have

26
00:02:23,970 --> 00:02:24,600
this year.

27
00:02:24,600 --> 00:02:29,160
This two three by three conv layers with 64 channels.

28
00:02:29,160 --> 00:02:37,140
We have a max pool, we have this two three by three conv layers, 128 channels max pull this one max

29
00:02:37,140 --> 00:02:46,080
pool three by three 512 and note that the dilation rate is equals one all throughout the VG model.

30
00:02:46,080 --> 00:02:50,910
So basically this is our base model, which is the VG now coming to the back end.

31
00:02:50,910 --> 00:02:52,740
That's what we call the back end year.

32
00:02:54,110 --> 00:02:55,470
We have this fall.

33
00:02:55,490 --> 00:02:56,780
Let's take the case.

34
00:02:56,810 --> 00:03:04,370
B we have this three by three conv layer with 512 channels and dilation rate of two.

35
00:03:05,000 --> 00:03:13,400
Also notice how it's very easy for us to leave from this output here and then continue with those back

36
00:03:13,400 --> 00:03:14,510
end layers.

37
00:03:15,220 --> 00:03:21,670
Now, after going through this back and layers, we now finish up with us one by one conv layer with

38
00:03:21,670 --> 00:03:24,610
one channel and then dilation rate of one.

39
00:03:25,210 --> 00:03:30,850
And then we notice that we live from our we have this input and then the outputs need to be divided

40
00:03:30,850 --> 00:03:33,040
by eight of this input.

41
00:03:33,040 --> 00:03:38,530
So we note is that after going to this first max pooling would divide our input by two.

42
00:03:38,560 --> 00:03:40,780
That is, the input dimension is divided by two.

43
00:03:40,780 --> 00:03:41,860
Then one will go to this.

44
00:03:41,860 --> 00:03:43,420
Next we divide it again by two.

45
00:03:43,420 --> 00:03:48,280
So now we've divided by four with respect to the input and then this will divide it by eight with respect

46
00:03:48,280 --> 00:03:48,820
to the input.

47
00:03:48,820 --> 00:03:54,670
So this is how we ensure that we actually get an output, which is eight times.

48
00:03:55,730 --> 00:03:58,220
Smaller than the input right here.

49
00:03:58,760 --> 00:04:04,430
Now, looking at the code, we are going to fine tune our Widget 16 model, which has been trained on

50
00:04:04,430 --> 00:04:05,180
Image Net.

51
00:04:05,210 --> 00:04:09,450
We have the input and in fact we have the input chip right here.

52
00:04:09,470 --> 00:04:12,050
We don't include the top, so let's run this.

53
00:04:13,400 --> 00:04:17,150
You have the results we obtained for our VG 16 model will see that.

54
00:04:17,300 --> 00:04:23,390
So 40 million parameter model which goes from which takes in this input and then goes right up to this

55
00:04:23,390 --> 00:04:25,290
because we don't we don't include the top.

56
00:04:25,310 --> 00:04:33,800
Now, since we are having an output which has this, we will notice that we shall actually use this

57
00:04:33,800 --> 00:04:38,930
first layers right up to this one right here as given in the paper.

58
00:04:38,930 --> 00:04:41,300
Because in the paper we have this first tool.

59
00:04:41,330 --> 00:04:42,530
Max, pull this tool.

60
00:04:42,530 --> 00:04:45,910
Max pull this to re Max pull this to re.

61
00:04:45,920 --> 00:04:53,480
So we wouldn't consider this last all layers before the last layer because this already, this isn't

62
00:04:53,480 --> 00:04:58,160
the last layers because we've already decided not to include the top right here.

63
00:04:58,250 --> 00:05:01,910
Now what we'll do is we shall copy out this name.

64
00:05:01,910 --> 00:05:04,850
So we take this block for conf three.

65
00:05:06,990 --> 00:05:08,910
And then it shall be used now.

66
00:05:08,910 --> 00:05:16,800
So instead of having to output the model itself, we shall have this model output only at the level

67
00:05:16,800 --> 00:05:18,750
of this block for comfortably right here.

68
00:05:18,750 --> 00:05:24,200
So we have this block fork after which we pick out, and then we have it as our output.

69
00:05:24,210 --> 00:05:26,130
So that's it way.

70
00:05:26,130 --> 00:05:32,130
Now return this model which has the inputs and then returns the output at this point.

71
00:05:32,130 --> 00:05:34,500
BLOCK for count three right here.

72
00:05:34,530 --> 00:05:38,550
Now let's rerun it again and see what we get, what we obtain.

73
00:05:38,550 --> 00:05:43,380
Now we see we have this suitable output right here which matches what we had in the paper.

74
00:05:43,920 --> 00:05:53,160
Also recall that from the paper we have fine tuning the VG 16 model, so we are not setting this layers

75
00:05:53,160 --> 00:05:57,320
to untraceable so we don't freeze this first layers right here.

76
00:05:57,330 --> 00:06:06,210
Now we've had this base model we could now go ahead to adding up the last layers, which in the paper

77
00:06:06,210 --> 00:06:07,710
to call the back end.

78
00:06:08,130 --> 00:06:16,920
We now define this layers where we have our input in X, in Y, we get the base model that has the inputs

79
00:06:16,920 --> 00:06:19,530
into our base model, which is what we've just defined.

80
00:06:19,530 --> 00:06:26,820
We have this random initialization which is still in the paper, and then the purpose to have the standard

81
00:06:26,820 --> 00:06:29,520
deviation of 0.01.

82
00:06:30,000 --> 00:06:36,390
Then following what's given the paper, we have this 512 number of channels, three by three channel

83
00:06:36,390 --> 00:06:41,730
activation radio dilation raid two we fill in the Part B, so that's why we have this dilation rates

84
00:06:41,730 --> 00:06:42,990
of 2 to 2.

85
00:06:42,990 --> 00:06:49,260
And then finally, this is one and the pattern is the same so that we maintain this dimension of 96

86
00:06:49,260 --> 00:06:50,550
by 128.

87
00:06:50,580 --> 00:06:53,940
Now we could run this and there we go.

88
00:06:53,940 --> 00:06:55,050
That's way we have.

89
00:06:55,170 --> 00:06:56,390
So that's fine.

90
00:06:56,400 --> 00:07:02,910
Now we'll note that we also have to do this reshape because after going through this Conv layer, we

91
00:07:02,910 --> 00:07:05,610
have this 96 by 188 by one.

92
00:07:05,610 --> 00:07:08,670
So we ship this to obtain this value right here.

93
00:07:08,700 --> 00:07:11,820
Now, once we have this, we are ready to train our model.
