1
00:00:00,330 --> 00:00:04,530
Welcome to the section on people counting with TensorFlow.

2
00:00:04,920 --> 00:00:14,310
This task involves taking in an image like this one and then automatically counting a number of people

3
00:00:14,310 --> 00:00:23,370
in this image feed or this image from a video feed, which in this case gives us a total of 54 people.

4
00:00:24,150 --> 00:00:31,530
So this means that in this case, our dataset is made up of inputs, which are this kind of images and

5
00:00:31,530 --> 00:00:35,810
then outputs which contain this point here.

6
00:00:35,820 --> 00:00:45,270
So this like sample output where we have all the positions in this input image where we have the head

7
00:00:45,270 --> 00:00:46,680
of a person found.

8
00:00:47,990 --> 00:00:52,370
The dataset, which I'll be working with, is the Shanghai Tech dataset.

9
00:00:53,270 --> 00:00:59,750
The Shanghai tech dataset is made up of two main parts that are the Part A and the part B and the part

10
00:00:59,750 --> 00:01:00,010
A.

11
00:01:00,020 --> 00:01:11,030
We have more of very densely packed people in an image, whereas the Part B contains images where the

12
00:01:11,030 --> 00:01:14,660
people are not very close to each other.

13
00:01:14,660 --> 00:01:15,860
So let's look at the part.

14
00:01:15,890 --> 00:01:20,690
A Yeah, we have the training and the test data, the ground.

15
00:01:21,500 --> 00:01:23,000
We have the images here.

16
00:01:23,000 --> 00:01:28,940
So yeah, we have 182 images, Ground truth, 182.

17
00:01:28,970 --> 00:01:31,760
We also have the training, the images.

18
00:01:32,610 --> 00:01:35,040
300 and then the ground truth.

19
00:01:35,070 --> 00:01:41,100
So so what happens here is we have this image, for example, this image one right here.

20
00:01:41,100 --> 00:01:46,920
We see that these people are it's kind of like a very densely packed image.

21
00:01:47,730 --> 00:01:55,410
And then this ground data right here contains all the points in the image where there is a human head.

22
00:01:56,780 --> 00:02:03,440
If we enlarge those images we see we have this kind of very crowded images.

23
00:02:03,440 --> 00:02:11,780
So we see that what this dataset will be able to train models which could, for example, in the case

24
00:02:11,780 --> 00:02:17,570
where we are having a drone during an event, we could count our.

25
00:02:18,980 --> 00:02:29,090
Predicts the number of people who are the events just by taking an aerial photo of the event.

26
00:02:29,090 --> 00:02:36,410
So see the kind of images we have now for the Part B, we have less crowded images.

27
00:02:36,410 --> 00:02:37,820
We can see that there just.

28
00:02:38,450 --> 00:02:48,200
Footage from CCTV cameras of people moving across the streets and maybe going to buy or going to various

29
00:02:48,200 --> 00:02:49,220
occupations.

30
00:02:49,520 --> 00:02:56,240
All I would have had a well, we have very huge crowds showing that people people have actually come

31
00:02:56,240 --> 00:02:57,770
together for an event.

32
00:02:58,130 --> 00:03:01,700
You could get this dataset via this link right here.

33
00:03:02,240 --> 00:03:09,890
Now, once we have this input as an image and then we have this output or it shows us exactly where

34
00:03:09,890 --> 00:03:17,270
the different persons have found in the image, the first idea that comes to our mind will be maybe

35
00:03:17,270 --> 00:03:23,630
to take that input and then train end to end with this output points.

36
00:03:23,840 --> 00:03:30,740
But generally in people counting, we don't use this output like this.

37
00:03:30,740 --> 00:03:35,450
We actually transform this output into a Gaussian map.

38
00:03:35,450 --> 00:03:38,150
Now a Gaussian map looks like this.

39
00:03:38,630 --> 00:03:42,080
This here is what we obtain when we have many points.

40
00:03:42,080 --> 00:03:50,370
But now if we have just one point is just one person, the map we shall obtain will look like this now.

41
00:03:50,570 --> 00:03:56,990
Instance for us to focus on just one pixel point where the person's head is found and give the value

42
00:03:56,990 --> 00:03:57,620
of one.

43
00:03:57,620 --> 00:04:07,010
What we do is we draw some sort of circle around that person's head, where the sum of all the points

44
00:04:07,010 --> 00:04:07,280
are.

45
00:04:07,280 --> 00:04:10,700
The different values we shall give will be equal as one.

46
00:04:11,120 --> 00:04:19,490
So this means that that point with the value one has been scattered all around that circle, such that

47
00:04:19,490 --> 00:04:23,420
all those different values will not give the value.

48
00:04:23,420 --> 00:04:30,680
One unlike the previous situation where we have just one point, which is our value one and all the

49
00:04:30,680 --> 00:04:33,100
surroundings of value zero.

50
00:04:33,110 --> 00:04:40,430
In this case we have all the we have that one point which may take a value of say, 0.2 points around

51
00:04:40,430 --> 00:04:46,010
it, 0.1 around 0.05, 0.005 and so on and so forth.

52
00:04:46,340 --> 00:04:52,550
But what's important to note is that the sum of all those gives us a value of one such that when we

53
00:04:52,550 --> 00:04:59,930
have many of them, like in this case, we see that wherever we have so many people, we have this this

54
00:04:59,930 --> 00:05:00,620
yellow color.

55
00:05:00,620 --> 00:05:08,120
You're telling us that these circles that we have here have somehow come together, because if we have

56
00:05:08,120 --> 00:05:14,210
say, let's take these two people here, if we have these two people together, this person's own circle

57
00:05:14,210 --> 00:05:21,440
with this other person's own circle will tend to form a region where the.

58
00:05:22,330 --> 00:05:28,600
Values are higher than in the situation where we have these two people who are a bit separate from each

59
00:05:28,600 --> 00:05:29,140
other.

60
00:05:29,380 --> 00:05:33,550
But in the case where we were using these maps, we wouldn't have this.

61
00:05:33,550 --> 00:05:35,590
And they'll be because this point will have a value.

62
00:05:35,590 --> 00:05:38,800
One, all those around it will have value zero.

63
00:05:38,800 --> 00:05:44,020
This point in value one, everything around it having value zero at this point one, this point one,

64
00:05:44,020 --> 00:05:46,000
everything around it, having value zero.

65
00:05:46,000 --> 00:05:51,310
But with the maps, as we could see now, regions surrounding.

66
00:05:53,460 --> 00:06:01,290
Places where there are many people tend to have or tend to take higher values as compared to, say,

67
00:06:01,290 --> 00:06:02,970
regions like this one around.

68
00:06:03,080 --> 00:06:05,670
Around this two people here.

69
00:06:06,090 --> 00:06:15,270
It happens that the way we distribute this points around a person's head follows a Gaussian distribution.

70
00:06:15,630 --> 00:06:24,300
Now, although we're using a two dimensional Gaussian distribution here for the points surrounding the

71
00:06:24,300 --> 00:06:30,930
person, we are going to explain this using this one Z go show distribution.

72
00:06:31,380 --> 00:06:39,810
This x axis represents the position of a pixel, and then the Y axis represents the probability of that

73
00:06:39,810 --> 00:06:41,370
pixel being a head.

74
00:06:41,400 --> 00:06:47,850
So this means that the closer we are to a given herd, the higher the probability is called.

75
00:06:47,880 --> 00:06:54,060
As you could see right here, this pixel positions which are around the head, which we could consider

76
00:06:54,060 --> 00:06:57,140
the head to or the head position to be the mean.

77
00:06:57,150 --> 00:07:00,120
Now this position right here is the mean.

78
00:07:00,120 --> 00:07:10,620
And then we also have a standard deviation, which is like a value or a position from the mean above

79
00:07:10,620 --> 00:07:14,980
which the probability of being a head is very, very low.

80
00:07:15,000 --> 00:07:21,450
So if our standard deviation is say at this point here, so we have the mean here, right here, which

81
00:07:21,450 --> 00:07:25,080
is considered to be mu.

82
00:07:25,110 --> 00:07:27,990
As you could see right here, we have mu minus sigma.

83
00:07:27,990 --> 00:07:29,310
Mu plus sigma.

84
00:07:29,340 --> 00:07:34,500
I will notice that above this, like this mu minus sigma mu plus sigma.

85
00:07:34,500 --> 00:07:43,920
If we cut this through the Gaussian bell shaped curve, we'll see that all values which fall in this

86
00:07:43,920 --> 00:07:48,720
range have a reasonable probability of being a head.

87
00:07:48,720 --> 00:07:53,460
But as soon as we move out of that range, this probability becomes very, very small.

88
00:07:54,120 --> 00:08:00,780
And that's why we see right here that as we have at this actual point where the person's head is found,

89
00:08:00,780 --> 00:08:11,790
we have this very yellow color as we're going out of this turns greenish and then bluish and then purple

90
00:08:11,790 --> 00:08:13,800
and then finally get this popular color.

91
00:08:13,830 --> 00:08:22,670
Now, note that the very yellow actually stands for one, and then this purple color stands for zero.

92
00:08:22,680 --> 00:08:29,640
But nonetheless, this isn't actually one since those the the one value has been shared among all the

93
00:08:29,640 --> 00:08:31,260
different points surrounding the person.

94
00:08:31,260 --> 00:08:37,860
So, yeah, we could have a value of, say, 0.5 and then value surrounded and have a value of 0.1 and

95
00:08:37,860 --> 00:08:41,220
then 0.05 and so on and so forth.

96
00:08:42,360 --> 00:08:48,480
Now, to get the details of the Gaussian distribution, you can look at our course on probability.

97
00:08:49,140 --> 00:08:54,630
Now we are going to be generating this kind of maps for each of the points.

98
00:08:54,660 --> 00:08:58,290
Now let's look at the formula for this probability distribution right here.

99
00:08:58,320 --> 00:09:04,830
This probability distribution is defined such that we have P of x where X is, the position is equal

100
00:09:04,830 --> 00:09:10,570
to one divided by the square root of two pi sigma squared, where sigma is the standard deviation.

101
00:09:10,590 --> 00:09:19,740
Now note that the standard deviation can be defined such that this region within which we have this

102
00:09:19,740 --> 00:09:29,550
reasonable probability score of being a head, of being a person is restrained to only positions which

103
00:09:29,550 --> 00:09:32,010
are very close to the person's head.

104
00:09:33,210 --> 00:09:39,810
So in that sense, this mu minus sigma now falls around this and Mu plus sigma falls around this position

105
00:09:39,810 --> 00:09:40,650
right here.

106
00:09:41,280 --> 00:09:47,910
And from this we see clearly that this is going to be a hyper parameter for our model wherein we could

107
00:09:47,910 --> 00:09:56,520
vary the values for this sigma to see whether when we choose points, when we decide to work with points

108
00:09:56,520 --> 00:09:58,590
very close to the person's head.

109
00:10:00,410 --> 00:10:06,110
Or we just had to work with points which are also very far from the person's head that's generating

110
00:10:06,110 --> 00:10:09,860
maps such that this circle is very large.

111
00:10:10,730 --> 00:10:19,880
We see how it affects our model's accuracy and predicting whether there is a person in that position

112
00:10:19,880 --> 00:10:26,360
or not, or better still encountered the number of people who are located in that image.

113
00:10:27,170 --> 00:10:32,510
So that set the standard deviation as where it goes is a deviation.

114
00:10:32,510 --> 00:10:39,110
So it tells us how much or better still, how many points surrounding the head.

115
00:10:39,110 --> 00:10:42,860
We decide to share that one score of the head width.

116
00:10:43,730 --> 00:10:45,320
So here we have this.

117
00:10:45,350 --> 00:10:48,160
Now this is what we get here.

118
00:10:48,170 --> 00:10:55,720
Times e to the negative half of x minus mu mule is the mean divided by sigma.

119
00:10:55,730 --> 00:10:56,780
All of that squared.

120
00:10:57,200 --> 00:11:00,610
We are going to use the we are going to consider the mu to be equal to zero.

121
00:11:00,620 --> 00:11:05,960
So this is going to be minus half of x divided by sigma, all of that squared.

122
00:11:06,800 --> 00:11:12,560
And since the Gaussian distribution is a probability distribution, more specifically a continuous probability

123
00:11:12,560 --> 00:11:18,540
distribution, the integral from negative infinity to positive infinity of P of x, the x equals one.

124
00:11:18,560 --> 00:11:25,160
Now this comes back to the fact that we are having a point here at the center which we are sharing across

125
00:11:25,160 --> 00:11:27,230
all the different points surrounding it.

126
00:11:27,230 --> 00:11:32,080
And that's why when you find an integral here, we have a value of one.

127
00:11:32,090 --> 00:11:37,100
Now recall this integral is just like finding the area under this curve right here.

128
00:11:37,100 --> 00:11:43,010
So you see this --, that portion in orange here, this area should always be equals one.

129
00:11:43,790 --> 00:11:50,600
And then coming back to the image, this comes to the fact that the sum of all this different pixel

130
00:11:50,600 --> 00:11:54,500
values surrounding this head should be equals one.

131
00:11:55,400 --> 00:12:02,060
And that said, when we have so many different positions like this, if we sum all the pixel values,

132
00:12:02,060 --> 00:12:04,490
they should all give us values of one.