1
00:00:11,940 --> 00:00:16,410
In this lecture we are going to discuss the theory behind the facial recognition approach.

2
00:00:16,410 --> 00:00:19,550
We are going to take in this section at this point.

3
00:00:19,560 --> 00:00:24,930
I hope you have given some thought as to how you might approach facial recognition if you had to create

4
00:00:24,930 --> 00:00:27,020
your own implementation.

5
00:00:27,030 --> 00:00:32,430
I think this is a worthwhile exercise especially if you've never seen any facial recognition approaches

6
00:00:32,430 --> 00:00:33,760
before.

7
00:00:33,900 --> 00:00:39,390
This is kind of a one time opportunity because once you do learn about how facial recognition is currently

8
00:00:39,390 --> 00:00:41,870
done it will taint your thinking.

9
00:00:41,910 --> 00:00:48,420
So take this time to think about how you would approach facial recognition without any external influences

10
00:00:53,520 --> 00:00:53,840
OK.

11
00:00:53,840 --> 00:00:59,360
So now that you've had the chance to think about how you might approach facial recognition let me describe

12
00:00:59,360 --> 00:01:02,960
to you what most might consider a good first crack at this problem.

13
00:01:04,110 --> 00:01:08,220
Facial recognition seems like nothing but image classification.

14
00:01:08,220 --> 00:01:13,510
Suppose for example our school has 100 students and given an image of a face.

15
00:01:13,590 --> 00:01:19,560
Let's suppose we want our neural network to tell us which student that is so we might build a CNN with

16
00:01:19,570 --> 00:01:27,620
101 classes where the extra class would be for images of faces which are not one of the 100 students.

17
00:01:27,700 --> 00:01:32,040
Now this is not necessarily a bad approach but there are some issues.

18
00:01:32,080 --> 00:01:37,500
For example what if one student leaves the school or another student joins the school.

19
00:01:37,510 --> 00:01:43,600
Now we would need to build a new classifier or what if we need our neural network to recognize millions

20
00:01:43,600 --> 00:01:48,880
of people having a neural network with millions of output classes is generally not a good idea.

21
00:01:54,030 --> 00:01:58,980
In this section we are going to take an approach similar to binary classification.

22
00:01:59,010 --> 00:02:04,470
Suppose we have some neural network where we can feed in to images the output of the neuron that work

23
00:02:04,620 --> 00:02:06,500
is just a binary decision.

24
00:02:06,570 --> 00:02:09,090
Are these two images of the same person or not.

25
00:02:14,170 --> 00:02:18,640
The architecture we will use for this approach is called a Siamese network.

26
00:02:18,640 --> 00:02:22,540
This is of course named after the famous Siamese twins.

27
00:02:22,540 --> 00:02:28,090
The idea is this we are going to have to CNN with the exact same shared weights.

28
00:02:28,090 --> 00:02:29,920
Hence they are twins.

29
00:02:29,950 --> 00:02:35,590
Both of these CNN are not classifiers so they don't have any classification head and they don't end

30
00:02:35,590 --> 00:02:38,070
in a soft Max.

31
00:02:38,400 --> 00:02:44,310
Instead the final layer is just a linear regression with no activation which just outputs some feature

32
00:02:44,310 --> 00:02:44,730
vector

33
00:02:49,820 --> 00:02:56,660
here's what we like these CNN is to do if we input two images of the same person even if they are different

34
00:02:56,660 --> 00:03:01,050
images we should get feature vectors that are very close to one another.

35
00:03:01,220 --> 00:03:07,250
So the CNN would have to be trained to ignore things like lighting conditions different angles different

36
00:03:07,250 --> 00:03:08,570
face sizes.

37
00:03:08,570 --> 00:03:11,600
Maybe one day you are wearing glasses and one day you are not.

38
00:03:11,630 --> 00:03:13,200
And so forth.

39
00:03:13,220 --> 00:03:19,190
In other words the CNN has to learn the features that are actually distinctive to your face not other

40
00:03:19,190 --> 00:03:20,540
differences in images

41
00:03:25,650 --> 00:03:31,990
on the other hand if we input two images of different people the feature vectors should be far apart.

42
00:03:32,190 --> 00:03:37,590
So even if the two images have the same lighting conditions are at the same angle have the same size

43
00:03:37,590 --> 00:03:43,800
and so on the CNN should learn how to distinguish those two faces using the actual features of those

44
00:03:43,800 --> 00:03:44,460
faces.

45
00:03:49,590 --> 00:03:51,940
You can think of this in terms of word embedding.

46
00:03:52,470 --> 00:03:59,190
We know that after training an algorithm like words of velcro glove we can map words into a vector space.

47
00:03:59,190 --> 00:04:01,170
Similar words appear close to one another.

48
00:04:01,260 --> 00:04:03,920
While this similar words appear far apart.

49
00:04:04,140 --> 00:04:10,920
For example fish should be close to carp and laundry should be close to dryer whereas fish should be

50
00:04:10,920 --> 00:04:17,330
far away from laundry in the same way the Siamese network outputs face embedding.

51
00:04:17,580 --> 00:04:21,720
There is simply a mapping from an image of a face to a feature vector

52
00:04:26,890 --> 00:04:27,600
at this point.

53
00:04:27,610 --> 00:04:32,320
You may be wondering once we've converted both input faces into feature vectors.

54
00:04:32,350 --> 00:04:34,170
What do we do with them.

55
00:04:34,210 --> 00:04:38,400
I said earlier that we would do something like binary classification.

56
00:04:38,500 --> 00:04:43,760
Well in fact there's nothing stopping us from doing exactly binary classification.

57
00:04:44,140 --> 00:04:49,060
So let's say we take these two feature vectors and then we calculate the absolute difference between

58
00:04:49,060 --> 00:04:51,880
them to give us a new vector.

59
00:04:51,880 --> 00:04:56,610
Then we simply pass this new vector z through a logistic regression.

60
00:04:57,010 --> 00:05:02,260
In this way we are doing exactly binary classification.

61
00:05:02,290 --> 00:05:06,270
Now you might wonder why should we take the absolute difference.

62
00:05:06,280 --> 00:05:12,720
Well this is because if we input image one into the first year of that work and image 2 into the second

63
00:05:12,770 --> 00:05:15,830
neural network it should give us the same feature vector.

64
00:05:15,910 --> 00:05:22,660
If we reversed the images so we input image 2 into the first known that work and we input image 1 into

65
00:05:22,660 --> 00:05:24,690
the second year of that work.

66
00:05:24,700 --> 00:05:30,550
The difference between these two images should be the same in both cases and thus the feature vector

67
00:05:30,820 --> 00:05:31,950
must be symmetric

68
00:05:37,140 --> 00:05:42,420
another option here is to build the feature vector using something called the chi square similarity

69
00:05:42,870 --> 00:05:45,660
which uses the formula you see here.

70
00:05:45,660 --> 00:05:51,000
Basically it's just a different way to calculate the features you should confirm to yourself that the

71
00:05:51,000 --> 00:05:53,820
chi square similarity is also symmetric

72
00:05:59,020 --> 00:06:03,740
but for our coding exercises I would actually like to show you a different approach.

73
00:06:03,850 --> 00:06:09,640
This uses what is called the contrast of loss and it accomplishes exactly what our intuition wants us

74
00:06:09,640 --> 00:06:10,930
to do.

75
00:06:11,200 --> 00:06:17,410
Remember the idea is that images of the same person should have feature vectors which are close together

76
00:06:18,010 --> 00:06:21,880
whereas images of different people should have features which are far apart.

77
00:06:22,720 --> 00:06:27,670
So how can we push the feature vectors closer together or further apart.

78
00:06:27,670 --> 00:06:34,970
The answer is the contrast of loss.

79
00:06:35,070 --> 00:06:36,970
So how does the contrast of lost work.

80
00:06:37,710 --> 00:06:40,120
Well it helps to look at the formula.

81
00:06:40,260 --> 00:06:45,480
Basically it looks like what you would get if you combine the squared error with the binary cross entropy

82
00:06:46,230 --> 00:06:51,240
which kind of makes sense because this is a binary problem but we are also working with continuous feature

83
00:06:51,240 --> 00:06:52,960
vectors.

84
00:06:53,010 --> 00:06:57,790
First let's talk about how it works and then we'll talk about why it works.

85
00:06:58,140 --> 00:07:00,000
So let's discuss all the variables.

86
00:07:00,000 --> 00:07:06,460
What does D that's the Euclidean distance between the two feature vectors or the two embed NGs.

87
00:07:06,480 --> 00:07:12,730
You can also think of this as the L2 norm of ever vector 1 minus f of x 2.

88
00:07:12,790 --> 00:07:21,900
Next what is y y is the target it's 1 if x 1 the next 2 are images of the same person and 0 otherwise

89
00:07:22,990 --> 00:07:27,680
next what is m m is what we refer to as the margin.

90
00:07:27,850 --> 00:07:30,490
This is the key ingredient of this loss.

91
00:07:30,520 --> 00:07:35,650
We'll discuss this more when we talk about why this loss works finally.

92
00:07:35,670 --> 00:07:41,910
Note that this loss is specified for a single training sample which consists of a pair of images x 1

93
00:07:41,910 --> 00:07:43,250
and x 2.

94
00:07:43,380 --> 00:07:47,070
So if we have multiple training samples then we would just take the mean of this

95
00:07:52,190 --> 00:07:52,650
okay.

96
00:07:52,670 --> 00:07:57,450
So now that you know what each of the variables in the loss means why does it actually work.

97
00:07:58,510 --> 00:08:01,030
Let's look at the first term since it's a little simpler.

98
00:08:01,030 --> 00:08:03,630
It's y times D squared.

99
00:08:03,700 --> 00:08:09,880
Remember that y can only be one or zero it's one when the two images are of the same person.

100
00:08:10,750 --> 00:08:17,170
Therefore when both images are of the same person only the first term matters and we want D squared

101
00:08:17,170 --> 00:08:18,550
to be small.

102
00:08:18,550 --> 00:08:28,030
We want to minimize these square if both images are of the same person.

103
00:08:28,100 --> 00:08:33,550
The other scenario is when we have images of different people when a Y is 0.

104
00:08:33,650 --> 00:08:35,960
Only the second term matters.

105
00:08:36,020 --> 00:08:44,970
Notice how we are taking the max of m minus D and 0 basically it say if D is bigger than M then m minus

106
00:08:44,970 --> 00:08:46,320
D will be negative.

107
00:08:46,320 --> 00:08:48,410
So just take the loss to be zero.

108
00:08:49,080 --> 00:08:56,430
Thus it saying if we have two images of different people we want D to be large but as long as it's greater

109
00:08:56,430 --> 00:09:01,140
than or equal to M we don't really care how large exactly it is.

110
00:09:01,290 --> 00:09:05,720
So we don't want to push D to infinity just bigger than M is good enough.

111
00:09:10,890 --> 00:09:13,080
What is the result of this loss.

112
00:09:13,080 --> 00:09:19,290
Well hopefully if we draw a histogram of the distances we get when we feed in images of matches and

113
00:09:19,290 --> 00:09:22,140
non matches it will look like this.

114
00:09:22,140 --> 00:09:24,850
You can see that we get two distributions.

115
00:09:24,900 --> 00:09:30,630
The distribution for the matches should hopefully be very close to zero and the distribution for the

116
00:09:30,630 --> 00:09:35,910
non matches should hopefully all be equal to or greater than M.

117
00:09:35,910 --> 00:09:41,700
The idea is once we can see a plot like this we know how to classify whether or not two images are a

118
00:09:41,700 --> 00:09:42,960
match.

119
00:09:42,960 --> 00:09:46,010
We can choose a threshold for example with this plot.

120
00:09:46,020 --> 00:09:49,290
That might be zero point seven five.

121
00:09:49,380 --> 00:09:54,720
Then when we plug in two images into our Siamese that works and we get a distance less than zero point

122
00:09:54,720 --> 00:09:58,090
seventy five we can predict that they are a match.

123
00:09:58,170 --> 00:10:02,870
If it's bigger than zero point seven five we can predict that it's not a match.

124
00:10:02,970 --> 00:10:08,370
As a side note you can see that we are going to set m to 1 which seems to work well enough for this

125
00:10:08,370 --> 00:10:09,630
problem.

126
00:10:09,630 --> 00:10:12,900
You are of course encouraged to try different hyper parameters on your own.