1
00:00:12,050 --> 00:00:17,360
As I mentioned in the introduction to this section the theory behind Siamese networks is not all that

2
00:00:17,360 --> 00:00:18,520
difficult.

3
00:00:18,590 --> 00:00:23,780
It's more complicated than a CNN you would use for image classification but it still follows the same

4
00:00:23,780 --> 00:00:24,870
steps.

5
00:00:24,890 --> 00:00:29,270
We have a system of CNN as we set up a loss and then we minimize that loss.

6
00:00:29,270 --> 00:00:31,370
Relatively straightforward.

7
00:00:31,370 --> 00:00:39,320
The real challenge here is what do we do with the data.

8
00:00:39,360 --> 00:00:45,960
Normally when we train a CNN for image classification we have a simple dataset of input target pairs

9
00:00:46,530 --> 00:00:47,010
for x.

10
00:00:47,010 --> 00:00:52,130
We have a single image and for y we have the class which the image belongs to.

11
00:00:52,290 --> 00:00:54,770
But what happens with a Siamese network.

12
00:00:54,990 --> 00:01:00,440
We don't have any label for a specific image necessarily or at least not the labels we will use directly

13
00:01:05,540 --> 00:01:07,060
here's what will be given.

14
00:01:07,400 --> 00:01:11,150
We're still going to get a list of images for each image.

15
00:01:11,150 --> 00:01:12,890
We will be told their identity.

16
00:01:12,890 --> 00:01:16,190
For example the first image belongs to subjects one.

17
00:01:16,190 --> 00:01:19,230
The next image belongs to subject 2 and so forth.

18
00:01:19,460 --> 00:01:23,390
But these are not the labels that the Siamese network cares about.

19
00:01:23,390 --> 00:01:28,970
Remember that what we care about is if two pairs of images are a match or a non match.

20
00:01:29,510 --> 00:01:30,590
So how do we do this

21
00:01:35,720 --> 00:01:41,770
what we have to make our own dataset to keep things simple let's say we have three subjects and we have

22
00:01:41,830 --> 00:01:44,350
two images for each subject.

23
00:01:44,350 --> 00:01:51,970
Let's call these images a b c d e and f so A and B belong to subjects one C CND belonged to subject

24
00:01:51,970 --> 00:01:59,790
to and E and F belong to subject 3.

25
00:01:59,800 --> 00:02:02,260
Now this becomes a counting problem.

26
00:02:02,260 --> 00:02:09,840
What are all the possible ways we can combine these images into pairs if you recall when we have N images.

27
00:02:09,840 --> 00:02:12,810
We should end up with and choose two pairs.

28
00:02:12,810 --> 00:02:14,670
So let's see what we get.

29
00:02:14,670 --> 00:02:26,790
We get a AP AC A D E and F then we have b b b b and b f then we have c b c and S.F. then we have D E

30
00:02:26,940 --> 00:02:29,950
and F and then we have just e f.

31
00:02:30,270 --> 00:02:32,680
And of course we have the corresponding labels.

32
00:02:32,940 --> 00:02:40,700
So a matches with B C matches with D and E matches with f all other samples are non matches.

33
00:02:45,850 --> 00:02:50,460
You will notice a problem which is inherent in the way that this data is created.

34
00:02:50,560 --> 00:02:55,660
It's that we will always have a much higher number of non matches than matches.

35
00:02:55,660 --> 00:02:58,950
In other words we will always have imbalanced classes.

36
00:02:59,050 --> 00:03:04,510
So if we tried to train the Siamese network with the usual batch gradient descent we are going to see

37
00:03:04,510 --> 00:03:07,840
a lot more of the non match class than of the match class

38
00:03:12,960 --> 00:03:14,030
to solve this problem.

39
00:03:14,040 --> 00:03:20,750
We are going to do two things first because we always have and choose two combinations of images.

40
00:03:20,790 --> 00:03:24,990
There will always be many more training samples than actual images.

41
00:03:24,990 --> 00:03:29,940
We don't want to store all those pairs of images in memory because that would be redundant.

42
00:03:30,120 --> 00:03:32,970
We would just be storing the same images over and over again.

43
00:03:33,960 --> 00:03:39,080
Thus we want to create our own generator and loop over that inside our training function.

44
00:03:39,090 --> 00:03:45,850
This will allow us to generate the data on the fly and not have to store it in some gigantic matrix.

45
00:03:45,870 --> 00:03:48,180
Second by running our own generator.

46
00:03:48,360 --> 00:03:50,590
This offers us some flexibility.

47
00:03:50,820 --> 00:03:56,910
What we're going to do is for each batch we'll take say 32 images from the match class and then we will

48
00:03:56,910 --> 00:04:01,410
sample 32 images from the non match class.

49
00:04:01,500 --> 00:04:06,840
In other words for every epoch we'll be looking at all the matches but only some of the non matches

50
00:04:11,960 --> 00:04:18,680
so far a simple example earlier where we had the images ABC the E and F here's what we would do.

51
00:04:18,950 --> 00:04:25,820
Only a b c b and d f are matches all the other training examples are non matches.

52
00:04:25,820 --> 00:04:31,610
Now let's say we do a gradient descent with a batch size of six then three training examples from the

53
00:04:31,610 --> 00:04:39,080
batch would be the matches a b c any f the other three training examples would be chosen randomly from

54
00:04:39,080 --> 00:04:40,910
the non match class.

55
00:04:40,910 --> 00:04:44,810
So let's say we choose AC DC and b f.

56
00:04:44,810 --> 00:04:51,710
Of course this means that on each epoch we will always see all of the matches but only some of the non

57
00:04:51,710 --> 00:04:59,860
matches in the following lectures we will look at the actual code to see how the process we just discussed

58
00:04:59,890 --> 00:05:00,790
is implemented.