1
00:00:00,480 --> 00:00:06,930
Hi and welcome to the lecture on Sony's networks, where we talk about how Siamese networks can be used

2
00:00:06,930 --> 00:00:09,030
for tasks such as image similarity.

3
00:00:09,570 --> 00:00:10,710
So let's get started.

4
00:00:10,920 --> 00:00:18,630
So if you're unfamiliar with the term Siamese, Siamese refers to conjoined twins that is twins that

5
00:00:18,630 --> 00:00:25,650
were unfortunately joined together, but sometimes even sharing organs so similarly to wood, Siamese

6
00:00:25,650 --> 00:00:28,830
and Siamese networks refers to networks that are connected.

7
00:00:29,250 --> 00:00:36,120
They consist of two or more identical some networks, and these same networks all share the same parameters

8
00:00:36,120 --> 00:00:38,180
and widths the parameter.

9
00:00:38,190 --> 00:00:40,170
But it's a myriad of both networks.

10
00:00:40,440 --> 00:00:45,570
And which means that when one network updates to which the other one consequently updates its way,

11
00:00:45,570 --> 00:00:47,940
it's exactly the same value as well.

12
00:00:48,330 --> 00:00:49,710
And that's an important fact.

13
00:00:50,190 --> 00:00:54,480
So now let's move on to what are Siamese networks used for?

14
00:00:55,320 --> 00:01:00,300
Well, they are very, very useful in comparing images or gathering image similarity.

15
00:01:00,780 --> 00:01:06,120
Things like signature verification, face recognition and fingerprint matching are all examples where

16
00:01:06,120 --> 00:01:11,970
we can use some these networks, and you would be very familiar with face recognition, which has been

17
00:01:11,970 --> 00:01:13,560
a bit controversial of late.

18
00:01:14,310 --> 00:01:17,760
So Siamese networks have a simple binary output.

19
00:01:18,360 --> 00:01:23,640
Yes, if the images match or new, if they don't match, and here's an example of some signatures that

20
00:01:23,640 --> 00:01:26,910
do much and some signatures that don't match.

21
00:01:27,720 --> 00:01:32,610
So here's a high level diagram of what a Siamese network's architecture looks like.

22
00:01:33,210 --> 00:01:35,950
Firstly, we have the image being fed here.

23
00:01:35,970 --> 00:01:39,810
We have two images one the images that we're trying to compare.

24
00:01:39,840 --> 00:01:41,490
So we have an image one image two.

25
00:01:42,150 --> 00:01:44,390
And then we have the confirm that were created.

26
00:01:44,400 --> 00:01:47,160
And these are identical convolution networks.

27
00:01:47,160 --> 00:01:47,820
You remember that.

28
00:01:48,300 --> 00:01:55,320
So what is convolution that looks produce is an encoding that's towards a representation of this image,

29
00:01:55,830 --> 00:02:04,050
so it learns one that works learn to to create and encoding that we use later on now to compare in the

30
00:02:04,050 --> 00:02:04,820
different singlet.

31
00:02:05,310 --> 00:02:11,100
Now the different thing is where we actually compared the values using something like Euclidean or cosine

32
00:02:11,100 --> 00:02:17,280
distance, and we use a lost function, typically triplet loss or contrastive loss or even binary cross

33
00:02:17,280 --> 00:02:18,000
entropy loss.

34
00:02:18,630 --> 00:02:20,790
And this produces a similarity score.

35
00:02:21,300 --> 00:02:28,230
So if the similarity scores low images are similar, meaning that they're very, very close match.

36
00:02:28,740 --> 00:02:34,440
Otherwise, if the similarity score is large, kind of reverse, kind of unintuitive, but just remember

37
00:02:34,440 --> 00:02:36,270
that so the similarity score metric looks.

38
00:02:36,270 --> 00:02:38,850
In this case, the images are different.

39
00:02:38,970 --> 00:02:41,160
So a low score means the images are similar.

40
00:02:42,360 --> 00:02:45,990
So here's another diagram of a typical Siamese network.

41
00:02:45,990 --> 00:02:48,180
You can see the convolutional layers here.

42
00:02:48,690 --> 00:02:50,370
Then it has a fully connected layers.

43
00:02:50,370 --> 00:02:52,860
Then it goes to the contrast of loss here.

44
00:02:53,400 --> 00:03:00,420
This fully connected layers have produced the embedding that that is towards the image and that converts

45
00:03:00,420 --> 00:03:02,130
this image to this embedding here.

46
00:03:02,550 --> 00:03:08,190
Likewise, it takes another image to image that we are comparing here and here convergence in exactly

47
00:03:08,190 --> 00:03:12,570
the same but exact same architecture converted into another embedding here.

48
00:03:13,080 --> 00:03:19,920
And then we use the contrasts of lost function to compare using Euclidean or cosine distance, and this

49
00:03:19,920 --> 00:03:22,380
outputs a similarity score at the end.

50
00:03:23,460 --> 00:03:28,650
So here are some key points to remember when thinking about a Siamese network architecture.

51
00:03:29,100 --> 00:03:34,770
Firstly, we noticed the output of the convolutional layers in a Siamese network, a flat matrix.

52
00:03:35,280 --> 00:03:37,980
This is an illustration of what I'm talking about right here.

53
00:03:38,400 --> 00:03:45,630
Remember, we have to some networks put convolutional convolutional networks if we're using it for images.

54
00:03:46,200 --> 00:03:49,800
And this concept works produced a one dimensional vector.

55
00:03:50,160 --> 00:03:55,620
That's basically the embedding layer or the including, which can be considered the features that have

56
00:03:55,620 --> 00:03:59,850
been extracted from the image since we're using a conflict to produce that encoding.

57
00:04:00,660 --> 00:04:02,310
Now what do we do with that encoding?

58
00:04:02,760 --> 00:04:04,710
Well, we feed of bulletin cuttings.

59
00:04:04,710 --> 00:04:06,900
Remember each image produces and encoding.

60
00:04:07,500 --> 00:04:09,950
We feed it, you know, into different things.

61
00:04:10,590 --> 00:04:15,660
And that's where we can use something like Euclidean or cosine difference to find the difference between

62
00:04:15,660 --> 00:04:18,090
the matrices produced in each network.

63
00:04:18,810 --> 00:04:24,240
We can then use a loss of function, though, to assess whether the same means that both has made the

64
00:04:24,240 --> 00:04:26,820
right decision from the comparison.

65
00:04:28,050 --> 00:04:31,340
So let's talk about the lost function.

66
00:04:31,560 --> 00:04:35,430
The tree common loss functions we typically use in saying these networks.

67
00:04:35,910 --> 00:04:37,890
Firstly, there's a trip that lost function.

68
00:04:38,520 --> 00:04:44,580
This is where a baseline input here showing as the anchor in this diagram is compared to a positive

69
00:04:44,580 --> 00:04:45,050
input.

70
00:04:45,090 --> 00:04:46,110
That's this one here.

71
00:04:46,110 --> 00:04:48,540
And the negative input, that's one that's different.

72
00:04:49,020 --> 00:04:52,940
And what we try to do is try to maximize the distance we can see.

73
00:04:52,950 --> 00:04:57,450
We try to minimize the distance or explicitly between the positive being the anchor and then maximize

74
00:04:57,450 --> 00:04:59,910
the distance between the negative peer and.

75
00:05:00,020 --> 00:05:00,950
Yanco, right, dear.

76
00:05:02,000 --> 00:05:07,400
Secondly, there's contrastive lost, no, the goal of a Siamese network is to differentiate between

77
00:05:07,400 --> 00:05:08,480
pairs of images.

78
00:05:08,990 --> 00:05:15,200
So contrastive refers to the fact that these losses are computed, contrasting to a more data point

79
00:05:15,200 --> 00:05:16,130
representations.

80
00:05:16,580 --> 00:05:20,600
And you can see an illustration of the contrast of lost concept here.

81
00:05:21,500 --> 00:05:24,590
And then lastly, we have binary cross entropy.

82
00:05:25,220 --> 00:05:32,510
So given that the output of the same is that which is binary and that is similar or dissimilar is kind

83
00:05:32,510 --> 00:05:38,000
of intuitive that we can use something like binary cross entropy lost function, as it might be an obvious

84
00:05:38,000 --> 00:05:39,440
choice for these types of networks.

85
00:05:40,070 --> 00:05:45,560
So hopefully you have an intuition and understanding of Siamese networks.

86
00:05:45,560 --> 00:05:47,090
This is a high level overview.

87
00:05:47,570 --> 00:05:51,650
Next, we're going to discuss how we can actually train the same these networks.

88
00:05:52,070 --> 00:05:53,360
So stay tuned for that.

89
00:05:53,390 --> 00:05:54,740
I'll see you there by.