1
00:00:00,170 --> 00:00:06,110
Hi there and welcome to this new session in which we shall be treating the variational encoder.

2
00:00:06,650 --> 00:00:10,700
And we shall see how it could be used in image generation.

3
00:00:10,880 --> 00:00:17,030
In this first part, we shall dive deep into understanding the theory behind the variational encoder

4
00:00:17,030 --> 00:00:19,130
and then the subsequent sections.

5
00:00:19,130 --> 00:00:25,160
We shall practically implement a working variational auto encoder.

6
00:00:25,670 --> 00:00:33,380
That said, we shall start with explaining or understanding this auto encoder and we'll make use of

7
00:00:33,380 --> 00:00:35,810
this blog post by Jeremy Jordan.

8
00:00:35,900 --> 00:00:42,290
Now to understand the auto encoder, we can break this word into two parts.

9
00:00:42,290 --> 00:00:47,870
That's auto and we have ENCODE.

10
00:00:48,350 --> 00:00:54,110
So essentially we have a system which self encodes itself.

11
00:00:55,310 --> 00:00:59,810
Now, supposing you have an image like this one right here.

12
00:01:01,210 --> 00:01:06,680
When we pass this into some encoder block, Let's let's have something like this.

13
00:01:06,700 --> 00:01:13,050
We have some encoder block and then we could obtain this output vector.

14
00:01:13,060 --> 00:01:16,480
Now, this output vector is six dimensional.

15
00:01:16,480 --> 00:01:18,130
So we had six different positions.

16
00:01:18,130 --> 00:01:24,580
And each of the position represent a specific characteristic of this image.

17
00:01:24,580 --> 00:01:25,840
You could see a smile.

18
00:01:25,840 --> 00:01:28,750
0.99 Skin tone 0.85.

19
00:01:28,750 --> 00:01:30,970
Gender -0.73.

20
00:01:31,000 --> 00:01:32,560
Beard 0.85.

21
00:01:32,560 --> 00:01:34,510
Glasses 0.002.

22
00:01:34,540 --> 00:01:36,430
Hair color 0.68.

23
00:01:36,430 --> 00:01:45,160
So all these six values here, the six values we have here are characterizing our image.

24
00:01:45,160 --> 00:01:56,110
So they encode information or information about this image is encoded in this vector or this vector

25
00:01:56,110 --> 00:01:56,830
right here.

26
00:01:57,610 --> 00:02:05,540
And then on the other hand, when we want to retrieve this encoded information, what we could now do

27
00:02:05,540 --> 00:02:15,380
is we get a decoder which takes this encoded information and then reproduces this original image.

28
00:02:16,010 --> 00:02:24,290
And so that's globally how we produce this kind of system, which could be used in image compression,

29
00:02:24,290 --> 00:02:31,490
where we could take this image, encode it so that we have just this vector, then we could pass this

30
00:02:31,490 --> 00:02:34,250
vector via some network.

31
00:02:34,250 --> 00:02:43,130
And then on the other side of the network, we decode this vector such that we have the original image

32
00:02:43,610 --> 00:02:53,300
apart from compression and other field where we could apply this kind of autoencoder network is in image

33
00:02:53,300 --> 00:02:54,110
search.

34
00:02:54,110 --> 00:03:00,230
So let's suppose that we have this image right here and we have this vector.

35
00:03:00,230 --> 00:03:08,030
Now if we have another image of this same person here, so we have another image of the same person,

36
00:03:08,360 --> 00:03:15,840
we call this image image B, and here we have image A, which produces a vector which we will call V

37
00:03:15,840 --> 00:03:25,370
A, Then it means that in this six D vector space six, because we have six different values here,

38
00:03:25,370 --> 00:03:30,700
it could be 128 D or whatever dimension we make it to be.

39
00:03:30,710 --> 00:03:36,980
So as we're saying, we have this image B, which is the same image here, the image of this of this

40
00:03:36,980 --> 00:03:42,650
person, but not necessarily the same image, but maybe some other image of the same person.

41
00:03:42,650 --> 00:03:51,350
Then after encoding, after passing through an encoder, you have our six D vector VB, But because

42
00:03:51,350 --> 00:03:59,600
it's a similar person or because it's the same person, we would expect this values to be similar.

43
00:03:59,600 --> 00:04:13,790
And so v a will be close to VB, and if we have another person, let's say another person C this year

44
00:04:13,790 --> 00:04:24,980
and we generate that person's V, C that's this encoded vector, then we would expect v A to be much

45
00:04:24,980 --> 00:04:28,130
different from v c.

46
00:04:29,110 --> 00:04:37,480
And so this means that in an image search scenario, we just pass this input, we obtain this vector,

47
00:04:37,480 --> 00:04:44,260
and then we would compare the two vectors to see whether it belongs to the same person or not.

48
00:04:44,830 --> 00:04:51,160
Now, it should also be noted that when training an autoencoder model where we have an image A and we

49
00:04:51,160 --> 00:05:02,020
have a reconstructed image, a prime, then our aim here will be to minimize the difference between

50
00:05:02,050 --> 00:05:08,800
A and a prime, so we could minimize a minus a prime.

51
00:05:08,980 --> 00:05:16,750
Now it turns out that in image generation, to get better results, instead of dealing with discrete

52
00:05:16,750 --> 00:05:21,010
values like what we had here, let's get back to this top.

53
00:05:21,010 --> 00:05:24,490
You see here we had the a given value.

54
00:05:24,520 --> 00:05:28,480
Let's take this off a given value for smile, for skin tone and so on.

55
00:05:28,480 --> 00:05:29,390
And so forth.

56
00:05:29,420 --> 00:05:37,010
So as we're saying, instead of having a fixed value for each and every one of these features, what

57
00:05:37,010 --> 00:05:41,660
we'll do is we'll make use of a probability distribution.

58
00:05:42,260 --> 00:05:48,860
So here instead of having a value, let's say this is -0.6.

59
00:05:49,640 --> 00:05:57,420
We would have a probability distribution whose mean is at -0.6.

60
00:05:57,450 --> 00:06:00,330
Well, this is this looks more like -0.5.

61
00:06:00,330 --> 00:06:07,290
But here we suppose now we're going from 0.6 to this probability distribution, which means a -0.6 with

62
00:06:07,320 --> 00:06:09,310
a given variance.

63
00:06:09,330 --> 00:06:16,770
Now, for those of you who don't have a math background, what this essentially means is instead of

64
00:06:16,770 --> 00:06:23,250
picking a value or picking the value zero -0.6, what we'll do is we'll pick.

65
00:06:24,210 --> 00:06:28,320
Some random value within this range.

66
00:06:28,320 --> 00:06:35,010
So instead of having -0.6, as we said, we're going to have a random value in this range and.

67
00:06:35,760 --> 00:06:42,340
Values closest to -0.6 have a higher probability of being picked.

68
00:06:42,360 --> 00:06:55,530
So instead of having this, we could now have -0.3 or -0.5 5 or -0.75 and so on and so forth.

69
00:06:55,530 --> 00:06:58,290
So we have values which we can pick in this range.

70
00:06:59,110 --> 00:07:02,810
Now, if we see this other example here, where we have zero.

71
00:07:02,830 --> 00:07:09,730
Now turned to this probability distribution, we pick values in this range, -1 to 1.

72
00:07:09,730 --> 00:07:21,700
So here the variance or the range of values which are in which we are allowed to pick a value is larger

73
00:07:21,700 --> 00:07:23,290
than this other one.

74
00:07:24,400 --> 00:07:28,570
But still, values are around this zero.

75
00:07:28,600 --> 00:07:33,400
That's the mean around this zero have a higher probability of being picked.

76
00:07:33,400 --> 00:07:42,640
So here you would have a higher chance of picking 0.1 instead of picking 0.9 to see that clearly here.

77
00:07:42,670 --> 00:07:47,860
Let's say this is 0.1 at this Mac and then this is 0.9 around here.

78
00:07:47,890 --> 00:07:55,480
You would find that if you link this up here, this 0.1, you see that this has a higher score, hence

79
00:07:55,480 --> 00:08:01,310
a higher chance of being picked as compared to this one which has much lower chance of being picked.

80
00:08:01,310 --> 00:08:02,120
So.

81
00:08:02,820 --> 00:08:07,530
In a nutshell, instead of having this 0.5.

82
00:08:08,590 --> 00:08:14,110
We now have a mean value, which is 0.5.

83
00:08:14,970 --> 00:08:27,810
And a variance which shows us or better still, gives us the range of values for which we can pick the

84
00:08:27,810 --> 00:08:30,390
specific value for a given feature.

85
00:08:31,200 --> 00:08:34,350
And so as you could see, this one here.

86
00:08:35,120 --> 00:08:40,910
Has a smaller variance as compared to this and is compared to this one.

87
00:08:41,850 --> 00:08:52,170
And from this point we will define this mean as mu and the variance, which is some distance from here,

88
00:08:52,560 --> 00:08:56,790
this distance as Sigma Square.

89
00:08:57,210 --> 00:09:05,610
This probabilistic approach to generating a latent vector, which previously was this vector we had

90
00:09:05,610 --> 00:09:07,330
here, scroll back up.

91
00:09:07,350 --> 00:09:15,090
Previously it was this vector is now what leads us to the variational autoencoder.

92
00:09:15,690 --> 00:09:20,250
So you see that here we have our input image.

93
00:09:20,670 --> 00:09:28,620
It gets into the encoder which produces MU and Sigma Square.

94
00:09:28,650 --> 00:09:31,380
Or let's just say Sigma Sigma is a standard deviation.

95
00:09:31,380 --> 00:09:34,890
Sigma Square is a variance, so it produces mu and sigma.

96
00:09:35,040 --> 00:09:44,680
And then using mu and sigma with our decoder, we are able to obtain our reconstructed output image.

97
00:09:45,610 --> 00:09:51,370
Now note that in this case we would have one, two, three, four, five, six positions, so mu would

98
00:09:51,370 --> 00:09:59,440
be this six z vector sigma will be another six d vector where mu one this first position here mu one

99
00:09:59,440 --> 00:10:07,420
and sigma one represent the mean and the standard deviation for this distribution.

100
00:10:07,720 --> 00:10:14,680
Now it should be noted that the main benefit of a variational autoencoder is that they are capable of

101
00:10:14,680 --> 00:10:20,140
learning smooth latent state representations of the input data.

102
00:10:21,190 --> 00:10:27,880
Now, to better understand that statement, let's consider this output generated by an auto encoder

103
00:10:28,120 --> 00:10:33,670
and this other output generated by a variational auto encoder.

104
00:10:33,940 --> 00:10:41,170
You will notice that as we go from one digit to another, like let's say we're going from 6 to 8 here.

105
00:10:41,170 --> 00:10:43,440
You see here we have six.

106
00:10:43,450 --> 00:10:45,160
Well, it looks very well like six.

107
00:10:45,160 --> 00:10:47,980
So let's let's take this one, which looks already very well, like six.

108
00:10:47,980 --> 00:10:48,880
This is six.

109
00:10:48,880 --> 00:10:54,220
But here it's really confusing because we don't know exactly what this is.

110
00:10:54,250 --> 00:10:58,870
Now, this looks more like eight, but not really very clear.

111
00:10:58,870 --> 00:11:06,550
And then here we start getting eight and eight and well, this two doesn't look very clear.

112
00:11:06,550 --> 00:11:13,030
But when you look at the output generated by the variational encoder or the variational auto encoder,

113
00:11:13,060 --> 00:11:22,450
as we go from one digit to another, we can see here that we have an even much smoother transition.

114
00:11:23,170 --> 00:11:29,440
And this is thanks to the fact that instead of working with discrete values at the level of our latent

115
00:11:29,440 --> 00:11:37,090
vectors, we're going for a probabilistic approach with the variational auto encoder because we're going

116
00:11:37,090 --> 00:11:39,670
in for this probabilistic approach.

117
00:11:39,910 --> 00:11:47,710
The training of our variational auto encoder is no longer evident, and this is simply because during

118
00:11:47,710 --> 00:11:51,220
the training we need to compute partial derivatives.

119
00:11:51,220 --> 00:12:05,800
With respect to Z here, with respect to MU and partial derivative, with respect to MU and with respect

120
00:12:05,800 --> 00:12:06,580
to Sigma.

121
00:12:07,480 --> 00:12:15,490
But because the Z we have here is drawn from a normal distribution with mean mu and standard deviation

122
00:12:15,490 --> 00:12:21,520
sigma, we won't be able to compute this partial derivative.

123
00:12:22,270 --> 00:12:31,750
And so the idea now will be to convert this node that's here to one that is deterministic.

124
00:12:31,810 --> 00:12:37,690
You can see here we have this key random node and then deterministic nodes.

125
00:12:37,690 --> 00:12:39,160
So this one is deterministic.

126
00:12:39,160 --> 00:12:39,760
That's fine.

127
00:12:39,760 --> 00:12:42,550
This one fine Now.

128
00:12:42,550 --> 00:12:43,690
Well, this is fine.

129
00:12:43,690 --> 00:12:50,830
Now the idea will be to convert this into one which is deterministic such that we could compute this

130
00:12:50,830 --> 00:12:59,320
partial derivatives and hence train the model such that we could update the encoder and the decoder

131
00:12:59,320 --> 00:13:00,490
parameters.

132
00:13:01,030 --> 00:13:10,450
And so now this idea of converting this node from one which is random to one which is deterministic,

133
00:13:10,480 --> 00:13:14,590
is known as the reparameterization trick.

134
00:13:15,400 --> 00:13:22,420
So instead of having this where we have well, let's take let's take this off, let's make it simple.

135
00:13:22,420 --> 00:13:29,290
So we have this where we have the mean mu and the standard deviation.

136
00:13:31,000 --> 00:13:34,810
Well, we'll pick any value at random in this range.

137
00:13:35,560 --> 00:13:41,140
We are instead going to define this epsilon which is drawn from.

138
00:13:41,300 --> 00:13:45,830
A normal distribution with mean zero.

139
00:13:45,830 --> 00:13:51,620
So our mean now will always be zero and then the standard deviation will be one.

140
00:13:51,620 --> 00:13:53,990
So we have negative one one.

141
00:13:53,990 --> 00:13:58,880
So this epsilon here, as we've said, is drawn from this probability distribution.

142
00:13:58,880 --> 00:14:11,180
And then to obtain Z, unlike here where we obtain Z randomly from values surrounding the mean here,

143
00:14:11,180 --> 00:14:21,740
we'll do the mean plus the standard deviation times a random value which lies between or which surrounds

144
00:14:21,740 --> 00:14:23,300
our zero.

145
00:14:24,110 --> 00:14:35,150
And so now we could compute this partial derivative here, this with respect to Z and hence train our

146
00:14:35,150 --> 00:14:37,540
variational autoencoder model.

147
00:14:37,550 --> 00:14:45,330
The next and final point we'll look at in this section is the variational autoencoders loss Now from

148
00:14:45,330 --> 00:14:49,980
the autoencoder or the variational autoencoder paper.

149
00:14:50,160 --> 00:14:54,570
The authors break up this loss into two main parts.

150
00:14:54,600 --> 00:14:57,360
The first part, let's take this off.

151
00:14:57,360 --> 00:15:07,590
The first part is the reconstruction loss, and this other part acts as a regularizer for the reconstruction

152
00:15:07,590 --> 00:15:08,190
loss.

153
00:15:08,190 --> 00:15:16,860
We try to minimize the difference between x and x prime or x Chapo.

154
00:15:16,860 --> 00:15:25,530
So we want that the input and the reconstructed input or the reconstructed output should be similar

155
00:15:25,980 --> 00:15:28,980
here is denoted as this.

156
00:15:29,010 --> 00:15:30,660
We try to minimize this.

157
00:15:30,660 --> 00:15:38,850
And then for the reconstruction loss, we're computing the KL divergence between this distribution and

158
00:15:38,850 --> 00:15:40,590
this other distribution.

159
00:15:40,680 --> 00:15:48,300
Now to understand what this distributions actually signify, we can take a look at this figure.

160
00:15:48,300 --> 00:16:01,950
So here we have this QR or this distribution Q of Z given X, which happens to be a learned distribution,

161
00:16:01,950 --> 00:16:08,430
meaning that when we'll be training this encoder model right here, this is our encoder model, we'll

162
00:16:08,430 --> 00:16:10,140
be training this encoder model.

163
00:16:10,560 --> 00:16:17,400
We shall in fact be getting this distribution, which as we've said already, is a learned distribution.

164
00:16:17,910 --> 00:16:26,970
Nonetheless, we do not want this learned distribution to be very much different from the distribution

165
00:16:26,970 --> 00:16:28,410
P of Z.

166
00:16:28,410 --> 00:16:39,750
And so that's why we are going to minimize the distance between this distribution and this or this learned

167
00:16:39,750 --> 00:16:44,760
distribution and the true prior distribution P of Z.

168
00:16:44,790 --> 00:16:51,030
Now it should be noted that this KL divergence here is a tool which permits us measure the distance

169
00:16:51,030 --> 00:16:52,830
between two distributions.

170
00:16:52,830 --> 00:17:00,060
And so if we could minimize this, if we could minimize this, then we'll reduce the distance between

171
00:17:00,060 --> 00:17:03,750
this distribution and this distribution P of Z.

172
00:17:03,750 --> 00:17:10,710
And getting back to the original paper, it should be noted that the reconstruction loss can be taken

173
00:17:10,710 --> 00:17:12,450
as the mean square error.

174
00:17:12,480 --> 00:17:16,920
While this here will be our regularizer.

175
00:17:16,920 --> 00:17:23,940
And so this is what we obtain after computing the KL divergence between those two distributions.