1
00:00:00,110 --> 00:00:06,830
Hi guys, and welcome to this session in which we are going to write the code for generating images

2
00:00:06,830 --> 00:00:08,900
like this in TensorFlow.

3
00:00:09,320 --> 00:00:18,320
From the previous session we had looked at the GAN loss function and now we will see how to adapt this

4
00:00:18,320 --> 00:00:23,570
loss such that we could instead use the binary cross entropy loss.

5
00:00:24,080 --> 00:00:32,740
And then from there we'll go ahead to look at different methods to make our GAN training much more stable.

6
00:00:32,750 --> 00:00:40,970
You can see from here that this GAN training isn't a very stable process since, um, we have two adversaries

7
00:00:40,970 --> 00:00:45,710
where one is trying to maximize the loss and the other is trying to minimize the loss.

8
00:00:45,710 --> 00:00:54,770
So by nature, Gans are much more difficult to train as compared to other classical neural networks.

9
00:00:54,770 --> 00:01:01,590
Then after looking at how or looking at different methods to make this GAN training process more stable,

10
00:01:01,620 --> 00:01:04,560
we'll go ahead to train our GAN.

11
00:01:04,560 --> 00:01:09,980
From our previous session, we had the discriminator loss and the generator loss.

12
00:01:09,990 --> 00:01:14,450
Now we want to introduce the binary cross entropy loss.

13
00:01:14,460 --> 00:01:17,750
You'll notice that it looks quite similar to what we have here.

14
00:01:17,760 --> 00:01:21,720
Now let's start first with the discriminator for the discriminator.

15
00:01:21,720 --> 00:01:30,960
It takes in this real image and then we expect it to output A1C that we expect to have a wander.

16
00:01:30,960 --> 00:01:37,920
Now, uh, for the binary cross, if we have to compare this with binary cross entropy, then this will

17
00:01:37,950 --> 00:01:50,370
shuffle this or wire is going to be equal uh, D of x I, which is practically what we have here.

18
00:01:50,370 --> 00:01:57,360
So this is our Y Chapo And then, uh, yeah, y this y Chapo is going to be this here.

19
00:01:57,480 --> 00:02:04,950
Now when, uh, since we want our Y to be equal one, it means that one minus Y will be zero.

20
00:02:04,950 --> 00:02:10,530
So we will not consider this, but we'll consider only this part right here.

21
00:02:10,530 --> 00:02:16,260
Now, if we consider only this part, we are left with this expression since Y is equal one.

22
00:02:16,260 --> 00:02:23,970
And then when Y equals zero, the case where we have this sorry, this Y Chapo equals zero, then we'll

23
00:02:23,970 --> 00:02:30,150
have log of zero, which is log of negative infinity minus that with this minus which we didn't have

24
00:02:30,180 --> 00:02:35,880
here is now in this year because we're dealing with the binary cross entropy, the negative of the negative

25
00:02:35,880 --> 00:02:38,580
infinity gives us positive infinity.

26
00:02:38,580 --> 00:02:45,510
So it means that, uh, for y Chapo equals zero, we instead have positive infinity.

27
00:02:45,510 --> 00:02:55,590
So here, since we want to have Y Chapo to be equal one, then, um, our aim will be to minimize the

28
00:02:55,590 --> 00:03:03,120
binary cross entropy loss since the values we could get will range between zero and positive infinity.

29
00:03:03,150 --> 00:03:07,650
Unlike here, where the values were ranging from negative infinity to zero.

30
00:03:07,650 --> 00:03:10,050
So here was maximization problem.

31
00:03:10,050 --> 00:03:12,510
We had, uh, gradient ascent.

32
00:03:12,510 --> 00:03:15,840
But here we're trying to minimize this instead.

33
00:03:15,840 --> 00:03:21,810
So we minimize this because we want to obtain a zero since it's when our output is zero.

34
00:03:21,840 --> 00:03:29,040
That is when our Y Chapo is equal one that we get this output of zero.

35
00:03:29,040 --> 00:03:31,500
So that said, we have it for this first part.

36
00:03:31,530 --> 00:03:36,720
For the next part, when we're dealing with, uh, our data, our fake data, which has been generated

37
00:03:36,720 --> 00:03:40,230
by our generator, we expect D to be equal to zero.

38
00:03:40,230 --> 00:03:47,430
And since our Y in this case is equal to zero, since we have Y to be equal to zero, then this term

39
00:03:47,430 --> 00:03:49,920
is taken off and we're left with only this term.

40
00:03:49,920 --> 00:03:55,470
Now, for this term we would have one minus the expected value of D, which is zero.

41
00:03:55,500 --> 00:04:03,330
That's one log, one minus y Chapo, where y Chapo is what the model predicts.

42
00:04:03,330 --> 00:04:11,070
So, uh, in the case where we have this, y sharp ought to be equal or zero as expected, then we would

43
00:04:11,070 --> 00:04:13,590
have log of zero and that will give us zero.

44
00:04:13,590 --> 00:04:15,570
So we would have a zero here.

45
00:04:15,570 --> 00:04:22,380
Now, in another case or in the other case where our model instead predicts one year.

46
00:04:22,470 --> 00:04:28,440
If the model instead predicts one, then we'll have log of one minus one, which is going to be zero.

47
00:04:28,440 --> 00:04:30,750
Log of zero is negative infinity.

48
00:04:30,780 --> 00:04:33,120
This negative here turns out to positive infinity.

49
00:04:33,120 --> 00:04:34,860
And so now we have positive infinity.

50
00:04:34,860 --> 00:04:43,770
And since our aim is to obtain this Y Chapo to be zero, it means that we will minimize this expression

51
00:04:43,770 --> 00:04:44,280
here.

52
00:04:44,280 --> 00:04:45,270
Such that.

53
00:04:45,990 --> 00:04:51,780
It takes the value of zero or with the with the aim of having a value of zero.

54
00:04:52,320 --> 00:04:58,020
And again, unlike this original expression where we went from negative infinity to zero, now we're

55
00:04:58,020 --> 00:04:59,970
going from zero to positive infinity.

56
00:05:01,000 --> 00:05:07,960
And so this means that when working with the discriminator and making use of the binary cross entropy

57
00:05:07,960 --> 00:05:13,810
would go through a simple gradient descent where we will minimizing the loss.

58
00:05:14,540 --> 00:05:16,880
From here, we will move on to the generator.

59
00:05:16,880 --> 00:05:18,210
For the generator.

60
00:05:18,230 --> 00:05:24,830
Well, we expect the to produce a one right here, so we expect them to have one year.

61
00:05:25,280 --> 00:05:30,470
And this means that this expression is taken off since we have one minus one, which is zero.

62
00:05:30,470 --> 00:05:32,390
So here we are left only with this.

63
00:05:32,390 --> 00:05:34,790
Now we're left with one is one.

64
00:05:34,790 --> 00:05:41,870
So it's just log log of Y chapel, which is what the model will predict or what this the, the discriminator

65
00:05:41,870 --> 00:05:42,680
would predict.

66
00:05:42,680 --> 00:05:45,080
And so we have negative anyway.

67
00:05:45,080 --> 00:05:45,860
Let's just say negative.

68
00:05:45,860 --> 00:05:48,350
Let's take up the one on N and the sum.

69
00:05:48,350 --> 00:05:50,270
So we have this expression right here.

70
00:05:50,270 --> 00:05:56,030
And then in the case where the model actually predicts a one, so the model is expected to predict a

71
00:05:56,030 --> 00:05:57,770
one and it actually predicts a one.

72
00:05:57,770 --> 00:06:05,060
In that case, we would have a zero in the case where the model predicts a zero when when it's supposed

73
00:06:05,060 --> 00:06:06,140
to predict a one.

74
00:06:06,140 --> 00:06:08,090
In that case, we will have log of zero.

75
00:06:08,120 --> 00:06:09,290
That's negative infinity.

76
00:06:09,320 --> 00:06:10,010
Negative negative.

77
00:06:10,010 --> 00:06:11,480
Infinity is positive infinity.

78
00:06:11,480 --> 00:06:17,130
And again, here our aim is to obtain an, uh, this zero right here.

79
00:06:17,130 --> 00:06:22,320
So we will again be minimizing our loss.

80
00:06:22,870 --> 00:06:29,530
Now, given that we will be implementing the model or the architecture in this paper, that is the decision

81
00:06:29,530 --> 00:06:30,310
paper.

82
00:06:30,730 --> 00:06:33,580
We should note some of these guidelines here.

83
00:06:33,580 --> 00:06:36,940
We should replace any polling layers with strided convolutions.

84
00:06:36,940 --> 00:06:42,400
So instead of using pooling, we use strided convolutions and fractional strided convolutions for the

85
00:06:42,400 --> 00:06:43,240
generator.

86
00:06:44,170 --> 00:06:50,020
The next use batch norm in both the generator and the discriminator then remove fully connected hidden

87
00:06:50,020 --> 00:06:52,720
layers for deeper architectures.

88
00:06:52,720 --> 00:06:57,820
Basically here we're using the convolutional layers instead of the fully connected hidden layers.

89
00:06:58,180 --> 00:07:07,330
Uh, use RELU activations in generator for all layers except for the output which uses the tank activation,

90
00:07:07,330 --> 00:07:11,080
then use leaky RELU activation in the discriminator for all layers.

91
00:07:11,080 --> 00:07:18,100
But before we move on to look at some of these details, we should note that there is this GitHub repo

92
00:07:18,130 --> 00:07:23,860
here by one of the author authors of the paper that's um.

93
00:07:23,860 --> 00:07:25,210
There we go.

94
00:07:25,360 --> 00:07:29,020
Let's get up here by Sumit Chintala.

95
00:07:29,170 --> 00:07:37,690
And what he proposes here is this, um, list of tips and tricks used to make Gans work.

96
00:07:37,690 --> 00:07:41,710
Now, as we said already, it's not that evident to make Gans work.

97
00:07:41,710 --> 00:07:51,460
So, uh, taking advantage of the experience from one of the authors of the original of the GAN paper,

98
00:07:51,460 --> 00:07:57,580
not the original Gan paper will be very interesting for us since we will not, uh, get to make the

99
00:07:57,580 --> 00:08:03,250
same mistakes which maybe he had made, uh, before discovering all those different tricks.

100
00:08:03,250 --> 00:08:09,460
Now, that said, we have here the very first one is you should note that this list is no longer maintained,

101
00:08:09,460 --> 00:08:11,890
and I'm not sure how relevant it is in 2020.

102
00:08:11,890 --> 00:08:14,350
So this is this has been for a while.

103
00:08:14,650 --> 00:08:15,520
Uh, let's see here.

104
00:08:15,520 --> 00:08:16,270
Six years.

105
00:08:16,300 --> 00:08:20,950
Anyways, we have, uh, normalized the first one, normalize the input.

106
00:08:20,950 --> 00:08:23,420
So normalize the images between -1 and 1.

107
00:08:23,420 --> 00:08:25,880
Then use Tench as a last layer of the generator.

108
00:08:25,890 --> 00:08:27,230
Again, this was already in the paper.

109
00:08:27,260 --> 00:08:30,800
The next tip will be to modify the loss function.

110
00:08:30,800 --> 00:08:36,080
Now, it should be noted that we've already explained this, but maybe the transition from this previous

111
00:08:36,080 --> 00:08:40,190
loss function to this other loss function wasn't made very clear.

112
00:08:40,190 --> 00:08:43,130
Now let's get back to that call for our loss function.

113
00:08:43,130 --> 00:08:52,490
We had, um, we had a log of one minus D of G of Z.

114
00:08:52,520 --> 00:08:54,110
This is for the generator, remember?

115
00:08:54,110 --> 00:08:59,330
So we had only this here, only this expression in the original GAN paper.

116
00:08:59,330 --> 00:09:05,360
They expected the discriminator to produce a zero, although previously we had mentioned that we expect

117
00:09:05,360 --> 00:09:07,790
the discriminator to produce a one right from the beginning.

118
00:09:07,790 --> 00:09:14,960
We've been speaking of this, but in the original paper what they actually wanted was the discriminator

119
00:09:14,960 --> 00:09:18,200
to produce a zero for the generator.

120
00:09:18,200 --> 00:09:27,620
Before we go on to look at how to compute this loss, we are going to take into consideration this modification

121
00:09:27,620 --> 00:09:30,170
on this original loss right here.

122
00:09:30,170 --> 00:09:34,490
And this modification comes in because of the following problem.

123
00:09:34,490 --> 00:09:42,740
When the training just starts, it's difficult for this discriminator to output a one unlike here where

124
00:09:42,740 --> 00:09:46,370
it's easier for it to output a zero when it sees fake data.

125
00:09:46,370 --> 00:09:52,220
Because remember, let's get back to this and let's restart here.

126
00:09:52,580 --> 00:09:59,600
Let's restart this and let's pick another distribution or let's let's pick this one which is slightly

127
00:09:59,600 --> 00:10:00,530
more complicated.

128
00:10:00,530 --> 00:10:06,250
So when you just start with the training, the the, the generated outputs, you see here, the generated

129
00:10:06,260 --> 00:10:10,390
outputs do not look very much like the real data.

130
00:10:10,410 --> 00:10:13,010
See, this here is very different from this.

131
00:10:13,010 --> 00:10:20,000
And so because of this great difference at the start of the training is difficult for us to make the

132
00:10:20,000 --> 00:10:24,860
generator fool the discriminator to output a one right here.

133
00:10:24,860 --> 00:10:31,370
And also because of the fact that classifying whether an input is real or not is easier than generating

134
00:10:31,370 --> 00:10:32,660
new inputs.

135
00:10:32,690 --> 00:10:36,380
The generator here will experience vanishing gradients.

136
00:10:36,380 --> 00:10:43,970
And so instead of, as we've seen already, trying to minimize this, we can instead maximize the sum

137
00:10:44,000 --> 00:10:46,370
of log D, G of Z.

138
00:10:46,400 --> 00:10:51,320
So instead of log one minus G of Z, we're going to have log D, G of Z.

139
00:10:51,470 --> 00:10:58,970
The reason why it's preferable for us to use this expression where we maximizing this here instead of

140
00:10:58,970 --> 00:11:07,130
minimizing this other expression is simply because when we make use of this expression, we are being

141
00:11:07,130 --> 00:11:15,230
more lenient on the generator at the beginning of the training especially so when we are minimizing

142
00:11:15,230 --> 00:11:18,080
the log of one minus D, G of Z.

143
00:11:18,110 --> 00:11:22,430
We had labels, or if we were to make use of the binary cross entropy.

144
00:11:22,440 --> 00:11:22,990
We lost.

145
00:11:23,040 --> 00:11:27,930
We would have levels which are equal zero.

146
00:11:28,140 --> 00:11:30,420
And so if our level equals zero.

147
00:11:30,450 --> 00:11:33,900
Obviously this expression is left out and we left only with this part.

148
00:11:33,930 --> 00:11:36,060
Now we're left with y equals zero.

149
00:11:36,060 --> 00:11:37,890
And then we have log of one minus y.

150
00:11:37,920 --> 00:11:39,820
Just similar to what we have here.

151
00:11:39,840 --> 00:11:48,810
But then the fact that at the very beginning we are expecting our discriminator to output y equals zero

152
00:11:48,810 --> 00:11:57,600
when it sees fake data from the generator is a problem because it's a very easy task for the for the

153
00:11:57,600 --> 00:12:04,860
discriminator, especially because right here or at the very beginning this year is obviously not going

154
00:12:04,860 --> 00:12:07,110
to look like the real data.

155
00:12:07,110 --> 00:12:15,090
And so the discriminator will find it very easy in, um, predicting that this is fake data and because

156
00:12:15,090 --> 00:12:24,160
of this is the generators weights wouldn't be updated any further to ensure that we could produce even

157
00:12:24,250 --> 00:12:27,450
a more realistic looking fake images.

158
00:12:27,460 --> 00:12:34,090
And so what we do is we flip the labels and flipping the labels matches with this expression, that

159
00:12:34,090 --> 00:12:37,150
is, let's, uh, take this off.

160
00:12:37,180 --> 00:12:42,070
We are having now y equal one.

161
00:12:42,340 --> 00:12:46,810
So instead of y equals zero, we now, uh, have y equal one.

162
00:12:46,810 --> 00:12:51,710
So we expecting the discriminator to output one when it sees fake data from the generator.

163
00:12:51,730 --> 00:12:56,140
Now, uh, doing this, we have y equal one, So obviously this is left out.

164
00:12:56,170 --> 00:12:59,710
You see that this now matches with maximizing this expression.

165
00:12:59,710 --> 00:13:07,180
And so when y equal one, now we are telling the discriminator to output a one year when it sees fake

166
00:13:07,180 --> 00:13:07,780
data.

167
00:13:07,780 --> 00:13:15,190
And now this will permit the generator to be able to update its weights and so make the training much

168
00:13:15,190 --> 00:13:22,990
more stable as compared to when we're dealing with or working with labels y, which are equal zero,

169
00:13:23,020 --> 00:13:27,430
You should also note that when we're talking about labels here, we're talking about this y eyes.

170
00:13:27,460 --> 00:13:32,820
And when we're talking about what the model predicts, we're talking about this y hat or this Y Sharples.

171
00:13:32,860 --> 00:13:35,950
Our next tip is one which we've discussed already.

172
00:13:35,950 --> 00:13:42,820
So we told in the GAN paper the loss function to optimize is the minimization of log one minus D, but

173
00:13:42,820 --> 00:13:47,350
in practice folks practically use max of log D.

174
00:13:47,890 --> 00:13:56,590
This uh, what we are seeing already where we had the minimization of one minus log D, g of Z.

175
00:13:56,620 --> 00:14:07,260
So we had um, log sorry, log of one minus D, G of Z right here.

176
00:14:07,270 --> 00:14:09,460
So we minimize this expression.

177
00:14:09,460 --> 00:14:18,910
And then now instead of this, we maximizing um, the log of D, G of Z.

178
00:14:19,450 --> 00:14:26,800
And the reason why we prefer to work this way, as we've said already, is because if you are expecting

179
00:14:26,800 --> 00:14:34,690
the discriminator to take in some fake data from your at the beginning and say that.

180
00:14:35,510 --> 00:14:42,320
It is fake, then this is going to be a very easy task, especially at the beginning, since the fake

181
00:14:42,740 --> 00:14:47,540
data here is going to look very much different from the real data.

182
00:14:47,540 --> 00:14:55,160
And so because we our aim here is to make this discriminator output, a zero is going to be difficult

183
00:14:55,160 --> 00:15:03,170
for the generator to update its parameters such that the discriminator can start getting fooled.

184
00:15:03,170 --> 00:15:11,090
And so instead of this, as we've said already, we flip the labels and instead aim for the discriminator

185
00:15:11,090 --> 00:15:12,680
to output once.

186
00:15:13,430 --> 00:15:15,270
For the next tip we set.

187
00:15:15,500 --> 00:15:19,220
Uh, to say don't sample from a uniform distribution.

188
00:15:19,220 --> 00:15:27,080
So for the noise, uh, here for our generator noise, we're going to sample from a Gaussian distribution

189
00:15:27,080 --> 00:15:28,570
or normal distribution.

190
00:15:28,580 --> 00:15:29,060
Okay?

191
00:15:29,060 --> 00:15:35,810
So from there we encouraged to use batch normalization, avoid sparse gradients.

192
00:15:35,810 --> 00:15:42,110
And unlike in the paper, if we get back to the Dcgan paper and we check, let's get to this.

193
00:15:42,140 --> 00:15:49,040
We check here we told use leaky relu activation in the discriminator for all layers and then use relu

194
00:15:49,070 --> 00:15:51,020
activation in the generator for all layers.

195
00:15:51,020 --> 00:15:56,720
But what this here is, uh, the leaky Relu is good in both the G and the D.

196
00:15:56,750 --> 00:15:59,840
That's both the generator and the discriminator.

197
00:15:59,870 --> 00:16:06,290
Then one point you should note is the fact that the stability of the GAN game suffers if you have space

198
00:16:06,290 --> 00:16:07,790
sparse gradients.

199
00:16:07,820 --> 00:16:09,470
Now, what do they mean by this?

200
00:16:09,800 --> 00:16:17,550
Uh, if you have a RELU activation, if you have some input, all negative inputs are sent to zero and

201
00:16:17,550 --> 00:16:19,590
all positives remain the same.

202
00:16:19,590 --> 00:16:29,280
Then for the leaky relu, we have all negatives which will take up some value depending on what we pick

203
00:16:29,310 --> 00:16:30,090
that value to be.

204
00:16:30,090 --> 00:16:37,770
If the value is 0.2, for example, then an input of negative one will give us an output of -0.2.

205
00:16:38,400 --> 00:16:47,040
Uh, and an input of negative two will give us, for example, uh, -0.4 an input of, of say whatever

206
00:16:47,040 --> 00:16:51,690
value times, uh, this times uh, -0.2 to get the output.

207
00:16:51,720 --> 00:16:54,960
Now that said, the this, the positive section remains the same.

208
00:16:54,960 --> 00:17:01,740
But now when we talk about the sparse gradients here, it comes due to the fact that if we have a relu's

209
00:17:01,740 --> 00:17:11,520
in our network, then we will tend to have many zeros and the zeros are will cause this sparsity in

210
00:17:11,520 --> 00:17:12,360
the gradient.

211
00:17:12,360 --> 00:17:18,480
And so because we do not want to have this and because we want to train the the Gans in a more stable

212
00:17:18,480 --> 00:17:22,620
manner, then we'll make use of the leaky relu.

213
00:17:23,780 --> 00:17:30,000
Then for down sampling, we're told to use average pooling or conv 2d and strides.

214
00:17:30,020 --> 00:17:37,520
Then for up sampling use pixel shuffle conv transpose convolutional transpose 2d with strides.

215
00:17:37,700 --> 00:17:38,540
Okay.

216
00:17:38,540 --> 00:17:43,760
Then from here we're told to use soft and noisy labels.

217
00:17:43,760 --> 00:17:45,180
So what does this mean?

218
00:17:45,200 --> 00:17:52,730
This means that if we have a discriminator and let's say we're training our discriminator where we have

219
00:17:52,730 --> 00:17:59,390
an input from a generator or we train our we update the parameters of the, of the generator, um,

220
00:17:59,390 --> 00:18:00,710
where we pass in.

221
00:18:02,290 --> 00:18:06,190
Our fake data from this output of the generator into our discriminator.

222
00:18:06,190 --> 00:18:11,400
And the discriminator has to compare the output from the model.

223
00:18:11,410 --> 00:18:13,540
Let's let's put the output from the model in this color.

224
00:18:13,540 --> 00:18:17,020
Let's say the output from the model 0.4, for example.

225
00:18:17,020 --> 00:18:23,160
So what we'll be comparing will be this correct label with the model's prediction.

226
00:18:23,170 --> 00:18:27,850
You see that now what this here is applying label smoothing.

227
00:18:27,850 --> 00:18:33,730
So instead of taking one, we could take a random value around one.

228
00:18:33,730 --> 00:18:43,240
So instead of this, we could take, for example, say 1.2 or 0 point 8 or 0.9 or whatever value just

229
00:18:43,240 --> 00:18:43,840
around one.

230
00:18:43,840 --> 00:18:50,950
So instead of having hard labels, just like some strict labels, either we have zero or we have one

231
00:18:51,070 --> 00:18:53,770
or um, take values around.

232
00:18:53,770 --> 00:18:56,050
So around let's change the color.

233
00:18:56,080 --> 00:19:01,090
We'll take values around zero and then values around one.

234
00:19:01,090 --> 00:19:01,600
You see that?

235
00:19:01,600 --> 00:19:06,710
So we use smooth labeling instead of some hard, uh, labeling.

236
00:19:07,970 --> 00:19:17,710
Then we also told to make the label, make the labels the noisy, make the labels noisy for the discriminator.

237
00:19:17,720 --> 00:19:21,020
That's occasionally flip the labels when training the discriminator.

238
00:19:21,350 --> 00:19:22,610
Okay, so that's fine.

239
00:19:22,610 --> 00:19:25,580
Now, next, uh, use this again when you can.

240
00:19:25,580 --> 00:19:25,970
It works.

241
00:19:25,970 --> 00:19:30,080
So if you, if you can use this again and the model stable use hybrid model.

242
00:19:31,460 --> 00:19:38,150
Like, for example, the and again, now here we have been told that if we are to generate images,

243
00:19:38,150 --> 00:19:46,160
we should not use the original Gans, that we should not use, um, the simple neural networks that

244
00:19:46,160 --> 00:19:47,540
are fully connected neural networks.

245
00:19:47,540 --> 00:19:50,990
We should go in for convolutional neural networks.

246
00:19:51,140 --> 00:19:52,070
Okay.

247
00:19:52,610 --> 00:19:56,150
The next is to use stability tricks from reinforcement learning.

248
00:19:56,150 --> 00:19:58,610
Now we get to treat your reinforcement learning.

249
00:19:58,610 --> 00:20:03,230
So we're going to skip out this now from here using Adam Optimizer Optim.

250
00:20:03,230 --> 00:20:09,320
Adam Rose So the app, the Adam Optimizer rules and then track failures early.

251
00:20:09,320 --> 00:20:18,680
So if you want to be able to train your Gans without maybe regretting at the end that you you haven't

252
00:20:18,680 --> 00:20:25,760
or you can get the kind of results you expected, you should try to ensure that you make sure your GAN

253
00:20:25,760 --> 00:20:28,010
isn't doing any one of this right here.

254
00:20:28,010 --> 00:20:35,060
So if you're training and your and your loss, the loss of the discriminator goes to zero, then it's

255
00:20:35,070 --> 00:20:41,400
failure mode because your discriminator is proven to be too good at doing its job.

256
00:20:41,400 --> 00:20:46,230
And then if the norms of the gradients are over 100, things are screwing up.

257
00:20:46,230 --> 00:20:54,630
When things are working, the loss has low variance and goes down over time versus having huge variance

258
00:20:54,630 --> 00:20:55,590
and spiking.

259
00:20:55,590 --> 00:21:01,620
So what they're saying here is if we have this, the loss for the discriminator, remember we have the

260
00:21:01,620 --> 00:21:03,090
discriminator on the generator.

261
00:21:03,720 --> 00:21:11,490
What we expect to have is something like this and should go down slowly over time instead of having

262
00:21:11,490 --> 00:21:15,660
this kind of, uh, high variance and spiking.

263
00:21:15,810 --> 00:21:21,960
Now, if the loss of the generator steadily decreases, then it's fooling the D that the discriminator

264
00:21:21,960 --> 00:21:23,130
with garbage.

265
00:21:23,130 --> 00:21:30,660
And so this means that we do not expect the generator to be so good that during training its loss just

266
00:21:30,660 --> 00:21:33,230
drops, um, steadily.

267
00:21:33,230 --> 00:21:37,820
And so, as we've said already, you should track all these failures early on.

268
00:21:37,820 --> 00:21:43,580
Now the next we have don't balance loss via statistics unless you have a good reason to.

269
00:21:43,610 --> 00:21:45,770
Now, uh, they say they've tried.

270
00:21:45,800 --> 00:21:47,870
It's hard and they've tried it all.

271
00:21:48,200 --> 00:21:49,430
Um, let's take this off.

272
00:21:49,430 --> 00:21:50,900
So what you're saying is.

273
00:21:51,730 --> 00:21:58,640
Don't try to balance the training of the generator and the discriminator based on some loss value.

274
00:21:58,660 --> 00:22:05,950
So if you are to try this, you should have a principle approach to it rather than just intuition.

275
00:22:06,280 --> 00:22:08,410
Now, if you have labels, use them now.

276
00:22:08,440 --> 00:22:15,610
Talking about labels, this means that if you have, say, we have this our discriminator and then we

277
00:22:15,610 --> 00:22:22,580
have our real data and then you have maybe some dataset of fake data.

278
00:22:22,600 --> 00:22:29,630
Then you could train your discriminator like the usual classifier in supervised learning.

279
00:22:29,650 --> 00:22:34,000
Then the next point is to add noise to the inputs and then decay over time.

280
00:22:34,000 --> 00:22:38,410
From here we have this tips where they actually not sure.

281
00:22:38,410 --> 00:22:41,230
So or we may just keep them actually.

282
00:22:41,530 --> 00:22:43,150
Uh, this is for conditional.

283
00:22:43,150 --> 00:22:49,120
Gans will not take this into consideration use dropout and in both train and test phase.

284
00:22:49,120 --> 00:22:54,310
So provide noise in the form of dropout as this generally leads to better results.

285
00:22:54,320 --> 00:22:59,090
Much thanks to the authors Smith, Emily Martin and Michal.

286
00:22:59,120 --> 00:23:05,030
Then, apart from the vanishing gradient problem, another very common problem will be that of mode

287
00:23:05,030 --> 00:23:12,770
collapse, where the generator produces output or produces the same outputs even after training for

288
00:23:12,770 --> 00:23:13,910
several epochs.

289
00:23:13,910 --> 00:23:20,060
And so now we're going to start with building all this again while taking into consideration the tips

290
00:23:20,060 --> 00:23:21,350
and tricks which we've just seen.