1
00:00:11,590 --> 00:00:15,050
In this lecture, we are going to discuss activation functions.

2
00:00:15,640 --> 00:00:21,010
Previously, we learn that the sigmoid function allows us to build neural networks and it does several

3
00:00:21,010 --> 00:00:22,210
important things for us.

4
00:00:22,930 --> 00:00:26,410
First, it maps its inputs to go from zero to one.

5
00:00:26,980 --> 00:00:29,930
This is nice because it mimics the biological neuron.

6
00:00:30,430 --> 00:00:34,780
So our neural network is literally a network of artificial neurons.

7
00:00:35,470 --> 00:00:39,530
Second, and this is more practical, it makes our neural network nonlinear.

8
00:00:40,060 --> 00:00:45,880
The function itself is non-linear and it also makes it so that we can't reduce the neural network into

9
00:00:45,880 --> 00:00:50,510
a simple linear equation at the same time in modern deep learning.

10
00:00:50,890 --> 00:00:55,900
We've discovered that there are some problems with the sigmoid and so it is no longer used as often,

11
00:00:55,900 --> 00:00:58,090
except in some specific cases.

12
00:01:03,130 --> 00:01:08,800
If you recall, something we discussed earlier was the importance of standardization, we don't want

13
00:01:08,800 --> 00:01:13,870
to have one input in the range, one million to five million and then another input going from zero

14
00:01:13,870 --> 00:01:15,980
to zero point zero zero zero one.

15
00:01:16,960 --> 00:01:22,720
Instead, we would like to have all of our data centered around zero and in approximately the same range.

16
00:01:23,380 --> 00:01:25,810
Now, the sigmoid is problematic in this regard.

17
00:01:26,410 --> 00:01:32,290
If you recall, the output of a sigmoid is always between zero and one and its middle value is therefore

18
00:01:32,290 --> 00:01:33,300
zero point five.

19
00:01:34,030 --> 00:01:37,870
Therefore, the output of a sigmoid can never be centered around zero.

20
00:01:38,770 --> 00:01:42,220
This goes back to our idea of the uniformity of a neural network.

21
00:01:42,700 --> 00:01:47,110
One layer of a neural network takes as input the output of the previous layer.

22
00:01:47,680 --> 00:01:52,660
So if the previous layer is outputting numbers centred around zero point five, that's not quite right.

23
00:01:54,120 --> 00:02:01,470
This is because this output is going to become the input to the next layer and the next layer wants

24
00:02:01,470 --> 00:02:04,040
to also see inputs which are standardized.

25
00:02:04,530 --> 00:02:10,440
Therefore we want both the inputs and the output of each layer to be standardized.

26
00:02:15,640 --> 00:02:20,770
Luckily, there is a solution to this, although it takes away from this idea that a neural network

27
00:02:20,770 --> 00:02:24,040
is made up of this idealized neuron simulation.

28
00:02:25,900 --> 00:02:31,240
Instead of using the sigmoid, we'll use a function that has the exact same shape as the sigmoid just

29
00:02:31,240 --> 00:02:32,330
centered around zero.

30
00:02:33,040 --> 00:02:36,640
This function is called the Tarnak, which is short for hyperbolic tangent.

31
00:02:37,390 --> 00:02:41,700
If you recall, there are hyperbolic versions of all the trigonometric functions.

32
00:02:42,010 --> 00:02:45,580
So there's a hyperbolic sine, hyperbolic, cosine and so forth.

33
00:02:46,360 --> 00:02:52,510
Well, it turns out that the hyperbolic tangent analogously to the trigonometric tangent is just the

34
00:02:52,510 --> 00:02:55,330
hyperbolic sine over the hyperbolic cosine.

35
00:02:56,210 --> 00:02:58,130
And it has the equation you see here.

36
00:02:59,420 --> 00:03:06,140
The sigmoid has a similar equation, one divided by one plus the exponent of minus X, the major difference

37
00:03:06,140 --> 00:03:12,200
between the sigmoid and the H is that the sigmoid goes between zero and one, while the Tanach goes

38
00:03:12,200 --> 00:03:13,970
between minus one and plus one.

39
00:03:14,870 --> 00:03:20,750
As an exercise, you may want to try and prove the relationship between the sigmoid and the Tanach in

40
00:03:20,750 --> 00:03:26,150
particular, that the ten is just a scaled and vertically shifted version of the sigmoid.

41
00:03:31,320 --> 00:03:36,900
Now, although the teenager is a little better than the sigmoid, the story is not over in modern deep

42
00:03:36,900 --> 00:03:37,240
learning.

43
00:03:37,260 --> 00:03:41,970
Researchers have figured out that there's actually a problem with both of these activation functions.

44
00:03:42,660 --> 00:03:45,510
I like to tell this story as a bit of a cautionary tale.

45
00:03:46,140 --> 00:03:51,030
For a long time, researchers were very attached to the beauty of neural networks, as is.

46
00:03:51,540 --> 00:03:55,930
They like this idea of a uniform neural network made up of idealized neurons.

47
00:03:56,460 --> 00:04:02,130
They like this idea that the sigmoid and hence also the Tanach were smooth differentiable functions.

48
00:04:02,820 --> 00:04:05,460
It is mathematically convenient and beautiful.

49
00:04:06,330 --> 00:04:11,340
The problem is they don't work that well, but nobody wanted to break away from this classic way of

50
00:04:11,340 --> 00:04:12,120
doing things.

51
00:04:17,220 --> 00:04:21,810
The major problem with sigmoidal and tanagers is called the vanishing gradient problem.

52
00:04:22,500 --> 00:04:25,220
Now this part requires a little bit of calculus knowledge.

53
00:04:25,230 --> 00:04:28,800
So if you're not comfortable with that, then feel free to skip ahead.

54
00:04:29,610 --> 00:04:33,000
Remember that our method of training a neural network is gradient descent.

55
00:04:33,780 --> 00:04:38,730
And obviously this involves finding the gradient of the cost with respect to the parameters.

56
00:04:39,420 --> 00:04:44,820
The problem is when you have a very deep neural network, your gradient has to propagate backwards throughout

57
00:04:44,820 --> 00:04:46,660
the neural network starting from the end.

58
00:04:47,370 --> 00:04:52,530
What happens is your output is made up of a bunch of composite functions, basically a sigmoid, of

59
00:04:52,530 --> 00:04:54,200
a sigmoid, of a sigmoid and so on.

60
00:04:55,020 --> 00:04:58,830
When you take the gradient of composite functions, you get the chain rule.

61
00:05:00,140 --> 00:05:03,740
So composite functions become multiplication in the derivative.

62
00:05:09,000 --> 00:05:10,650
So what does this end up looking like?

63
00:05:11,280 --> 00:05:15,390
Well, the further you go back in the neural network, the more terms you have in the chain rule to

64
00:05:15,390 --> 00:05:16,040
multiply.

65
00:05:16,590 --> 00:05:20,070
So you're multiplying by the derivative of the sigmoid over and over again.

66
00:05:20,850 --> 00:05:22,250
What happens when you do that?

67
00:05:22,800 --> 00:05:25,320
Why is the derivative of the sigmoid a problem?

68
00:05:30,410 --> 00:05:34,770
Well, consider what the sigmoid actually looks like, most of the sigmoid is flat.

69
00:05:34,970 --> 00:05:39,950
In other words, the derivative of the sigmoid is very nearly zero at most points.

70
00:05:40,550 --> 00:05:42,530
Only in the center is it non-zero.

71
00:05:43,220 --> 00:05:47,840
It also turns out that the maximum value of the derivative is only zero point two five.

72
00:05:48,290 --> 00:05:52,010
This is the highest possible value of the derivative of the sigmoid.

73
00:05:57,040 --> 00:05:58,600
OK, so why is that a problem?

74
00:05:59,380 --> 00:06:02,370
Well, what happens when you multiply numbers that are very small?

75
00:06:03,100 --> 00:06:05,920
The answer is you get an even smaller number.

76
00:06:06,550 --> 00:06:11,470
Let's say you multiply zero point to five, the maximum possible value five times.

77
00:06:11,920 --> 00:06:17,890
The result is zero point twenty five to the power of five, which is approximately zero point zero zero

78
00:06:17,890 --> 00:06:18,270
one.

79
00:06:19,420 --> 00:06:24,310
What happens if more realistically, you have a value like zero point one multiplied by itself?

80
00:06:24,310 --> 00:06:31,840
Five times, that's zero point one, the power of five, which is zero point zero zero zero zero one.

81
00:06:32,560 --> 00:06:37,780
In other words, this leads to the result that the further you go back in the neural network, the smaller

82
00:06:37,780 --> 00:06:38,920
the gradient becomes.

83
00:06:39,430 --> 00:06:41,500
We call this the vanishing gradient problem.

84
00:06:46,540 --> 00:06:48,310
So how does this problem manifest?

85
00:06:49,000 --> 00:06:53,350
Well, take a look at this graph, which shows the magnitude of the gradient at each layer of the neural

86
00:06:53,350 --> 00:06:54,820
network as it is trained.

87
00:06:55,600 --> 00:06:59,140
You'll notice that the further you go back, the smaller the gradient gets.

88
00:06:59,890 --> 00:07:03,940
Remember that the training algorithm is to take small steps in the direction of the gradient.

89
00:07:04,510 --> 00:07:09,550
Well, if the gradient is nearly zero, that means the update to the weights is also nearly zero.

90
00:07:10,180 --> 00:07:15,690
The end result is that weights close to the input of the neural network are almost not trained at all.

91
00:07:17,280 --> 00:07:22,110
This was a problem in the olden days which prevented us from building very deep neural networks like

92
00:07:22,110 --> 00:07:24,990
we have today, which can have hundreds of layers.

93
00:07:30,110 --> 00:07:33,480
Back in the day, there was lots of theory around how to solve this problem.

94
00:07:34,310 --> 00:07:39,830
One method was called greedy layaways pre training, which was invented by Geoffrey Hinton and his students,

95
00:07:40,640 --> 00:07:44,240
since all the layers of the neural network could not be trained all at once.

96
00:07:44,420 --> 00:07:47,320
The idea was you could train each layer one at a time.

97
00:07:47,840 --> 00:07:51,940
So what you would do is train the first layer using some alternative lost function.

98
00:07:52,670 --> 00:07:56,750
By the way, if you want to know what that lost function is, it's basically an auto encoder.

99
00:07:56,840 --> 00:07:59,830
But hold that thought for now as it's not an important detail.

100
00:08:01,420 --> 00:08:06,310
Then once you were done training the first layer, you would add on a second layer and that layer by

101
00:08:06,310 --> 00:08:08,960
itself not touching the first layer anymore.

102
00:08:09,700 --> 00:08:14,400
Then you would add a third layer and train only the third layer, leaving the first two layers alone.

103
00:08:15,130 --> 00:08:20,200
Once you got to the last layer, all of the previous layers would already be trained to some extent.

104
00:08:25,270 --> 00:08:30,760
Now, despite all this theory, which generated lots of theoretical research into new models such as

105
00:08:30,760 --> 00:08:37,090
restricted Boltzmann machines, deportment, machines and so forth, today we no longer require such

106
00:08:37,090 --> 00:08:42,700
complicated models in order to train very deep neural networks, although that's not to say they're

107
00:08:42,700 --> 00:08:43,750
not worth learning about.

108
00:08:44,560 --> 00:08:46,570
In fact, the solution was simple.

109
00:08:47,020 --> 00:08:50,710
Just don't use activation functions that have vanishing gradients.

110
00:08:51,280 --> 00:08:56,580
So throw away these beautiful, smooth differentiable functions like the Tanach and the sigmoid.

111
00:08:57,220 --> 00:09:02,560
Instead, let's use this ugly looking, not completely differentiable function called The Real You.

112
00:09:03,310 --> 00:09:06,030
The Real You is short for rectifier linear unit.

113
00:09:06,640 --> 00:09:11,980
Basically it looks like a hockey stick and it has a corner at zero where the derivative is technically

114
00:09:11,980 --> 00:09:12,810
not defined.

115
00:09:13,630 --> 00:09:16,690
The fact that value is greater than zero never have a zero.

116
00:09:17,020 --> 00:09:20,020
It makes training neural networks a lot more efficient.

117
00:09:25,060 --> 00:09:30,430
Now, you might be wondering, wait a minute, if a zero gradients are so bad, then why does the rescue

118
00:09:30,430 --> 00:09:30,990
work at all?

119
00:09:31,630 --> 00:09:37,500
It appears that half the function, any input, less than zero has a derivative that is exactly zero.

120
00:09:37,780 --> 00:09:38,950
Never mind vanishing.

121
00:09:39,190 --> 00:09:40,990
The gradient is already vanished.

122
00:09:41,740 --> 00:09:46,570
Indeed, this is somewhat of a problem as it leads to a phenomenon called dead neurons.

123
00:09:47,170 --> 00:09:52,180
Dead neurons are neurons that always outputs zero because the weighted sum of its inputs are always

124
00:09:52,180 --> 00:09:53,450
less than or equal to zero.

125
00:09:54,340 --> 00:09:59,500
However, and this is important in deep learning, what we care about is experimental results.

126
00:09:59,890 --> 00:10:05,500
In other words, the most important thing is that it works, not how theoretically satisfying it is,

127
00:10:05,890 --> 00:10:09,680
because as we now know, that can lead to lots of wasted time.

128
00:10:10,390 --> 00:10:16,090
The fact that just the right side doesn't vanish seems to be good enough according to what we've observed.

129
00:10:21,140 --> 00:10:26,840
Of course, some researchers have tried to make modifications to the simple rescue to solve this problem

130
00:10:26,840 --> 00:10:27,830
of dead neurons.

131
00:10:28,520 --> 00:10:34,550
One alternative is the leaky rescue, which has a slope of less than one like zero point one for values

132
00:10:34,550 --> 00:10:35,240
less than zero.

133
00:10:35,960 --> 00:10:38,540
Importantly, this is still a nonlinear function.

134
00:10:38,900 --> 00:10:45,380
So we're able to learn nonlinear geometrical patterns using the leaky well, you your derivatives will

135
00:10:45,380 --> 00:10:48,710
always be positive, just like the sigmoid and the Tanach.

136
00:10:53,830 --> 00:10:59,710
There are other options as well, such as the ACLU or exponential linnear unit, which has a more steadily

137
00:10:59,710 --> 00:11:01,690
decreasing value on the left side.

138
00:11:02,380 --> 00:11:07,720
The authors claim that this activation function speeds up learning and leads to higher accuracy than

139
00:11:07,720 --> 00:11:08,350
the value.

140
00:11:09,250 --> 00:11:15,250
One interesting aspect of the ACLU is that it allows its outputs to be negative, which goes back to

141
00:11:15,250 --> 00:11:18,490
the idea that we like the mean of the values to be close to zero.

142
00:11:23,560 --> 00:11:28,780
Another option which is very similar is the soft plus activation, which, because you're taking the

143
00:11:28,780 --> 00:11:33,310
log of the exponent, looks very linear when the input is reasonably large.

144
00:11:34,870 --> 00:11:39,490
Now, for both of these previous activation functions, there is the vanishing gradient on the left

145
00:11:39,490 --> 00:11:45,130
side, but we've established that it's not so much of a problem since we know that the value already

146
00:11:45,130 --> 00:11:47,890
works and it has gradients that are equal to zero.

147
00:11:49,760 --> 00:11:55,010
Also, although we initially stated that we would like the inputs at each layer to be centered around

148
00:11:55,010 --> 00:12:01,820
zero, we can see that the U.N. supplies do not accomplish this for the supplies and the you the minimum

149
00:12:01,820 --> 00:12:04,570
value is zero, while the maximum value is infinity.

150
00:12:05,150 --> 00:12:07,790
This definitely means they won't be centered around zero.

151
00:12:08,390 --> 00:12:11,180
So is the value not a good choice in the end?

152
00:12:16,220 --> 00:12:22,070
Now, despite all this work to find alternatives to the rescue activation these days, most people still

153
00:12:22,070 --> 00:12:25,090
use the rescue as a reasonable default choice.

154
00:12:26,310 --> 00:12:31,320
It works well, and sometimes you'll find that using other alternatives, such as the leaky Riu or the

155
00:12:31,320 --> 00:12:33,300
ACLU offer no benefit.

156
00:12:33,900 --> 00:12:37,850
Sometimes they do, which is why you always have to experiment for yourself.

157
00:12:38,850 --> 00:12:44,190
My motto, which a lot of my students are tired of hearing by this point, is that machine learning

158
00:12:44,190 --> 00:12:46,820
is experimentation and not philosophy.

159
00:12:47,550 --> 00:12:51,590
Never use your mind to try and predict the outcome of a computer program.

160
00:12:52,110 --> 00:12:58,590
If you have a computer that is always the suboptimal course of action, why not simply run the computer

161
00:12:58,590 --> 00:13:00,540
program with a computer?

162
00:13:01,380 --> 00:13:07,650
Your mind is not suitable for running computer programs, but computers are therefore follow the rule.

163
00:13:08,040 --> 00:13:10,890
Don't use philosophy, use experimentation.

164
00:13:15,960 --> 00:13:21,480
Interestingly, some researchers have talked about the biological plausibility of the rescue activation

165
00:13:21,480 --> 00:13:21,970
function.

166
00:13:22,680 --> 00:13:27,210
In fact, it may even be more biologically plausible than the sigmoid.

167
00:13:27,960 --> 00:13:32,900
To understand this, you have to understand a little more about how action potentials encode information.

168
00:13:33,720 --> 00:13:39,450
What is the difference between an actual potential when you hear a quiet sound versus an action potential,

169
00:13:39,660 --> 00:13:40,900
when you hear a loud sound?

170
00:13:41,580 --> 00:13:45,750
The answer is there is no difference and action potential is just an action potential.

171
00:13:46,650 --> 00:13:50,130
The key really is in the frequency of action potentials.

172
00:13:50,700 --> 00:13:54,570
When you hear a very quiet sound, your neurons are only activated a little bit.

173
00:13:55,110 --> 00:13:58,230
So you'll get some action potentials, but they won't be very frequent.

174
00:13:58,920 --> 00:14:03,900
If you hear a very loud sound like, say you're at a concert or a party, then your neurons are getting

175
00:14:03,900 --> 00:14:05,160
lots of stimulation.

176
00:14:05,760 --> 00:14:10,560
The action potentials are going to be very frequent, in other words, closer together in time.

177
00:14:11,340 --> 00:14:17,670
So what we're saying is the more intense a stimulus is, the higher the frequency of the action potentials.

178
00:14:18,180 --> 00:14:19,800
We call this frequency coding.

179
00:14:24,850 --> 00:14:26,710
What does this mean in terms of the rescue?

180
00:14:27,490 --> 00:14:32,900
Well, you can think of it as the rescue is just encoding the actual potential frequency itself.

181
00:14:33,460 --> 00:14:38,110
Of course, the minimum frequency is zero and it's not possible to have a negative frequency.

182
00:14:38,710 --> 00:14:41,590
That's why the minimum value of the rescue is zero.

183
00:14:42,670 --> 00:14:48,940
Then, as the inputs into the neuron get larger because they are being stimulated more, the receiving

184
00:14:48,940 --> 00:14:52,570
neuron also gets stimulated more, increasing its frequency.

185
00:14:57,660 --> 00:15:03,060
Now, one thing to note is that we can go even deeper than this, although this is not part of mainstream,

186
00:15:03,060 --> 00:15:04,050
deep learning quite yet.

187
00:15:04,920 --> 00:15:07,480
As a side note, if you don't want to listen to this part.

188
00:15:07,530 --> 00:15:08,490
It's not required.

189
00:15:08,870 --> 00:15:13,470
It's something that I think is interesting to those of us who are interested in modeling brains.

190
00:15:13,800 --> 00:15:17,980
But if you are a hardcore statistician, then perhaps you may think differently.

191
00:15:19,320 --> 00:15:25,290
What we know based on neuroscience is that unlike the real you activation, actual potential frequency

192
00:15:25,290 --> 00:15:28,290
is not linearly related to its input stimuli.

193
00:15:28,860 --> 00:15:31,110
Instead, we have a nonlinear relationship.

194
00:15:32,250 --> 00:15:36,290
It can be modeled using the log function or a root function like the square root.

195
00:15:37,320 --> 00:15:43,020
A good way to understand this is with sound, something that is twice as loud, does not sound twice

196
00:15:43,020 --> 00:15:45,780
as loud using measurements of sound intensity.

197
00:15:46,560 --> 00:15:48,890
This is the intuition behind the decibel scale.

198
00:15:49,830 --> 00:15:53,090
As you recall, the decibel scale is a logarithmic scale.

199
00:15:53,670 --> 00:15:57,840
For example, a 90 decibels sound would be like sitting inside a moving bus.

200
00:15:58,500 --> 00:16:01,620
100 decibels sound would be like standing by an electric saw.

201
00:16:02,220 --> 00:16:08,100
A 110 decibels sound would be like a loud orchestra at 120 decibels.

202
00:16:08,100 --> 00:16:10,320
Sound would be like standing near a jet engine.

203
00:16:11,250 --> 00:16:17,280
A 130 decibels sound would be like being close to artillery fire, and this value is the threshold of

204
00:16:17,280 --> 00:16:17,770
pain.

205
00:16:18,270 --> 00:16:21,300
So when you hear a sound this loud, it actually hurts physically.

206
00:16:22,170 --> 00:16:27,840
To put that in perspective, 100 decibels is 10 times more powerful than 90 decibels.

207
00:16:28,380 --> 00:16:36,420
110 decibels is 100 times more powerful, 120 decibels is 1000 times more powerful and 130 decibels

208
00:16:36,570 --> 00:16:38,670
is 10000 times more powerful.

209
00:16:43,780 --> 00:16:49,600
Some work has been done to experiment with activation functions that more accurately model the frequency

210
00:16:49,600 --> 00:16:53,370
characteristics of real neurons such as the Bee are you?

211
00:16:53,800 --> 00:16:57,330
But as a whole, they haven't yet caught on in the deep learning community.

212
00:16:58,030 --> 00:16:59,680
I've attached the paper on the bee.

213
00:16:59,680 --> 00:17:03,410
Are you to extra reading that text in case you want to check it out.

214
00:17:04,120 --> 00:17:09,960
The author reports that the Bahru activation function led to better results than the you and the ACLU.

215
00:17:10,360 --> 00:17:13,410
So it may be something worth trying in your own project.
