1
00:00:11,590 --> 00:00:15,010
In this lecture, we are going to discuss activation functions.

2
00:00:15,610 --> 00:00:21,010
Previously, we learned that the sigmoid function allows us to build neural networks, and it does several

3
00:00:21,010 --> 00:00:22,150
important things for us.

4
00:00:22,900 --> 00:00:26,440
First, it maps its inputs to go from zero to one.

5
00:00:26,980 --> 00:00:29,950
This is nice because it mimics the biological neuron.

6
00:00:30,430 --> 00:00:34,750
So our neural network is literally a network of artificial neurons.

7
00:00:35,510 --> 00:00:37,450
Second, and this is more practical.

8
00:00:37,690 --> 00:00:39,520
It makes our neural network non-linear.

9
00:00:40,060 --> 00:00:41,740
The function itself is non-linear.

10
00:00:41,980 --> 00:00:47,320
And it also makes it so that we can't reduce the neural network into a simple, linear equation.

11
00:00:48,160 --> 00:00:52,930
At the same time, in modern deep learning, we've discovered that there are some problems with the

12
00:00:52,930 --> 00:00:53,400
sigmoid.

13
00:00:53,410 --> 00:00:58,060
And so it is no longer used as often, except in some specific cases.

14
00:01:03,160 --> 00:01:07,690
If you recall, something we discussed earlier was the importance of standardization.

15
00:01:08,290 --> 00:01:13,210
We don't want to have one input in the range one million to five million and then another input going

16
00:01:13,210 --> 00:01:15,940
from zero to zero point zero zero zero one.

17
00:01:16,990 --> 00:01:22,690
Instead, we would like to have all of our data centered around zero and in approximately the same range.

18
00:01:23,380 --> 00:01:25,790
Now the sigmoid is problematic in this regard.

19
00:01:26,410 --> 00:01:32,260
If you recall, the output of a sigmoid is always between zero and one, and it's middle value is therefore

20
00:01:32,260 --> 00:01:33,250
zero point five.

21
00:01:34,030 --> 00:01:37,900
Therefore, the output of a sigmoid can never be centered around zero.

22
00:01:38,770 --> 00:01:42,190
This goes back to our idea of the uniformity of a neural network.

23
00:01:42,700 --> 00:01:47,110
One layer of a neural network takes as input the output of the previous layer.

24
00:01:47,650 --> 00:01:52,660
So if the previous layer is outputting numbers centered around zero point five, that's not quite right.

25
00:01:54,120 --> 00:02:01,470
This is because this output is going to become the input to the next layer, and the next layer wants

26
00:02:01,470 --> 00:02:03,990
to also see inputs which are standardized.

27
00:02:04,500 --> 00:02:10,380
Therefore, we want both the inputs and the outputs of each layer to be standardized.

28
00:02:15,670 --> 00:02:20,740
Luckily, there is a solution to this, although it takes away from this idea that a neural network

29
00:02:20,740 --> 00:02:24,010
is made up of this idealized neuron simulation.

30
00:02:25,930 --> 00:02:31,210
Instead of using the sigmoid, we'll use a function that has the exact same shape as the sigmoid just

31
00:02:31,210 --> 00:02:32,320
centered around zero.

32
00:02:33,070 --> 00:02:36,610
This function is called the 10H, which is short for hyperbolic tangent.

33
00:02:37,420 --> 00:02:42,940
If you recall, there are hyperbolic versions of all the trigonometric functions, so there's a hyperbolic

34
00:02:42,940 --> 00:02:45,550
sine hyperbolic cosine and so forth.

35
00:02:46,360 --> 00:02:52,480
Well, it turns out that the hyperbolic tangent analogously to the trigonometric tangent is just the

36
00:02:52,480 --> 00:02:55,300
hyperbolic sign over the hyperbolic cosine.

37
00:02:56,270 --> 00:02:58,090
And it has the equation you see here.

38
00:02:59,390 --> 00:03:04,580
The sigmoid has a similar equation, one divided by one plus the exponent of minus X.

39
00:03:05,270 --> 00:03:10,760
The major difference between the sigmoid and the 10h is that the sigmoid goes between zero and one.

40
00:03:11,150 --> 00:03:13,910
While the tannic goes between minus one and plus one.

41
00:03:14,900 --> 00:03:21,440
As an exercise, you may want to try and prove the relationship between the sigmoid and the 10h in particular

42
00:03:21,770 --> 00:03:26,120
that the 10h is just a scaled and a vertically shifted version of the sigmoid.

43
00:03:31,320 --> 00:03:37,220
Now, although the 10h is a little better than the sigmoid, the story is not over in modern deep learning,

44
00:03:37,230 --> 00:03:41,940
researchers have figured out that there's actually a problem with both of these activation functions.

45
00:03:42,660 --> 00:03:45,510
I like to tell this story as a bit of a cautionary tale.

46
00:03:46,140 --> 00:03:51,000
For a long time, researchers were very attached to the beauty of neural networks, as is.

47
00:03:51,540 --> 00:03:55,920
They like this idea of a uniform neural network made up of idealized neurons.

48
00:03:56,460 --> 00:04:02,100
They like this idea that the sigmoid and hence also the 10h or smooth differential functions.

49
00:04:02,850 --> 00:04:05,490
It is mathematically convenient and beautiful.

50
00:04:06,330 --> 00:04:11,310
The problem is they don't work that well, but nobody wanted to break away from this classic way of

51
00:04:11,310 --> 00:04:12,090
doing things.

52
00:04:17,220 --> 00:04:21,810
The major problem with Sigmoidal and Tan Acres is called the vanishing gradient problem.

53
00:04:22,500 --> 00:04:27,270
Now this part requires a little bit of calculus knowledge, so if you're not comfortable with that,

54
00:04:27,480 --> 00:04:28,800
then feel free to skip ahead.

55
00:04:29,610 --> 00:04:33,000
Remember that our method of training a neural network is gradient descent.

56
00:04:33,780 --> 00:04:38,730
And obviously, this involves finding the gradient of the cost with respect to the parameters.

57
00:04:39,390 --> 00:04:44,070
The problem is, when you have a very deep neural network, your gradient has to propagate backwards

58
00:04:44,430 --> 00:04:46,620
throughout the neural network, starting from the end.

59
00:04:47,370 --> 00:04:52,560
What happens is your output is made up of a bunch of composite functions, basically a sigmoid of a

60
00:04:52,560 --> 00:04:54,180
sigmoid of a sigmoid and so on.

61
00:04:55,050 --> 00:04:58,800
When you take the gradient of composite functions, you get the chain rule.

62
00:05:00,170 --> 00:05:03,830
So composite functions become multiplication in the derivative.

63
00:05:09,030 --> 00:05:10,650
So what does this end up looking like?

64
00:05:11,280 --> 00:05:15,360
Well, the further you go back in the neural network, the more terms you have in the chain rule to

65
00:05:15,360 --> 00:05:20,100
multiply, so you're multiplying by the derivative of the sigmoid over and over again.

66
00:05:20,820 --> 00:05:22,230
What happens when you do that?

67
00:05:22,770 --> 00:05:25,290
Why is the derivative of the sigmoid a problem?

68
00:05:30,380 --> 00:05:32,780
Well, consider what the sigmoid actually looks like.

69
00:05:33,410 --> 00:05:34,790
Most of the sigmoid is flat.

70
00:05:35,000 --> 00:05:38,840
In other words, the derivative of the sigmoid is very nearly zero.

71
00:05:38,870 --> 00:05:42,530
At most points, only in the center is at nonzero.

72
00:05:43,250 --> 00:05:47,810
It also turns out that the maximum value of the derivative is only 0.25.

73
00:05:48,260 --> 00:05:51,980
This is the highest possible value of the derivative of the sigmoid.

74
00:05:57,100 --> 00:05:58,600
OK, so why is that a problem?

75
00:05:59,350 --> 00:06:02,380
Well, what happens when you multiply numbers that are very small?

76
00:06:03,100 --> 00:06:05,890
The answer is you get an even smaller number.

77
00:06:06,550 --> 00:06:11,450
Let's say you multiply 0.25 the maximum possible value five times.

78
00:06:11,890 --> 00:06:18,250
The result is 0.2 five to the power five, which is approximately zero point zero zero one.

79
00:06:19,390 --> 00:06:24,550
What happens if, more realistically, you have a value like zero point one multiplied by itself five

80
00:06:24,550 --> 00:06:25,210
times?

81
00:06:25,750 --> 00:06:31,810
That's zero point one the par five, which is zero point zero zero zero zero one.

82
00:06:32,590 --> 00:06:37,780
In other words, this leads to the result that the further you go back in the neural network, the smaller

83
00:06:37,780 --> 00:06:38,890
the gradient becomes.

84
00:06:39,430 --> 00:06:41,500
We call this the vanishing gradient problem.

85
00:06:46,570 --> 00:06:48,310
So how does this problem manifest?

86
00:06:49,000 --> 00:06:53,350
Well, take a look at this graph, which shows the magnitude of the gradient at each layer of the neural

87
00:06:53,350 --> 00:06:54,790
network as it is trained.

88
00:06:55,630 --> 00:06:59,090
You'll notice that the further you go back, the smaller the gradient gets.

89
00:06:59,920 --> 00:07:03,880
Remember that the training algorithm is to take small steps in the direction of the gradient.

90
00:07:04,510 --> 00:07:09,580
Well, if the gradient is nearly zero, that means the update to the weights is also nearly zero.

91
00:07:10,210 --> 00:07:15,670
The end result is that weights close to the input of the neural network are almost not trained at all.

92
00:07:17,280 --> 00:07:22,080
This was a problem in the olden days, which prevented us from building very deep neural networks like

93
00:07:22,080 --> 00:07:24,960
we have today, which can have hundreds of layers.

94
00:07:30,080 --> 00:07:33,470
Back in the day, there was lots of theory around how to solve this problem.

95
00:07:34,310 --> 00:07:39,200
One method was called it greedy layaways pre-training, which was invented by Geoffrey Hinton and his

96
00:07:39,200 --> 00:07:44,210
students, since all the layers of the neural network could not be trained all at once.

97
00:07:44,390 --> 00:07:47,300
The idea was you could train each layer one at a time.

98
00:07:47,840 --> 00:07:51,890
So what you would do is train the first layer using some alternative lost function.

99
00:07:52,670 --> 00:07:56,720
By the way, if you want to know what that loss function is, it's basically an auto encoder.

100
00:07:56,840 --> 00:07:59,810
But hold that thought for now is it's not an important detail.

101
00:08:01,420 --> 00:08:06,130
Then once you were done training the first layer, you would add on a second layer and train that layer

102
00:08:06,130 --> 00:08:08,950
by itself, not touching the first layer anymore.

103
00:08:09,700 --> 00:08:14,350
Then you would add a third layer and train only the third layer, leaving the first two layers alone.

104
00:08:15,100 --> 00:08:20,170
Once you got to the last layer, all of the previous layers would already be trained to some extent.

105
00:08:25,240 --> 00:08:30,760
Now, despite all this theory, which generated lots of theoretical research into new models such as

106
00:08:30,760 --> 00:08:37,720
restricted Boltzmann machines, deportment machines and so forth, today we no longer require such complicated

107
00:08:37,720 --> 00:08:40,750
models in order to train very deep neural networks.

108
00:08:41,559 --> 00:08:43,750
Although that's not to say they're not worth learning about.

109
00:08:44,560 --> 00:08:50,650
In fact, the solution was simple just don't use activation functions that have vanished ingredients.

110
00:08:51,280 --> 00:08:56,560
So throw away these beautiful, smooth differential functions like the 10H and the sigmoid.

111
00:08:57,250 --> 00:09:02,560
Instead, let's use this ugly looking, not completely differential function called the real you.

112
00:09:03,280 --> 00:09:06,010
The real you is short for rectifier linear unit.

113
00:09:06,640 --> 00:09:11,950
Basically, it looks like a hockey stick, and it has a corner at zero, where the derivative is technically

114
00:09:11,950 --> 00:09:12,760
not defined.

115
00:09:13,600 --> 00:09:18,970
The fact that value is greater than zero never have a zero gradient makes training neural networks a

116
00:09:18,970 --> 00:09:19,960
lot more efficient.

117
00:09:25,090 --> 00:09:30,400
Now, you might be wondering, wait a minute, if a zero gradients are so bad, then why does the rescue

118
00:09:30,400 --> 00:09:31,000
work at all?

119
00:09:31,660 --> 00:09:37,480
It appears that half the function any input less than zero has a derivative that is exactly zero.

120
00:09:37,780 --> 00:09:40,960
Never mind vanishing, the gradient is already vanished.

121
00:09:41,770 --> 00:09:46,540
Indeed, this is somewhat of a problem as it leads to a phenomenon called dead neurons.

122
00:09:47,170 --> 00:09:52,450
Then neurons are neurons that always output zero because the weighted sum of its inputs are always less

123
00:09:52,450 --> 00:09:53,410
than or equal to zero.

124
00:09:54,340 --> 00:09:59,470
However, and this is important in deep learning, what we care about is experimental results.

125
00:09:59,890 --> 00:10:05,440
In other words, the most important thing is that it works not how theoretically satisfying it is,

126
00:10:05,890 --> 00:10:09,670
because as we now know, that can lead to lots of wasted time.

127
00:10:10,390 --> 00:10:16,060
The fact that just the right side doesn't vanish seems to be good enough, according to what we've observed.

128
00:10:21,200 --> 00:10:26,810
Of course, some researchers have tried to make modifications to the simple real you to solve this problem

129
00:10:26,810 --> 00:10:27,800
of dead neurons.

130
00:10:28,490 --> 00:10:34,820
One alternative is the leaky you, which has a slope of less than one like 0.1 four values less than

131
00:10:34,820 --> 00:10:35,270
zero.

132
00:10:35,990 --> 00:10:42,080
Importantly, this is still a nonlinear function, so we're able to learn nonlinear geometrical patterns

133
00:10:42,980 --> 00:10:44,390
using the leaky well, you.

134
00:10:44,420 --> 00:10:48,650
Your derivatives will always be positive, just like the sigmoid and the 10h.

135
00:10:53,770 --> 00:10:59,680
There are other options as well, such as the EU or exponential linear unit, which has a more steadily

136
00:10:59,680 --> 00:11:01,630
decreasing value on the left side.

137
00:11:02,350 --> 00:11:07,690
The authors claim that this activation function speeds up learning and leads to higher accuracy than

138
00:11:07,690 --> 00:11:08,380
the real you.

139
00:11:09,250 --> 00:11:15,370
One interesting aspect of the EU is that it allows its outputs to be negative, which goes back to the

140
00:11:15,370 --> 00:11:18,490
idea that we like the mean of the values to be close to zero.

141
00:11:23,590 --> 00:11:29,140
Another option which is very similar is the soft plus activation, which because you're taking the log

142
00:11:29,140 --> 00:11:33,280
of the exponent, looks very linear when the input is reasonably large.

143
00:11:34,930 --> 00:11:39,490
Now, for both of these previous activation functions, there is the vanishing gradient on the left

144
00:11:39,490 --> 00:11:45,100
side, but we've established that it's not so much of a problem since we know that the value already

145
00:11:45,100 --> 00:11:47,860
works and it has gradients that are equal to zero.

146
00:11:49,760 --> 00:11:54,980
Also, although we initially stated that we would like the inputs at each layer to be centered around

147
00:11:54,980 --> 00:12:01,160
zero, we can see that the Réélu and Southwest do not accomplish this for the South Plus and the value,

148
00:12:01,160 --> 00:12:04,580
the minimum value is zero, while the maximum value is infinity.

149
00:12:05,150 --> 00:12:07,760
This definitely means they won't be centered around zero.

150
00:12:08,390 --> 00:12:11,120
So is the revenue not a good choice in the end?

151
00:12:16,250 --> 00:12:21,770
Now, despite all this work, to find alternatives to the rescue activation these days, most people

152
00:12:21,770 --> 00:12:25,060
still use the Rahayu as a reasonable default choice.

153
00:12:26,310 --> 00:12:31,200
It works well, and sometimes you'll find that using other alternatives, such as the leaky rescue or

154
00:12:31,200 --> 00:12:33,300
the you offer no benefit.

155
00:12:33,900 --> 00:12:37,800
Sometimes they do, which is why you always have to experiment for yourself.

156
00:12:38,850 --> 00:12:44,370
My motto, which a lot of my students are tired of hearing by this point is that machine learning is

157
00:12:44,370 --> 00:12:46,830
experimentation and not philosophy.

158
00:12:47,550 --> 00:12:51,540
Never use your mind to try and predict the outcome of a computer program.

159
00:12:52,140 --> 00:12:58,560
If you have a computer that is always the suboptimal course of action, why not simply run the computer

160
00:12:58,560 --> 00:13:00,510
program with a computer?

161
00:13:01,350 --> 00:13:07,620
Your mind is not suitable for running computer programs, but computers are therefore follow the rule.

162
00:13:07,980 --> 00:13:09,240
Don't use philosophy.

163
00:13:09,480 --> 00:13:10,860
Use experimentation.

164
00:13:15,990 --> 00:13:21,450
Interestingly, some researchers have talked about the biological plausibility of the role you activation

165
00:13:21,450 --> 00:13:21,960
function.

166
00:13:22,680 --> 00:13:27,180
In fact, it may even be more biologically plausible than the sigmoid.

167
00:13:27,960 --> 00:13:32,880
To understand this, you have to understand a little more about how action potentials encode information.

168
00:13:33,720 --> 00:13:35,950
What is the difference between an action potential?

169
00:13:36,180 --> 00:13:42,240
When you hear a quiet sound versus an action potential, when you hear a loud sound, the answer is

170
00:13:42,240 --> 00:13:43,290
there is no difference.

171
00:13:43,410 --> 00:13:45,750
And action potential is just an action potential.

172
00:13:46,650 --> 00:13:50,070
The key really is in the frequency of action potentials.

173
00:13:50,700 --> 00:13:56,010
When you hear a very quiet sound, your neurons are only activated a little bit, so you'll get some

174
00:13:56,010 --> 00:13:58,200
action potentials, but they won't be very frequent.

175
00:13:58,950 --> 00:14:03,870
If you hear a very loud sound like, say, you're at a concert or a party, then your neurons are getting

176
00:14:03,870 --> 00:14:05,130
lots of stimulation.

177
00:14:05,730 --> 00:14:08,160
The action potentials are going to be very frequent.

178
00:14:08,250 --> 00:14:10,530
In other words, closer together in time.

179
00:14:11,370 --> 00:14:17,670
So what we're saying is the more intense a stimulus is, the higher the frequency of the action potentials.

180
00:14:18,180 --> 00:14:19,770
We call this frequency coding.

181
00:14:24,820 --> 00:14:26,710
What does this mean in terms of the real you?

182
00:14:27,460 --> 00:14:32,830
Well, you can think of it as the real you is just encoding the actual potential frequency itself.

183
00:14:33,520 --> 00:14:35,590
Of course, the minimum frequency is zero.

184
00:14:35,770 --> 00:14:38,110
And it's not possible to have a negative frequency.

185
00:14:38,680 --> 00:14:41,590
That's why the minimum value of the value is zero.

186
00:14:42,640 --> 00:14:48,940
Then, as the inputs into the neuron get larger because they are being stimulated more, the receiving

187
00:14:48,940 --> 00:14:52,570
neuron also gets stimulated, more increasing its frequency.

188
00:14:57,660 --> 00:15:03,060
Now, one thing to note is that we can go even deeper than this, although this is not part of mainstream

189
00:15:03,060 --> 00:15:07,470
deep learning quite yet, as a side note, if you don't want to listen to this part.

190
00:15:07,500 --> 00:15:08,430
It's not required.

191
00:15:08,880 --> 00:15:13,440
It's something that I think is interesting to those of us who are interested in modeling brains.

192
00:15:13,770 --> 00:15:17,970
But if you are a hardcore statistician, then perhaps you may think differently.

193
00:15:19,290 --> 00:15:25,290
What we know based on neuroscience is that unlike the real you activation action, potential frequency

194
00:15:25,290 --> 00:15:28,260
is not linearly related to its input stimuli.

195
00:15:28,890 --> 00:15:31,050
Instead, we have a nonlinear relationship.

196
00:15:32,250 --> 00:15:36,290
It can be modeled using the log function or a root function like the square root.

197
00:15:37,380 --> 00:15:39,620
A good way to understand this is with sound.

198
00:15:40,260 --> 00:15:43,500
Something that is twice as loud does not sound twice as loud.

199
00:15:43,830 --> 00:15:45,780
Using measurements of sound intensity.

200
00:15:46,560 --> 00:15:48,900
This is the intuition behind the decibel scale.

201
00:15:49,860 --> 00:15:53,070
As you recall, the decibel scale is a logarithmic scale.

202
00:15:53,670 --> 00:15:57,810
For example, a 90 decibels sound would be like sitting inside a moving bus.

203
00:15:58,500 --> 00:16:01,620
100 decibels sound would be like standing by an electric saw.

204
00:16:02,110 --> 00:16:06,210
A 110 decibel sound would be like a loud orchestra.

205
00:16:06,720 --> 00:16:10,290
A 120 decibel sound would be like standing near a jet engine.

206
00:16:11,310 --> 00:16:17,280
A 130 decibel sound would be like being close to artillery fire, and this value is the threshold of

207
00:16:17,280 --> 00:16:17,760
pain.

208
00:16:18,300 --> 00:16:21,270
So when you hear a sound this loud, it actually hurts physically.

209
00:16:22,170 --> 00:16:27,780
To put that in perspective, 100 decibels is 10 times more powerful than 90 decibels.

210
00:16:28,350 --> 00:16:36,360
110 decibels is 100 times more powerful, 120 decibels is 1000 times more powerful, and when 30 decibels

211
00:16:36,600 --> 00:16:38,670
is 10000 times more powerful.

212
00:16:43,780 --> 00:16:49,600
Some work has been done to experiment with activation functions that more accurately model the frequency

213
00:16:49,600 --> 00:16:53,380
characteristics of real neurons, such as the B.U..

214
00:16:53,800 --> 00:16:57,310
But as a whole, they haven't yet caught on in the deep learning community.

215
00:16:58,030 --> 00:17:01,380
I've attached the paper on the B.U. to extra reading text.

216
00:17:02,080 --> 00:17:07,930
In case you want to check it out, the author reports that the B.U. activation function led to better

217
00:17:07,930 --> 00:17:09,970
results than the EU and the EU.

218
00:17:10,329 --> 00:17:13,390
So it may be something worth trying in your own project.

