1
00:00:11,700 --> 00:00:15,660
In this lecture we are going to discuss activation functions.

2
00:00:15,660 --> 00:00:21,000
Previously we learned that the sigmoid function allows us to build neural networks and it does several

3
00:00:21,000 --> 00:00:22,910
important things for us.

4
00:00:22,950 --> 00:00:27,010
First it maps its inputs to go from zero to 1.

5
00:00:27,030 --> 00:00:30,450
This is nice because it mimics the biological neuron.

6
00:00:30,450 --> 00:00:35,500
So our neural network is literally a network of artificial neurons.

7
00:00:35,520 --> 00:00:37,770
Second and this is more practical.

8
00:00:37,770 --> 00:00:39,930
It makes our neural network nonlinear.

9
00:00:40,110 --> 00:00:45,900
The function itself is non-linear and it also makes it so that we can't reduce the neural network into

10
00:00:45,900 --> 00:00:48,120
a simple linear equation.

11
00:00:48,240 --> 00:00:53,430
At the same time in modern Deep Learning we've discovered that there are some problems with the sigmoid.

12
00:00:53,430 --> 00:00:58,080
And so it is no longer used as often except in some specific cases

13
00:01:03,200 --> 00:01:08,290
if you recall something we discussed earlier was the importance of standardization.

14
00:01:08,360 --> 00:01:13,490
We don't want to have one input in the range one million to 5 million and then another input going from

15
00:01:13,490 --> 00:01:21,120
zero to zero point 0 0 0 1 instead we would like to have all of our data centered around zero and in

16
00:01:21,120 --> 00:01:23,250
approximately the same range.

17
00:01:23,430 --> 00:01:26,460
Now the sigmoid is problematic in this regard.

18
00:01:26,490 --> 00:01:32,490
If you recall the output of a sigmoid is always between 0 and 1 and its middle value is therefore zero

19
00:01:32,490 --> 00:01:34,050
point five.

20
00:01:34,080 --> 00:01:38,820
Therefore the output of a sigmoid can never be centered around zero.

21
00:01:38,820 --> 00:01:43,920
This goes back to our idea of the uniformity of a neuron that work one layer of a neuron that where

22
00:01:43,920 --> 00:01:47,630
it takes as input the output of the previous layer.

23
00:01:47,730 --> 00:01:54,150
So if the previous layer is outputting numbers centered around zero point five that's not quite right.

24
00:01:54,160 --> 00:02:01,480
This is because this output is going to become the input to the next layer and the next layer wants

25
00:02:01,480 --> 00:02:04,060
to also see inputs which are standardized.

26
00:02:04,600 --> 00:02:10,450
Therefore we want both the inputs and the output of each layer to be standardized

27
00:02:15,740 --> 00:02:21,170
luckily there is a solution to this although it takes away from this idea that a neuron network is made

28
00:02:21,170 --> 00:02:29,130
up of this idealized neuron simulation instead of using the sigmoid we'll use a function that has the

29
00:02:29,130 --> 00:02:33,050
exact same shape as the sigmoid just centered around zero.

30
00:02:33,090 --> 00:02:39,330
This function is called the 10 H which is short for hyperbolic tangent if you recall there are hyperbolic

31
00:02:39,330 --> 00:02:42,060
versions of all the trigonometric functions.

32
00:02:42,060 --> 00:02:46,360
So there is a hyperbolic sine hyperbolic cosine and so forth.

33
00:02:46,440 --> 00:02:52,470
Well it turns out that the hyperbolic tangent analogous lead to the trigonometric tangent is just the

34
00:02:52,470 --> 00:03:00,520
hyperbolic sine over the hyperbolic cosine and it has the equation you see here the sigmoid has a similar

35
00:03:00,520 --> 00:03:05,030
equation one divided by one plus the exponent of minus X.

36
00:03:05,320 --> 00:03:11,380
The major difference between the sigmoid and the 10 H is that the sigmoid goes between 0 and 1 while

37
00:03:11,380 --> 00:03:16,360
the 10 age goes between minus 1 and plus 1 as an exercise.

38
00:03:16,360 --> 00:03:22,090
You may want to try and prove the relationship between the sigmoid and the 10 H in particular that the

39
00:03:22,090 --> 00:03:26,140
10 H is just a scaled and vertically shifted version of the sigmoid

40
00:03:31,380 --> 00:03:36,880
now although the teenagers is a little better than the sigmoid the story is not over in modern deep

41
00:03:36,880 --> 00:03:37,250
learning.

42
00:03:37,260 --> 00:03:42,720
Researchers have figured out that there is actually a problem with both of these activation functions.

43
00:03:42,750 --> 00:03:46,970
I like to tell this story as a bit of a cautionary tale for a long time.

44
00:03:46,980 --> 00:03:53,340
Researchers were very attached to the beauty of neural networks as is they liked this idea of a uniform

45
00:03:53,340 --> 00:03:56,480
neural network made up of idealized neurons.

46
00:03:56,520 --> 00:04:02,770
They like this idea that the sigmoid and hence also the 10 H with smooth differential functions.

47
00:04:02,940 --> 00:04:06,360
It is mathematically convenient and beautiful.

48
00:04:06,360 --> 00:04:11,580
The problem is they don't work that well but nobody wanted to break away from this classic way of doing

49
00:04:11,580 --> 00:04:12,150
things

50
00:04:17,280 --> 00:04:22,470
the major problem with sigmoid and tan ages is called The Vanishing gradient problem.

51
00:04:22,560 --> 00:04:25,230
Now this part requires a little bit of calculus knowledge.

52
00:04:25,230 --> 00:04:29,560
So if you're not comfortable with that then feel free to skip ahead.

53
00:04:29,670 --> 00:04:33,460
Remember that our method of training a neural network is gradient descent.

54
00:04:33,900 --> 00:04:39,420
And obviously this involves finding the gradient of the cost with respect to the parameters.

55
00:04:39,450 --> 00:04:44,820
The problem is when you have a very deep neural network your gradient has to propagate backwards throughout

56
00:04:44,820 --> 00:04:47,170
the neuron that we're starting from the end.

57
00:04:47,430 --> 00:04:53,010
What happens is your output is made up of a bunch of composite functions basically a sigmoid of a sigmoid

58
00:04:53,010 --> 00:05:00,980
of a sigmoid and so on when you take the gradient of composite functions you get the chain rule so composite

59
00:05:00,980 --> 00:05:09,030
functions become multiplication in the derivative.

60
00:05:09,080 --> 00:05:10,700
So what does this end up looking like.

61
00:05:11,330 --> 00:05:16,640
Well the further you go back in the neural network the more terms you have in the chain rule to multiply.

62
00:05:16,640 --> 00:05:19,130
So you're multiplying by the derivative of the sigmoid.

63
00:05:19,130 --> 00:05:22,700
Over and over again what happens when you do that.

64
00:05:22,850 --> 00:05:25,310
Why is the derivative of the sigmoid a problem

65
00:05:30,470 --> 00:05:35,090
well consider what the sigmoid actually looks like most of the sigmoid is flat.

66
00:05:35,090 --> 00:05:41,690
In other words the derivative of the sigmoid is very nearly zero at most points only in the center is

67
00:05:41,690 --> 00:05:43,340
it non-zero.

68
00:05:43,340 --> 00:05:48,290
It also turns out that the maximum value of the derivative is only zero point two five.

69
00:05:48,320 --> 00:05:57,160
This is the highest possible value of the derivative of the sigmoid.

70
00:05:57,240 --> 00:05:58,650
Okay so why is that a problem.

71
00:05:59,430 --> 00:06:03,180
Well what happens when you multiply numbers that are very small.

72
00:06:03,180 --> 00:06:06,630
The answer is you get an even smaller number.

73
00:06:06,630 --> 00:06:08,660
Let's say you multiply zero point two five.

74
00:06:08,670 --> 00:06:16,440
The maximum possible value five times the result is zero point to five to the power five which is approximately

75
00:06:16,620 --> 00:06:19,470
zero point zero zero one.

76
00:06:19,500 --> 00:06:25,520
What happens if more realistically you have a value like zero point one multiplied by itself five times.

77
00:06:25,800 --> 00:06:32,640
That's zero point once the Power Five which is zero point zero zero zero zero one.

78
00:06:32,670 --> 00:06:37,770
In other words this leads the result that the further you go back in the neural network the smaller

79
00:06:37,770 --> 00:06:39,510
the gradient becomes.

80
00:06:39,510 --> 00:06:46,500
We call this the vanishing gradient problem.

81
00:06:46,640 --> 00:06:49,060
So how does this problem manifest.

82
00:06:49,070 --> 00:06:53,350
We'll take a look at this graph which shows the magnitude of the gradient at each layer of the neuron

83
00:06:53,350 --> 00:06:54,950
that work as it is trained.

84
00:06:55,700 --> 00:06:59,810
You'll notice that the further you go back the smaller the gradient gets.

85
00:06:59,960 --> 00:07:03,970
Remember that the training algorithm is to take small steps in the direction of the gradient.

86
00:07:04,580 --> 00:07:10,190
Well if the gradient is nearly zero that means the update to the weights is also nearly zero.

87
00:07:10,250 --> 00:07:15,320
The end result is that weights close to the input of the neuron that work are almost not trained at

88
00:07:15,320 --> 00:07:17,220
all.

89
00:07:17,350 --> 00:07:22,090
This was a problem in the olden days which prevented us from building very deep neural networks like

90
00:07:22,090 --> 00:07:25,000
we have today which can have hundreds of layers

91
00:07:30,140 --> 00:07:30,950
back in the day.

92
00:07:30,950 --> 00:07:34,330
There was lots of theory around how to solve this problem.

93
00:07:34,370 --> 00:07:39,200
One method was called a greedy layer wise pre training which was invented by Geoffrey Hinton and his

94
00:07:39,200 --> 00:07:40,510
students.

95
00:07:40,700 --> 00:07:45,770
Since all the layers of the neural network could not be trained all at once the idea was you could train

96
00:07:45,770 --> 00:07:52,330
each layer one at a time so what you would do is train the first layer using some alternative lost function.

97
00:07:52,700 --> 00:07:57,800
By the way if you want to know what that last function is it's basically an auto encoder but hold that

98
00:07:57,800 --> 00:08:04,060
thought for now as it's not an important detail then once you were done training the first layer you

99
00:08:04,060 --> 00:08:09,730
would add on a second layer and train that layer by itself not touching the first layer anymore.

100
00:08:09,730 --> 00:08:15,100
Then you would add a third layer and train only the third layer leaving the first two layers alone.

101
00:08:15,190 --> 00:08:25,320
Once you got to the last layer all of the previous layers would already be trained to some extent.

102
00:08:25,330 --> 00:08:31,300
Now despite all this theory which generated lots of theoretical research into new models such as restricted

103
00:08:31,330 --> 00:08:37,720
boltzmann machines deep Boltzmann machines and so forth today we no longer require such complicated

104
00:08:37,720 --> 00:08:43,060
models in order to train very deep and learn that works although that's not to say they're not worth

105
00:08:43,060 --> 00:08:44,650
learning about.

106
00:08:44,650 --> 00:08:50,740
In fact the solution was simple Just don't use activation functions that have vanished ingredients.

107
00:08:51,340 --> 00:08:57,970
So throw away these beautiful smooth differential functions like the Tanakh and the sigmoid instead.

108
00:08:57,970 --> 00:09:04,030
Let's use this ugly looking not completely differential function called the real you the real you is

109
00:09:04,030 --> 00:09:06,530
short for rectify a linear unit.

110
00:09:06,670 --> 00:09:11,980
Basically it looks like a hockey stick and it has a corner at zero where the derivative is technically

111
00:09:11,980 --> 00:09:13,540
not defined.

112
00:09:13,660 --> 00:09:18,880
The fact that value is greater than zero never have a zero gradient makes training neuron that works

113
00:09:18,910 --> 00:09:25,130
a lot more efficient.

114
00:09:25,150 --> 00:09:26,910
Now you might be wondering wait a minute.

115
00:09:27,130 --> 00:09:31,680
If a zero gradients are so bad then why does the real you work at all.

116
00:09:31,720 --> 00:09:37,840
It appears that half the function any input less than zero has a derivative that is exactly zero.

117
00:09:37,840 --> 00:09:41,860
Never mind vanishing the gradient is already vanished.

118
00:09:41,860 --> 00:09:47,200
Indeed this is somewhat of a problem as it leads to a phenomenon called Dead neurons.

119
00:09:47,200 --> 00:09:52,450
The neurons are neurons that always output zero because the weighted sum of its inputs are always less

120
00:09:52,450 --> 00:09:53,410
than or equal to zero.

121
00:09:54,400 --> 00:09:59,950
However and this is important in deep learning what we care about is experimental results.

122
00:09:59,950 --> 00:10:06,430
In other words the most important thing is that it works not how theoretically satisfying is because

123
00:10:06,460 --> 00:10:10,450
as we now know that can lead to lots of wasted time.

124
00:10:10,450 --> 00:10:21,190
The fact that just the right side doesn't vanish seems to be good enough according to what we've observed.

125
00:10:21,260 --> 00:10:26,810
Of course some researchers have tried to make modifications to the simple rule to solve this problem

126
00:10:26,840 --> 00:10:28,550
of dead neurons.

127
00:10:28,550 --> 00:10:34,550
One alternative is the leaky rescue which has a slope of less than one like zero point one for values

128
00:10:34,550 --> 00:10:36,080
less than zero.

129
00:10:36,080 --> 00:10:42,140
Importantly this is still a nonlinear function so we're able to learn non-linear geometrical patterns

130
00:10:43,070 --> 00:10:46,580
using the leaky well you your derivatives will always be positive.

131
00:10:46,730 --> 00:10:48,710
Just like the sigmoid and the tan age

132
00:10:53,840 --> 00:10:59,690
there are other options as well such as the ACLU or exponential linear unit which has a more steadily

133
00:10:59,690 --> 00:11:02,350
decreasing value on the left side.

134
00:11:02,420 --> 00:11:07,700
The authors claim that this activation function speeds up learning and leads to higher accuracy than

135
00:11:07,700 --> 00:11:09,320
the real you.

136
00:11:09,320 --> 00:11:15,380
One interesting aspect of the ACLU is that it allows its outputs to be negative which goes back to the

137
00:11:15,380 --> 00:11:18,500
idea that we like the mean of the values to be close to zero

138
00:11:23,680 --> 00:11:29,170
another option which is very similar is the soft plus activation which because you're taking the log

139
00:11:29,170 --> 00:11:36,310
of the exponent looks very linear when the input is reasonably large now for both of these previous

140
00:11:36,310 --> 00:11:37,640
activation functions.

141
00:11:37,660 --> 00:11:43,180
There is the vanishing gradient on the left side but we've established that it's not so much of a problem

142
00:11:43,510 --> 00:11:50,370
since we know that the real you already works and it has gradients that are equal to zero also.

143
00:11:50,380 --> 00:11:55,450
Although we initially stated that we would like the inputs at each layer to be centered around zero

144
00:11:55,960 --> 00:12:01,360
we can see that the real U.N. soft laws do not accomplish this for the soft plus and the real you the

145
00:12:01,360 --> 00:12:05,200
minimum value is zero while the maximum value is infinity.

146
00:12:05,200 --> 00:12:08,350
This definitely means they won't be centered around zero.

147
00:12:08,440 --> 00:12:11,170
So is the real you not a good choice in the end.

148
00:12:16,310 --> 00:12:22,040
Now despite all this work to find alternatives to the value activation these days most people still

149
00:12:22,040 --> 00:12:25,130
use the value as a reasonable default choice.

150
00:12:26,400 --> 00:12:31,630
It works well and sometimes you'll find that using other alternatives such as the leaky you or the evil

151
00:12:31,630 --> 00:12:33,950
you offer no benefit.

152
00:12:33,960 --> 00:12:35,340
Sometimes they do.

153
00:12:35,340 --> 00:12:37,860
Which is why you always have to experiment for yourself.

154
00:12:38,900 --> 00:12:45,500
My motto which a lot of my students are tired of hearing by this point is that machine learning is experimentation

155
00:12:45,680 --> 00:12:47,400
and not philosophy.

156
00:12:47,600 --> 00:12:52,080
Never use your mind to try and predict the outcome of a computer program.

157
00:12:52,190 --> 00:12:58,580
If you have a computer that is always the suboptimal course of action why not simply run the computer

158
00:12:58,580 --> 00:13:01,050
program with a computer.

159
00:13:01,430 --> 00:13:08,060
Your mind is not suitable for running computer programs but computers are therefore follow the rule.

160
00:13:08,060 --> 00:13:10,880
Don't use philosophy use experimentation

161
00:13:16,050 --> 00:13:21,480
interestingly some researchers have talked about the biological plausibility of the real you activation

162
00:13:21,480 --> 00:13:22,710
function.

163
00:13:22,770 --> 00:13:29,670
In fact it may even be more biologically plausible than the sigmoid to understand this you have to understand

164
00:13:29,730 --> 00:13:33,730
a little more about how action potentials encode information.

165
00:13:33,810 --> 00:13:36,130
What is the difference between an action potential.

166
00:13:36,240 --> 00:13:39,720
When you hear a quiet sound versus an action potential.

167
00:13:39,720 --> 00:13:41,320
When you hear a loud sound.

168
00:13:41,640 --> 00:13:43,530
The answer is there is no difference.

169
00:13:43,530 --> 00:13:46,710
An action potential is just an action potential.

170
00:13:46,710 --> 00:13:50,690
The key really is in the frequency of action potentials.

171
00:13:50,760 --> 00:13:56,340
When you hear a very quiet sound your neurons are only activated a little bit so you'll get some action

172
00:13:56,340 --> 00:13:58,200
potentials but they won't be very frequent.

173
00:13:59,010 --> 00:14:03,900
If you hear a very loud sound like say you're at a concert or a party then your neurons are getting

174
00:14:03,900 --> 00:14:05,640
lots of stimulation.

175
00:14:05,790 --> 00:14:08,340
The action potentials are going to be very frequent.

176
00:14:08,370 --> 00:14:11,360
In other words closer together in time.

177
00:14:11,400 --> 00:14:18,200
So what we're saying is the more intense a stimulus is the higher the frequency of the action potentials.

178
00:14:18,240 --> 00:14:19,800
We call this frequency coding

179
00:14:24,880 --> 00:14:27,540
what does this mean in terms of the real you.

180
00:14:27,550 --> 00:14:33,520
Well you can think of it as the real you is just encoding the actual potential frequency itself.

181
00:14:33,580 --> 00:14:38,680
Of course the minimum frequency is zero and it's not possible to have a negative frequency.

182
00:14:38,740 --> 00:14:42,490
That's why the minimum value of the value is zero.

183
00:14:42,700 --> 00:14:49,360
Then as the inputs into the neuron get larger because they are being stimulated more the receiving neuron

184
00:14:49,420 --> 00:14:52,570
also gets stimulated more increasing its frequency

185
00:14:57,690 --> 00:14:57,980
now.

186
00:14:57,990 --> 00:15:03,270
One thing to note is that we can go even deeper than this although this is not part of mainstream deep

187
00:15:03,270 --> 00:15:04,980
learning quite yet.

188
00:15:05,010 --> 00:15:08,920
As a side note if you don't want to listen to this part it's not required.

189
00:15:08,970 --> 00:15:13,690
It's something that I think is interesting to those of us who are interested in modelling brains.

190
00:15:13,800 --> 00:15:20,410
But if you are a hardcore statistician then perhaps you may think differently what we know based on

191
00:15:20,410 --> 00:15:26,920
neuroscience is that unlike the real you activation action potential frequency is not linearly related

192
00:15:27,040 --> 00:15:28,950
to its input stimuli.

193
00:15:28,960 --> 00:15:31,910
Instead we have a linear relationship.

194
00:15:32,320 --> 00:15:38,460
It can be modeled using the log function or a root function like the square root a good way to understand

195
00:15:38,460 --> 00:15:43,920
this is with sound something that is twice as loud does not sound twice as loud.

196
00:15:43,920 --> 00:15:49,470
Using measurements of sound intensity this is the intuition behind the decibel scale.

197
00:15:49,920 --> 00:15:53,700
As you recall the decibel scale is a logarithmic scale.

198
00:15:53,700 --> 00:15:59,850
For example a 90 decibels sound would be like sitting inside a moving bus 100 decibels sound would be

199
00:15:59,850 --> 00:16:08,070
like standing by an electric saw at 110 decibels sound would be like a loud orchestra of 120 decibels

200
00:16:08,070 --> 00:16:10,310
sound would be like standing near a jet engine.

201
00:16:11,320 --> 00:16:17,280
A 130 decibels sound would be like being close to artillery fire and this value is the threshold of

202
00:16:17,280 --> 00:16:18,220
pain.

203
00:16:18,330 --> 00:16:22,230
So when you hear a sound this loud it actually hurts physically.

204
00:16:22,230 --> 00:16:29,880
To put that in perspective 100 decibels is ten times more powerful than 90 decibels 110 decibels is

205
00:16:29,880 --> 00:16:37,920
100 times more powerful 120 decibels is 1000 times more powerful and 130 decibels is ten thousand times

206
00:16:37,920 --> 00:16:38,670
more powerful

207
00:16:43,830 --> 00:16:49,590
some work has been done to experiment with activation functions that more accurately model the frequency

208
00:16:49,590 --> 00:16:53,590
characteristics of real neurons such as the bee are you.

209
00:16:53,820 --> 00:16:58,120
But as a whole they haven't yet caught on in the Deep Learning Community.

210
00:16:58,110 --> 00:17:04,010
I've attached the paper on the bee are you two extra reading ATX t in case you want to check it out.

211
00:17:04,170 --> 00:17:09,360
The author reports that the BYU activation function led to better results than the real you and the

212
00:17:09,360 --> 00:17:10,240
ACLU.

213
00:17:10,380 --> 00:17:13,440
So it may be something worth trying in your own project.