1
00:00:01,160 --> 00:00:02,130
‫OK.

2
00:00:02,250 --> 00:00:06,500
‫In the last lecture we learn what gradient descent is.

3
00:00:06,990 --> 00:00:14,340
‫In this lecture we are going to see how to use this mathematical technique to find the optimum W's and

4
00:00:14,340 --> 00:00:17,000
‫B's for this.

5
00:00:17,040 --> 00:00:22,620
‫We first need to understand the error function which we discussed in the last lecture.

6
00:00:22,620 --> 00:00:28,200
‫Here are the five steps that we use to implement gradient descent.

7
00:00:28,200 --> 00:00:32,910
‫The first step is to give random ideas to all WS and B's in the system.

8
00:00:32,970 --> 00:00:39,960
‫Then we take one training example and put its x values as input to our system.

9
00:00:40,020 --> 00:00:46,610
‫We process through the entire network to get one predicted value.

10
00:00:46,830 --> 00:00:54,060
‫Now on this third step I told you that we measure the distance between predicted and the actual value

11
00:00:54,540 --> 00:00:57,980
‫using an error function.

12
00:00:57,990 --> 00:01:00,740
‫Let's see what this means.

13
00:01:00,750 --> 00:01:04,660
‫Suppose we predicted an output of zero point three.

14
00:01:05,000 --> 00:01:09,030
‫Whereas the actual value is zero.

15
00:01:09,030 --> 00:01:14,400
‫One way of calculating error of prediction could be just to subtract these two.

16
00:01:14,580 --> 00:01:21,510
‫That is finding out actual minus predicted which will be zero minus zero point three giving us minus

17
00:01:21,660 --> 00:01:30,250
‫zero point three to remove this negative sign in the and focus only on the magnitude of this error.

18
00:01:30,270 --> 00:01:37,140
‫We can simply put an absolute function or a squared function on top of it.

19
00:01:38,010 --> 00:01:42,740
‫Basically meaning minus zero point three would become point three.

20
00:01:42,870 --> 00:01:46,650
‫Or it would be squared and it will become zero point zero nine

21
00:01:50,230 --> 00:01:52,030
‫these two are good measures of error.

22
00:01:52,990 --> 00:01:57,680
‫But they do not work well when we are doing classification with neural networks.

23
00:01:59,600 --> 00:02:02,870
‫For this purpose we use a different function

24
00:02:06,430 --> 00:02:10,720
‫this function is called Cross entropy loss function.

25
00:02:10,970 --> 00:02:14,000
‫It is represented by this formula.

26
00:02:14,160 --> 00:02:27,010
‫He is equal to minus of Y into log y dash minus 1 minus Y log of 1 minus Y life here.

27
00:02:27,180 --> 00:02:36,010
‫Y represent actual value and Y dice represents the predicted output value.

28
00:02:36,060 --> 00:02:42,090
‫I know this looks complex much complex tended to edit functions that we saw in the last slide.

29
00:02:43,080 --> 00:02:52,200
‫But the reason for using this is that this function does not have local minimums that is the graph of

30
00:02:52,200 --> 00:02:53,100
‫this function.

31
00:02:53,130 --> 00:02:57,570
‫Looks like this one on the left and not like this one on the right.

32
00:02:59,370 --> 00:03:06,910
‫If a function has local minimums are gradient descent won't work properly and it might stop here.

33
00:03:07,200 --> 00:03:09,920
‫Instead of finding the global minima which is hit

34
00:03:12,740 --> 00:03:14,730
‫if you don't understand the last comment.

35
00:03:14,840 --> 00:03:15,770
‫Don't worry about it.

36
00:03:16,670 --> 00:03:20,570
‫The simple takeaway is for classification problems.

37
00:03:20,570 --> 00:03:25,750
‫The added function to be used is this cross entropy at a function.

38
00:03:26,330 --> 00:03:30,130
‫We can take a look at this edit function to build some intuition around it.

39
00:03:32,450 --> 00:03:38,370
‫As you know in classification problems the output value is either 0 or 1.

40
00:03:39,170 --> 00:03:47,470
‫So if the output value is 1 the second part of this function that is 1 minus Y.

41
00:03:47,570 --> 00:03:52,200
‫This entire part will become 0 because 1 minus 1 would be 0.

42
00:03:52,370 --> 00:03:59,990
‫If the actual output is 0 then the first item of this equation will become zero and only the second

43
00:03:59,990 --> 00:04:01,990
‫time will remain.

44
00:04:02,360 --> 00:04:09,740
‫So let's see if the actual output is 1 for this edit function to be minimum the function should be as

45
00:04:09,740 --> 00:04:11,080
‫close to zero as possible.

46
00:04:12,290 --> 00:04:14,880
‫Let's see if y is equal to 1.

47
00:04:14,890 --> 00:04:25,330
‫Edit is minus of 1 and 2 log y that plus 1 minus 1 log 1 minus Y that the second term becomes zero.

48
00:04:25,430 --> 00:04:35,220
‫So we are left with only minus of log y dash so for this error to be small minus log y that should be

49
00:04:35,220 --> 00:04:35,640
‫small.

50
00:04:37,230 --> 00:04:47,040
‫This implies that log y life should be large this well implies that why that should be large since our

51
00:04:47,040 --> 00:04:50,840
‫predicted output is between 0 and 1.

52
00:04:50,880 --> 00:04:56,910
‫Why does being large simply means that y dash should be as close to 1 as possible.

53
00:04:58,780 --> 00:05:02,240
‫Similarly if actual value of output is 0.

54
00:05:02,650 --> 00:05:12,250
‫The first term of this equation will be 0 so the added function remaining would be minus log of 1 minus

55
00:05:12,250 --> 00:05:22,760
‫Y dash for this error to be small log of 1 minus Y Dash has to be logged implying that 1 minus Y.

56
00:05:22,760 --> 00:05:29,330
‫This has to be large implying that Y should be as small as possible although I have not given you the

57
00:05:29,330 --> 00:05:35,090
‫mathematical justification for using this function but I guess with these particular examples you are

58
00:05:35,090 --> 00:05:44,210
‫getting the feel of how minimizing the error or loss function is trying to match the predicted output

59
00:05:44,210 --> 00:05:45,770
‫value to the actual value.

60
00:05:50,200 --> 00:05:57,530
‫So now you may have guessed the job of gradient descent is to find the minimum of this error function

61
00:05:58,430 --> 00:06:06,020
‫that is we will make small changes to the values of weights and biases in that direction where we get

62
00:06:06,110 --> 00:06:07,760
‫maximum decrease in.

63
00:06:07,760 --> 00:06:16,490
‫Edit We will continue changing WS and B's they'll know for their degrees in it it is possible this is

64
00:06:16,490 --> 00:06:18,470
‫all the process looks graphically

65
00:06:21,040 --> 00:06:22,570
‫for ease of understanding.

66
00:06:22,660 --> 00:06:31,030
‫I have represented all of DVDs on one actors and all biases on another axis and on the vertical axis

67
00:06:31,090 --> 00:06:38,220
‫we have the corresponding value of error these values of error are calculated using the error function

68
00:06:40,380 --> 00:06:40,830
‫okay.

69
00:06:41,010 --> 00:06:45,570
‫So now let's revisit our steps to implement gradient descent.

70
00:06:46,620 --> 00:06:52,360
‫Again the first step is setting random initial values of W and B.

71
00:06:53,100 --> 00:06:58,230
‫Then we go forward to get predicted output value.

72
00:06:58,230 --> 00:07:04,690
‫Then we put this predicted output value in our lost function to get the error prediction.

73
00:07:04,710 --> 00:07:13,930
‫Now we have the error W's and B's say we have a W value between 1 and 2 a biased value between 0 and

74
00:07:13,930 --> 00:07:18,480
‫minus 1 and added value near 1.

75
00:07:18,490 --> 00:07:21,460
‫So we are nearly here on this graph.

76
00:07:24,130 --> 00:07:31,940
‫Now in the fourth step we do backward propagation to finding direction of movement on this graph.

77
00:07:33,320 --> 00:07:37,650
‫Which means we find Delta W and Delta B.

78
00:07:37,730 --> 00:07:47,720
‫That is the change in WS and B's that will take us to the minimum point if you look at this graph.

79
00:07:47,720 --> 00:07:56,090
‫You can probably see that by decreasing the weight and increasingly biased values we are will be moving

80
00:07:56,180 --> 00:08:00,660
‫closer to the lowest point.

81
00:08:00,680 --> 00:08:04,150
‫So basically we have initial WS and B's.

82
00:08:04,150 --> 00:08:15,950
‫We will be updating our W W minus all four times delta W and will be updating or b to b minus alpha

83
00:08:15,950 --> 00:08:22,230
‫times delta B head Alpha is called the learning rate.

84
00:08:22,700 --> 00:08:29,430
‫Basically Delta W and Delta B are unit steps that we calculate using calculus.

85
00:08:29,690 --> 00:08:35,740
‫Alpha is controlling the number of those tapes we take in that direction.

86
00:08:35,960 --> 00:08:43,970
‫You can imagine the impact of large vs. small values of alpha if alpha is large we are taking multiple

87
00:08:43,970 --> 00:08:47,180
‫steps in the direction of gradient descent.

88
00:08:47,180 --> 00:08:54,980
‫This means that we can reach t bottom faster but problem with large alpha is that we can overshoot from

89
00:08:54,980 --> 00:08:57,220
‫the minimum.

90
00:08:57,260 --> 00:09:05,570
‫Imagine you're very near to the bottom but on the next time you take 50 steps instead of just one in

91
00:09:05,570 --> 00:09:08,540
‫such a situation you will climb the curve on the other side.

92
00:09:10,420 --> 00:09:19,780
‫So a large learning rate can help in faster descent but can face issue in the final stages of convergence.

93
00:09:19,930 --> 00:09:24,110
‫Therefore a moderate value of learning rate is to be used.

94
00:09:24,220 --> 00:09:29,470
‫You will see what value of learning rate is to be used in practical section of the schools.

95
00:09:31,150 --> 00:09:31,870
‫Very well.

96
00:09:31,960 --> 00:09:37,120
‫So the step to be taken in the direction of the descent is alpha times.

97
00:09:37,190 --> 00:09:42,810
‫Reader W and alpha times there to be now.

98
00:09:42,830 --> 00:09:47,030
‫How do we find the letter W ending the B here.

99
00:09:47,150 --> 00:09:54,060
‫Delta W is the change in wait and Delta B is the change in bias.

100
00:09:54,290 --> 00:10:00,050
‫Basically we will change the initially said W's and B's in the effort to reduce error.

101
00:10:04,100 --> 00:10:08,480
‫Now let us see how to find Delta W and Delta B.

102
00:10:08,870 --> 00:10:15,800
‫These values are formed by doing backward propagation which means we will look back in the network to

103
00:10:15,800 --> 00:10:21,770
‫find out the instantaneous slope with respect to eat W and b

104
00:10:24,510 --> 00:10:29,680
‫let me take an example with a single neuron to show you how this happens.

105
00:10:29,970 --> 00:10:37,500
‫Otherwise the mathematics and calculus in more can get quite messy and is often overwhelming for some

106
00:10:37,500 --> 00:10:45,750
‫student if you are comfortable with calculus you can look at the complete back propagation theory in

107
00:10:45,750 --> 00:10:48,170
‫the link shared in the description of this lecture.

108
00:10:49,940 --> 00:10:57,350
‫However I think with this simple example you will get a solid intuition of how back propagation works

109
00:10:59,450 --> 00:11:08,060
‫here's a single neuron with two inputs X1 and X2 it first calculate linearly.

110
00:11:08,150 --> 00:11:17,270
‫That is it will calculate the value of Z which is equal to w 1 X1 plus W2 x2 plus B1.

111
00:11:17,670 --> 00:11:20,920
‫It then applies a sigmoid on this value of Z

112
00:11:24,920 --> 00:11:30,380
‫this sigmoid of the is the predicted output of this neuron.

113
00:11:30,530 --> 00:11:37,670
‫We used this predicted output with the actual output to get the error of this particular training example

114
00:11:40,860 --> 00:11:42,680
‫so let's start with step 1.

115
00:11:42,750 --> 00:11:48,140
‫Step 1 is we have to randomly initialize the values of weight and bias.

116
00:11:49,450 --> 00:11:52,320
‫We have to wait and one bias.

117
00:11:52,450 --> 00:11:58,320
‫We randomly initialize w W1 2 with 2 W2 is equal to 3.

118
00:11:58,530 --> 00:11:59,380
‫And bias.

119
00:11:59,380 --> 00:12:00,950
‫Value is equal to minus 4.

120
00:12:04,840 --> 00:12:08,290
‫Now the second step is forward propagation.

121
00:12:08,290 --> 00:12:16,150
‫That is we will take one training example and put the input values of that training example to get a

122
00:12:16,150 --> 00:12:19,090
‫predicted output.

123
00:12:19,090 --> 00:12:28,490
‫We have taken this training example in which X1 value is 10 x2 value is minus 4 and the output is 1.

124
00:12:28,540 --> 00:12:33,000
‫This way is the actual output and it is equal to 1.

125
00:12:33,130 --> 00:12:35,980
‫So we have the w 1 value.

126
00:12:35,980 --> 00:12:39,870
‫We have X1 we have W2 x2 and B1.

127
00:12:39,940 --> 00:12:41,960
‫So we can calculate z.

128
00:12:42,070 --> 00:12:48,100
‫We put all these values to get a Z value of 4.

129
00:12:48,310 --> 00:12:52,530
‫We apply the activation function that is the sigmoid function on this value of z.

130
00:12:53,640 --> 00:13:02,790
‫Kennedy predicted output of this neuron to sigmoid of Z that is sigmoid of 4 gives a predicted output

131
00:13:02,790 --> 00:13:05,450
‫of zero point nine two.

132
00:13:06,240 --> 00:13:13,620
‫This predicted output value is divided x value that we will use in the edit function.

133
00:13:13,700 --> 00:13:18,240
‫You can see that this values already very close to the actual output which is 1.

134
00:13:19,010 --> 00:13:23,640
‫But let's see how we can improve this value.

135
00:13:23,660 --> 00:13:26,300
‫Now the third step is error calculation.

136
00:13:26,300 --> 00:13:28,480
‫We have the error function with us.

137
00:13:28,640 --> 00:13:30,730
‫We have predicted output value.

138
00:13:30,770 --> 00:13:34,450
‫That is why dash it as 0 1 9 8 2.

139
00:13:34,700 --> 00:13:41,690
‫And we have the actual output value for the training example as one we put these two values and this

140
00:13:41,750 --> 00:13:47,210
‫error function to get a final added value of zero point zero 0 7 9

141
00:13:51,490 --> 00:13:59,980
‫now country fourth step which is back propagation the next few minutes are going to be a little heavy

142
00:13:59,980 --> 00:14:00,940
‫on mathematics.

143
00:14:00,970 --> 00:14:03,700
‫We will cover some basics of calculus here.

144
00:14:05,230 --> 00:14:09,030
‫If you're not comfortable with this part it is still okay.

145
00:14:09,070 --> 00:14:13,990
‫This is happening in the background and your software is handling this but if you have some understanding

146
00:14:13,990 --> 00:14:22,360
‫of calculus looking at this example will tell you how a neuron is doing back propagation so do not worry

147
00:14:22,360 --> 00:14:28,030
‫if you do not understand this because this is happening in the background and your software is handling

148
00:14:28,030 --> 00:14:28,850
‫this.

149
00:14:29,380 --> 00:14:36,900
‫It is good to have this infusion if you know a little bit of mathematics so let's see how to do backward

150
00:14:36,900 --> 00:14:38,880
‫propagation.

151
00:14:39,000 --> 00:14:40,380
‫We are at the end.

152
00:14:40,380 --> 00:14:42,000
‫We have calculated it.

153
00:14:43,230 --> 00:14:47,400
‫The first step is finding out the slope of error.

154
00:14:47,540 --> 00:14:54,680
‫With that predicted output that is by dash this symbol here.

155
00:14:54,750 --> 00:15:02,030
‫The Ebi data very nice simply means that we are finding the instantaneous slope of error with respect

156
00:15:02,030 --> 00:15:04,940
‫to wide eyes keeping everything else constant.

157
00:15:06,140 --> 00:15:12,550
‫So if you know calculus you can find a derivative of this function with respect to wide eyes.

158
00:15:12,630 --> 00:15:15,160
‫Then is equal to 1.

159
00:15:15,170 --> 00:15:20,800
‫This gives us an output of minus 1 by white ash.

160
00:15:20,900 --> 00:15:30,260
‫We go further back in our network and we find out the slope of our output function with respect to Z

161
00:15:32,170 --> 00:15:35,230
‫the output function is a sigmoid function.

162
00:15:35,230 --> 00:15:43,210
‫The slope of sigmoid function with respect to Z is this value it is to but minus the upon value plus

163
00:15:43,780 --> 00:15:52,660
‫it is about minus C the whole squared if you know differentiation you can differentiate this function

164
00:15:52,690 --> 00:15:58,750
‫with respect to the and you will get this value of slope.

165
00:15:58,770 --> 00:16:09,570
‫Lastly we find a differential of Z with respect to W1 W2 and B so Z was equal to w 1 times X1 plus W2

166
00:16:09,570 --> 00:16:11,790
‫times x2 plus B1.

167
00:16:11,790 --> 00:16:19,560
‫So when we find out the differential respect to w when we get X1 which is equal to then at this current

168
00:16:19,560 --> 00:16:33,090
‫point for W2 we get x2 which is equal to minus 4 and 4 B we get a slope of 1.

169
00:16:33,160 --> 00:16:37,510
‫Next comes the process of combining all of this.

170
00:16:37,600 --> 00:16:45,190
‫We moved back in our network to find all these slopes but the slope we are actually interested in is

171
00:16:46,120 --> 00:16:50,230
‫how does the edit function change with respect to w 1.

172
00:16:50,290 --> 00:16:57,100
‫How does it change with respect to W2 and how does it change was meant to be to find the differential

173
00:16:57,100 --> 00:17:04,690
‫of E respect to w one we applied chain rule which means that if you want to find differential of E respect

174
00:17:04,690 --> 00:17:13,390
‫to w 1 you can instead find differential of E respect to by Dash multiplied with differential Avinash

175
00:17:13,740 --> 00:17:22,720
‫with respect to the multiplied with differential of the respect to differential of w 1 we have calculated

176
00:17:22,870 --> 00:17:29,920
‫all these 3 values in our last light you can see on the top here we know the value of wildlife we know

177
00:17:29,920 --> 00:17:37,090
‫the value of Z for this particular training example we can put all these values and calculate this differential

178
00:17:37,600 --> 00:17:40,930
‫and it comes out to be minus zero point 1 8 6

179
00:17:45,290 --> 00:17:56,780
‫we can do the similar exercise for W2 and Fort B also for their differential of E respect to W2 comes

180
00:17:56,780 --> 00:18:03,860
‫out to be zero point 0 7 4 6 and the differential of even spread to be comes out to be minus zero point

181
00:18:04,040 --> 00:18:06,450
‫zero 1 8 6.

182
00:18:06,480 --> 00:18:15,420
‫Now these three differentials are the unit steps that we are going to take in the direction of our descent.

183
00:18:15,440 --> 00:18:23,840
‫These are the data w ones the land w twos and Delta beats we are going to use these delta values to

184
00:18:23,990 --> 00:18:26,210
‫update our weights and biases.

185
00:18:27,860 --> 00:18:35,390
‫So that we move in the direction where the final loss would be less than the loss that we had earlier.

186
00:18:35,390 --> 00:18:44,810
‫This brings us to the last step last step is we have to update W and b the new W one would be previous

187
00:18:44,810 --> 00:18:54,800
‫w one minus alpha times delta w one previous w one was to Alpha we have taken as five we have taken

188
00:18:54,800 --> 00:19:04,490
‫a learning rate of five year and we calculated Delta w one as minus zero point 186 this updates R W

189
00:19:04,490 --> 00:19:14,230
‫and value to two point ninety similarly we calculate W2 value and it comes out to be two point six and

190
00:19:14,230 --> 00:19:22,150
‫we update b value and it is now minus three point nine you can compare the previous and new W on W to

191
00:19:22,150 --> 00:19:31,300
‫be values earlier w one was to notice two point ninety three earlier w two was three nowadays two point

192
00:19:31,300 --> 00:19:34,640
‫six earlier B was minus four.

193
00:19:34,690 --> 00:19:36,430
‫Now it is minus three point nine

194
00:19:39,430 --> 00:19:48,430
‫now since we have updated our WS and B values we have to go back to our step two we have to reiterate

195
00:19:49,300 --> 00:19:56,560
‫we have to do forward propagation again and we will calculate the predicted output again so this is

196
00:19:56,560 --> 00:20:06,520
‫the r training example x1 is 10 x2 is minus 4 y is 1 we put these values without updated rate and bias

197
00:20:07,590 --> 00:20:15,210
‫this time the z values come out to be fourteen point seven when we apply our activation function on

198
00:20:15,210 --> 00:20:23,730
‫this the value we get the predicted output value that is why Dash has zero point nine nine nine if you

199
00:20:23,730 --> 00:20:31,440
‫remember last time we got a predicted value of zero point nine it too so clearly this is an improvement

200
00:20:31,800 --> 00:20:41,720
‫over the last values of WS and B's this process is repeated several times till we get minimum error

201
00:20:44,370 --> 00:20:52,320
‫if we have a lot of neurons in our network the same process is followed in forward propagation we go

202
00:20:52,330 --> 00:21:00,510
‫to the end to find a predicted output value we use that predicted output value to find the loss then

203
00:21:00,510 --> 00:21:09,360
‫we stepwise come back do the differentials find the individual differential values with Ed function

204
00:21:11,580 --> 00:21:23,200
‫and then we update R wait and biases so that the final edit is reduced again I will repeat that I understand

205
00:21:23,320 --> 00:21:29,530
‫that this lecture was a little mathematics heavy but if you have some background calculus I am sure

206
00:21:29,560 --> 00:21:35,590
‫you would have understood but if you do not have any background in calculus I understand that you would

207
00:21:35,590 --> 00:21:43,200
‫be facing some difficulty in following all the things that I said try to listen to this lecture again

208
00:21:43,290 --> 00:21:51,570
‫if you are facing difficulty if you are still unable to follow the concept here Do not worry you can

209
00:21:51,570 --> 00:21:58,170
‫still implement a neural network in a software tool all this mathematical calculation will be done by

210
00:21:58,170 --> 00:22:02,790
‫this software tool and you do not have to do anything on your own.

211
00:22:02,790 --> 00:22:04,410
‫That is the beauty of neural networks.

212
00:22:04,470 --> 00:22:12,570
‫If you have to do it with hand it will take a lot of pain but with computers you can have millions of

213
00:22:12,570 --> 00:22:16,850
‫neurons and millions of features and your computer will still be able to solve it

214
00:22:19,820 --> 00:22:24,830
‫so do focus on the practical lecture that is where you will learn how to implement these neural networks

215
00:22:25,220 --> 00:22:26,060
‫in this software to.