1
00:00:01,370 --> 00:00:07,440
‫In this lecture, we are going to understand the concept behind how neural networks actually learn.

2
00:00:08,860 --> 00:00:12,660
‫So tell this lecture, we have covered what that neural network is.

3
00:00:13,800 --> 00:00:17,760
‫Now we are starting with how does a neural network work?

4
00:00:20,910 --> 00:00:22,170
‫Here's a quick recap.

5
00:00:24,170 --> 00:00:28,980
‫A neural network is a network of cells in our network.

6
00:00:29,220 --> 00:00:38,010
‫We are going to use sigmoid neurons because these learn in a more controllable manner in a sigmoid neuron.

7
00:00:38,640 --> 00:00:39,510
‫Two things happen.

8
00:00:40,860 --> 00:00:47,950
‫We first multiply the input features with the weights that are represented by W and then add a bias.

9
00:00:48,140 --> 00:00:51,260
‫Term b, this value

10
00:00:51,600 --> 00:00:54,510
‫We name as Z or Z.

11
00:00:56,010 --> 00:00:59,790
‫The second step is the application of sigmoid or the resistory function.

12
00:01:00,450 --> 00:01:03,570
‫That is, the cell calculates one upon one

13
00:01:03,570 --> 00:01:05,660
‫Plus e raise to power minus e

14
00:01:07,140 --> 00:01:09,300
‫This value is the output of the cell.

15
00:01:09,840 --> 00:01:12,180
‫And is always between zero and one.

16
00:01:13,800 --> 00:01:17,220
‫This output becomes the new input for the next layer.

17
00:01:18,480 --> 00:01:24,540
‫This continues till the last layer, till we get the final output as per our network.

18
00:01:26,940 --> 00:01:29,580
‫Now the problem for which we are solving is this.

19
00:01:30,750 --> 00:01:39,600
‫We want to find out the weights and biases of all the cells in the system so that the final output of this

20
00:01:39,600 --> 00:01:44,160
‫network is as close to the actual value of the variable to be predicted.

21
00:01:47,040 --> 00:01:52,680
‫For better understanding, let us just calculate the number of variables we need to calculate.

22
00:01:52,770 --> 00:02:00,780
‫For this small network here, we have two neurons in the hidden layer and one neuron in the output layer.

23
00:02:01,620 --> 00:02:04,920
‫And there are two input features, x1 and X2.

24
00:02:06,570 --> 00:02:16,980
‫So for the first neuron we are trying to calculate W1 into X1 plus W2 into X2 plus B1 is equal

25
00:02:16,980 --> 00:02:17,840
‫to Z.

26
00:02:19,770 --> 00:02:23,520
‫This z will be put into the activation function.

27
00:02:23,670 --> 00:02:25,000
‫That is the sigmoid function.

28
00:02:25,470 --> 00:02:27,900
‫And that will be the output of this neuron.

29
00:02:30,090 --> 00:02:35,820
‫Let us say that the output of this neuron is represented by a1 for the second neuron.

30
00:02:36,270 --> 00:02:39,790
‫We have two new weights, W 3 and W 4.

31
00:02:40,470 --> 00:02:45,840
‫We calculate this value w3 x1 plus W4 x2 plus B2.

32
00:02:46,110 --> 00:02:50,430
‫This B2 is the bias of this neuron is equal to Z2.

33
00:02:51,660 --> 00:02:59,670
‫We apply activation function on the Z2 to get A2.  These A1 and A2

34
00:02:59,820 --> 00:03:06,060
‫Are the inputs to this final output neuron for these two inputs.

35
00:03:06,510 --> 00:03:09,900
‫We need two new weights, W5 and W6.

36
00:03:11,100 --> 00:03:19,980
‫So the equation at this output neuron is W5 into a1 plus W6 into A2 plus B3 gives Z3.

37
00:03:21,360 --> 00:03:28,350
‫When we apply activation function on this Z3, we get the predicted output from this output neuron.

38
00:03:31,350 --> 00:03:40,650
‫So if you look at the variables that we need to estimate for weights, we have W1, W2, W3, W4 W5

39
00:03:40,650 --> 00:03:41,460
‫and W6.

40
00:03:41,910 --> 00:03:45,990
‫So we are estimating six weights. For biases

41
00:03:46,080 --> 00:03:50,040
‫We have B1, B2 and B3 three Bias's.

42
00:03:51,180 --> 00:03:59,610
‫So for this small network we need to establish the values of nine variables to make this neural network

43
00:03:59,760 --> 00:04:00,900
‫ready for predictions.

44
00:04:04,980 --> 00:04:13,420
‫Now how do we find out the values of Ws and Bs the technique followed for

45
00:04:13,470 --> 00:04:15,300
‫This is called gradient descent.

46
00:04:17,220 --> 00:04:22,260
‫Gradient descent is just another optimization technique to find minimum of a function.

47
00:04:24,090 --> 00:04:29,730
‫There are other optimization techniques also, such as ordinarily squared, which is used in linear

48
00:04:29,730 --> 00:04:30,240
‫regression.

49
00:04:31,780 --> 00:04:40,170
‫But for a large number of features and complex relationships, gradient descent shows much better computational

50
00:04:40,170 --> 00:04:42,390
‫performance than any other technique.

51
00:04:44,280 --> 00:04:50,640
‫This means that if you have a large number of input variables and a very complex relationship between

52
00:04:50,730 --> 00:04:59,590
‫input on output, gradient descent will train the model in a much faster way as compared to other optimization.

53
00:05:01,550 --> 00:05:06,200
‫So let's first discuss the process followed and gradient descent in a stepwise manner.

54
00:05:10,070 --> 00:05:18,090
‫We start by assigning a random weights and bias values to all the cells in our network.

55
00:05:20,660 --> 00:05:26,870
‫Since all the weights and biased values are available, that is, we have randomly assigned all weight

56
00:05:26,900 --> 00:05:27,560
‫and biases.

57
00:05:28,580 --> 00:05:30,670
‫Our model is ready to give out output.

58
00:05:33,470 --> 00:05:37,280
‫The second step is we input one training example.

59
00:05:38,540 --> 00:05:45,830
‫We use the X values of the training example and calculate the final output of the network using these

60
00:05:45,830 --> 00:05:47,570
‫weights and bias values.

61
00:05:50,180 --> 00:05:59,510
‫Third step is that we compare the predicted values vs. the actual values and we note the difference

62
00:05:59,570 --> 00:06:02,510
‫between these two using some error function.

63
00:06:02,640 --> 00:06:06,200
‫E will come back to this error function later.

64
00:06:07,250 --> 00:06:12,920
‫Remember that we have the actual Y value because this was a training observation.

65
00:06:14,150 --> 00:06:20,720
‫So these actual values are being used to give feedback to our network that how bad it is performing.

66
00:06:24,190 --> 00:06:34,510
‫The fourth step is we try to find out those weights and biases changing, which we can reduce this error.

67
00:06:36,280 --> 00:06:40,030
‫Lastly, we update the values of the weights and bias.

68
00:06:41,590 --> 00:06:44,270
‫And repeat this process from step two.

69
00:06:46,180 --> 00:06:48,570
‫This loop goes on till no

70
00:06:48,580 --> 00:06:51,370
‫Further reduction in error function can be achieved.

71
00:06:53,740 --> 00:06:59,530
‫And these steps, the first step is called initialization. Here

72
00:06:59,710 --> 00:07:03,540
‫We just gave some random initial values to weights and bias.

73
00:07:06,040 --> 00:07:09,310
‫The second step is called forward propagation.

74
00:07:10,690 --> 00:07:14,830
‫This is because in this step, we start with input values.

75
00:07:15,910 --> 00:07:17,290
‫Process them in layer one.

76
00:07:18,550 --> 00:07:21,730
‫Then we take the output of layer one and process it.

77
00:07:21,880 --> 00:07:22,630
‫In layer two.

78
00:07:22,840 --> 00:07:23,620
‫And so on.

79
00:07:24,190 --> 00:07:26,530
‫Till we get one final predicted output.

80
00:07:28,810 --> 00:07:32,500
‫We are simply moving forward in terms of the layers of the network.

81
00:07:33,580 --> 00:07:35,020
‫So this is forward propagation.

82
00:07:37,360 --> 00:07:42,790
‫The third and fourth step are called backward propagation in these steps.

83
00:07:43,060 --> 00:07:50,650
‫We already have the final error function and we look backwards in our network to find out which weights

84
00:07:51,160 --> 00:07:55,090
‫and biases have maximum impact on this error function.

85
00:07:56,920 --> 00:08:00,730
‫Once we establish which weights and biases have maximum impact.

86
00:08:01,510 --> 00:08:06,160
‫We update these weights biases slightly to reduce the error.

87
00:08:07,510 --> 00:08:12,520
‫So this is the process we follow to implement gradient descent in neural networks.

88
00:08:13,810 --> 00:08:15,490
‫But we are still not discussed.

89
00:08:15,760 --> 00:08:17,590
‫The concept behind gradient descent

90
00:08:22,360 --> 00:08:28,150
‫gradient descent is a mathematical technique which is used to find out B minimum of A function.

91
00:08:29,980 --> 00:08:36,430
‫Let's see this example in the graph on the left on the x axis.

92
00:08:36,670 --> 00:08:39,220
‫I have this variable on the Y axis.

93
00:08:39,340 --> 00:08:41,350
‫I have a function applied on this variable.

94
00:08:43,180 --> 00:08:44,740
‫This is the plot of this function.

95
00:08:47,410 --> 00:08:55,210
‫Now, if you want to find out the value of X at which the function has minimum value, there are two

96
00:08:55,210 --> 00:08:55,780
‫ways to it.

97
00:08:57,220 --> 00:09:04,610
‫One is if you know the exact relationship between X and the function, you can use calculus to find

98
00:09:04,610 --> 00:09:06,100
‫the minimum of this function.

99
00:09:08,610 --> 00:09:14,350
‫But as you know, in our machine learning problems, we do not have this exact relationship.

100
00:09:16,090 --> 00:09:19,900
‫So we use a second technique, which is an iterative technique.

101
00:09:21,340 --> 00:09:22,210
‫In this technique.

102
00:09:22,570 --> 00:09:25,210
‫We start at a random point on this plot.

103
00:09:26,640 --> 00:09:36,730
‫Say we have this value of X and fx. Now instead of focusing on the whole graph, we focus only

104
00:09:36,730 --> 00:09:43,900
‫on this small part of the graph and try to find out what happens if we slightly increase the value of

105
00:09:43,900 --> 00:09:46,160
‫X or decrease the value of X.

106
00:09:48,040 --> 00:09:52,240
‫In other words, we are trying to find out which way is the slope.

107
00:09:53,380 --> 00:10:00,160
‫If the slope is negative, that is like this, we increase the value of X a little bit.

108
00:10:01,120 --> 00:10:03,790
‫And then we will see that fx will also decrease.

109
00:10:04,720 --> 00:10:11,860
‫Similarly, if the slope is positive, we decrease the value of X, which will slightly decrease the

110
00:10:11,860 --> 00:10:13,290
‫value of fx.

111
00:10:16,810 --> 00:10:19,330
‫We continue taking these small, small steps.

112
00:10:19,570 --> 00:10:29,020
‫Till we reach the final minimum point when we are at this point moving either side only increases

113
00:10:29,020 --> 00:10:30,100
‫the value of the function.

114
00:10:31,690 --> 00:10:33,670
‫So we stop our process here.

115
00:10:35,560 --> 00:10:43,300
‫This iterative technique of finding instantaneous slope, also known as gradient and slightly moving

116
00:10:43,300 --> 00:10:48,160
‫down that slope that is descent is called gradient descent.

117
00:10:53,260 --> 00:10:58,510
‫If you want to picturise this, you can think of yourself being on top of a hill.

118
00:11:00,160 --> 00:11:03,640
‫You cannot see anything around you because it is dark and foggy.

119
00:11:04,870 --> 00:11:08,380
‫Now, you want to come down the hill as fast as possible.

120
00:11:09,100 --> 00:11:09,760
‫What do you do?

121
00:11:11,590 --> 00:11:17,460
‫Ideally, if you could see, you would spot the closest downhill point and run to it.

122
00:11:19,000 --> 00:11:26,260
‫But since you cannot see and you know, the gradient descent technique, you take a step in each direction,

123
00:11:27,430 --> 00:11:35,590
‫see which direction is more down, and then you move from your current position to that new position.

124
00:11:37,270 --> 00:11:42,970
‫Then you again take out your right foot, Jack, which is the direction of steep slope and move.

125
00:11:42,980 --> 00:11:45,640
‫Did you keep doing this?

126
00:11:46,090 --> 00:11:48,700
‫And eventually you will come downhill.

127
00:11:50,350 --> 00:11:52,810
‫This is the concept behind gradient descent.

128
00:11:54,390 --> 00:11:58,330
‫In the next lecture we will merge the two ideas.

129
00:11:58,780 --> 00:12:03,270
‫The first is the process that neural networks use to implement gradient descent.

130
00:12:04,060 --> 00:12:07,660
‫And the second idea was what gradient descent is.

131
00:12:07,750 --> 00:12:15,460
‫Mathematically, we will merge these two and understand how gradient descent is helping us achieve the

132
00:12:15,460 --> 00:12:17,470
‫minima in neural networks.