﻿1
00:00:00,740 --> 00:00:03,360
‫Hey, everyone, congratulations, first of all.

2
00:00:04,280 --> 00:00:09,890
‫If you understood whatever I taught you in the last few lectures, you understand neural networks in

3
00:00:09,890 --> 00:00:14,930
‫the way they are used for prediction purposes in this lecture.

4
00:00:15,080 --> 00:00:17,360
‫I'm going to talk about several subtopics.

5
00:00:18,740 --> 00:00:24,250
‫These topics came up in our lectures previously also. But to maintain the flow.

6
00:00:25,130 --> 00:00:28,150
‫I did not discuss them in detail at that time.

7
00:00:29,850 --> 00:00:35,400
‫Also, this extra knowledge is often what is asked in interview questions.

8
00:00:36,820 --> 00:00:40,450
‫In fact, I'm going to cover this lecture in a question answer format.

9
00:00:41,930 --> 00:00:44,830
‫So keeping attention as this is also very important.

10
00:00:47,140 --> 00:00:49,870
‫The first subtopic is of activation functions.

11
00:00:51,360 --> 00:00:55,730
‫And the first question that we are going to discuss is, why do we use activation functions?

12
00:00:58,660 --> 00:01:02,380
‫Let's see if we do not have any activation function.

13
00:01:02,650 --> 00:01:09,320
‫What is the output of a neuron, the output would be given by this equation, W1 X1 plus W2 X2

14
00:01:09,320 --> 00:01:13,840
‫plus b is equal to Z, which is the output.

15
00:01:16,100 --> 00:01:20,300
‫This means that the output could be any real number with no boundaries.

16
00:01:21,610 --> 00:01:25,360
‫If you're solving for a regression problem, then this may be acceptable.

17
00:01:26,380 --> 00:01:33,340
‫But when we are doing classification, that is, we want an output of yes no type or one zero type.

18
00:01:34,780 --> 00:01:39,790
‫We need to treat the output z to get a zero one type of output.

19
00:01:41,950 --> 00:01:51,130
‫Also, if there are only linear neurons in the whole network, you can only predict a linear relationship

20
00:01:51,280 --> 00:01:53,430
‫between input and output variables.

21
00:01:54,900 --> 00:01:59,400
‫So the answer is we use an activation function for two reasons.

22
00:02:00,360 --> 00:02:04,740
‫First, to put boundary conditions on our output for classification.

23
00:02:04,860 --> 00:02:06,120
‫The boundaries are obvious.

24
00:02:07,080 --> 00:02:12,610
‫For regression, also, if you have some boundary, you can use an activation function.

25
00:02:14,470 --> 00:02:21,400
‫The second reason is to introduce non-linearity so that we can find complex, nonlinear patterns.

26
00:02:21,530 --> 00:02:21,910
‫Also.

27
00:02:25,270 --> 00:02:29,080
‫Next question is, what are the different types of activation functions?

28
00:02:31,040 --> 00:02:33,350
‫Earlier, we had discussed two activation functions.

29
00:02:34,100 --> 00:02:38,970
‫One was the step function, which is zero till a threshold value.

30
00:02:39,800 --> 00:02:43,100
‫And then it suddenly jumps to one at its actual value.

31
00:02:45,170 --> 00:02:51,830
‫Then we discussed this sigmoid function, which is a continuous s shape curve with zero as the lowered

32
00:02:51,830 --> 00:02:52,310
‫boundary.

33
00:02:53,030 --> 00:02:57,770
‫And one as the upper boundary for most practical business purposes.

34
00:02:58,190 --> 00:03:05,330
‫Sigmoid function is good enough, but for rare scenarios where competition is an issue, we sometimes

35
00:03:05,330 --> 00:03:10,070
‫use these two other activation functions also because of their convergence efficiency.

36
00:03:10,860 --> 00:03:15,740
‫First is the hyperbolic tangent function or the pan edge.

37
00:03:17,200 --> 00:03:23,110
‫The graph of this function is almost similar to the sigmoid in shape, but it has different boundaries.

38
00:03:24,040 --> 00:03:27,490
‫It has upper boundary of one at a lower boundary of minus one.

39
00:03:29,010 --> 00:03:35,950
‫And because it is centered at zero, it almost always has better converges efficiency than sigmoid.

40
00:03:38,200 --> 00:03:43,480
‫The second is Relu, which is Short for rectify linear unit.

41
00:03:45,310 --> 00:03:51,430
‫It is very widely used function, especially in the hidden layers of regression neural networks.

42
00:03:53,080 --> 00:03:56,770
‫This is how this function looks like, till zero

43
00:03:57,010 --> 00:04:05,380
‫The function also outputs zero, but after zero function outputs the same as input.

44
00:04:05,860 --> 00:04:08,590
‫That is, fx is equal to X.

45
00:04:10,760 --> 00:04:15,080
‫So the lower bound is zero, but there is no upper bound.

46
00:04:16,460 --> 00:04:20,660
‫This function performs well because it is very easy to execute.

47
00:04:22,220 --> 00:04:28,610
‫The reason for using this function in hidden layers is that this function introduces non-linearity

48
00:04:29,360 --> 00:04:30,320
‫in the hidden layers.

49
00:04:31,970 --> 00:04:39,320
‫However, on the output layer, it is rarely used because for classification, the right side of the

50
00:04:39,320 --> 00:04:43,340
‫function is not bounded and therefore it cannot be used.

51
00:04:44,780 --> 00:04:46,430
‫On the other hand, for regression.

52
00:04:46,940 --> 00:04:52,660
‫The left side of this function is bound and therefore this function cannot be used.

53
00:04:54,320 --> 00:04:57,430
‫So this function is good for activating hidden layers.

54
00:04:57,650 --> 00:04:59,990
‫But not for activating output layers.

55
00:05:03,140 --> 00:05:07,100
‫You can find a summary of all these activation functions in the next slide.

56
00:05:07,940 --> 00:05:09,290
‫This is for your reference.

57
00:05:13,500 --> 00:05:16,890
‫This brings us to the next question, which we have already answered.

58
00:05:18,000 --> 00:05:22,290
‫Can hidden layers and output layers have different activation functions.

59
00:05:23,370 --> 00:05:24,330
‫Answer is yes.

60
00:05:24,960 --> 00:05:30,660
‫As I told you, we can implement RELU in the hidden layers and a sigmoid in the output layer.

61
00:05:32,310 --> 00:05:35,250
‫Any such combination is allowed by our software tool.

62
00:05:38,250 --> 00:05:46,160
‫Next question is what is multiclass classification and is there any special activation function for

63
00:05:46,170 --> 00:05:47,700
‫multiclass classification?

64
00:05:49,830 --> 00:05:52,140
‫So first of all, what is multiclass classification.

65
00:05:53,010 --> 00:05:54,810
‫Suppose you are classifying into

66
00:05:54,960 --> 00:05:55,980
‫Yes or no.

67
00:05:56,790 --> 00:05:57,450
‫Or one.

68
00:05:57,450 --> 00:05:57,990
‫Or zero.

69
00:05:58,590 --> 00:06:01,740
‫This is binary classification because there are two classes only.

70
00:06:03,370 --> 00:06:05,130
‫But if we have more than two classes.

71
00:06:06,270 --> 00:06:12,060
‫So if we want to classify images into shirts, trousers, ties and socks.

72
00:06:12,720 --> 00:06:14,880
‫Now the output cannot be zero or one.

73
00:06:16,500 --> 00:06:23,660
‫We cannot do like this that we give zero for shirts, one for trousers, two for ties, three for socks.

74
00:06:24,390 --> 00:06:26,100
‫That would not give us the right answer.

75
00:06:27,600 --> 00:06:33,730
‫So we have to handle the situation in a little different manner for such a situation.

76
00:06:34,140 --> 00:06:36,960
‫We have an activation function called softmax.

77
00:06:40,680 --> 00:06:43,900
‫This activation function works similar to sigmoid.

78
00:06:44,460 --> 00:06:45,960
‫But has an additional step.

79
00:06:48,120 --> 00:06:56,280
‫So what do we do in multiclass classification is we usually keep as many output neurons as we have classes.

80
00:06:57,210 --> 00:07:06,420
‫So if we have three classes like shirts, trousers and socks, we keep three neurons at the output layer.

81
00:07:07,650 --> 00:07:14,820
‫You can see in this image these three output neurons correspond to three output classes.

82
00:07:16,890 --> 00:07:21,370
‫These three output neurons have the sigmoid activation function only.

83
00:07:22,950 --> 00:07:27,720
‫So the output of first neuron would also lie between zero and one.

84
00:07:28,260 --> 00:07:35,370
‫And we can say that the output value is corresponding to the probability of whether it is a shirt or

85
00:07:35,370 --> 00:07:35,640
‫not.

86
00:07:37,380 --> 00:07:40,230
‫The second output would also be between zero and one.

87
00:07:40,350 --> 00:07:43,770
‫And we can say that it is the probability of whether it is a trouser or not.

88
00:07:44,820 --> 00:07:48,000
‫And the third output will also be between zero and one.

89
00:07:48,180 --> 00:07:52,170
‫And we can see that it is the probability of whether it is socks or not.

90
00:07:54,830 --> 00:08:02,990
‫But the thing is that the item can be only one that is the sum of probabilities should come out to be

91
00:08:02,990 --> 00:08:03,320
‫one.

92
00:08:04,430 --> 00:08:10,970
‫Either it is shirt or it is trouser or it is socks to implement this.

93
00:08:11,660 --> 00:08:20,030
‫We put an additional layer of softmax and we input the result of these three neurons into this soft

94
00:08:20,030 --> 00:08:20,650
‫max layer.

95
00:08:22,340 --> 00:08:29,120
‫This softmax layer just divides each of the value by the sum of all the values.

96
00:08:31,610 --> 00:08:35,930
‫Now this output can be considered as the probability of that class occurring.

97
00:08:36,560 --> 00:08:40,070
‫And the sum of all these probabilities will also be equal to one.

98
00:08:43,780 --> 00:08:50,590
‫So for multiclass classification a softmax, Activision is often used on the output layer.

99
00:08:53,980 --> 00:08:56,200
‫That's all about the activation functions.

100
00:08:57,400 --> 00:08:59,740
‫The next topic is gradient descent.

101
00:09:01,750 --> 00:09:07,750
‫The question I want to discuss here is what is the difference between gradient descent and stochastic

102
00:09:07,750 --> 00:09:08,560
‫gradient descent?

103
00:09:10,480 --> 00:09:18,250
‫This is a very common question in the mind of students because they find stochastic written at some

104
00:09:18,250 --> 00:09:18,790
‫places.

105
00:09:18,970 --> 00:09:21,340
‫And it is not written in some texts.

106
00:09:22,870 --> 00:09:24,160
‫Let me clarify this for you.

107
00:09:25,240 --> 00:09:32,380
‫What we discussed in our previous lectures was actually stochastic gradient descent, because I told

108
00:09:32,380 --> 00:09:39,460
‫you that we take each individual training record and update our weights and biases.

109
00:09:39,730 --> 00:09:47,680
‫With each training record, when we run the whole forward and backward propagation for each individual

110
00:09:47,680 --> 00:09:51,860
‫training record, that is stochastic gradient descent.

111
00:09:54,490 --> 00:10:02,500
‫But if you run forward propagation for entire training set at one, go and find out the average error

112
00:10:02,500 --> 00:10:03,520
‫on the entire set.

113
00:10:03,850 --> 00:10:07,450
‫And then update the weights and biases during backward propagation.

114
00:10:08,260 --> 00:10:09,340
‫That is gradient descent.

115
00:10:11,450 --> 00:10:17,780
‫There is another variation in which we make small batches out of the training set and use these batches

116
00:10:17,870 --> 00:10:19,050
‫instead of complete set.

117
00:10:20,420 --> 00:10:23,000
‫This one is called mini batch gradient descent

118
00:10:25,870 --> 00:10:36,400
‫The point is, stochastic gradient descent starts updating weights and biases rapidly, but it finds

119
00:10:36,490 --> 00:10:37,840
‫difficulty in converging.

120
00:10:40,250 --> 00:10:47,340
‫Whereas gradient descent is slow because in each pass, it has to go through the entire training set.

121
00:10:49,100 --> 00:10:52,990
‫But the good thing about gradient descent is it converges very well.

122
00:10:54,370 --> 00:10:59,650
‫So we have to accordingly select which optimization technique is to be used.

123
00:11:00,820 --> 00:11:03,780
‫We will see this further in the practical part of this course.

124
00:11:06,290 --> 00:11:16,280
‫Last thing I want to discuss in this lecture is epoch in neural networks, epoch is one cycle through the

125
00:11:16,280 --> 00:11:17,540
‫full training data.

126
00:11:20,030 --> 00:11:28,590
‫So when we say epoch is equal to five, it means we want the full training data to be fed five times.

127
00:11:29,510 --> 00:11:32,450
‫Note that epoch is different from iterations.

128
00:11:34,850 --> 00:11:37,910
‫Suppose you have 1000 training records.

129
00:11:39,590 --> 00:11:45,950
‫Now, if you start training the records one by one, like we do in stochastic gradient descent,

130
00:11:46,790 --> 00:11:50,570
‫you will have to, iterate it 1000 times.

131
00:11:53,350 --> 00:12:00,550
‫So iterations are the number of times you execute the process within one full training set.

132
00:12:02,200 --> 00:12:11,620
‫On the other hand, if you have set Epoch to two, then this means that the 1000 training examples will

133
00:12:11,620 --> 00:12:18,400
‫be fed two times, either one by one or all at the same time or in many batches.

134
00:12:20,230 --> 00:12:26,410
‫The idea of using epoch is to allow the network to recheck its performance on the same data.

135
00:12:28,600 --> 00:12:32,950
‫We will see how we can specify and use Epoch in a particular lectures.

136
00:12:34,420 --> 00:12:36,730
‫That's all in this lecture, in the upcoming video.

137
00:12:36,760 --> 00:12:43,660
‫We will summarize the key parameters that you must know while implementing neural networks in the software.

