﻿1
00:00:00,720 --> 00:00:01,760
‫Hey everyone.

2
00:00:02,120 --> 00:00:03,370
‫Congratulations first of all.

3
00:00:04,280 --> 00:00:09,980
‫If you understood whatever I taught you in the last few lectures you understand neural networks in the

4
00:00:09,980 --> 00:00:14,210
‫way they are used for prediction purposes.

5
00:00:14,210 --> 00:00:21,430
‫In this lecture I'm going to talk about several subtropics these sub topics came up in our lectures

6
00:00:21,440 --> 00:00:25,030
‫previously also but to maintain the flow.

7
00:00:25,130 --> 00:00:29,760
‫I did not discuss them in detail at that time.

8
00:00:29,850 --> 00:00:36,760
‫Also this extra knowledge is often what is asked in interview questions.

9
00:00:36,820 --> 00:00:43,700
‫In fact I'm going to cover this lecture in a question on the format so keeping your attention as this

10
00:00:43,700 --> 00:00:44,830
‫is also very important.

11
00:00:47,100 --> 00:00:53,500
‫The first off topic is of activation functions and the first question that we are going to discuss is

12
00:00:53,660 --> 00:00:58,640
‫why do we use activation functions.

13
00:00:58,650 --> 00:01:02,640
‫Let's see if we do not have any activation function.

14
00:01:02,640 --> 00:01:04,810
‫What is the output of a neuron.

15
00:01:04,860 --> 00:01:16,090
‫The output would be given by this equation W1 X1 plus W2 x2 plus B is equal to Z which is the output.

16
00:01:16,100 --> 00:01:22,990
‫This means that the output could be any real number with no boundaries if you are solving for a regression

17
00:01:22,990 --> 00:01:26,380
‫problem then this may be acceptable.

18
00:01:26,380 --> 00:01:34,670
‫But when we are doing classification that is we want an output of yes no type or 1 0 type.

19
00:01:34,780 --> 00:01:41,780
‫We need to read the output d to get a 0 1 paper output.

20
00:01:41,950 --> 00:01:51,130
‫Also if there are only linear neurons in the whole network you can only predict a linear relationship

21
00:01:51,280 --> 00:01:53,440
‫between input and output variables.

22
00:01:54,930 --> 00:02:00,360
‫So the answer is we use an activation function for two reasons.

23
00:02:00,360 --> 00:02:04,860
‫First to put boundary conditions on our output for classification.

24
00:02:04,860 --> 00:02:08,420
‫The boundaries are obvious for regression also.

25
00:02:08,500 --> 00:02:14,470
‫If you have some boundary you can use an activation function.

26
00:02:14,470 --> 00:02:21,490
‫The second reason is to introduce nonlinear entity so that we can find complex nonlinear patterns.

27
00:02:21,490 --> 00:02:21,880
‫Also

28
00:02:25,270 --> 00:02:31,390
‫next question is what are the different types of activation functions earlier.

29
00:02:31,400 --> 00:02:34,080
‫We had discussed two activation functions.

30
00:02:34,100 --> 00:02:39,410
‫One was the step function which is zero below threshold value.

31
00:02:39,800 --> 00:02:45,170
‫And then it suddenly jumps to one at differential value.

32
00:02:45,170 --> 00:02:52,280
‫Then we discussed the sigmoid function which is a continuous S shape go with zero as the lower boundary

33
00:02:53,030 --> 00:03:01,220
‫and one as the upper boundary for most practical business purposes sigmoid function is good enough but

34
00:03:01,220 --> 00:03:07,280
‫for rare scenarios where computation is an issue we sometimes use these two other activation functions

35
00:03:07,280 --> 00:03:15,740
‫also because of their convergence efficiency first is the hyperbolic tangent function or the PAN edge

36
00:03:17,170 --> 00:03:23,640
‫the graph of this function is almost similar to the sigmoid in shape but it has different boundaries.

37
00:03:24,040 --> 00:03:31,780
‫It has upper boundary of one at a lower boundary of minus one and because it is centered at zero it

38
00:03:31,900 --> 00:03:38,060
‫almost always has better convergence efficiency than sigmoid.

39
00:03:38,200 --> 00:03:45,310
‫The second is the loop which is short for identified linear unit.

40
00:03:45,310 --> 00:03:53,030
‫It is very widely used function especially in the inner layers of regression neural networks.

41
00:03:53,080 --> 00:04:03,550
‫This is how this function looks like till 0 the function also outputs 0 but after 0 function outputs

42
00:04:03,880 --> 00:04:10,080
‫the same as input that is if X is equal to x.

43
00:04:10,780 --> 00:04:15,740
‫So the lower one is 0 but there is no upper bound.

44
00:04:16,750 --> 00:04:21,880
‫This function performs well because it is very easy to execute.

45
00:04:22,210 --> 00:04:28,570
‫The reason for using this function and it in layers is that this function introduces nonlinear entity

46
00:04:29,450 --> 00:04:31,920
‫different layers.

47
00:04:31,990 --> 00:04:39,820
‫However on the output layer it is rarely used because for classification the right side of the function

48
00:04:39,910 --> 00:04:46,860
‫is not bounded and therefore it cannot be used on the other hand for regression.

49
00:04:46,930 --> 00:04:52,700
‫The left side of this function is bone and therefore this function cannot be used.

50
00:04:54,340 --> 00:05:02,900
‫So this function is good for activating it in layers but not for activating output layers.

51
00:05:03,140 --> 00:05:07,900
‫You can find a summary of all these activation functions in the next light.

52
00:05:07,940 --> 00:05:13,390
‫This is for your reference.

53
00:05:13,470 --> 00:05:20,250
‫This brings us to the next question which we have already answered can hit in layers and output layers

54
00:05:20,460 --> 00:05:23,370
‫have different activation functions.

55
00:05:23,370 --> 00:05:24,710
‫Answer is yes.

56
00:05:24,960 --> 00:05:30,690
‫As I told you we can implement redo in the layers and the sigmoid in the output layer.

57
00:05:32,310 --> 00:05:35,220
‫Any such combination is allowed by our software to

58
00:05:38,240 --> 00:05:46,120
‫next question is what is multi class classification and is there any special activation function for

59
00:05:46,160 --> 00:05:49,850
‫multi class classification.

60
00:05:49,850 --> 00:05:53,030
‫So first of all what is multi class classification.

61
00:05:53,030 --> 00:05:58,510
‫Suppose you are classifying into yes or no or 1 0 0.

62
00:05:58,580 --> 00:06:02,960
‫This is binary classification because there are two classes only.

63
00:06:03,370 --> 00:06:05,810
‫But if we have more than two classes.

64
00:06:06,260 --> 00:06:12,560
‫So if we want to classify images into shirts trousers ties and socks.

65
00:06:12,710 --> 00:06:16,140
‫Now the output cannot be 0 or 1.

66
00:06:16,490 --> 00:06:25,010
‫We cannot do like this that we gave 0 for shirts 1 for trousers 2 4 days 3 for socks that would not

67
00:06:25,010 --> 00:06:26,110
‫give us the right answer.

68
00:06:27,580 --> 00:06:34,110
‫So we have to handle the situation in a little different manner for such a situation.

69
00:06:34,130 --> 00:06:36,950
‫We have an activation function called soft Max

70
00:06:40,650 --> 00:06:44,030
‫this activation function works similar to sigmoid.

71
00:06:44,460 --> 00:06:48,150
‫Mark has an additional step.

72
00:06:48,150 --> 00:06:57,190
‫So what we do in multi class classification is we usually keep as many output neurons as we have classes.

73
00:06:57,210 --> 00:07:07,550
‫So if we have three classes like shirts trousers and socks we keep three neurons at the output layer.

74
00:07:07,650 --> 00:07:17,630
‫You can see in this image these three output neurons correspond to three output classes these three

75
00:07:17,870 --> 00:07:22,110
‫output neurons have these sigmoid activation function only.

76
00:07:22,970 --> 00:07:31,940
‫So the output of first neuron would also lay between 0 and 1 and we can see that the output value is

77
00:07:31,940 --> 00:07:37,300
‫corresponding to the probability of whether it is a shirt or not.

78
00:07:37,400 --> 00:07:42,590
‫The second output would also be between 0 and 1 and we can see that it is the probability of whether

79
00:07:42,590 --> 00:07:44,640
‫it is a browser or not.

80
00:07:44,750 --> 00:07:50,690
‫And the third output will also be between 0 and 1 and we can see that it is the probability of whether

81
00:07:50,690 --> 00:07:53,090
‫it is socks or not.

82
00:07:54,830 --> 00:08:02,990
‫But the thing is that the item can be only one that is the sum of probabilities should come out to be

83
00:08:02,990 --> 00:08:03,850
‫one.

84
00:08:04,430 --> 00:08:09,770
‫Either it is shirt or it is trouser or it is socks.

85
00:08:09,950 --> 00:08:19,010
‫To implement this we put an additional layer of soft Max and we input the desert of these three neurons

86
00:08:19,190 --> 00:08:22,040
‫into this sort Max layer.

87
00:08:22,340 --> 00:08:30,670
‫This soft Max layer just divides each of the value by the sum of all the values.

88
00:08:31,610 --> 00:08:37,790
‫Not this output can be considered as the probability of that class occurring and the sum of all these

89
00:08:37,790 --> 00:08:40,070
‫probabilities will also be equal to one

90
00:08:43,780 --> 00:08:52,460
‫so for multi class classification a soft Max activation is often used on the output layer.

91
00:08:53,980 --> 00:08:57,400
‫That's all about the activation functions.

92
00:08:57,400 --> 00:09:01,620
‫The next topic is gradient descent.

93
00:09:01,750 --> 00:09:07,750
‫The question I want to discuss here is what is the difference between gradient descent and stochastic

94
00:09:07,750 --> 00:09:10,460
‫gradient descent.

95
00:09:10,480 --> 00:09:18,250
‫This is a very common question in the mind of students because they find stochastic written at some

96
00:09:18,250 --> 00:09:22,870
‫places and it is not written in some texts.

97
00:09:22,870 --> 00:09:24,840
‫Let me clarify this for you.

98
00:09:25,240 --> 00:09:32,620
‫What we discussed in our previous lectures was actually stochastic gradient descent because I told you

99
00:09:32,770 --> 00:09:41,080
‫that we take each individual training record and update our weights and biases with each training record

100
00:09:43,080 --> 00:09:51,040
‫when we run the whole forward and backward propagation for each individual training record that is stochastic

101
00:09:51,040 --> 00:09:51,880
‫gradient descent.

102
00:09:54,460 --> 00:10:02,500
‫But if you run forward propagation for entire training set at one go and find out the average error

103
00:10:02,500 --> 00:10:11,300
‫on the entire set and then a blade DVDs and biases during backward propagation that is gradient descent.

104
00:10:11,440 --> 00:10:17,740
‫There is another variation in which we make small batches out of the training set and use these batches

105
00:10:17,860 --> 00:10:19,790
‫instead of completed.

106
00:10:20,410 --> 00:10:25,830
‫This one is called mini bad gradient descent.

107
00:10:25,880 --> 00:10:37,070
‫The point is stochastic gradient descent starts updating weights and biases rapidly but it finds difficulty

108
00:10:37,070 --> 00:10:46,760
‫in converging whereas gradient descent is slow because in each pass it has to go through the entire

109
00:10:46,760 --> 00:10:47,380
‫training set.

110
00:10:49,100 --> 00:10:56,110
‫But the good thing about gradient descent is it converges very well so we have to accordingly select

111
00:10:56,680 --> 00:10:59,650
‫which optimization technique is to be used.

112
00:11:00,820 --> 00:11:03,820
‫We will see this further in the practical part of this course

113
00:11:06,370 --> 00:11:13,250
‫lasting I want to discuss in this lecture is epoch in neural networks.

114
00:11:13,290 --> 00:11:20,040
‫Epoch is one cycle through the full training data.

115
00:11:20,040 --> 00:11:29,180
‫So when we say epoch is equal to five it means we want the full training data to be fed five things.

116
00:11:29,460 --> 00:11:34,850
‫Note that epoch is different from iterations.

117
00:11:34,860 --> 00:11:39,870
‫Suppose you have 1000 training records now.

118
00:11:39,960 --> 00:11:47,850
‫If you start reading the records one by one like we do doing stochastic gradient descent you will have

119
00:11:47,850 --> 00:11:50,550
‫to I created 1000 times

120
00:11:53,370 --> 00:12:01,700
‫so I iterations are the number of times you execute any process within one full training set.

121
00:12:02,220 --> 00:12:11,610
‫On the other hand if you have said epoch to do then this means that the 1000 training examples will

122
00:12:11,610 --> 00:12:19,960
‫be fed two times either one by one or all at the same time or in many batches.

123
00:12:20,220 --> 00:12:28,210
‫The idea of using epoch is to allow the network to rejig its performance on the same data.

124
00:12:28,620 --> 00:12:34,380
‫We will see how we can specify and use epoch in our party collectives.

125
00:12:34,410 --> 00:12:36,770
‫That's all in this lecture in the upcoming video.

126
00:12:36,780 --> 00:12:43,680
‫We will summarize the key parameters that you must know while implementing neural networks in the software.

