1
00:00:01,130 --> 00:00:09,230
‫In this lecture, we will discuss about the hyper parameters of what neural network architecture . In

2
00:00:09,230 --> 00:00:10,220
‫the lectors till now.

3
00:00:10,550 --> 00:00:18,350
‫You have seen that there are so many hyper parameters in neural networks, and these hyper parameters

4
00:00:18,530 --> 00:00:22,880
‫give us the flexibility of creating several types of architectures.

5
00:00:25,100 --> 00:00:30,530
‫But the flexibility of neural networks is also one of their main drawbacks.

6
00:00:31,970 --> 00:00:35,420
‫We have to decide on so many hyper parameters in our model.

7
00:00:37,010 --> 00:00:44,960
‫Not only can we use any imaginable network architecture, but even in our simple, multi-level Perceptron,

8
00:00:45,500 --> 00:00:51,710
‫you can change the number of layers, the number of neurons per layer the type of activation function

9
00:00:51,710 --> 00:00:56,540
‫to use in each layer the weight, initialization logic and many more.

10
00:00:59,020 --> 00:01:05,800
‫Although still a lot of exciting research is going on in the field of hyper parameter tuning for neural

11
00:01:05,800 --> 00:01:06,340
‫networks.

12
00:01:07,030 --> 00:01:13,360
‫It will still help to have an idea of what values are reasonable for each hyper parameter.

13
00:01:13,930 --> 00:01:18,010
‫So you can build a quick prototype and restrict the search space.

14
00:01:20,540 --> 00:01:26,760
‫Here are a few guidelines for choosing the number of hidden layers and neurons in a multilevel Perceptron.

15
00:01:29,100 --> 00:01:31,660
‫Let's first discuss about the number of hidden layers.

16
00:01:32,580 --> 00:01:39,210
‫For most of the problems, you can just begin with a single hidden layer and you will get reasonable results.

17
00:01:40,890 --> 00:01:48,450
‫It has actually been shown that an MLP with just one hidden layer can model even the most complex functions,

18
00:01:49,020 --> 00:01:54,090
‫provided it has enough neurons for a long time.

19
00:01:54,660 --> 00:02:00,650
‫These facts convinced researchers that there is no need to investigate any deep neural network.

20
00:02:01,740 --> 00:02:09,330
‫But it was later found that deep networks have a much higher parameter efficiency than shallow ones.

21
00:02:11,010 --> 00:02:18,870
‫They can model complex functions using exponentially fewer neurons than shallowness, allowing them

22
00:02:18,870 --> 00:02:26,310
‫to reach much better performance with the same amount of training data to understand why this happens.

23
00:02:26,700 --> 00:02:33,480
‫Suppose you are asked to draw a forest using some drawing software, but you are forbidden to use copy

24
00:02:33,480 --> 00:02:33,810
‫paste.

25
00:02:34,710 --> 00:02:42,540
‫You have brought each tree individually branch pee6 branch leaf per leaf if you could instead draw one

26
00:02:42,540 --> 00:02:50,490
‫leaf copy paste it to draw the branch, then copy paste the branches to create the tree and finally copy

27
00:02:50,490 --> 00:02:52,560
‫paste this tree to make a forest.

28
00:02:53,750 --> 00:02:55,350
‫You would be finished in no time.

29
00:02:57,900 --> 00:03:01,230
‫Real world data is often structured in such a hierarchical way.

30
00:03:01,710 --> 00:03:05,640
‫And deep neural networks automatically take advantage of this fact.

31
00:03:07,150 --> 00:03:12,840
‫Lower hidden and layers, modrl lower level structures, intermediate hidden layers.

32
00:03:13,000 --> 00:03:19,680
‫Combine these lower level structures to model intermediate level structures and highest hidden layers

33
00:03:19,980 --> 00:03:21,030
‫and the output layer.

34
00:03:21,210 --> 00:03:25,890
‫Combine these intermediate structures to model high level structures.

35
00:03:27,630 --> 00:03:35,430
‫Not only does this hierarchical structure help deep neural networks converts faster, it also improves

36
00:03:35,430 --> 00:03:38,150
‫Their ability to generalize to new dataset

37
00:03:38,240 --> 00:03:48,300
‫for example, if you've already trained a model to recognize faces in a picture and you now,

38
00:03:48,300 --> 00:03:55,650
‫want to train a new neural network model to recognize hairstyles, then you can kick start training

39
00:03:55,860 --> 00:04:04,170
‫by reusing the lower levels of the first network instead of randomly initializing the weights and biases of

40
00:04:04,170 --> 00:04:07,380
‫the first few layers of the new neural network.

41
00:04:08,220 --> 00:04:14,940
‫You can initialize them to the value of the weights and biases of the lower layers of the first network.

42
00:04:16,560 --> 00:04:23,670
‫This way, the network will not have to learn from scratch and it will only have to learn the higher

43
00:04:23,670 --> 00:04:24,630
‫level structures.

44
00:04:26,130 --> 00:04:28,200
‫This is called transfer learning.

45
00:04:29,910 --> 00:04:37,140
‫So in summary, for most problems, you can start with just one or two hidden layers and it will work

46
00:04:37,170 --> 00:04:41,310
‫just fine for more complex problems.

47
00:04:41,610 --> 00:04:47,910
‫You can gradually ramp up the number of hidden layers until you start overfitting the training data.

48
00:04:51,840 --> 00:04:55,360
‫Next, we discuss the number of neurons, per hidden layer.

49
00:04:56,630 --> 00:05:02,460
‫Obviously, the number of neurons in the input and output layers is determined by the type of input

50
00:05:02,670 --> 00:05:03,310
‫and output.

51
00:05:03,330 --> 00:05:07,590
‫Your task requires, for example, the mnist

52
00:05:07,590 --> 00:05:13,050
‫fashion dataset that we used that required 24 into 24.

53
00:05:13,170 --> 00:05:18,600
‫That is 784 input neurons and 10 output neurons.

54
00:05:20,100 --> 00:05:23,040
‫As for the hidden layers, earlier

55
00:05:23,970 --> 00:05:27,310
‫It was a common practice to size them to form a

56
00:05:27,510 --> 00:05:32,430
‫Pyramid, that is the first layer had the most number of neurons.

57
00:05:33,390 --> 00:05:41,100
‫So, for example, in the mnist fashion dataset with three hidden layers, you can have 300 neurons in

58
00:05:41,100 --> 00:05:49,290
‫the first hidden layer, 200 neurons in the second one and hundred in the third the rationale being

59
00:05:49,410 --> 00:05:55,770
‫that many low level features can coalesce into far fewer higher level features.

60
00:05:58,090 --> 00:06:05,140
‫However, this practice has been largely abandoned now, as it seems that simply using these same number

61
00:06:05,140 --> 00:06:12,530
‫of neurons in all hidden layers performs just as well in most cases, or maybe even better.

62
00:06:15,420 --> 00:06:22,120
‫Also, it has the advantage of having only one hyper parameter to tune instead of one per layer.

63
00:06:24,750 --> 00:06:30,900
‫So instead of having three hundred, two hundred and one hundred neurons in the tree hidden layers, you

64
00:06:30,900 --> 00:06:34,200
‫can have 150 neurons in all three of them.

65
00:06:36,730 --> 00:06:43,330
‫If you think that the problem at hand is really complex, then you can try increasing the number of

66
00:06:43,330 --> 00:06:52,270
‫neurons gradually until the network starts overfitting in general, increasing the depth of the network

67
00:06:52,750 --> 00:06:58,000
‫has better result on the accuracy, then increasing the number of neurons.

68
00:06:58,020 --> 00:06:58,470
‫Per layer.

69
00:06:59,720 --> 00:07:06,040
‫Another approach could be to pick a model with a large number of layers and large number of neurons

70
00:07:06,130 --> 00:07:11,860
‫Per hidden layer and then use early stopping  to prevent that model from overfitting.

71
00:07:14,690 --> 00:07:15,740
‫Next, hyper parameter

72
00:07:15,770 --> 00:07:18,470
‫We are going to discuss is learning rate. Learning

73
00:07:18,590 --> 00:07:21,530
‫Rate is arguably the most important hyper parameter.

74
00:07:23,000 --> 00:07:27,830
‫In general, the optimal learning rate is about half of the maximum learning rate.

75
00:07:28,220 --> 00:07:32,150
‫That is the learning rate about which the training algorithm diverges.

76
00:07:34,810 --> 00:07:39,670
‫So a simple approach for tuning the learning rate is to start with large value.

77
00:07:40,420 --> 00:07:41,950
‫That makes the algorithm diverge.

78
00:07:43,690 --> 00:07:45,960
‫Then divide this value by three.

79
00:07:46,150 --> 00:07:52,010
‫And try again and repeat this until the training algorithm stops diverging.

80
00:07:53,650 --> 00:07:58,930
‫At that point, you generally won't be too far from the optimal learning rate.

81
00:07:59,740 --> 00:08:02,460
‫Then there is batch size for batch size

82
00:08:02,510 --> 00:08:12,010
‫The general rule of thumb, try to keep batch size lower than 32 because a small bad

83
00:08:12,010 --> 00:08:19,330
‫size ensures that each training iteration is very fast and on the lower end.

84
00:08:19,990 --> 00:08:28,300
‫Try to keep a batch size more than 20 because this helps take advantage of the hardware and software optimizations,

85
00:08:28,990 --> 00:08:31,390
‫in particular for matrix multiplications.

86
00:08:31,810 --> 00:08:34,180
‫So this will also help in speeding up training.

87
00:08:35,770 --> 00:08:38,940
‫So a good range is between twenty to thirty two.

88
00:08:40,810 --> 00:08:44,170
‫Lastly, there is another hyper parameter called epoch.

89
00:08:44,530 --> 00:08:50,290
‫That is the number of training iterations that we will do instead of tuning it.

90
00:08:50,590 --> 00:08:59,530
‫We would suggest that you use a large number for epochs and use the early stopping technique to prevent

91
00:08:59,590 --> 00:09:00,060
‫overfitting.

92
00:09:01,980 --> 00:09:04,360
‫So that's all about selecting hyper parameters.