1
00:00:01,120 --> 00:00:09,310
‫In this lecture we will discuss about the hyper parameters of our neural network architecture and the

2
00:00:09,310 --> 00:00:17,230
‫lectures in law you have seen that there are so many hybrid parameters in neural networks and these

3
00:00:17,320 --> 00:00:26,020
‫hyper parameters give us the flexibility of creating similar types of architectures but the flexibility

4
00:00:26,020 --> 00:00:31,960
‫of neural networks is also one of their main drawbacks.

5
00:00:31,960 --> 00:00:36,900
‫We have to decide on so many hybrid parameters in our model.

6
00:00:37,000 --> 00:00:45,670
‫Not only can we use any imaginable network architecture but even in a simple multilevel perception you

7
00:00:45,670 --> 00:00:47,380
‫can change the number of layers.

8
00:00:47,530 --> 00:00:55,000
‫The number of neurons per layer the type of activation function to use in each layer the rate initialization

9
00:00:55,000 --> 00:01:04,780
‫logic and many more although still a lot of exciting research is going on in the field of hyper parameter

10
00:01:04,780 --> 00:01:06,830
‫tuning for neural networks.

11
00:01:07,030 --> 00:01:14,440
‫It will still help to have an idea of what values are reasonable for each hyper parameter so you can

12
00:01:14,440 --> 00:01:20,900
‫build a quick prototype and restrict the search space.

13
00:01:21,030 --> 00:01:29,050
‫Here are a few guidelines for choosing the number of the layers and neurons in a multilevel perception.

14
00:01:29,100 --> 00:01:34,050
‫Let's first discuss about the number of hidden layers for most of the problems.

15
00:01:34,140 --> 00:01:40,820
‫You can just begin with a single hidden layer and you will get reasonable results.

16
00:01:40,890 --> 00:01:48,450
‫It has actually been shown that an MLP with just one hidden layer can model even the most complex functions

17
00:01:49,020 --> 00:01:54,650
‫provided it has enough neurons for a long time.

18
00:01:54,660 --> 00:02:02,130
‫These fact convince these voters that there is no need to investigate any deep neural network but it

19
00:02:02,130 --> 00:02:10,490
‫was later found that deep networks have a much higher parameter efficiency than shallow ones.

20
00:02:11,000 --> 00:02:19,020
‫They can model complex functions using exponentially fewer neurons than shallowness allowing them to

21
00:02:19,020 --> 00:02:26,690
‫read much better performance with the same amount of trading data to understand why this happens.

22
00:02:26,700 --> 00:02:33,480
‫Suppose you are asked to draw a forest using some drawing software but you are forbidden to use copy

23
00:02:33,480 --> 00:02:34,700
‫paste.

24
00:02:34,710 --> 00:02:39,840
‫You have to draw each tree individually branch by branch leaf belief.

25
00:02:40,860 --> 00:02:48,300
‫If you could instead draw one leaf copy pasted to a broader branch then copy paste the branches to create

26
00:02:48,300 --> 00:02:53,650
‫the tree and finally copy paste this tree to make a forest.

27
00:02:53,760 --> 00:02:56,940
‫You would be finished in no time.

28
00:02:57,900 --> 00:03:03,750
‫Real world data is often structured in such a hierarchical way and deep neural networks automatically

29
00:03:03,750 --> 00:03:05,670
‫take advantage of this fact.

30
00:03:07,320 --> 00:03:12,950
‫Lower hidden layers model lower level structures intermediate hidden layers.

31
00:03:13,000 --> 00:03:19,680
‫Combine these lower level structures to model intermediate level structures and higher state layers

32
00:03:19,980 --> 00:03:21,140
‫and the output layer.

33
00:03:21,150 --> 00:03:27,590
‫Combine these intermediate structures to model high level structures.

34
00:03:27,620 --> 00:03:35,420
‫Not only does this hierarchical structure help deep neural networks converge faster it also improves

35
00:03:35,420 --> 00:03:39,470
‫the ability to generalize to new data.

36
00:03:40,280 --> 00:03:48,710
‫For example if you have already trained a model to recognize faces in a picture and you know want to

37
00:03:48,710 --> 00:03:56,720
‫train a new neural network model to recognize hair styles then you can kick start training by reusing

38
00:03:56,720 --> 00:04:04,310
‫the lower levels of the first network instead of randomly initializing the weights and biases of the

39
00:04:04,310 --> 00:04:08,180
‫first few layers of the new neural network.

40
00:04:08,210 --> 00:04:16,510
‫You can initialize them to the value of the weights and biases of the lower layers of the first network.

41
00:04:16,550 --> 00:04:23,990
‫This way the network will not have to learn from scratch and it will only have to learn the higher level

42
00:04:23,990 --> 00:04:26,150
‫structures.

43
00:04:26,150 --> 00:04:28,350
‫This is called transfer learning.

44
00:04:29,900 --> 00:04:37,490
‫So in summary for most problems you can start with just one or two hidden layers and it will work just

45
00:04:37,490 --> 00:04:41,540
‫fine for more complex problems.

46
00:04:41,600 --> 00:04:47,920
‫You can gradually ramp up the number of hidden layers until you start all fitting the training data.

47
00:04:51,840 --> 00:04:59,040
‫Next we discuss the number of neurons but you then layer obviously the number of neurons in the input

48
00:04:59,160 --> 00:05:03,330
‫and output layers is determined by the type of input and output.

49
00:05:03,330 --> 00:05:13,170
‫Your task requires for example the M NASD fashion dataset that we used that required 24 into 24.

50
00:05:13,170 --> 00:05:20,010
‫That is 780 for input neurons and then output neurons.

51
00:05:20,100 --> 00:05:28,860
‫As for the hidden layers earlier it was a common practice to size them to form a pyramid.

52
00:05:29,250 --> 00:05:33,400
‫That is the first layer had the most number of neurons.

53
00:05:33,540 --> 00:05:38,700
‫For example in the M NASD fashion dataset with three layers.

54
00:05:38,730 --> 00:05:45,960
‫You can have 300 neurons in the first data layer two hundred neurons and the second one and hundred

55
00:05:45,960 --> 00:05:55,200
‫and deterred the rationale being that many low level features can coalesce into far fewer higher level

56
00:05:55,200 --> 00:06:04,840
‫features however this practice has been largely abandoned now as it seems that simply using these same

57
00:06:04,840 --> 00:06:09,630
‫number of neurons in all hidden layers performs just as well.

58
00:06:09,670 --> 00:06:12,520
‫In most cases or maybe even better

59
00:06:15,420 --> 00:06:21,530
‫also it has the advantage of having only one hyper parameter to tune instead of one.

60
00:06:21,540 --> 00:06:30,900
‫But earlier so instead of having three hundred two hundred and 100 neurons in the tree and layers you

61
00:06:30,900 --> 00:06:40,960
‫can have 150 neurons in all three of them if you think that the problem at hand is really complex then

62
00:06:41,290 --> 00:06:49,740
‫you can try increasing the number of neurons gradually until the network starts over putting in general

63
00:06:50,250 --> 00:06:57,480
‫increasing the depth of the network has better results on the accuracy then increasing the number of

64
00:06:57,480 --> 00:06:58,200
‫neurons.

65
00:06:58,220 --> 00:07:06,030
‫Butler another approach could be to pick a model with large number of layers and large number of neurons

66
00:07:06,100 --> 00:07:11,850
‫per head and there and then use at least hoping to prevent that model from all fitting

67
00:07:14,690 --> 00:07:15,770
‫next type of barometer.

68
00:07:15,770 --> 00:07:22,290
‫We are going to discuss learning great learning that is arguably the most important hyper parameter.

69
00:07:22,970 --> 00:07:28,980
‫In general the optimal learning rate is about half of the maximum learning rate that is in learning

70
00:07:28,980 --> 00:07:32,150
‫grade about which the training algorithm diverges

71
00:07:34,810 --> 00:07:41,560
‫so a simple approach for tuning the learning rate is to start with large value that makes the algorithm

72
00:07:41,580 --> 00:07:50,770
‫they would then divide this value by three and try again and repeat this until the training algorithm

73
00:07:50,890 --> 00:07:55,210
‫stops diverging at that point.

74
00:07:55,210 --> 00:07:59,620
‫You generally won't be too far from the optimal learning grade.

75
00:07:59,710 --> 00:08:02,500
‫Then there is bad sites for bad essays.

76
00:08:02,510 --> 00:08:12,460
‫The general rule general rules of thumb try to keep bad sides lower than 32 because a small bad size

77
00:08:12,670 --> 00:08:19,690
‫ensures that each training ideation is very fast and on the lower end.

78
00:08:19,990 --> 00:08:28,300
‫Try to keep a bad size more than 20 because this helps take advantage of the hardware and software optimizations

79
00:08:28,990 --> 00:08:31,720
‫in particular for matrix multiplications.

80
00:08:31,810 --> 00:08:40,760
‫So this will also help in speeding up training so a good range is between 20 to 32.

81
00:08:40,780 --> 00:08:47,110
‫Lastly there is another hyper parameter called epoch that is the number of training ideations that we

82
00:08:47,110 --> 00:08:50,560
‫will do instead of tuning it.

83
00:08:50,560 --> 00:08:59,530
‫We would suggest that you use a large number for epochs and use the early stopping technique to prevent

84
00:08:59,590 --> 00:09:04,360
‫or predict that's all about selecting high but parameters.