1
00:00:11,080 --> 00:00:16,060
So in this lecture, we will be answering the question, how can I choose hyper parameters?

2
00:00:17,140 --> 00:00:21,240
Please note that while I've discussed this in the course already, sometimes people forget.

3
00:00:21,610 --> 00:00:25,330
So I'm making this lecture to explicitly remind you of this topic.

4
00:00:25,990 --> 00:00:29,290
Now, this is often a point of frustration amongst beginners.

5
00:00:29,860 --> 00:00:33,340
Beginners often feel like there must be some direct method of doing this.

6
00:00:33,820 --> 00:00:38,950
They think because this is science, that there must be an answer, that there must be some formula

7
00:00:38,950 --> 00:00:41,650
or some algorithm that will yield to the correct results.

8
00:00:42,460 --> 00:00:44,440
Unfortunately, this is not the case.

9
00:00:44,830 --> 00:00:51,190
And as a responsible and mature scientist, you would do well to get comfortable with this reality as

10
00:00:51,190 --> 00:00:51,850
a side note.

11
00:00:51,970 --> 00:00:55,300
Some have said that machine learning is more like alchemy than science.

12
00:00:55,690 --> 00:01:00,760
I would tend to agree with this notion, and this is one of those areas that makes it apparent.

13
00:01:05,480 --> 00:01:10,640
So before we continue on, I want to mention that this applies to all the hype parameters discussed

14
00:01:10,820 --> 00:01:11,810
in this course.

15
00:01:12,290 --> 00:01:17,360
For example, in the early parts of the course, our main concern is how do we choose the learning rate?

16
00:01:17,660 --> 00:01:20,900
What is the best optimizer in the Aon and section?

17
00:01:20,930 --> 00:01:24,200
Our concerns may be how do we choose the number of hidden layers?

18
00:01:24,440 --> 00:01:26,270
How do we choose the number of hidden units?

19
00:01:26,540 --> 00:01:28,310
How do we choose the best activation?

20
00:01:28,730 --> 00:01:32,960
What dropout probability should I use in the CNN section?

21
00:01:32,990 --> 00:01:36,050
Our concerns may be what is the best filter size?

22
00:01:36,320 --> 00:01:38,360
What is the best number of feature maps?

23
00:01:38,840 --> 00:01:44,960
And these general themes repeat throughout as you learn about origins in LP recommenders, gans and

24
00:01:44,960 --> 00:01:45,650
so forth.

25
00:01:46,730 --> 00:01:51,410
So please keep in mind that this is not actually a separate question for each of these items.

26
00:01:51,800 --> 00:01:53,870
These are all actually the same question.

27
00:01:53,930 --> 00:01:58,160
Specifically, it's the question of how to choose hyper parameters.

28
00:02:02,730 --> 00:02:08,220
So the answer to this question, although many of you may find this disappointing, is my famous rule.

29
00:02:08,370 --> 00:02:11,940
Machine learning is experimentation, not philosophy.

30
00:02:12,720 --> 00:02:18,570
The problem is, many beginners start to look for a deep philosophical reasons for choosing this number

31
00:02:18,570 --> 00:02:21,000
of hidden layers or that number of hitting units.

32
00:02:21,510 --> 00:02:27,420
In fact, the correct answer is to simply try random values and see what result you get.

33
00:02:28,020 --> 00:02:32,070
Obviously, the final choice should be the one that gives you the best results.

34
00:02:32,940 --> 00:02:36,060
The basic idea is that you must do experiments.

35
00:02:36,210 --> 00:02:37,890
There is simply no other option.

36
00:02:42,500 --> 00:02:47,540
One obvious thing you can do if you want to be sure you're at least in the right range is to look up

37
00:02:47,540 --> 00:02:50,180
the papers in your field to see what they used.

38
00:02:50,570 --> 00:02:55,880
This is assuming that your data sets are similar, so that should give you a reasonable starting point

39
00:02:55,910 --> 00:02:57,230
for your random search.

40
00:02:57,770 --> 00:03:03,320
But do keep in mind that this adds a bias and could potentially prevent you from searching closer to

41
00:03:03,320 --> 00:03:04,370
the optimal point.

42
00:03:05,240 --> 00:03:09,440
This will also help you get a sense for the scales at which to try random values.

43
00:03:10,010 --> 00:03:14,930
For example, learning rates in many other parameters are usually chosen on a log scale.

44
00:03:15,260 --> 00:03:20,720
For example, zero point one zero point zero one zero point zero zero one and so forth.

45
00:03:21,920 --> 00:03:27,650
If you're interested in seeing more specifics, including actual code, to run a random search, I've

46
00:03:27,650 --> 00:03:33,170
already covered this in my in-depth series of deep learning courses, specifically modern deep learning

47
00:03:33,170 --> 00:03:33,920
in Python.

48
00:03:34,880 --> 00:03:38,180
Do note, however, that most of it is just common sense.

49
00:03:38,600 --> 00:03:45,410
For example, if a value of 10 is bad and 100 is worse and one thousand is even worse, you probably

50
00:03:45,410 --> 00:03:47,270
won't continue one in that direction.

51
00:03:48,110 --> 00:03:53,750
It may help if you made a plot of your model's performance versus the hyper parameter value, but again,

52
00:03:53,750 --> 00:03:55,820
this type of thing is just common sense.

53
00:04:00,420 --> 00:04:06,270
Now, many beginners still object to this, they say, but lazy programmer, there must be some rules

54
00:04:06,270 --> 00:04:06,840
of thumb.

55
00:04:06,990 --> 00:04:09,780
Why can't you just give me some simple rules of thumb?

56
00:04:10,440 --> 00:04:13,560
And the answer to that is this is simply a bad approach.

57
00:04:14,010 --> 00:04:16,140
I'll illustrate this with several examples.

58
00:04:17,370 --> 00:04:22,650
As you recall, we've noted that Adam is often the default choice for optimization method.

59
00:04:23,250 --> 00:04:28,050
However, there was a paper that came out showing that plain SGMD performed better.

60
00:04:28,830 --> 00:04:34,020
Thus, if you're the type of student that tries to follow only rules of thumb without actually doing

61
00:04:34,020 --> 00:04:37,830
the requisite work, you may have landed at a suboptimal result.

62
00:04:38,490 --> 00:04:41,010
This is what happens when you fail to experiment.

63
00:04:43,440 --> 00:04:48,690
Here's another very recent example from OpenAI, released in October 2021.

64
00:04:49,500 --> 00:04:55,020
In this paper called Rocking, the authors explore an interesting phenomenon regarding the ability of

65
00:04:55,020 --> 00:04:56,850
neural networks to generalize.

66
00:04:57,510 --> 00:05:02,820
Now the idea behind this paper is not that relevant for this discussion, but what is relevant is what

67
00:05:02,820 --> 00:05:03,870
the authors tried.

68
00:05:04,470 --> 00:05:09,480
Specifically, they tried different approaches, such as full versus stochastic gradient descent.

69
00:05:09,930 --> 00:05:11,370
They tried different learning rates.

70
00:05:11,670 --> 00:05:16,530
They tried, dropped out, and they even tried to wait to K, which is a method that has somewhat fallen

71
00:05:16,530 --> 00:05:18,810
out of favor in the modern era of deep learning.

72
00:05:19,590 --> 00:05:25,980
So even researchers at one of the top eight companies in the world still must do these experiments and

73
00:05:25,980 --> 00:05:28,080
even on simple things like weight decay.

74
00:05:28,950 --> 00:05:34,440
Furthermore, I lied a little bit when I said the idea behind this paper is not relevant for this discussion.

75
00:05:35,100 --> 00:05:42,120
In fact, what the paper demonstrates is that reality disagrees with some long-held notions about statistical

76
00:05:42,120 --> 00:05:42,570
learning.

77
00:05:43,290 --> 00:05:47,430
Specifically, they show that neural networks can in fact generalize well.

78
00:05:47,700 --> 00:05:50,520
If you keep training beyond the point, it has over, fit it.

79
00:05:51,810 --> 00:05:56,640
So again, if you're the type of student that just wants to memorize some rules of thumb, you would

80
00:05:56,640 --> 00:06:00,120
again be proven wrong by those who did the actual experiments.

81
00:06:00,750 --> 00:06:06,210
Importantly, note that this paper came out several years after the release of this course.

82
00:06:07,110 --> 00:06:09,480
You should expect things to constantly change.

83
00:06:09,840 --> 00:06:14,220
You should expect any rules of thumb you've memorized to soon become irrelevant.

84
00:06:15,270 --> 00:06:17,580
Put more simply, memorization bad.

85
00:06:17,790 --> 00:06:19,260
Experimentation good.