1
00:00:11,690 --> 00:00:16,610
In this lecture we are going to talk a little bit more in-depth about training sets versus validation

2
00:00:16,610 --> 00:00:19,280
sets versus test sets.

3
00:00:19,290 --> 00:00:23,790
First let's think about why we would want to split our data up in the first place.

4
00:00:23,840 --> 00:00:29,210
As usual I'm going to return to my dumb as possible picture which is that machine learning is nothing

5
00:00:29,240 --> 00:00:31,140
but a geometry problem.

6
00:00:31,160 --> 00:00:35,100
Imagine that all the data I collected in the past looks like this.

7
00:00:35,150 --> 00:00:41,630
So I find a line that can discriminate between the two classes of data and I can conclude that my classifier

8
00:00:41,630 --> 00:00:43,280
has pretty good accuracy.

9
00:00:43,310 --> 00:00:43,700
Great.

10
00:00:44,420 --> 00:00:47,950
But remember what's the point of building a classifier.

11
00:00:47,960 --> 00:00:55,090
It's to use it on unseen data that I want to classify in the future so what happens when tomorrow's

12
00:00:55,090 --> 00:00:57,190
data looks like this.

13
00:00:57,190 --> 00:01:00,190
Now all of a sudden the classifier is no good.

14
00:01:00,190 --> 00:01:07,070
In fact it's terrible Well that sucks because this is actually the data I really care about classifying.

15
00:01:07,250 --> 00:01:09,450
I don't care about classifying past data.

16
00:01:09,490 --> 00:01:10,500
Why.

17
00:01:10,760 --> 00:01:12,600
Because I already know what the answer is.

18
00:01:12,620 --> 00:01:16,640
So there's no need to classify those data points anymore.

19
00:01:16,670 --> 00:01:19,750
Now you might think this example is a little extreme.

20
00:01:19,760 --> 00:01:25,220
Clearly the data we have seen so far in this course did not behave this way when we get good training

21
00:01:25,220 --> 00:01:25,820
accuracy.

22
00:01:25,820 --> 00:01:32,060
We tend to get good test accuracy but in fact we'll see some data that behaves exactly like this later

23
00:01:32,060 --> 00:01:32,780
in the course

24
00:01:37,960 --> 00:01:42,520
here's another image that should help you understand why we need train and test sets.

25
00:01:42,520 --> 00:01:49,060
The problem is over fitting if you have a complex enough model you will get very good accuracy on almost

26
00:01:49,060 --> 00:01:55,030
any data set but you can see here that although we fit very well to the training data we have a poor

27
00:01:55,030 --> 00:02:00,340
fit on the test data which in this case is the true function that we are trying to model.

28
00:02:00,340 --> 00:02:07,390
This is called over fitting and it's the opposite of what we want which is good generalization there

29
00:02:07,400 --> 00:02:12,590
is a tradeoff that must happen here should you make your model more complex so that you're training

30
00:02:12,590 --> 00:02:18,620
accuracy is improved or should you make your model less complex so that there's a better fit to the

31
00:02:18,620 --> 00:02:20,140
test data.

32
00:02:20,150 --> 00:02:26,240
The problem is if your model is not complex enough then it will perform poorly on both the train and

33
00:02:26,240 --> 00:02:27,370
test data.

34
00:02:27,590 --> 00:02:33,560
So your job is to find the optimal amount of complexity such that it performs well on the test data.

35
00:02:33,560 --> 00:02:39,160
It can't be too complex but it also must be complex enough.

36
00:02:39,190 --> 00:02:42,150
You may also know this as the bias variance tradeoff.

37
00:02:42,310 --> 00:02:47,410
Although this concept is outside the scope of this course you're encouraged to check out the in-depth

38
00:02:47,410 --> 00:02:49,680
series if you're interested in learning more.

39
00:02:50,500 --> 00:02:56,200
So I hope that satisfies your intuition about why we need to have separate data for both training and

40
00:02:56,200 --> 00:03:01,810
testing.

41
00:03:01,860 --> 00:03:06,960
Now you may have heard that sometimes people split their data into three separate data sets the train

42
00:03:06,960 --> 00:03:09,940
set the validation set and the test set.

43
00:03:09,960 --> 00:03:11,500
What's the difference.

44
00:03:11,520 --> 00:03:18,210
Why is it that sometimes we just have train and test and sometimes we have train validation and test.

45
00:03:18,210 --> 00:03:24,570
It's really just semantics when you have a train and test set only your test set is more like a validation

46
00:03:24,570 --> 00:03:25,640
set.

47
00:03:25,740 --> 00:03:29,630
In fact the words test and validation are pretty much synonymous.

48
00:03:29,670 --> 00:03:32,340
They mean the same thing linguistically.

49
00:03:32,340 --> 00:03:37,380
What you are doing when you split your data into train and test is you are actually using your test

50
00:03:37,380 --> 00:03:41,500
step to validate that your model works on unseen data.

51
00:03:41,640 --> 00:03:46,860
So in fact sometimes when people split their data into two parts they'll call them train and validation

52
00:03:46,860 --> 00:03:49,680
sets rather than train and test sets.

53
00:03:49,710 --> 00:03:53,230
In this context the meaning of both is exactly the same.

54
00:03:53,250 --> 00:03:54,420
I hope you're not confused

55
00:03:59,640 --> 00:04:05,340
so in what case well we want to distinguish between train validation and test for this.

56
00:04:05,340 --> 00:04:11,340
I like to think of Kaggle contests although it's well known that Kaggle contests are not a good representation

57
00:04:11,670 --> 00:04:14,480
of how you will do machine learning in the real world.

58
00:04:14,490 --> 00:04:18,550
In fact there have been multiple instances of cheating on the platform.

59
00:04:18,630 --> 00:04:24,900
In any case the way that Kaggle works is this you will have some data that Kaggle gives you to train

60
00:04:24,900 --> 00:04:26,160
your model on.

61
00:04:26,160 --> 00:04:29,280
These are typically labeled as train and test.

62
00:04:29,280 --> 00:04:34,710
Importantly the test set does not come with any labels so you can't compute your models performance

63
00:04:34,800 --> 00:04:36,960
on the test set by yourself.

64
00:04:36,960 --> 00:04:43,100
This is unlike when we split the data into train and tests in our class all you can do with your test

65
00:04:43,100 --> 00:04:47,060
set is make predictions and submit them to Kaggle servers.

66
00:04:47,390 --> 00:04:52,760
Then Kaggle will show you your test score along with the test scores from other participants.

67
00:04:52,970 --> 00:04:56,380
They will list these on what is called a leaderboard.

68
00:04:56,580 --> 00:05:00,210
This is the true test set because it's how Kaggle is testing your model.

69
00:05:05,430 --> 00:05:09,810
But if that's how Kaggle tests your model then how will you test your model.

70
00:05:09,810 --> 00:05:12,990
This is where the validation set comes into play.

71
00:05:13,020 --> 00:05:13,950
Remember it Kaggle.

72
00:05:13,950 --> 00:05:18,790
Generally speaking although this is not always the case we'll give you just that train and test set

73
00:05:19,470 --> 00:05:23,640
you want to perform well on the test set because if you win then you'll make some money.

74
00:05:25,050 --> 00:05:31,290
So you have to make sure that your model performs well on unseen data namely this test set for which

75
00:05:31,290 --> 00:05:34,880
Kaggle withholds the labels in order to do that.

76
00:05:34,920 --> 00:05:40,710
You have to pretend to evaluate your model on unseen data by splitting the train set that Kaggle gives

77
00:05:40,710 --> 00:05:44,480
you and to train and validation sets for this data.

78
00:05:44,490 --> 00:05:49,140
You do have all the labels and so you can calculate both the train and validation accuracy

79
00:05:54,360 --> 00:05:55,070
sometimes.

80
00:05:55,080 --> 00:06:00,480
Although this isn't usually done in deep learning you will split your data randomly into train and validation

81
00:06:00,480 --> 00:06:02,480
sets multiple times.

82
00:06:02,520 --> 00:06:04,670
We call this cross validation.

83
00:06:04,740 --> 00:06:10,990
Here's the basic idea let's suppose you split your training data into five parts.

84
00:06:10,990 --> 00:06:17,200
Then you'll do a loop which runs five times on each iteration of the loop you'll consider four of the

85
00:06:17,200 --> 00:06:23,080
parts to be your train set and the remaining part to be your validation set on each iteration.

86
00:06:23,080 --> 00:06:27,730
You will train a new model on the current train set and evaluate the model on the current validation

87
00:06:27,730 --> 00:06:28,730
set.

88
00:06:28,930 --> 00:06:31,840
When you're done you'll have five different validation scores.

89
00:06:31,900 --> 00:06:37,930
And from this you can calculate the mean variance and so forth to get some statistical measure of how

90
00:06:37,930 --> 00:06:41,190
well your model might perform on unseen data.

91
00:06:41,200 --> 00:06:46,270
The reason this is in use for deep learning is that the data sets are usually so large and training

92
00:06:46,270 --> 00:06:51,940
takes so long that the variance is pretty small and just one validation set should give you a pretty

93
00:06:51,940 --> 00:06:59,000
good idea of your model's performance.

94
00:06:59,010 --> 00:07:01,270
There is a small caveat to the leaderboard.

95
00:07:01,380 --> 00:07:03,210
In fact there are two leaderboards.

96
00:07:03,210 --> 00:07:08,580
The public leaderboard and the private leaderboard the public leaderboard is where your score on the

97
00:07:08,580 --> 00:07:09,980
test set is shown.

98
00:07:10,380 --> 00:07:15,870
The private leaderboard uses yet another dataset and this is the one on which the winner of the contest

99
00:07:15,900 --> 00:07:17,400
is selected.

100
00:07:17,400 --> 00:07:23,220
This is of course to prevent over fitting to the public leaderboard test set the public leaderboard

101
00:07:23,250 --> 00:07:28,080
allows you to submit a solution multiple times so you have some chance of a fitting to it.

102
00:07:29,260 --> 00:07:34,620
On the other hand you only know your private leaderboard placement once at the end of the contest.

103
00:07:34,750 --> 00:07:38,740
So in fact to confuse you even more now there is not just the train set.

104
00:07:38,740 --> 00:07:40,450
The validation set and the test set.

105
00:07:40,780 --> 00:07:42,520
But there is yet another test set

106
00:07:47,680 --> 00:07:51,130
of course Kaggle is not the real world's in the real world.

107
00:07:51,130 --> 00:07:54,810
Your test set is the set of data for which you really want the answer.

108
00:07:54,910 --> 00:07:58,250
But you can't produce using non machine learning techniques.

109
00:07:58,540 --> 00:08:03,610
If you could just write a basic computer program or implement a simple set of rules then you don't need

110
00:08:03,610 --> 00:08:04,950
machine learning.

111
00:08:04,990 --> 00:08:09,820
So when we are using machine learning it's because we really do not know the answer or we can't compute

112
00:08:09,820 --> 00:08:11,220
it practically.

113
00:08:11,410 --> 00:08:17,810
Some basic examples of this are fraud detection spam filtering or object detection we really want these

114
00:08:17,810 --> 00:08:20,090
to work on data that we haven't seen before.

115
00:08:20,210 --> 00:08:27,580
But it's clear that we can't do that work manually.

116
00:08:27,580 --> 00:08:33,310
The next question I want to answer is what is the validation set even for you might be thinking.

117
00:08:33,310 --> 00:08:39,400
Sure I can evaluate the model to get some idea of how I will perform on unseen data but then what.

118
00:08:39,850 --> 00:08:41,620
What do I do with that result.

119
00:08:41,620 --> 00:08:46,290
How is it actionable recall that in machine learning.

120
00:08:46,290 --> 00:08:48,210
We have a lot of choices.

121
00:08:48,240 --> 00:08:51,030
We call these choices hyper parameters.

122
00:08:51,030 --> 00:08:54,460
For example in gradient descent we have to choose the learning rate.

123
00:08:54,660 --> 00:09:00,170
We can't learn the learning rate as you learn more about different variants of gradient descent.

124
00:09:00,230 --> 00:09:05,240
You will have to choose between those variants and then among those variants there are yet more choices

125
00:09:05,240 --> 00:09:07,400
of hyper parameters.

126
00:09:07,610 --> 00:09:12,650
If you have a lot of input features you might want to choose just a subset of the features so that the

127
00:09:12,650 --> 00:09:17,060
model can learn from the important features rather than just the noise.

128
00:09:17,060 --> 00:09:20,260
You also have to choose between different model architectures.

129
00:09:20,480 --> 00:09:23,830
After you learn about basic linear models you'll learn about and ends.

130
00:09:23,850 --> 00:09:26,570
CNN's aren't ends and so forth.

131
00:09:26,570 --> 00:09:31,150
You'll have to choose the number of layers in a network and the number of units in a layer.

132
00:09:31,220 --> 00:09:34,260
You'll also have to choose activation functions.

133
00:09:34,280 --> 00:09:37,450
There so many choices you will feel overwhelmed.

134
00:09:37,460 --> 00:09:41,890
Unfortunately it's not possible to choose these hyper parameters automatically.

135
00:09:41,930 --> 00:09:45,650
You can't just do gradient descent on model architectures.

136
00:09:45,650 --> 00:09:49,050
At this stage your approach amounts to just trial and error.

137
00:09:49,220 --> 00:09:52,880
And so how do you evaluate each of these different choices.

138
00:09:52,880 --> 00:09:59,430
The answer is by using the validation set typically you will pick the model that gives you the highest

139
00:09:59,430 --> 00:10:05,430
validation score and that will be the model that you use on your test data whether that's a Kaggle contest

140
00:10:05,460 --> 00:10:07,110
or pushing your model to production.