1
00:00:00,530 --> 00:00:07,560
‫In this video we will learn how to split the available data into test and train said then we will train

2
00:00:07,560 --> 00:00:13,080
‫the model on the training set and find the means squared error of the test said

3
00:00:15,920 --> 00:00:17,860
‫to split the data into test and.

4
00:00:17,870 --> 00:00:24,620
‫I prefer to install this package or what other method this package is called C IT tools.

5
00:00:24,830 --> 00:00:26,800
‫You know how to install a package.

6
00:00:27,200 --> 00:00:29,930
‫You can just trade in starter dart packages

7
00:00:32,990 --> 00:00:36,210
‫and within Blackboard and double quotation marks right.

8
00:00:36,240 --> 00:00:39,010
‫See tools and be of tool just capital

9
00:00:43,470 --> 00:00:48,420
‫on this.

10
00:00:48,800 --> 00:00:52,370
‫You can see on the right see tools is now available.

11
00:00:52,370 --> 00:00:55,510
‫We'll just take this checkbox to make this available.

12
00:00:56,770 --> 00:00:59,530
‫Now we are going to set a seed.

13
00:00:59,710 --> 00:01:06,370
‫The concept of setting seed is that when splitting the data into a distant train I'll be doing it randomly

14
00:01:06,970 --> 00:01:13,510
‫but if I set the seed at a particular value and you said the same seed at the same value we both will

15
00:01:13,510 --> 00:01:15,100
‫get the same split.

16
00:01:15,100 --> 00:01:21,610
‫That is the observation in the training set which I will hit you will get the same observation in your

17
00:01:21,610 --> 00:01:27,100
‫training set so we'll set the seed at 0.

18
00:01:27,150 --> 00:01:32,490
‫So relate said Dot seed again and within the bracket we will write 0

19
00:01:37,220 --> 00:01:43,230
‫within Baghdad relay detail will run this so we'll see this attack 0.

20
00:01:43,270 --> 00:01:49,240
‫Now we will split the data we'll write split is equal to sample dot split

21
00:01:54,380 --> 00:02:05,960
‫and within a decade will write B if comma split ratio is equal 2.8 the S and the odds of split ratio

22
00:02:05,960 --> 00:02:17,000
‫are capital X this to a new variable called Split is created and it has true and false value for each

23
00:02:17,000 --> 00:02:24,370
‫of the observation we will assign through to the training set and the values that follows will assign

24
00:02:24,370 --> 00:02:26,480
‫a two day set so training set is equal to

25
00:02:31,700 --> 00:02:36,090
‫and thus will set is equal to subset

26
00:02:39,570 --> 00:02:46,040
‫it's a subset of B if so be it comma split

27
00:02:48,720 --> 00:02:51,510
‫equal to equal to true.

28
00:02:52,780 --> 00:02:59,460
‫So we're checking wherever the split values true we take out that subset of the EV and put it into the

29
00:02:59,460 --> 00:03:08,950
‫training set variable so you can see training set variable is also created it has 378 observations it

30
00:03:08,950 --> 00:03:15,730
‫will not tell exactly 80 percent of the observations but merely whichever one you mentioned in the split

31
00:03:15,730 --> 00:03:22,870
‫ratio you will have nearly those number of observations and what the remaining values will assign them

32
00:03:22,870 --> 00:03:23,600
‫to test it.

33
00:03:23,620 --> 00:03:25,330
‫So test underscore said

34
00:03:30,080 --> 00:03:36,930
‫is equal to subset and within that get the F comma split equal to equal to False

35
00:03:43,030 --> 00:03:47,750
‫on this so best said video is also created.

36
00:03:49,370 --> 00:03:53,470
‫Now let we run a linear model on the training data set.

37
00:03:53,750 --> 00:04:02,480
‫We know how to run a linear model for that will create a variable elem underscored a and this is equal

38
00:04:02,480 --> 00:04:05,810
‫to a limb within bracket will.

39
00:04:05,910 --> 00:04:09,600
‫But I use data dot

40
00:04:12,810 --> 00:04:18,190
‫comma data is equal to training set.

41
00:04:18,640 --> 00:04:21,530
‫We are not running this model on the complete data that we have.

42
00:04:21,610 --> 00:04:26,200
‫We are running it only on the 378 observations in the training set.

43
00:04:26,440 --> 00:04:28,610
‫So let's run this.

44
00:04:28,810 --> 00:04:31,400
‫The model is fit in the 11:00 dress code 8.

45
00:04:31,960 --> 00:04:37,180
‫If you want to look at this somebody you can date somebody and we can record eleven elements for a.

46
00:04:38,510 --> 00:04:46,040
‫But here we are going to find out the mean square error of the training set and it's set.

47
00:04:46,440 --> 00:04:52,420
‫So to find means grid errors we need to first predictive value of trace basis.

48
00:04:52,430 --> 00:04:55,880
‫This fitted model to predictive value.

49
00:04:55,990 --> 00:05:01,370
‫We use a function called predict the predict function takes two parameters.

50
00:05:01,370 --> 00:05:05,350
‫One is the model that we have fitted which is a limited story.

51
00:05:05,600 --> 00:05:14,750
‫And the other is the data which is to be used to predict the values of a So we get the predicted values

52
00:05:15,230 --> 00:05:23,540
‫of the training set into a variable called Train underscored a will rate train underscored a is equal

53
00:05:23,540 --> 00:05:33,980
‫to predict and within bracket the first parameter will be element underscoring the city for that model

54
00:05:34,270 --> 00:05:36,100
‫comma.

55
00:05:36,500 --> 00:05:37,330
‫This is the data.

56
00:05:37,340 --> 00:05:45,070
‫So this the training data will guide training underscored said.

57
00:05:45,150 --> 00:05:52,960
‫So what this will do is it will take all the independent variables from this it put it into this model

58
00:05:53,320 --> 00:05:57,910
‫and predict the value of the independent variable and stored it and to train underscored it.

59
00:05:59,050 --> 00:06:02,490
‫So let's run this.

60
00:06:02,800 --> 00:06:12,430
‫So we have train underscored a as another variable will do this same thing for the test it also just

61
00:06:12,640 --> 00:06:14,470
‫in place of train will test

62
00:06:19,750 --> 00:06:25,880
‫to will get the predicted value of house price for our test data also.

63
00:06:26,020 --> 00:06:34,570
‫Now the means where error is the average of difference all the squares of these predicted values and

64
00:06:34,570 --> 00:06:39,640
‫the actual values so to get that average will right mean

65
00:06:43,520 --> 00:06:48,760
‫and within brackets we have to square the differences of these.

66
00:06:49,000 --> 00:06:55,310
‫So it is a difference of training underscore said dollar price.

67
00:06:55,420 --> 00:07:01,330
‫So these these are the actual values minus the predicted values which are trained underscored a

68
00:07:05,400 --> 00:07:07,240
‫and we want to squared these values.

69
00:07:07,380 --> 00:07:09,900
‫So we'll put another bracket under

70
00:07:13,640 --> 00:07:25,490
‫I will square this difference and run this so twenty point six six is the mean squared error on the

71
00:07:25,490 --> 00:07:28,250
‫printing data.

72
00:07:28,460 --> 00:07:35,420
‫So on an average squared distance of the predicted values and the actual values on the training data

73
00:07:35,810 --> 00:07:38,090
‫is twenty point six six.

74
00:07:38,240 --> 00:07:39,170
‫Let's do this.

75
00:07:39,250 --> 00:07:40,640
‫Body tested also

76
00:07:44,100 --> 00:07:47,660
‫will use the best set dollar price

77
00:07:51,680 --> 00:07:53,880
‫minus test underscored a

78
00:07:57,650 --> 00:07:57,860
‫so.

79
00:07:58,090 --> 00:08:06,700
‫Since this test data is previously unseen most probably our model will not work as well on this data.

80
00:08:06,700 --> 00:08:15,250
‫The main square error is today 3.0 4 which means it is performing worse on the unseen data.

81
00:08:15,270 --> 00:08:17,640
‫This is as discussed in these two electives also.

82
00:08:19,470 --> 00:08:24,430
‫So this is all we split the data into tested and a train set in.

83
00:08:24,530 --> 00:08:32,140
‫Are we then done the model on the training set and using the model created on the training set will

84
00:08:32,150 --> 00:08:35,530
‫predict the values of the test dependent variable.

85
00:08:35,640 --> 00:08:40,140
‫We then find the estimated error on this test data.

86
00:08:40,140 --> 00:08:43,860
‫This estimated error is to be used when we are comparing different models.