WEBVTT

00:01.320 --> 00:08.040
In this session, we will discuss about the third informal learning method, which is stacking now,

00:08.070 --> 00:15.540
then all we have discussed about a lot of supervised learning methods, all of these supervised learning

00:15.540 --> 00:19.500
methods perform decently at their own level.

00:20.010 --> 00:28.230
But sometimes the problems are so complex that one cannot really capture all the patterns which are

00:28.230 --> 00:28.760
available.

00:29.600 --> 00:38.360
So for this particular situation, we have a solution for stuck in this solution, what we actually

00:38.360 --> 00:48.890
do is we pick out some trained methods, some green algorithms, and create a linear combination of

00:48.890 --> 00:50.000
these and partitas.

00:50.980 --> 01:00.760
The combination of these algorithms actually helps us in getting a better outlook, in getting a better

01:00.760 --> 01:02.630
performance from these markets.

01:04.110 --> 01:07.690
So let us see what stacking helps us in doing so.

01:07.710 --> 01:10.170
When do we actually choose?

01:12.010 --> 01:21.070
So in case we want to decrease variance, then we will use -- in case we want to decrease the bias.

01:21.310 --> 01:23.610
In that case, we will use full steam.

01:24.190 --> 01:31.040
Now, in the in the case, we want to improve the predictions, then we will see to stacking Alberto.

01:32.910 --> 01:41.760
So what is stacking, stacked generalization or stacking for short is an ensemble machine learning algorithm

01:42.030 --> 01:48.120
and combining the predictions from multiple machine learning models on the same data set.

01:48.370 --> 01:51.720
So we will train multiple machine learning algorithms.

01:51.930 --> 01:52.710
So let us see.

01:52.710 --> 01:55.070
I have one linear regression.

01:55.080 --> 02:00.660
I have one random forest, I have another she to another SVM model.

02:00.870 --> 02:08.460
And all of these models are performing decently on their own part of the data, but they are not able

02:08.460 --> 02:13.830
to capture one or the other part of the pattern that is present in the data.

02:14.310 --> 02:24.720
So in that case, we will combine the powers of all of these good performers and get a very strong performer,

02:24.990 --> 02:33.090
which is called Ustad Model, which will be created by combining the powers of all of these algorithms.

02:34.080 --> 02:42.030
So this is used to explore a space of different models for the same problem, learning with different

02:42.030 --> 02:48.180
types of models which are capable to learn some part of the problem, but not the whole space of the

02:48.180 --> 02:48.660
problem.

02:49.770 --> 02:57.450
So what we will be doing here is we will take these different models and create a linear combination

02:57.450 --> 03:05.430
of these models so that they will perform better from the other models, from these individual models.

03:07.150 --> 03:14.710
So the models to be stacked should be non-linear in nature and the models are combined using a linear

03:14.710 --> 03:15.160
method.

03:15.190 --> 03:26.410
So we will be choosing several long nonlinear models such as SBM vs Kinen or Qaboos, Random Forest

03:26.650 --> 03:29.100
and these nonlinear models.

03:29.260 --> 03:34.720
We will combine these nonlinear models using the linear method.

03:35.020 --> 03:41.200
That is, if we have a classification problem, then we will combine these using logistic.

03:41.500 --> 03:48.550
If it is a linear problem that it is a regression problem, then we will combine these using linear

03:48.550 --> 03:48.940
model.

03:50.810 --> 03:58.400
So this is the entire process, so we will let us have our training data, which is engross in that

03:58.670 --> 04:01.810
is the number of rules and is the number of columns.

04:02.090 --> 04:10.910
So we will train these four modules or five modules, any number of models of our choice and train these

04:10.910 --> 04:13.010
models on this particular dataset.

04:13.820 --> 04:21.120
Now a new training set of second level models consisting of the prediction from the first level of models.

04:21.290 --> 04:25.380
Now we will have some predictions from these models.

04:26.450 --> 04:30.410
So this model will let us say give I had one.

04:30.410 --> 04:34.740
This will give by two, by three and I had four.

04:35.290 --> 04:43.730
Now we will be predicting we will create a logistic or linear regression on both of these very high

04:43.820 --> 04:54.320
values, which we will obtain from these models and try to reduce the error on on these by actually

04:54.320 --> 04:58.050
predicting the values from these very high values.

04:58.220 --> 05:06.110
So at the first level, we try to predict the values from the input being this enclose in matrix that

05:06.110 --> 05:07.940
is, this includes NBPA.

05:08.240 --> 05:15.380
Now, the input will actually be the for the second layer, the input will be the output, which we

05:15.380 --> 05:17.360
will get from these four moments.

05:18.570 --> 05:24.230
And then we will try to reduce the error from these forward monitors.

05:24.570 --> 05:25.440
Now let us see.

05:25.470 --> 05:31.980
Still, it is not performing well, then we can introduce another layer and that there will actually

05:31.980 --> 05:38.610
be trained on the output of the second layer so we can have any number of layers present in the stacking.

05:39.570 --> 05:44.580
So based on these levels, we will get the final prediction.

05:46.580 --> 05:55.720
So what is the process, the process is that the initial training that X has and observations and features,

05:55.730 --> 06:07.010
so it is a and so there are different models that are trained on X by some method of training like cross-validation

06:07.010 --> 06:07.650
beforehand.

06:07.850 --> 06:11.250
So we have already trained some different models.

06:11.540 --> 06:16.090
Now each model provides the prediction for the outcome.

06:16.100 --> 06:16.410
Right.

06:16.550 --> 06:20.420
So we will have some very high value from each of the models.

06:20.720 --> 06:27.920
Now, these are forced into a second level training data and which is now across in size.

06:28.130 --> 06:31.500
That is M number of rows I am wanted.

06:31.760 --> 06:37.970
Now, the M predictions become the feature for the second level and value remains the same.

06:38.180 --> 06:44.520
So now we will apply on the second level of model on top of it, and then they could further.

06:44.720 --> 06:50.180
Now what we can do is if the output does not surprise us, then we can apply another model on top of

06:50.180 --> 06:52.210
it so we can do it that way.

06:53.180 --> 06:59.510
Now, after the second level, which is a linear model, we can again apply all non-union model on top

06:59.510 --> 07:06.130
of it so we can try different variants, try different combinations and get us that model over.

07:07.810 --> 07:18.190
Now, how do we do cross sampling now, the stocking will be using a similar idea as to the key words

07:18.190 --> 07:21.920
cross-validation to create out of sample predictions.

07:22.360 --> 07:29.740
So if we were to use predictions from the models that can fit through all the training data, then we

07:29.740 --> 07:35.810
will not be having any out of sample data to bring the second level of the model.

07:36.280 --> 07:42.080
So for this, actually, we want to have us out the work sample.

07:42.220 --> 07:46.570
So that is why we don't want to be biased towards these models.

07:46.780 --> 07:56.140
So we will also need to train the models on T minus one foot so each model will be trained on legacy

07:56.140 --> 08:05.200
for food and prediction will be made out of this food so that we will have an out of box prediction.

08:06.460 --> 08:11.870
This will allow the predictions to be out of something and useful for the beginning of the next year.

08:12.070 --> 08:14.160
So let us see how this would happen.

08:15.230 --> 08:18.640
So let us say we have created three for here.

08:19.120 --> 08:28.750
So what we can do is we can bring the data on legacy first two blocks for us to fold and we will train

08:28.990 --> 08:35.110
all the models on the force to force and make a prediction on the third one.

08:35.620 --> 08:42.460
Now, again, we will train all of these models on the next two four and make a prediction from this

08:42.460 --> 08:43.090
green one.

08:43.630 --> 08:51.220
Lastly, we will bring the model from the green and red Ford and make a prediction from the blue one.

08:51.670 --> 09:01.060
Now, the predictions which we will be having here from the Models X one, two, three will be of if

09:01.060 --> 09:05.980
one F2, F3, these are very high values, which we will be getting from these models.

09:06.460 --> 09:08.710
Now we have these very high values.

09:08.710 --> 09:16.540
These very high values will actually test the X values for the linear model, which we will be bringing

09:16.540 --> 09:17.530
in the next level.

09:17.950 --> 09:24.280
So these will be these will go on in two into the X values for the next model and this will go into

09:24.280 --> 09:25.160
the value.

09:25.510 --> 09:32.070
So we will bring a linear model on top of this particular data and get the output.

09:32.290 --> 09:37.430
So like this, we will be able to get the stacking implemented.

09:37.780 --> 09:43.020
So we will have a look at the code of this particular implementation in the next session.