WEBVTT

00:01.030 --> 00:09.220
In this session, we will discuss about the forced ensemble learning method, which is bigging, which

00:09.220 --> 00:13.090
has been implemented in the form of random forest.

00:14.920 --> 00:16.090
So let us look at it.

00:17.180 --> 00:25.880
So first of all, let us discuss about the session, please, so large decision trees have low bias

00:26.000 --> 00:30.020
and high variance and they tend to overthink.

00:31.300 --> 00:39.550
While in case of storms, the storms have high bias and low ratings and depend on the.

00:40.950 --> 00:54.150
Now, these will be creating all random forest, random forest is an implementation of bagging and bagging

00:54.300 --> 00:57.700
helps us to reduce the variance.

00:57.960 --> 01:06.690
So how do we do that is we basically combine different stumps and we take the average of them so that

01:06.690 --> 01:09.650
we get the low variance model.

01:10.170 --> 01:14.190
So let's see for the what are we doing in case of body?

01:14.490 --> 01:17.850
In case of bug, we reduce the variance.

01:18.390 --> 01:25.330
So we combine the stumps so that the audience is reduced in comparison to the degree.

01:25.350 --> 01:29.510
So in case of decision tree, we have high variance.

01:30.060 --> 01:37.160
So what we do is we basically use stumps so that we could have low variance in this particular situation.

01:39.220 --> 01:42.740
So how what is actually random forest?

01:43.510 --> 01:50.320
So we have discussed about bigging, we have discussed about why they use the stumps instead of decision

01:50.320 --> 01:53.860
trees, but we don't really know what random forest is.

01:55.350 --> 02:06.090
So random forest is a group of small decision trees, that is a group of stumps which we think and using

02:06.090 --> 02:12.480
these group of stumps, we create a huge forest, we create a forest.

02:12.480 --> 02:13.410
What is a forest?

02:13.410 --> 02:16.320
Forest is something that contains a lot of trees.

02:16.710 --> 02:22.640
So same thing is here we are creating forest of decision stumps.

02:23.160 --> 02:29.720
Now, when we are creating the forest of decision brown stumps, we need to add this randomness.

02:30.300 --> 02:38.220
So when we are seeing random forest, it means that we are creating a forest and we have to add some

02:38.220 --> 02:39.490
randomness to it.

02:40.810 --> 02:50.440
Now, how are we introducing this randomness, this randomness has been introduced by creating subsect

02:50.440 --> 02:54.940
of samples, which is taken for the framing of each and every three.

02:56.080 --> 03:03.160
So when you will be creating that decision, what we do is we invite the decision three, we consider

03:03.160 --> 03:10.330
all the features which are present now in case of random forest, when we will be creating the decision

03:10.330 --> 03:18.490
stems, we will be taking the small samples of data instead of considering all the samples of data.

03:19.000 --> 03:27.240
So we will take only small samples of data so that we can get only a subset of the data points.

03:27.400 --> 03:30.970
That is only a subset of patterns from the data.

03:32.440 --> 03:40.240
The next thing what we do is we also take a subset of the variables by draining each tree.

03:40.870 --> 03:44.620
Now, what happens when we are taking the subset of variable now?

03:44.620 --> 03:47.340
Because we are taking a subset of variables.

03:47.500 --> 03:56.440
So instead of having all the features present for creating a tree, we are now having different subsets

03:56.440 --> 03:58.960
of features for creating that decision.

03:59.860 --> 04:07.950
Now, imagine if we would have used the complete set of features for creating the decision tree stumps.

04:08.170 --> 04:13.690
Then what would have happened is that because it will be evaluating all the root?

04:14.910 --> 04:22.080
And because it is evaluating all the rules on the basis of same Ghneim, they sort of squared errors.

04:22.300 --> 04:32.350
That is the reason why it will be the next thing, the rules which are having the highest Guiney gain.

04:33.090 --> 04:40.830
So that is the reason why it was the same split in the first and then followed by the next split.

04:41.040 --> 04:47.100
So the decision tree, which will be formulated will be just the same because there is no new feature

04:47.100 --> 04:54.090
or no subset of feature, which is why when the new features will not be there, when the old features

04:54.300 --> 05:01.440
will not be accepted, then it will be creating exactly the same decision because there is no variation

05:01.440 --> 05:01.760
in it.

05:02.370 --> 05:07.560
So that is why we need to provide the variation in the form of randomness.

05:08.630 --> 05:15.650
Which is the reason why we are looking at a sample subsid so from the entire data, we will look at

05:15.650 --> 05:18.400
small samples of the data at every time.

05:19.010 --> 05:26.020
And for creating the rules, we will be creating all the rules from all the variables.

05:26.030 --> 05:29.390
We will not be considering all the variables for creating the three.

05:29.690 --> 05:34.140
We will be creating only rules from the small subset of the variables.

05:34.640 --> 05:40.910
Now, again, for each and every split which we will be creating, the rules are not considered from

05:40.910 --> 05:42.540
the all in variables.

05:43.070 --> 05:44.090
So let us see.

05:44.270 --> 05:50.630
Initially, we have a number of variables from which we use to create our decision.

05:50.630 --> 05:52.580
Three these rules.

05:52.640 --> 05:54.680
But all the variables which we had.

05:55.840 --> 06:05.410
So now for creating the stump one stump to stump three and four, I will be taking small subsets of

06:05.410 --> 06:06.900
these variables.

06:07.240 --> 06:11.440
So now the subsequent have any number of variables.

06:11.740 --> 06:16.740
So each more they created from any number of variables.

06:16.960 --> 06:25.300
I like creating the subsect while creating this small stump out of these and variables out of these

06:25.540 --> 06:26.290
variables.

06:26.560 --> 06:30.640
Each split, which I will be making now, the rules will be big.

06:30.640 --> 06:35.890
No, from these in variables, but again, a subset of these in variables.

06:36.880 --> 06:37.480
So you see.

06:38.490 --> 06:46.590
The four subsegments we are taking is like creating the tree itself by creating the stump itself or

06:46.590 --> 06:55.530
the Vitullo itself, so the vehicle itself could be created from a subset of the original number of

06:56.010 --> 07:02.760
features which we had from these variables, which we now have the smaller subset of people, which

07:02.760 --> 07:10.330
we now have from this also every split which we will be making at every split.

07:10.350 --> 07:15.240
Again, we will not consider all the rules, but only a subset of these rules.

07:16.960 --> 07:26.710
So this will again be the one the miss in the stump or in the vehicle, which we will be using now as

07:26.710 --> 07:32.580
a random subset of features selected for the --, are different for each and every split.

07:32.920 --> 07:34.980
So this is what we have known for now.

07:35.110 --> 07:38.620
So that is that the subset of the samples will be done.

07:38.800 --> 07:45.580
Then the subset of the variables will be considered for creating the pre and after creating these three

07:45.580 --> 07:46.780
from this subset.

07:47.800 --> 07:56.230
And every split, again, we will be considering a subset of the number of variables which we have now.

07:58.360 --> 08:03.030
So this is how we will be adding randomness to the decision to.

08:09.420 --> 08:13.350
Now, what a different type of enemy does we have for them?

08:13.380 --> 08:15.990
What is the hypovolemic those?

08:17.330 --> 08:26.060
The investigators, what are the investigators these are in these numbers, which is the which is the

08:26.060 --> 08:34.480
number of bodies which we will be creating so we can create the one hundred three point five hundred

08:34.490 --> 08:37.730
these thousand trees, any number of trees.

08:38.060 --> 08:43.190
But the prescribed number is five hundred five hundred usually works.

08:43.190 --> 08:50.090
Well, in the case of random forest, we usually try a hundred five hundred thousand and then narrow

08:50.090 --> 08:51.110
it down for the.

08:52.880 --> 09:01.880
Next is maximum depth, maximum bet is again in these informal eye, this shows the maximum depth of

09:01.880 --> 09:06.110
the model or the vehicle on which we will be using.

09:06.290 --> 09:12.950
So each week, lonas, that is the maximum that usually we keep it around to five.

09:13.820 --> 09:15.500
Then we have Mike Sumfest.

09:16.860 --> 09:23.940
This is the number of samples to draw from the X Supreme E to base estimated that this.

09:25.730 --> 09:31.820
This maximum number of samples, so what is the maximum number of samples which we have to select?

09:32.670 --> 09:42.330
For creating the estimated then minimum impurity degrees, that is an order will be split if this split

09:42.330 --> 09:49.230
induces or degrees of the impurity greater than the equal of this value that this.

09:51.230 --> 10:00.260
That is the minimum amount of impurity or degrees, or you can also see information gain the information

10:00.260 --> 10:07.460
gained, which we were talking about, we said that if we have a fixed amount of information gain at

10:07.460 --> 10:12.460
least a minimum amount of information gained, then only we will make a split.

10:12.710 --> 10:14.900
Otherwise we will not make a split.

10:15.920 --> 10:22.970
Then warm, stop it, if it is said to be true, then it will allow us to refuse the solution of the

10:22.970 --> 10:30.830
previous goal of the Fichte function and it will allow us to add more estimated dollars to the ensemble.

10:30.830 --> 10:34.400
Otherwise it will just fit a whole new forest.

10:36.410 --> 10:42.830
Then in the jobs is what defines the number of jobs which begin the running buddy.

10:43.070 --> 10:50.210
So instead of creating the number of threes sequentially, we can actually create them.

10:50.210 --> 10:53.090
Barlinnie by using any jobs.

10:56.680 --> 11:00.680
So this is the theory about random forest.

11:01.120 --> 11:04.900
Next, we will learn about the implementation of random forest.

11:08.450 --> 11:15.360
Now, at this point of time, you have a good knowledge of how a model should be generated.

11:15.800 --> 11:20.420
So now you can actually start to create your project.

11:21.710 --> 11:30.710
So to create your project, you can become the B2C provider and to the data set, you can now actually

11:30.710 --> 11:34.270
run the NINIAN models and often linear models.

11:34.400 --> 11:38.450
You can start implementing decision tree, random forest.

11:38.570 --> 11:47.720
And as we go ahead, you will be implementing stocking and extra boost to it and finally create a pipeline

11:47.870 --> 11:50.760
of the best models which you have generated.

11:51.710 --> 11:58.670
This is a project which will allow you to try all the aspect of machine learning.

11:58.880 --> 12:07.280
That is from data preparation to data analysis and also the implementation of various machine learning

12:07.280 --> 12:08.030
algorithms.

12:08.210 --> 12:09.590
And it will allow you.

12:10.590 --> 12:17.990
To compare different machine learning algorithms with each other so that you can easily create the models

12:18.000 --> 12:23.220
when you want to create them in the real life during your walk.