WEBVTT

00:01.010 --> 00:09.650
High in this session, we will revisit random thoughts and discuss about the issues with decision trees,

00:09.830 --> 00:17.270
how we can implement random forest and different details about the random forest and what else we can

00:17.270 --> 00:19.220
actually gain from random forest.

00:19.820 --> 00:21.900
So let us have a look at it now.

00:21.920 --> 00:24.050
What are the different issues with decision?

00:24.050 --> 00:30.430
Trees, decision trees help capturing the nation on patterns in the data.

00:30.740 --> 00:39.000
That is the modern basic training data very well, but may not do well with the newer dataset whenever

00:39.000 --> 00:40.700
we build a particular model.

00:40.970 --> 00:44.270
We wanted to do well with the future data.

00:44.270 --> 00:50.510
Also, we don't just want to learn from the current data and predict the current values properly, but

00:50.510 --> 00:55.250
also to predict the new values, the unseen values better.

00:56.160 --> 01:05.190
So hence, we need the model to be generalized in nature, that is, we don't want the model to be overfitting.

01:06.400 --> 01:08.400
So how do we really handle this?

01:08.650 --> 01:16.050
So on one hand, we want to capture the decision through this amazing ability to capture not only new

01:16.070 --> 01:23.260
data and I'm the same time we do not want to be susceptible to noise or the need patterns from the training

01:23.260 --> 01:23.680
data.

01:23.980 --> 01:30.340
That is the decision tree, because when that decision will grow longer, what will happen is that the

01:30.340 --> 01:37.000
decision tree will try to learn from the noisy or the noisy features with which it has.

01:37.330 --> 01:39.260
But then we will have random forest.

01:39.280 --> 01:45.960
What will happen is because we will be averaging out the votes from different decision trees.

01:46.180 --> 01:52.660
So in that case, if there is a particular variable which has some noisy data, or in case there is

01:52.660 --> 01:56.920
some feature which is noisy features and we don't really want that feature.

01:57.160 --> 02:02.530
So what will happen is the impact of those particular data points will get diminished.

02:03.560 --> 02:07.910
So it will, in tone, perform better then the decision.

02:08.570 --> 02:15.710
So it is a very simple and powerful idea which resulted from the popular vote, some for the random

02:15.710 --> 02:21.760
forest where we use the bigging algorithm, which is present evidence for Malony.

02:23.440 --> 02:32.050
Now, what do we have in random forest, the random forest introduces two levels of randomness, so

02:32.260 --> 02:36.040
the we know what forest stands for in random forest.

02:36.220 --> 02:39.940
The forest stands for the multiple number of trees.

02:40.150 --> 02:45.280
So in random, what is we will have one hundred, five hundred or thousands.

02:45.520 --> 02:53.020
But the randomness, which is the name, the randomness is actually included by two levels.

02:54.150 --> 02:58.650
That is the building process, which will help in handling the noise of the data.

02:59.010 --> 03:01.320
Now, what are the two types of randomness?

03:01.530 --> 03:08.910
The two types of randomness is that it stays on them subset of all the observations in the training

03:08.910 --> 03:09.300
data.

03:10.370 --> 03:18.950
That is the first level that we will be having the entire data, I'm considering all the data points,

03:19.160 --> 03:23.330
we will consider only a subset of the data points.

03:23.330 --> 03:30.390
That is a sample of the data point, which will be used for creating one of the three.

03:30.500 --> 03:32.410
So we will have hundreds of these.

03:32.630 --> 03:36.210
So each three will be created from a different subsample.

03:37.280 --> 03:43.370
The second one is it will also take random subset of all the variables from the random.

03:44.590 --> 03:45.400
Training data.

03:45.970 --> 03:52.870
So now what will happen is that initially what we did was we did a sample of the entire data.

03:53.080 --> 03:58.480
Instead of taking all the data, we took only a part of the data for each three to be constructed.

03:58.900 --> 04:05.770
The next randomness which we introduced was by taking up subsample of the.

04:06.900 --> 04:15.510
There is also so we have a subset of the dollars and we have a subset of the soldiers on active duty

04:15.510 --> 04:21.420
also, so we are not considering all hundred or twenty variables, which we have.

04:21.420 --> 04:28.020
We are considering only a subset of the variables at the time so that the rules which will be generated

04:28.020 --> 04:34.800
will also be a subset that will not be all rules will not be created because for all the rules to be

04:34.800 --> 04:36.910
created, a lot of variables have to be present.

04:36.940 --> 04:39.530
Right now we have a subset of variables.

04:39.690 --> 04:43.620
So now the rules which will be generated will also be lesser in number.

04:44.310 --> 04:48.000
Now this will help in creating different decision parties.

04:48.240 --> 04:55.920
If they had all the variables, then it will automatically be able to find out the best split.

04:56.700 --> 05:04.290
And the best split will any how come to be seen if Jenny was calculated, so because we have taken a

05:04.290 --> 05:10.620
subset of variables, the guinea value will again be different for different for these different split,

05:10.620 --> 05:14.050
which will be competing created by comparing different rules.

05:14.340 --> 05:19.470
So now that isn't three, the levels in the decision, three will again be different.

05:19.860 --> 05:27.390
So one decision three might start with the age as a rule and the decision three might start as number

05:27.390 --> 05:32.490
of children as a rule, because now the variables might be missing in one or the other doesn't.

05:34.700 --> 05:41.960
So in order to understand this, let us assume that we have ten thousand observations and the 100 variables

05:41.960 --> 05:47.620
in the dataset in random forest, many decision trees are buried.

05:47.630 --> 05:54.400
Instead of just one single tree, for example, it is considered a 500 trees are being.

05:54.440 --> 06:02.270
But now if it were these trees using same ten thousand observation and 500 times, we will end up with

06:02.270 --> 06:03.560
the same 500 trees.

06:03.560 --> 06:08.380
So now we will subsample the data and we will subsample the number of variables.

06:08.660 --> 06:16.220
So instead of considering all the thousand rows of data, we will be glad to see a thousand or five

06:16.220 --> 06:18.260
thousand rows of data for each tree.

06:18.470 --> 06:23.780
Or let us say we can take only two thousand barrels of data for each and for the variables.

06:23.780 --> 06:31.960
We can just let us say one hundred variables at one presentation or say twenty was what each generation.

06:32.360 --> 06:37.250
This is how we can actually create randomness in our trees.

06:39.270 --> 06:46.780
So this will in on that every node will have a fresh, random subset of variables.

06:46.890 --> 06:52.820
And from this, the best tool which will be picked up will be different from the others.

06:53.200 --> 06:59.220
Now, what will happen is because earlier the large decision tree, which was being created, it have

06:59.430 --> 07:05.820
a lot of noise and it would have captured pythons, which didn't really exist because of the presence

07:05.820 --> 07:09.510
of the noisy that was already noisy data point.

07:09.640 --> 07:14.940
So now it does not capture that because anyhow, it will have to be average value.

07:15.270 --> 07:22.290
It will have a bigger majority now because some noisy pythons will not be present in all the trees.

07:22.620 --> 07:29.700
The the the diminish impact of the noisy neighbors will occur.

07:29.730 --> 07:35.940
So the noisy variables will not show up or will not have high impact on their decisions which we are

07:35.940 --> 07:36.380
making.

07:36.630 --> 07:42.810
Hence, the performance of the random forest will be much, much better in comparison to that of the

07:43.050 --> 07:43.630
decision.

07:45.710 --> 07:54.020
So in short, the first randomness to remove the effect of these noisy observations and the second randomness

07:54.020 --> 07:57.240
will remove the effect of the noisy variables.

07:57.770 --> 08:04.190
So the final predictions made by the random forest model will be a majority vote of all the trees in

08:04.190 --> 08:08.090
the forest in case of classification and for regression.

08:08.090 --> 08:14.570
Again, the prediction will be the average of the predictions made by the five hundred decision trees.

08:15.930 --> 08:23.720
Now, once we have discussed all these letters, go ahead with the implementation of random forest.

08:23.950 --> 08:25.840
Now, what do we have now?

08:25.860 --> 08:27.800
We will build the Diamond Forest model.

08:28.020 --> 08:34.280
So for this, we will use some hyper barometers for the decision trees, amongst others.

08:34.280 --> 08:38.250
Since that time, the forest is ultimately a collection of the decision.

08:38.730 --> 08:43.270
So we will use all the hype meters, which we have discussed earlier.

08:43.650 --> 08:48.210
We have discussed certain hyper parameters which were present for their decision tree.

08:48.720 --> 08:55.950
So we will use all of these hypovolemic as a bathroom, that we will have some additional hyperbola

08:55.950 --> 09:01.910
meters, which are an estimated maximum features I would strap.

09:02.130 --> 09:08.460
Now, what is an estimate and estimated is the number of trees to be buried in the forest.

09:10.090 --> 09:18.430
So the default value is then and the good starting point can be Hundert, so we can have a little bit

09:18.430 --> 09:25.880
of value as hundred or let's say one hundred five hundred thousand seven hundred two hundred.

09:26.140 --> 09:32.320
These would be different values of an estimate and then we can decide after using the model.

09:33.320 --> 09:39.100
Which number of value, which an estimated value actually gives a better result?

09:41.450 --> 09:42.380
Next is.

09:43.500 --> 09:50.640
Maximum features, so now the maximum featured is the number of features being considered for selecting

09:50.640 --> 09:53.150
the best at each split.

09:53.490 --> 09:58.340
So the value of this barometer should not exceed the total number of features available.

09:58.710 --> 10:05.160
So the maximum number of features should be kept up as a subset of the total number of values.

10:05.170 --> 10:10.340
So initially, the maximum number of features, let's say we had two hundred features, right?

10:10.530 --> 10:13.200
So it maximum the value could be two hundred.

10:13.380 --> 10:20.300
But we should keep the value something between say one hundred one fifty fifty.

10:20.400 --> 10:26.790
We should try different values between these so that we could find out the optimal number of maximum

10:26.790 --> 10:27.390
features.

10:28.900 --> 10:37.000
Next is step now bootstrap allows for sampling with replacement or without replacement, so it takes

10:37.000 --> 10:37.930
a boolean value.

10:37.930 --> 10:42.090
If it is true, then the sampling will be done with replacement.

10:42.310 --> 10:45.870
And if it is flaws, then the sampling without replacement will be taken.

10:45.880 --> 10:48.160
That is, the sample would be taken only once.

10:48.520 --> 10:51.880
That is a particular value will be sampled only once.

10:52.600 --> 10:57.220
So no rules of detail we would be doing when we will be sampling the data.

10:58.510 --> 11:00.160
Now, let us begin.

11:00.460 --> 11:07.690
So what we will be doing is we will import the Iron Dome for this classifier from the Escalon ensemble.

11:08.850 --> 11:11.820
Next, we will create an object.

11:13.120 --> 11:16.240
Cliff of the land, what is classified?

11:20.120 --> 11:24.500
For what we will be doing is we will create the dictionary.

11:25.590 --> 11:32.430
Now, this dictionary has different values of parameters that will be trying to figure out the model

11:32.460 --> 11:33.910
giving the best performance.

11:34.170 --> 11:42.120
Now, apart from investigators maximum features and bootstrap parameters which are specific to the rest

11:42.120 --> 11:47.670
of the hippopotami, those which have been chosen, these are the hyper parameters which belong to the

11:47.880 --> 11:48.780
decision tree.

11:49.050 --> 11:52.040
So this is is the base classifier.

11:52.050 --> 11:58.150
So we will be using the base classifier and we will be passing these hyper parameters to this.

11:58.440 --> 12:05.470
Now, you can choose any of the values that is used, either randoms or TV or TV.

12:05.820 --> 12:11.490
So it is completely your decision which method you want to use for this.

12:11.700 --> 12:14.010
And I have already explained random.

12:14.010 --> 12:16.410
So it's even good TV separately.

12:16.620 --> 12:23.010
So you can go to that particular video and decide yourself which method you want to take.

12:26.260 --> 12:35.640
Next, we can simply get a list of bottom distributions, you can use this extensive list of bottom

12:35.640 --> 12:40.570
feeder distribution or create your own list of parameter distributions.

12:42.080 --> 12:51.470
I will not be running any in in the sessions because the more fitting takes a lot of time and that cannot

12:51.770 --> 12:54.100
be considered in the video recording.

12:54.290 --> 12:57.590
So I have simply skipped the training part.

12:57.860 --> 13:00.080
So you can simply run these trainings.

13:00.290 --> 13:04.180
And when the model is training, it might take a lot of time.

13:04.430 --> 13:06.730
So you need to be patient for that time.

13:07.490 --> 13:14.810
So for this particular combination, there are around six to nine thousand one hundred twenty combinations.

13:15.260 --> 13:21.410
You can see there are probably nine sixty six and six in between two combinations present.

13:22.730 --> 13:27.520
So we will not be doing all that, we will be fighting only a small little combination.

13:28.220 --> 13:36.770
So what we can do is we can simply the run the estimation from the bottom of the distribution.

13:38.740 --> 13:42.610
So let me remove this, this is not needed.

13:48.420 --> 13:57.000
So you can simply run the estimation and then get the best classifier from that, and after getting

13:57.000 --> 14:04.500
the best classifier you can, simply put the values of the best classifier inside the line of what is

14:04.500 --> 14:05.280
classifier.

14:06.190 --> 14:10.480
And after that, you can do the odd, if not for.

14:11.770 --> 14:18.310
After doing a lot of good faith, you can obtain all featured importance now.

14:20.000 --> 14:25.010
What is feature important, feature important is.

14:26.850 --> 14:34.410
In Ashkelon, one of the ways in which we feature importances described is by something called mean

14:34.410 --> 14:43.770
decrees impunity, that is this mean decrees impunity is the total decrees in the north impurity averaged

14:43.770 --> 14:46.650
over all the bodies buried in the random forest.

14:47.010 --> 14:56.580
So what this will be doing is this will basically allow us to compare all the trees so it will compare

14:56.580 --> 15:05.200
all the trees and from all the trees that it has come back, it will find out the average of the impurity

15:05.200 --> 15:05.860
reduced.

15:06.090 --> 15:14.130
So for each and every feature, it will compare how much impurity this feature has, of course, to

15:14.150 --> 15:18.620
reduce and then based on the maximum impurity did used.

15:19.820 --> 15:26.930
By a particular feature, it will be sorting that and providing the same as the feature importance.

15:27.290 --> 15:32.440
So if there is a big degrees, an inability, then the feature is important.

15:32.780 --> 15:35.910
If it is not important, not in.

15:36.170 --> 15:39.560
Now, in this particular report, what we are doing is.

15:40.940 --> 15:44.900
The idea of banning the feature important attribute.

15:46.010 --> 15:51.110
So the use of this ad will make sense only after the model has been --.

15:51.470 --> 15:58.010
So we store the future importance along with the corresponding feature name in the frame and then sort

15:58.010 --> 16:00.370
them according to the importance.

16:00.800 --> 16:02.150
So here you can see.

16:06.350 --> 16:12.940
The different values, so you can see that the importance for marital status is zero point one, two,

16:13.010 --> 16:20.210
three, the importance for capital gain is one zero point one and one for education number is zero point

16:20.210 --> 16:21.380
one five six six.

16:21.560 --> 16:30.140
So we can see different important values so we can observe that the marital status of the spouse is

16:30.140 --> 16:35.600
identified as the most important variable while the column.

16:36.780 --> 16:45.960
Marital status underscored separate is of the least important, so random forest can also be used as

16:45.960 --> 16:47.980
dimensionality reduction technique.

16:48.240 --> 16:54.870
So we have discussed about the last regression where we actually used the feature, the importance of

16:54.870 --> 16:55.890
being from Lassalle.

16:56.370 --> 17:04.440
Similarly, we can use the random photos to obtain this, feature the importance and then remove the

17:04.440 --> 17:10.160
columns which are of less importance from this particular list, which we have got.

17:10.590 --> 17:17.640
So in the case of high dimensional data, we can run on a random forest and get the features sorted

17:17.640 --> 17:24.840
by importance and we can go ahead and choose the top three hundred or twenty or twenty five features

17:25.050 --> 17:26.580
for for the processing.

17:26.940 --> 17:33.240
Now, using this technique, we would not be losing relevant information because the relevant information

17:33.240 --> 17:36.060
is present in these columns on the.

17:37.270 --> 17:43.040
So at the same time, we will be drastically reducing the number of features considered.

17:43.210 --> 17:50.650
So what we can do is we can let this be considered, though not the top 15 or 20 columns and get rid

17:50.650 --> 17:51.040
of the.

17:52.290 --> 18:00.690
The rest of the columns, so if we do not use the line of what is more as the final model, we can at

18:00.690 --> 18:04.770
least use the random forest, what, reducing those features?

18:04.950 --> 18:11.160
So what they can do is as a process begin, first of all, implement the linear model that this minute

18:11.160 --> 18:15.420
regression or logistic regression and once we have implemented that.

18:17.190 --> 18:25.470
We will get to know what columns are important and what features are important, and then we can keep

18:25.470 --> 18:26.370
those features.

18:26.670 --> 18:33.300
Again, we can implement decision three and decide upon which features have higher importance and keep

18:33.310 --> 18:34.600
only those features.

18:34.890 --> 18:37.920
This will allow us to reduce the number of columns.

18:38.430 --> 18:40.950
It will allow us to reduce the number of features.

18:42.290 --> 18:48.980
Now, once we have reduced the number of features, then we can implement Igbos or any other algorithm

18:48.980 --> 18:56.320
of our choice and compare the performance of Random Forest with Boozed and SVM and Navys.

18:56.360 --> 19:02.600
So all of these algorithms can become big and then we can find out which one has a better performance.

19:03.620 --> 19:10.790
In case we are not satisfied with any of these performances, then we can go ahead with using methods

19:10.790 --> 19:11.800
like stacking.

19:12.620 --> 19:21.000
Then we will have a bit dunwoodie so we can use that one, but at least implement random forest online

19:21.170 --> 19:27.800
models to at least obtain the feature of importance so that you will have unknowledgeable for what feature

19:27.800 --> 19:30.610
is actually important, what feature is not important.

19:31.160 --> 19:35.690
So it can be used as one of the methods for dimensionality reduction.

19:37.880 --> 19:48.500
Now, let us discuss about fashion block, so random forest Wandel is one black box, obviously, because

19:48.500 --> 19:55.050
we will not be able to understand how each and every model works inside this, how each and every place

19:55.130 --> 19:56.360
created inside this.

19:56.510 --> 19:58.970
And it will be really difficult to find that out.

19:59.540 --> 20:07.220
So in this, we do not get any officiants what anybody will, and hence we are unable to make any interpretation

20:07.220 --> 20:13.280
of different variables on the response responsivity in case of linear regressions or this regression.

20:13.490 --> 20:16.410
It is a normal model.

20:16.430 --> 20:18.080
It is not a black box model.

20:18.080 --> 20:19.910
We are able to see the video.

20:20.280 --> 20:22.280
We shoot more transparent.

20:22.850 --> 20:31.970
So what will happen is in order to meet this particular line of what is interpretable and to make the

20:32.210 --> 20:37.760
different interpretation of the variables which are used in that, I know what is we need to make a

20:37.760 --> 20:44.290
prediction on the entire data and average it for the variable, which we are interested in.

20:45.820 --> 20:53.320
So, for example, let's consider the case of education in this so we can create the variable name as

20:53.320 --> 21:01.210
education and get the predictions for extreme and get the predicted values.

21:02.790 --> 21:11.100
Next, what we are doing is we are getting the video data that we are just simply concatenating the

21:11.100 --> 21:15.380
variable that is education number and the corresponding response with it.

21:16.160 --> 21:22.290
Now we are simply printing the alleged plot for the variable and the response.

21:22.950 --> 21:25.980
And we have not just for the discussion for this one.

21:27.140 --> 21:34.820
Now, you can see that when applaud the two columns education number against the response as shown in

21:34.820 --> 21:35.390
this below.

21:37.490 --> 21:41.900
It is not very informative in nature.

21:41.900 --> 21:47.810
That is, there are a lot of variations since the response contains the effect of other variables.

21:48.380 --> 21:55.030
Right here we have a lot of dots and we are not really make out what this really means.

21:55.490 --> 22:03.890
So we will plot us, will then go on top of this, which will give an approximate effect of the variable

22:03.890 --> 22:05.980
education and little on the response.

22:06.320 --> 22:10.930
So we basically will average the response at each education, no value.

22:11.300 --> 22:18.620
So what we will be doing is we are simply importing these DOT or API as Essene and we are creating the

22:18.890 --> 22:26.750
essence that dot non parametric Lewis Blätter, which is a small data plot and they are plotting it

22:26.960 --> 22:28.030
using a lot.

22:30.000 --> 22:33.720
So now if you see this is the goal, which we get.

22:35.050 --> 22:44.020
Now, in this plot, we know this, that the educational value goes up, the chances of having income

22:44.020 --> 22:47.980
greater than fifty thousand dollars go up.

22:48.130 --> 22:55.120
So as the value goes, as the education goes up, the response also goes up.

22:55.750 --> 22:56.060
Right.

22:56.320 --> 23:00.610
The education number variable does not have much impact on the response.

23:00.850 --> 23:03.200
Feel the value around the inaudible deal.

23:03.250 --> 23:07.330
Now, there is no much impact on education, no on the response.

23:07.480 --> 23:14.230
But as the education response goes above 11, you can see the response value is increasing very much

23:14.500 --> 23:14.820
right.

23:14.830 --> 23:21.940
There is a steep indicating with the higher education, the probability of owning more than fifty thousand

23:21.940 --> 23:29.260
dollar increases so it can be done for any other variable that we wish to check of, although it does

23:29.260 --> 23:33.070
not make sense for dummy variables since it has only two value.

23:33.100 --> 23:37.260
So we cannot really compare because the values will be only to here to here.

23:37.270 --> 23:43.540
We have eight values present, so it is more clearer picture, but what a categorical variable.

23:43.540 --> 23:47.800
We might not get a clear picture, but it will work really fine for the.

23:48.790 --> 23:53.350
Continuous variables so we can create this financial independence plot also.

23:55.060 --> 24:02.690
So you can create this financial dependence plot and use the feature of importance obtained from the

24:02.800 --> 24:08.300
abandoned forest as it will help a lot for dimensionality reduction.

24:08.560 --> 24:11.110
So that is about the forest.

24:11.320 --> 24:17.990
In the next session, we will be covering the extreme boost, which is based on the boosting algorithm.

24:18.220 --> 24:21.340
So let us go ahead with that.