WEBVTT

00:01.200 --> 00:08.280
In this particular session, we will discuss about the implementation of decision to use this court

00:08.280 --> 00:14.820
file contains the implementation for random forest also, but that is something which we will be covering

00:15.060 --> 00:19.360
once we have already discussed victory for random forest.

00:20.490 --> 00:25.290
So the first thing what we do is we will import the required libraries.

00:25.500 --> 00:32.040
That is, find us as number as in the Escalon as.

00:33.630 --> 00:36.310
Three, we will import trees from Escala.

00:36.660 --> 00:39.970
We will import numbers, and B, this is not required.

00:39.990 --> 00:43.880
It's again then we have Escalon metric.

00:44.310 --> 00:53.940
Eskil on that metric contains a lot of metrics which we can use for analyzing and for evaluating our

00:53.940 --> 00:54.470
models.

00:55.580 --> 01:04.580
So I would suggest you to go to Eskil on Dot Matrix and check the documentation from Escalon and see

01:04.580 --> 01:13.820
what metric is used for what purpose, I will do a separate video for different types of matrix, so

01:13.820 --> 01:15.030
don't worry about that.

01:15.500 --> 01:24.110
So for now, we will import a U.S. score and we will be making in mudflat clip in light, which means

01:24.110 --> 01:25.010
that we want to.

01:27.110 --> 01:31.400
Pretty much a plot in this before.

01:32.490 --> 01:41.460
So once we run this, we will import the CIA tree so the CIA train data will be coming from this particular

01:41.480 --> 01:41.820
fight.

01:44.270 --> 01:46.070
So we will import the five.

01:47.280 --> 01:52.350
Census info, and we will read this file using.

01:53.470 --> 01:55.300
They did not read CSFI.

01:56.370 --> 02:03.960
Now to explain the fine can be placed in any folder, but currently this file is placed in the same

02:03.960 --> 02:12.180
folder as this Jupiter notebook, which is the reason why I have simply given the file name in case

02:12.180 --> 02:15.000
the file was placed in any other directory or folder.

02:15.180 --> 02:18.570
Then I would have said slash data.

02:19.170 --> 02:21.600
Slash some folder name.

02:23.430 --> 02:34.200
And so on for the topic, and this roistering reader will actually help in understanding the characters,

02:34.200 --> 02:42.340
these data in the raw form so that these values are not rendered some other character.

02:43.200 --> 02:48.800
So we will import this and we will import data from us three.

02:50.480 --> 02:58.490
Now, in case you have a test data, then you can combine the best data as earlier in the earlier model,

02:58.490 --> 02:59.670
as we have shown.

02:59.840 --> 03:10.120
So what you can simply do is you can combine the data in the same data frame and from the data frame

03:10.130 --> 03:12.790
itself, you can give it a name.

03:12.950 --> 03:15.080
So we'll see then how we can do that.

03:17.030 --> 03:20.300
So far now we have this data c.i.t..

03:21.180 --> 03:28.680
And we have got hit, so we are checking the details which are present in this particular data frame,

03:29.040 --> 03:33.550
so we have age, which is numeric in nature.

03:33.930 --> 03:36.360
Then we have the will class.

03:37.740 --> 03:43.540
Class contains values like Steve Goman, self-employed, so what is the class?

03:44.100 --> 03:50.670
It belongs to if someone is doing a government job or a private job or if he's self-employed.

03:50.680 --> 03:55.970
So all these details are present in the middle class, then we have final vote.

03:56.460 --> 04:02.280
We don't really know what this value is the big thing for, but we have this numeric value here.

04:03.120 --> 04:04.660
Next is education.

04:05.010 --> 04:11.990
Now, this education value has bachelor's bachelors, high school graduate, 11 bachelors.

04:12.150 --> 04:15.000
So these are different values which are present in the education.

04:15.300 --> 04:20.610
And in decide this, we have another column that is education.

04:20.610 --> 04:21.210
No.

04:22.550 --> 04:28.760
This education number, again, has values 13, 13, nine, seven, 30, so.

04:30.340 --> 04:36.490
It seems to be that these are numeric codes for this one, but we will evaluate that little.

04:37.650 --> 04:46.110
Next, we have marital status, there are multiple statuses like never married marriage, civic spouse,

04:46.440 --> 04:50.670
then divorced, then we have occupations.

04:50.670 --> 04:52.350
These occupations are.

04:53.650 --> 05:05.470
And local executive managerial handlers, cleaners, profession speciality, then we have a relationship

05:05.470 --> 05:13.160
that is someone is not in the family or someone has a husband and if someone has a wife.

05:13.450 --> 05:16.530
So these are different relationship status of the people.

05:18.060 --> 05:25.140
Next, we have the breeze, which is either someone is white, black, Asian, whatever it is, then

05:25.140 --> 05:28.840
we have sex that is male, female, these values.

05:29.430 --> 05:37.560
Next, we have values like capital gain, capital loss, then hours per week, how many hours per week

05:37.560 --> 05:41.390
a person walks, then the native country for that person.

05:41.670 --> 05:45.060
And finally, we have value, which we want to predict.

05:45.300 --> 05:51.630
That is if the person is earning less than equal or greater than 50.

05:53.470 --> 05:58.840
So from this data set, you can see that why is the value which we want to predict?

05:59.770 --> 06:08.800
And the rest of the columns are actually the X values, which we want to use as the features or attributes

06:09.070 --> 06:12.750
or input values or independent values, you can see.

06:13.060 --> 06:20.170
And why is the label or target or you can also say dependent variable.

06:20.200 --> 06:22.870
So these are different films which we use for the white.

06:24.530 --> 06:29.930
Now we can look at the head value for the train dataset.

06:30.320 --> 06:36.560
So this is the train data center, which is, again, just to see now what we can do is.

06:39.360 --> 06:46.560
Here I am showing the values which have been created after the transformation.

06:46.860 --> 06:53.860
Now you can see the values have been converted into numeric values, which is the target here.

06:54.120 --> 07:01.830
So our main target is to convert all the data into all the data instead of it being the.

07:03.140 --> 07:05.210
Categorically or any data.

07:05.330 --> 07:11.090
So all of this data has to be converted into numeric data right now, what all can we do to convert

07:11.090 --> 07:12.370
this into unlimited data?

07:12.380 --> 07:16.510
Because the algorithm cannot really understand these words.

07:16.910 --> 07:22.500
The algorithm does not know what BATCHELLER means and what private means, what has been means, what

07:22.640 --> 07:23.130
it means.

07:23.300 --> 07:26.780
So all these things are nothing for the algorithm.

07:26.810 --> 07:31.130
It will only understand what thirty nine zettl one these numbers are.

07:31.160 --> 07:32.930
So it will only understand the numbers.

07:33.110 --> 07:38.840
So our main focus here is to convert all this data into numeric form.

07:38.890 --> 07:40.490
So now how will we do that?

07:41.690 --> 07:48.290
So let us have a look at each and every column, so the first column which we have is each now this

07:48.290 --> 07:50.660
column is already in a numeric form.

07:51.500 --> 07:53.690
Next, we have a class.

07:53.900 --> 07:57.950
So the world class is in a categorical form.

07:58.100 --> 08:02.380
So we will have to convert these categories into numbers.

08:02.480 --> 08:03.830
And how do we do that?

08:03.830 --> 08:08.600
We can avoid that by using one word, encoding or dumbing creation.

08:10.120 --> 08:16.630
Next, we have finally this again, looks like a numeric variable, but in case it is not, we will

08:16.630 --> 08:22.450
convert it into a numeric by error course.

08:23.110 --> 08:24.810
Next, we have education.

08:25.000 --> 08:28.680
Now, again, it looks like a categorical variable.

08:28.960 --> 08:33.850
So we will see what all categories are present here and here.

08:33.850 --> 08:35.530
We have education number.

08:35.740 --> 08:41.830
So we will try to compare the relationship between education and education number and find out if these

08:41.830 --> 08:42.930
are related or not.

08:44.630 --> 08:51.920
Next, we have marital status, so this, again, looks like a good column, so we will have to convert

08:51.920 --> 08:58.430
this into one two, including next we have occupation, which again would have to be converted into

08:58.430 --> 08:58.650
one.

08:58.830 --> 09:07.140
The same applies to columns like relationship, race, sex, then the column Native Country.

09:07.280 --> 09:13.100
So all of these columns and column have to be converted into a dummy form.

09:14.730 --> 09:21.730
So now let us do one thing, so let us apply Findus profiling to it.

09:21.800 --> 09:29.610
So I have just applied Fondas profiling by importing profiling report from Findus profiling, and I

09:29.610 --> 09:35.590
have generated the report using fear and putting Citrine in it.

09:35.880 --> 09:39.460
So this is the profile report which I have obtained from this.

09:39.750 --> 09:45.330
So here you can see that the number of variables on 15 numeric variables are six.

09:45.330 --> 09:47.070
Categorical variables are nine.

09:48.490 --> 09:58.570
Out of this, the data set has twenty four duplicate laws and capital gain has ninety one percent zeros

09:58.570 --> 10:01.510
on capital loss, has ninety five percent zeros.

10:02.520 --> 10:06.270
Next, you can see the type of values which are present in each.

10:07.140 --> 10:09.540
Here is the detail of the capital gain.

10:09.570 --> 10:14.130
Here is the videos of the capital loss for education.

10:16.610 --> 10:28.400
You can see that the top four categories contain most of the data, while these categories contain less

10:28.400 --> 10:30.060
than five percent of the data.

10:30.380 --> 10:37.270
So we can decide if we should keep the top five categories on the order of four categories only and

10:37.640 --> 10:39.590
we can remove the other categories.

10:39.740 --> 10:47.540
So for this, we can actually run different variants, one variant with the top categories and another

10:47.540 --> 10:54.500
variant with all the categories present and lifting, which we will be learning, is getting the feature

10:54.500 --> 10:55.230
importance.

10:55.430 --> 11:03.890
So we were able to get future importance from the logistic regression also so we can use that and obtain

11:03.890 --> 11:07.800
the of important features from that as well.

11:09.190 --> 11:13.190
So next, what we have is these days.

11:13.210 --> 11:15.790
So, again, we have education, No.

11:17.480 --> 11:20.960
Here we have the final eight hours per week.

11:23.050 --> 11:25.330
You can see there are four different values.

11:26.720 --> 11:36.080
So here are different values, which are married, spouse, never married, divorced, then separated,

11:36.290 --> 11:40.240
then widowed, battered spouse absent, married if spouse.

11:40.490 --> 11:48.770
So out of all of these, we can clearly select only the top three categories and keep other categories

11:48.770 --> 11:49.610
in others.

11:51.860 --> 12:00.090
Next, we have these values where the major people are actually from United States.

12:00.350 --> 12:06.080
And next, there are a lot of countries for other people.

12:06.290 --> 12:15.260
So we can keep the first category that is United States and make under the category as non United States.

12:17.820 --> 12:27.870
Next, we have occupation, so for occupation, you can see that we have one, two, three, four,

12:27.870 --> 12:36.240
five, six, seven, eight, eight categories which are having more than five percent of the data so

12:36.240 --> 12:42.240
we can decide upon if we want to keep the rest of the categories or if we can put them into other categories

12:42.480 --> 12:45.110
or should we keep those categories or not.

12:49.450 --> 12:57.850
Then we have this, so for this again, we have eighty five percent white, ninety nine point six percent

12:57.850 --> 13:00.080
black and three point two percent Asian.

13:00.310 --> 13:07.720
So what we can do is we can simply have three categories, one being white, another being black, and

13:07.720 --> 13:11.590
the rest of the values could be put into other categories.

13:13.700 --> 13:21.890
Then we have a relationship, so in relationship, again, as these four seem to be the major categories,

13:22.100 --> 13:28.580
we can create these four major categories and rest we can put in the other categories.

13:31.720 --> 13:36.380
Or as there is only one category which would be left, we can also keep all of these.

13:36.670 --> 13:39.190
So that is all we can mean the means for this.

13:41.040 --> 13:48.240
Then because sex has only two values, so we can easily create one variable for that.

13:49.350 --> 13:54.540
Next, we have world class, so for this world class, again, we have.

13:56.100 --> 14:01.920
Around six prominent categories, and apart from that, all could be put into others.

14:03.910 --> 14:12.640
And then here we can see that for the very values, less than Feki is almost twenty four thousand and

14:12.640 --> 14:15.410
the greater than fifty seven thousand values.

14:15.640 --> 14:20.830
So here you can see that the ratio is quite different.

14:20.860 --> 14:27.580
So that is almost seventy six, forcing all values into one class and twenty four percent values in

14:27.580 --> 14:28.450
the other class.

14:28.660 --> 14:35.770
So when we have such kind of data, the one class has more values on, the other has less number of

14:35.770 --> 14:36.320
values.

14:36.580 --> 14:39.940
This means that the classes are unbalanced.

14:40.720 --> 14:49.420
So when they have such unbalanced classes, then what can happen is that because there is more number

14:49.420 --> 14:58.030
of values having less than four Feki, then the training might get biased towards less than 50.

14:58.570 --> 15:07.450
And for all the values, it might start of predicting that the value is actually less than 50.

15:08.940 --> 15:16.770
So what we need to do here is instead of keeping these values like this, we will be using a great idea

15:16.770 --> 15:26.190
that this glass beads the glass, which will actually help us in a way that they will allow us to give

15:26.190 --> 15:29.190
equal weight age for this particular class.

15:30.460 --> 15:35.880
So that it is not getting overpowered by the other guys.

15:36.040 --> 15:41.530
It will give equal weight to the values of this class also in comparison to this class.

15:43.010 --> 15:52.500
Next is the correlation values, so as there are no specific columns which are very dark in Cologne,

15:52.970 --> 16:00.230
neither in the right side nor in blue side, so we can easily see that there is no specific correlation.

16:00.590 --> 16:04.590
And the same thing has been depicted in the above.

16:04.610 --> 16:10.930
There was no specific correlation depicted here and there were no columns which were rejected.

16:11.180 --> 16:15.680
So we are going to go with these columns and these details.

16:17.630 --> 16:24.710
So the next thing which we will be doing is we will be comparing the education problem with the education,

16:24.710 --> 16:25.550
no problem.

16:26.660 --> 16:37.570
So here you can see that the values for education and education number are actually correlated.

16:37.760 --> 16:48.950
That is that first and fourth actually refers to the education level one, level two five to six level

16:49.610 --> 16:51.520
is basically education.

16:51.530 --> 16:54.980
Number three, seven to eight is education number four.

16:55.220 --> 16:56.600
Ninth is education.

16:56.600 --> 16:59.150
Number five, the intent is education.

16:59.150 --> 17:01.460
Number six 11 is education.

17:01.460 --> 17:05.660
Number seven, 12 is education, number eight and so on.

17:07.370 --> 17:15.860
So what we can do is we can simply remove the column education and keep only education number in our

17:15.860 --> 17:16.580
dataset.

17:18.220 --> 17:26.200
So we will simply Skytrain ball drop and remove the education column from this particular dataset.

17:27.070 --> 17:30.410
Next, we will get the count of the value.

17:30.470 --> 17:36.120
So we already know that the count of the values is actually unbalanced in nature.

17:36.340 --> 17:41.870
So we will have to take that into consideration by training our model.

17:42.160 --> 17:48.260
But for now, what we will be doing is we will convert the CIA train by column into a dummy column.

17:48.520 --> 17:56.590
So for that we will simply use ITIN VI is equal to greater than 50.

17:56.600 --> 18:01.560
So whenever the value is greater than 50, it will put that as one.

18:01.780 --> 18:06.380
And whenever the condition is not satisfied, if that does zero.

18:08.420 --> 18:09.770
Because this is a condition.

18:10.960 --> 18:17.500
This condition will be evaluated, and when this condition is evaluated, it will give one false.

18:18.070 --> 18:20.350
So when this condition is true, it will give.

18:21.480 --> 18:28.800
True and true, Wengen voted to impeach and gives one and false is converted to invasion, it will give

18:29.160 --> 18:29.510
them.

18:32.200 --> 18:39.660
Next, we are getting all the columns, which I've got the vertical columns by using Katrien to select

18:39.670 --> 18:40.310
the types.

18:40.590 --> 18:44.020
So this will give us all the data, such as object data.

18:44.560 --> 18:47.020
And then we are selecting the columns out perfect.

18:47.230 --> 18:49.530
So we get all the categorical columns.

18:49.690 --> 18:54.280
So these are the different categories of columns which are present in this particular dataset.

18:56.170 --> 19:04.270
Next, we will we have created this particular law, which will actually allow us to create a cutoff

19:04.660 --> 19:14.470
for the column creation so that we can actually obtain only some amount of categorical columns, only

19:14.470 --> 19:19.690
some categories converted into the dummy columns and not all of them.

19:21.610 --> 19:27.820
So it is only new thing on top of the different categorical columns and.

19:28.970 --> 19:36.800
Getting the values of columns and then it is running on the citrine, I'm getting the different value

19:36.800 --> 19:41.090
counts and these value columns, I get best frequency's.

19:41.780 --> 19:49.970
Then from the frequency data, it is checking if the index of the it is getting the indexes of the frequency

19:49.970 --> 19:57.750
is the frequency value is greater than five hundred and it is checking the category in.

19:57.780 --> 20:00.800
And so it is just creating a name.

20:00.800 --> 20:04.430
That is the column plus the category name.

20:05.650 --> 20:12.830
And it is creating the column name using this, so what it does is it just creates the got the got a

20:12.830 --> 20:15.010
good column out of the columns, which we have.

20:15.010 --> 20:22.990
So we get we have one plus column marcosi, this occupation, relationship, race, sex and native country.

20:24.370 --> 20:33.850
So for all of these columns, these have been converted into a categorical two or three to one, including

20:33.850 --> 20:34.960
two Dubnyk lists.

20:37.130 --> 20:42.780
Next, we are just checking the shape of the data now, so now we have thirty nine columns in hand.

20:43.040 --> 20:51.200
So the next task, after converting all the columns from can go to the column to the columns or whatnot,

20:51.200 --> 20:54.630
including yours, to check if there is any value or not.

20:54.800 --> 21:01.560
So we will check the data by using DOT is not a dot dorsum, so we get all the column values.

21:01.580 --> 21:05.220
So here we can see that all the columns are actually Ziegel.

21:05.450 --> 21:09.020
And here you can see that these are the columns which have been created.

21:09.020 --> 21:16.100
So the relationship value hundred husband relationship, not in a family relationship or child relationship.

21:16.760 --> 21:20.200
Unmetered Relationship five, this white is black.

21:20.510 --> 21:24.680
So these are different columns which has been created using the dummy creation.

21:26.130 --> 21:28.260
Now we are simply getting the data.

21:29.420 --> 21:38.960
Into extreme, so extreme contains all the problems from CIA except for the vehicle, so that is why

21:38.960 --> 21:42.710
we drop the big column from the axis equal to one.

21:43.880 --> 21:45.410
We can ride this lake.

21:46.570 --> 21:50.710
Axis equal to one or two, it'll be just the same thing.

21:51.100 --> 21:52.840
Next, we have vitrine.

21:54.100 --> 22:02.140
So we are putting the V column into right now, there are different type of barometers for Decision

22:02.140 --> 22:05.390
B, which we have already discussed about.

22:05.620 --> 22:14.100
So we will be training for these different hyper barometers and these on these different hypovolemic

22:14.140 --> 22:20.290
as we will be getting the values of the hyper barometers which give us the best model.

22:22.690 --> 22:32.440
So we are implementing the randomized so TV, so we have already discussed about the grid search now

22:32.440 --> 22:41.860
in a grid search c.v, what happens is that the grid search will give us the combinations.

22:41.860 --> 22:45.130
It will run all the combinations which are pressing everything.

22:45.380 --> 22:50.380
And then from all the combinations, it will select the best one.

22:50.560 --> 22:53.480
It will run all the combinations from the bottom.

22:54.040 --> 22:59.980
So if I'm giving these barometers, it will run all the combinations out of it and then give me the

22:59.980 --> 23:08.530
result, like what random solar TV does is that because we don't want to put in so much time into running

23:08.530 --> 23:08.990
the code.

23:09.190 --> 23:15.000
So what we will do is we instead of using the digital TV, we will use the random thoughts.

23:15.430 --> 23:22.300
So what it will do is let's say we have one hundred five orders which are being created using grid TV.

23:22.450 --> 23:30.250
So instead of running those 100 models, it will randomly select 10 or 20 models, whatever number we

23:30.250 --> 23:32.440
define and run those.

23:33.760 --> 23:40.510
So in this way, we will actually run a random selection from those markets.

23:40.810 --> 23:44.950
Now, the benefit is that it will save us a lot of.

23:46.020 --> 23:53.850
But there is another drawback that is we will be losing on a particular one.

23:54.390 --> 24:02.100
So there might be some model which would have been better than the one which we will receive by running

24:02.120 --> 24:03.000
the random lines.

24:03.000 --> 24:11.370
So see why we would have used the TV, but because we want to see if that is why we are using randomizer

24:11.370 --> 24:11.800
TV.

24:12.030 --> 24:18.630
This is good when you are running for sartin proof of concept or you want to get a model very quickly,

24:18.840 --> 24:21.100
then you can use my TV.

24:21.420 --> 24:30.080
But otherwise if you want to do a quick thing but actually run extensively and get the best model.

24:30.300 --> 24:32.930
So in that case you would run the grid.

24:32.940 --> 24:42.360
So it would be and there are different methods which will actually ease the running of TV, but that

24:42.360 --> 24:45.750
is something which we will learn in something later.

24:45.900 --> 24:52.370
Maybe in another topic because of the I don't want to make it really fast so that you get confused.

24:52.560 --> 24:58.050
So for now, we will have a look at this randomizer TV.

24:58.200 --> 25:04.950
So if you want to or don't really list any time, I just want to have a one week run.

25:05.220 --> 25:13.020
Then you can use a screen split in case you have a moderate amount of time.

25:13.020 --> 25:14.670
You can use randomized.

25:14.970 --> 25:15.840
So TV.

25:16.140 --> 25:23.850
But if you want to run the phone extensively and try all the possible combinations, in that case you

25:23.850 --> 25:25.280
will use the grid.

25:25.300 --> 25:25.730
So it's.

25:27.330 --> 25:28.690
So let us go for the.

25:29.830 --> 25:30.660
So here.

25:32.170 --> 25:39.370
We have different parameters, so these are different barometers, which we have that this class meet

25:39.990 --> 25:46.330
criterion, maximum depth minimum, some believe minimum sample split.

25:46.590 --> 25:48.990
So these are different criteria, which we have.

25:49.180 --> 25:51.670
So we will be trying all the combinations.

25:53.820 --> 26:01.350
Next is another thing is like instead of having just one minimum belief, you could have had a larger

26:01.680 --> 26:02.960
size of criteria's.

26:03.210 --> 26:06.480
So this is another criteria which would work better.

26:06.720 --> 26:08.990
So it is a completely dependent on you.

26:09.000 --> 26:13.170
How many samples, how many parameters you want to put in?

26:13.890 --> 26:22.350
Usually what I would do is I will try of three values or four or five values, and out of those values

26:22.500 --> 26:24.180
I will select the best ones.

26:24.330 --> 26:25.520
So how would I do it?

26:26.040 --> 26:27.780
Minimum's, I believe values.

26:27.780 --> 26:30.870
So I'll try five then.

26:32.340 --> 26:41.740
15, 20 and 25 now out of these values, let us see, I get no value for theme.

26:42.990 --> 26:52.020
So in that case, I will eliminate five and twenty five and then I will add another value.

26:52.020 --> 26:54.120
Here I will add value 12.

26:55.140 --> 27:00.780
And Values 17 here, and I will eliminate the five.

27:02.110 --> 27:03.280
I'm twenty five.

27:05.650 --> 27:10.870
This way, I will actually get to know and I will then also.

27:12.590 --> 27:15.960
Because if it was actually then, then it would have chosen then.

27:16.670 --> 27:22.860
So now I have these three values now based on which in which direction this election goes.

27:23.060 --> 27:29.280
So if the best model comes out to be 15, then we are good in is the best model comes out.

27:29.390 --> 27:29.990
We do it.

27:30.620 --> 27:33.560
Then what we can do is then we can eliminate 70.

27:34.640 --> 27:41.240
And we can run something which is between 12 and 15, 13 and 14.

27:42.710 --> 27:44.510
Then we can run something like this.

27:45.500 --> 27:48.920
This way, we will be able to narrow down the selections, which we are.

27:49.640 --> 27:51.530
So this is how you can run this.

27:52.000 --> 28:00.380
So here what we are getting is we will be running through of the Glaspie to criterions.

28:01.380 --> 28:01.890
For.

28:02.980 --> 28:04.570
Maximum bet values.

28:06.470 --> 28:13.760
For three minimum, some beliefs and three minimum sample split, so the total number of runs which

28:13.760 --> 28:20.540
we will be having a number of models, which we will be having, will be to cross the cross for cross

28:20.540 --> 28:21.020
three.

28:22.180 --> 28:29.440
So in total, I have 144 murders and I'm a third thing is that this is a decision three and I will be

28:29.440 --> 28:31.980
running then cross-validation.

28:32.290 --> 28:33.960
So it will be in Dotan.

28:34.300 --> 28:34.750
So.

28:36.640 --> 28:43.870
They will be in total one thousand four hundred and forty more dollars, which will be created if I

28:43.870 --> 28:45.250
was running a grid.

28:45.250 --> 28:46.110
So a TV.

28:47.520 --> 28:55.170
But the good thing is I'm using randomise so TV, so now what will happen is first thing first I will

28:55.170 --> 28:59.670
import the Escalon tree that this decision to declassify.

28:59.700 --> 29:05.100
From this, I will create an object of decision to be classified as sealife.

29:05.370 --> 29:06.900
And in my randomise.

29:06.900 --> 29:14.520
So it's KVI, I will give the details that I want to run this cliff model that are in cross-validation,

29:14.580 --> 29:15.930
which I want to have.

29:16.200 --> 29:22.800
These are the parameters which I want to run this clip for, and this is the scoring method which I

29:22.800 --> 29:26.010
want to use a bathroom that I will give.

29:26.460 --> 29:38.420
And I do this and either means how many more those do I want to select from these 144 words?

29:39.120 --> 29:45.960
So if I say I want to select 10 models out of these 144 models, then it will select.

29:47.880 --> 29:54.240
Then more days and it will run then cross-validation from this, so it will give.

29:55.240 --> 30:01.900
Into then that is one hundred more, those will be built instead of one thousand four hundred and forty.

30:03.320 --> 30:13.720
So now you can understand how it is of easing the run, so instead of having such a huge run, then

30:13.760 --> 30:15.330
we will have a lesson one day.

30:15.590 --> 30:21.230
That is, we will have to run only a hundred more days instead of one thousand four hundred more.

30:21.950 --> 30:27.400
So running the randomizer, TV has its own pros and cons.

30:27.410 --> 30:34.190
So that is something that you can evaluate based on what kind of work you are doing and what exactly

30:34.190 --> 30:35.620
you want, what you promised.

30:37.050 --> 30:42.570
So the next thing which we will be doing is we have created the object of randomizer TV.

30:42.780 --> 30:46.560
So this is the object of random TV and we will find the.

30:47.520 --> 30:54.450
Extranet, vitrine, when we find the extreme and white rain, it will take some time to run because

30:54.450 --> 30:59.790
we have so many parameters ReGive running and there are around hundred more of those which are being

30:59.790 --> 31:01.030
created internally.

31:01.350 --> 31:03.120
So it will take some time.

31:03.120 --> 31:08.910
It will learn from this extreme NYG and after learning from this data.

31:10.550 --> 31:16.340
It will give the best estimate using random search talk, best estimate.

31:17.340 --> 31:26.490
Now this is the best model which we have received from this and this best model is having the class

31:26.510 --> 31:27.770
with as balanced.

31:27.980 --> 31:30.020
The criteria is Guiney.

31:30.230 --> 31:32.450
The maximum depth is five.

31:32.690 --> 31:34.970
The maximum number of features is none.

31:34.970 --> 31:36.660
The maximum Leaford is none.

31:36.830 --> 31:43.670
So these are different parameters which it has chosen from these group of barometers, which we have.

31:45.550 --> 31:54.490
Now, you can also go through the documentation of Decision three and see what all other hippopotami

31:54.490 --> 31:54.880
does.

31:54.910 --> 32:03.130
You can select and view this model for these are all called hyper barometers, which we are tuning.

32:03.640 --> 32:06.320
And this is exactly what you need to do.

32:06.340 --> 32:11.020
You need to have patience and give it the time to clean.

32:11.320 --> 32:15.290
And once you have the model in hand, you can make the predictions.

32:16.060 --> 32:20.890
So this is everything here is actually a task of patience.

32:20.900 --> 32:28.330
If you have enough patience and you know which type of barometer to choose and how to deal with these

32:28.330 --> 32:34.600
hyper barometers, how will check which value you need to use is exactly what you need to know here.

32:36.420 --> 32:46.350
Now, let's see if I would have had values, let us see five and then and 15 and somehow I would have

32:46.350 --> 32:52.750
got values 15, that means that the value is marginal value.

32:53.010 --> 32:57.110
Now, the value could have been less than 15 dollars, one greater than 15.

32:57.780 --> 33:05.780
So in that case, in the next run, in the next training, what I will do is I will give some value.

33:05.820 --> 33:15.750
See twenty five here and let's say 20 or so, so that I will get to know if actually the value was 15

33:15.960 --> 33:22.650
or it was something less than 15 or greater than 15, because I don't want to lose on something just

33:22.650 --> 33:27.940
because I didn't run the random search again or grid search again.

33:28.140 --> 33:34.830
So once you run this is randomized, so you will get some parameters, then you can modify the parameters

33:34.830 --> 33:41.690
and run these models individually or by changing these hypovolemic and running a grid search on top

33:41.700 --> 33:45.860
of these so that you can get a better result and improve your fine yield.

33:45.870 --> 33:46.890
More value using that.

33:48.220 --> 33:53.890
And the Mysore TV will also help you to get the benchmark for your.

33:55.210 --> 34:01.480
So you can decide, OK, for now, my model is performing with seventy five percent accuracy.

34:01.630 --> 34:06.600
So now after the morning the hippopotami goes, how much is it improving?

34:06.820 --> 34:08.820
Is it actually improving or not?

34:08.980 --> 34:15.910
And what is the accuracy value for this in comparison to other models, let's say logistic regression

34:15.910 --> 34:22.330
or random sort random forest or examples which we will be learning in coming times.

34:22.510 --> 34:25.600
So you can be you will be able to compare these mortgage.

34:26.780 --> 34:27.150
Right.

34:27.260 --> 34:33.770
So this is what we will be doing, this is how we will be training on borders and working with this.

34:35.380 --> 34:43.390
So once we get the best estimate, then you can use this particular function to generate a report from

34:43.390 --> 34:43.750
this.

34:43.930 --> 34:50.110
So I have generated one report and you can see the score, which I have, of being this zero point eight

34:50.110 --> 34:52.570
nine one, which is the best score.

34:52.570 --> 34:56.740
And the next I have zero point eight nine zero zero point eight eight nine.

34:57.010 --> 34:59.860
So the best one is zero point eight nine one.

34:59.860 --> 35:01.930
And these are the barometer values.

35:04.110 --> 35:10.890
Next, what I will be doing is still I have the best estimate, but I have not received it anywhere

35:10.890 --> 35:12.840
on my model is still not prepared.

35:13.200 --> 35:18.420
So what I will be doing is I will take the random surge, Lord, best estimate.

35:18.840 --> 35:22.430
This is the model which has the best performance.

35:22.680 --> 35:26.550
I will take this message to me and put it in the big.

35:27.600 --> 35:34.590
This is my new video which will hold the new model, the best model out of all the models which random

35:34.590 --> 35:36.150
searches bring.

35:37.860 --> 35:47.430
After I get this dream, I will again fit my model on the screen and by train using the new parameters,

35:47.430 --> 35:54.360
the and that is these best barometers, these best barometers, again, I will fit the model and then

35:54.360 --> 35:59.760
the model, which I get that does this deeply will be the model which has the best performance.

36:01.580 --> 36:09.290
Now, I can actually visualize a decision tree, so for visualizing a decision tree, I can simply say

36:09.320 --> 36:13.130
open my tree dot w.

36:15.180 --> 36:20.390
Coma W, which means that I want to write this particular fight.

36:21.370 --> 36:29.800
And this is the trial has been created, then this is the new fire which has been created and I'm just

36:29.800 --> 36:33.520
exporting the model to.

36:35.370 --> 36:42.110
This particular fire took to this particular deadly fire, to this dog fight.

36:43.630 --> 36:52.130
So after that, I will simply close the file and I can visualize this particular tree using Web graph

36:52.150 --> 36:53.380
with dot com.

36:54.580 --> 36:58.870
In this, you can actually see how the decision tree looks like.

36:59.880 --> 37:02.160
So let us visualize that.

37:03.480 --> 37:11.100
So this is the fight which has been generated, which has all the details of my decision tree, so you

37:11.100 --> 37:18.670
can see that it has details like different rules which are used for this particular decision, tree

37:18.740 --> 37:19.260
creation.

37:19.470 --> 37:21.880
So it has all the rules associated.

37:22.200 --> 37:25.430
So I'll just show you what exactly will do.

37:25.430 --> 37:30.080
We just copy this entire text into this file, which we have.

37:30.330 --> 37:36.560
So I just copied this and I just click generate graph and it will generate a graph for me.

37:46.540 --> 37:52.780
And below it, you can see the graph which has been generated, so it is a very huge graph which has

37:52.780 --> 37:53.630
been generated.

37:54.040 --> 37:56.080
So this is what it looks like.

37:57.020 --> 38:05.180
So you can see it is a very huge graph, so this is the roof node, which is marital status that is

38:05.180 --> 38:08.870
married to a spouse, it is less than equal to zero point five.

38:08.870 --> 38:10.570
Guiney value is zero point five.

38:10.820 --> 38:12.560
The samples are one hundred percent.

38:12.590 --> 38:13.700
These are the values.

38:13.700 --> 38:14.750
These are the classes.

38:15.080 --> 38:23.080
So very now this is the condition where the spouse is true and accordingly, the capital gain.

38:23.090 --> 38:24.210
This is the next rule.

38:24.230 --> 38:25.690
This is the value of it.

38:26.090 --> 38:28.970
So here you can see the value was zero point five.

38:29.240 --> 38:36.530
Here the value has reduced to zero point two nine for the number of samples are now fifty four percent.

38:36.770 --> 38:41.480
And you can see that the values are zero point eight one nine and zero point one.

38:41.480 --> 38:45.050
And one has predicted the glass to be zero, so, so on.

38:45.050 --> 38:51.370
It has just generated different class values and thus it keeps on reducing the value accordingly.

38:51.680 --> 38:57.080
So you can generate your own decision using this.

38:58.260 --> 38:59.760
And analyze the scene.

39:00.390 --> 39:00.930
Thank you.