WEBVTT

00:01.400 --> 00:08.360
Let us see how we can do feature selection, so the first thing which we will be doing is we will import

00:08.360 --> 00:09.710
numbers and find this.

00:12.540 --> 00:14.070
Now, after importing.

00:15.510 --> 00:21.720
The basic preprocessing, the all we have to do, which is removing the null values, so the first thing

00:21.720 --> 00:27.990
which we will be doing is we will remove the null values and whatever null value is present, we would

00:27.990 --> 00:36.810
either remove the entire rule or we will impute the mean value or median value or the more the value

00:37.260 --> 00:39.210
of those places where null values.

00:39.210 --> 00:39.990
What olelbis?

00:41.860 --> 00:48.160
The second thing which we will do is we will convert the categorical variables into one Hawtin for doing

00:48.430 --> 00:51.190
all, we can also see dummy variables.

00:52.430 --> 00:58.400
And the ordinary variables in numeric variable, we can convert those into numeric.

01:00.250 --> 01:06.790
Now, let us look at a data, so I'm importing a bank full data so it does a bank related dataset.

01:08.100 --> 01:16.650
So in this particular data, we have these problems, age of marital status, education, void, violence,

01:16.650 --> 01:23.010
housing loan gone by the month duration gap in the baby's previous outcome.

01:23.040 --> 01:31.940
By now out of these values, these are only a few problems which we have now in a real life dataset.

01:32.220 --> 01:35.460
There would be a lot more number of problems.

01:35.640 --> 01:41.570
They could be around three hundred thousand or two thousand of five thousand number of columns.

01:42.660 --> 01:50.460
Now, to find out only 10 or 20 such columns, which would actually give a good amount of information,

01:50.460 --> 01:51.790
is a very difficult task.

01:52.200 --> 01:55.890
So how we do it, we do it using feature selection.

01:57.440 --> 02:03.390
So for selection, the first thing which we will be going with is bonders profiling.

02:04.190 --> 02:10.410
So we will import profiling report from Binder's profiling and find that's profiling.

02:10.940 --> 02:13.870
We have a profiling default option.

02:14.720 --> 02:19.510
So we will run this profiling support function on the data frame.

02:20.270 --> 02:24.710
And we can also export this report, which we have generated to a particular file.

02:24.980 --> 02:29.120
So here I am just reading a file to your report filing and it is a.

02:30.910 --> 02:39.280
So this is the fire which it has generated based on the size of this fire generation, might be a longer

02:39.280 --> 02:43.950
time period because it will be analyzing all the columns in the data.

02:43.960 --> 02:46.060
So it may take a lot of time.

02:47.980 --> 02:55.270
So let us start analyzing the data, let us see what all we have found out in this particular report.

02:55.900 --> 03:02.650
So the first thing that we have followed this, that the number of video booths is 17 and there are

03:03.130 --> 03:06.200
forty five thousand round number of observations.

03:06.550 --> 03:10.710
There is no missing since and there is no duplicate through.

03:11.710 --> 03:19.090
So if there were any missing values, then we would have had to imbue the values or remove the rules

03:19.090 --> 03:25.210
of data based on how much values are actually missing, then based on the duplicate rules, in case

03:25.210 --> 03:27.700
there are any duplicate rules, then we would have deleted.

03:27.710 --> 03:28.540
They was also.

03:29.960 --> 03:37.860
Then this is just the memory of how much memory this takes, then it tells us about the video tapes,

03:37.880 --> 03:39.100
what will be built up.

03:39.860 --> 03:42.650
So it would tell us about what are the numeric variables.

03:42.650 --> 03:48.800
So there are seven numeric video with six categorical variables and four boolean variables.

03:50.460 --> 03:58.580
Then the next thing which we get here is how many dead body blows up and then how many variables are

03:58.590 --> 03:59.300
rejected.

03:59.850 --> 04:07.980
So it will automatically provide us with the number of columns which it will suggest to reject.

04:08.670 --> 04:15.390
So these columns could be rejected on the basis of if two columns have a lot of correlation between

04:15.390 --> 04:24.220
them or if the columns have high multiple linearity or the columns have a huge amount of null values.

04:24.420 --> 04:31.820
So on those bases, it was suggested what columns has this Findus profile suggested as to Turin's?

04:33.300 --> 04:39.600
So because this is a very small dataset, it has not provided any rejections, but it has provided a

04:39.600 --> 04:46.500
few warnings, such as the balance column has a seven point eight percent zero values.

04:47.190 --> 04:55.320
So increase the balance values had around to 80 percent or 90 percent zero values, then we could have

04:55.320 --> 04:57.480
thought of removing that particular column.

04:59.350 --> 05:06.550
Then here we have a previous column, which has around eighty one percent, so we can think of removing

05:06.550 --> 05:11.500
this particular column because it has a very high amount of Zettl values.

05:13.710 --> 05:21.070
Next, it gives us details about each and every variable, that is how many unique values are there?

05:21.450 --> 05:25.690
What are the distinct number of values and the mean value?

05:25.710 --> 05:30.220
The minimum number value, the maximum value and how many zeros are present.

05:30.420 --> 05:33.120
So all the details are provided here.

05:33.270 --> 05:37.770
So based on these details, we can actually find out what we need to do for them.

05:37.950 --> 05:43.470
So if there are any missing values, we can again figure out how we need to improve these values.

05:43.680 --> 05:47.960
And for those, we will again get the idea by seeing the plot.

05:49.840 --> 05:56.710
So in case the plot is kind of symmetric in nature, if it is a normal plot, then we can use any of

05:56.710 --> 06:00.610
mean or median or in case the plot is, you can just.

06:01.330 --> 06:04.660
Then that is we would use the median of the plot.

06:05.020 --> 06:07.630
So these are a few things which we will have to care of.

06:08.410 --> 06:14.340
Now, here you can see that seven point eight percent value Z also that has been highlighted here.

06:14.800 --> 06:22.630
And you can see the values on usually ranging from minus eight thousand to around one hundred and two

06:22.630 --> 06:23.100
thousand.

06:23.980 --> 06:26.850
So this is the range of this particular data.

06:28.770 --> 06:34.530
So like this, we can see different things like here in categorical variables, you can see that there

06:34.530 --> 06:42.840
is a very low amount of data for the telephone while there is a high amount of data for cellular.

06:43.940 --> 06:52.620
So we can have only one category, cellular or cellular, instead of having three categories itself.

06:52.790 --> 06:56.240
So this is how we can decide how many categories we need to have.

06:56.630 --> 06:58.670
So here we have data about the.

06:59.810 --> 07:02.420
And here we have data about education.

07:04.300 --> 07:10.960
Here we have job categories, so here you can see that there are around other values where 40 percent

07:10.960 --> 07:19.210
of values that others, then 21 percent of values for blue collar, 20 percent of the base for management,

07:19.660 --> 07:22.720
and 16 percent of the base for the nation's.

07:25.070 --> 07:28.370
So here are the further details offered.

07:28.790 --> 07:37.130
So here we can see that around five percent data is for retired and all of the above categories have

07:37.130 --> 07:39.150
more than five percent data in them.

07:39.740 --> 07:46.760
So what we can do is we can decide the threshold value that we can have at least thirty five three point

07:46.760 --> 07:53.370
five percent data in one column only with those categories we can consider for creating the categories.

07:53.810 --> 08:00.590
So in that case, we can ignore these categories and the dummies for only these columns.

08:01.580 --> 08:02.780
Only these categories.

08:04.700 --> 08:06.070
Here are more details.

08:13.580 --> 08:15.860
Then we have the correlation matrix.

08:16.910 --> 08:19.280
So this shows what is the poor relation?

08:20.230 --> 08:27.670
So in this, that is a very nominal correlation, there is not much information here in Spearman's,

08:27.670 --> 08:32.590
again, you can see that there is a very, very, very high correlation between bidets.

08:33.670 --> 08:35.350
And previous.

08:36.940 --> 08:42.190
So these are a few things that we have learned from here that bidets and previous is very highly correlated.

08:42.760 --> 08:46.900
Then there is a meaningful relation in other values.

08:48.890 --> 08:56.330
Then you can see there are no missing values, and this is how the first few rows and the last few rows

08:56.330 --> 08:56.900
look like.

08:57.820 --> 09:03.370
So these are a few things which we get from this profile report, and similarly, we can learn a lot

09:03.370 --> 09:07.250
from these reports and we can decide what all we need to do for them.

09:08.110 --> 09:10.150
Now, the next things which we have is.

09:11.800 --> 09:17.380
So based on this report, we can decide what we need to keep and what we need to do now, here is another

09:17.560 --> 09:22.310
report which I have written for the breast cancer dataset.

09:22.660 --> 09:25.920
So here we have around thirty three columns in both.

09:27.470 --> 09:31.820
Out of which the last volume contains usually not a number values.

09:32.880 --> 09:34.560
So here is the profile of the.

09:36.820 --> 09:45.910
So it has actually provided us what all Gollum's we can project, and it has rejected 11 columns for

09:45.910 --> 09:54.010
us so you can see how important this is, it has provided us that these walls to an area mean and highly

09:54.010 --> 09:54.830
correlated.

09:55.120 --> 10:00.190
And then one key point was to go on five point nine is, again, highly correlated.

10:00.370 --> 10:06.610
So it has rejected so columns for us so we can directly to these one of these columns.

10:06.790 --> 10:13.150
So when we when it is giving and we don't mean an area vorst, then we can remove eatables from this

10:13.150 --> 10:14.100
particular data.

10:15.450 --> 10:25.710
So that is how we can use this particular report and the other details, again, was the same way so

10:25.710 --> 10:32.250
we can again decide what we need to do, what values we can use for improving those values, and if

10:32.250 --> 10:36.840
there are any hihi values and then how we need to take care of them.

10:36.850 --> 10:40.890
So all those details are provided by the that's provided to us.

10:43.100 --> 10:46.910
Now, for though we have this is the data.

10:47.860 --> 10:52.270
Again, and what we are doing is we are just getting the predictable values.

10:53.370 --> 11:00.070
I'm from the predicted values of just getting the X values and the predictable, so I am slicing this

11:00.090 --> 11:02.790
data from for all the rules.

11:04.300 --> 11:13.060
And from the second indexed column to the last column, this means that it will ignore Zettl, ignore

11:13.060 --> 11:17.890
one, and it will begin from the mean column and go up.

11:21.010 --> 11:27.750
Fractal dimension WUST, it will ignore this minus one or two, it goes to one less than this one.

11:28.300 --> 11:31.090
And then I have decided what V is here.

11:31.090 --> 11:34.130
My V value is the diagnosis value.

11:34.510 --> 11:42.880
So I have a sign that the D if I lock all the doors and only the first column for the next column,

11:42.880 --> 11:44.220
which is a diagnosis called.

11:45.830 --> 11:47.720
And this is the head of the data.

11:48.960 --> 11:57.750
Now, based on this, I have received the correlation matrix, this core function gives this particular

11:57.750 --> 11:59.040
correlation matrix.

11:59.310 --> 12:02.880
This is similar to this matrix which we have here.

12:03.450 --> 12:11.310
So this matrix actually tells us what all columns are kind of related to each other and the columns

12:11.310 --> 12:13.430
which are highly related to each other.

12:13.480 --> 12:22.940
We will be removing them because they will kind of increase the impact of their votes.

12:23.820 --> 12:30.690
So we don't want to have a similar kind of viewpoint from two different columns because one column is

12:30.690 --> 12:32.420
enough to provide its viewpoint.

12:32.850 --> 12:38.550
So we will keep only one of the column for one type of relationship.

12:40.100 --> 12:47.750
So this is why we are just creating this particular function, this particular function will find the

12:47.750 --> 12:52.070
uncorrelated features from all the features which we have.

12:53.030 --> 12:57.140
And based on that, it will on the columns, which we need to keep.

12:58.240 --> 13:05.590
So what this does is it basically keeps the correlation value, so it creates a correlation matrix and

13:05.590 --> 13:12.010
based on the correlation matrix, it shakes each and every value in the Polish relation matrix.

13:12.190 --> 13:14.910
So this is the correlation matrix which has been generated.

13:15.610 --> 13:18.460
So it has a lot of value is negative and positive.

13:18.460 --> 13:21.460
All the values will be ranging from minus one to plus one.

13:22.700 --> 13:30.020
The values, which are two minus one and the values which are nearly two plus one, are highly correlated

13:30.170 --> 13:35.120
and the values which are near to zero are either not for on very least.

13:36.050 --> 13:42.320
So we want to remove those particular columns, which have a very high correlation.

13:42.530 --> 13:49.400
So that is why we will be removing the columns which have correlation, more than zero point nine.

13:50.360 --> 13:57.140
So that is why we are giving the value to be zero point nine, so which is the reason why we are checking

13:57.470 --> 13:59.230
the value of the correlation.

13:59.280 --> 14:05.450
And if the correlation value is above zero point nine, then we are removing it.

14:05.450 --> 14:11.540
Otherwise we are keeping it and we are spending it to the keep list.

14:12.230 --> 14:17.660
And then this list actually provides us the list of all the columns which we need to keep.

14:17.900 --> 14:20.960
So we have selected these many columns which we should keep.

14:22.770 --> 14:30.450
Now, similarly, we have another matter, which is radiation variance, inflation factor, which is

14:30.900 --> 14:38.500
a technique to estimate the severity of multiple linearity among the independent labels so it will find

14:38.520 --> 14:40.530
out different variables in some.

14:40.530 --> 14:45.570
Similarly, just the way we are doing this, just the way we are finding variables which we need to

14:45.570 --> 14:50.250
remove, VIFF also helps us find those variables which we need to remove.

14:50.610 --> 14:55.050
So what we are doing is we are again keeping the output values diagnosis.

14:56.060 --> 14:58.910
And all of those values as the predictors.

15:00.400 --> 15:08.560
And he'll be calculating the variation variance inflation factor using the ladin's inflation factor

15:08.560 --> 15:17.710
method, which we have imported from the states model that starts outplaying winds of from the library.

15:18.040 --> 15:21.850
And we will just push in the predictable values.

15:22.420 --> 15:27.850
And the evalu value is just each and every column which we are taking it upon.

15:28.540 --> 15:33.400
And we are getting the all the features which we want to have in this.

15:33.610 --> 15:39.670
So we are kind of calculating the fact that for all the columns which are present, so when we print

15:39.670 --> 15:44.990
this, we get all the features and their respective variation, variance, inflation factor.

15:45.250 --> 15:49.920
So this factor should be a lesson then.

15:49.960 --> 15:53.400
This is what is of great value.

15:53.650 --> 15:56.850
So the value of value should be less than 10.

15:57.160 --> 15:58.860
So that is what we are doing.

15:58.870 --> 16:05.310
We have given the value to be then we are given the value to be then.

16:05.650 --> 16:10.840
So based on that, we can find out what features we need to keep.

16:11.380 --> 16:16.280
So here is another method implementation which you can likely use.

16:16.510 --> 16:23.530
So I've just given the predictors inside this and the threshold value Usdin and based on the threshold

16:23.530 --> 16:32.380
value, it is again calculating the variance inflation factor using the same library and it is just

16:33.100 --> 16:37.960
removing the column name from the entire column list.

16:39.130 --> 16:44.710
By checking if the value is greater than the threshold value, I'm giving us the.

16:45.870 --> 16:50.830
Final details of what we need to keep and what we need to remove.

16:51.060 --> 16:56.370
So this is the final detail from which I have received on this day.

16:56.370 --> 17:03.180
The frame has the seven columns which I should keep from this entire dataset of thirty three columns.

17:05.620 --> 17:12.400
Now, these are a few methods how we can actually select features and this is what we should learn.

17:12.770 --> 17:15.660
Now, there are a few more methods lately.

17:15.790 --> 17:22.810
Your feature selection using Glassell, which will we will be launching later when you will be learning

17:23.080 --> 17:24.020
linear regression.

17:24.310 --> 17:28.840
And again, there is a previous feature selection method which is using random forest.

17:29.140 --> 17:31.390
So this is something that we will be learning.

17:31.390 --> 17:33.670
Then we will learn about random forest.

17:34.000 --> 17:42.520
So till then you can use these three methods to find out what all the columns you should keep and what

17:42.640 --> 17:43.900
columns you should remove.
