WEBVTT

00:00.150 --> 00:00.880
Hello, everyone.

00:01.260 --> 00:06.930
So in this particular session, we will discuss about the implementation of the junk property classification

00:06.930 --> 00:07.390
problem.

00:07.920 --> 00:16.560
So this particular solution is slightly tricky because here we need to decide what we want to achieve

00:17.820 --> 00:18.220
here.

00:18.600 --> 00:27.000
We have to balance out the situation of identifying the junk property properly.

00:28.290 --> 00:37.710
So there would be certain cases with the junk property would be written as a junk property and there

00:37.710 --> 00:46.430
would be certain cases where the good property might be classified as junk property.

00:46.440 --> 00:47.130
My old model.

00:48.060 --> 00:56.430
So you need to make sure based on your understanding and your business requirement, if you want to

00:56.730 --> 01:04.510
classify all the junk properties properly or you still need to keep all the non junk properties.

01:04.920 --> 01:14.970
So let's see if it is important for me to make sure that no junk property is classified as junk, then

01:15.360 --> 01:22.740
I then there wouldn't be a case where some properties which would not be junk would still be classified

01:22.740 --> 01:31.320
as junk properties, because here I would keep my position such that some non junk properties would

01:31.320 --> 01:39.780
be classified as junk, but it would make sure that no junk property is skipped or left.

01:40.530 --> 01:43.960
So we need to make sure what we want to get out of this.

01:44.800 --> 01:53.970
Another scenario could be, let's say it's fine if I have a few junk properties listed, but I don't

01:53.970 --> 01:56.260
want to give up on any good property.

01:57.210 --> 02:06.270
So in that case, I will keep my threshold such that the good properties are all intact and they are

02:06.290 --> 02:07.340
tagged correctly.

02:07.590 --> 02:12.170
I have some bad properties, junk property as well as good properties.

02:13.260 --> 02:16.720
So you need to balance this out.

02:17.610 --> 02:18.770
Now let's see for them.

02:19.800 --> 02:25.500
So in this particular dataset, I have simply imported all the libraries and this is the data which

02:25.500 --> 02:26.010
I have.

02:26.010 --> 02:28.740
I have the flag of junk zero and one.

02:29.160 --> 02:37.770
Then there are different columns which are in your state price index listed material price index agency.

02:38.310 --> 02:45.860
Then the rating, then we have expedited listing price index for price index fund.

02:45.870 --> 02:52.440
These are different price indexes, which would be pertaining to some different criteria because they

02:52.440 --> 02:53.880
are not domain experts.

02:53.880 --> 02:55.460
So we don't know much about it.

02:55.830 --> 03:03.420
So we will have to analyze these columns and then find out what is important, what is what next, what

03:03.420 --> 03:10.410
we have our John knows the zip code, the insurance premium, then the blood type architecture.

03:10.410 --> 03:13.260
And there are a lot many quotes associated.

03:13.860 --> 03:20.700
So in total, we have these many levels of detail that the sixty two thousand and thirty one columns

03:20.700 --> 03:21.060
out there.

03:21.630 --> 03:25.460
So these are all the columns which are present and this is the description of those.

03:27.120 --> 03:31.890
So from these you can easily identify outliers if there are any.

03:32.280 --> 03:35.810
So there are 25 percentile of in price index nine.

03:35.820 --> 03:39.500
The twenty five percent is fifty four fifty four hundred.

03:39.900 --> 03:41.760
Then we have fifty percent less.

03:42.060 --> 03:44.650
Sixty seven hundred seventy five percent less.

03:44.660 --> 03:45.720
Seventy nine hundred.

03:45.990 --> 03:51.220
But the maximum value is very high in comparison to the seventy five percent.

03:51.220 --> 03:56.970
Then this surely shows that there are a few outliers in the price index.

03:57.000 --> 04:06.000
Maybe then you can see there are so many zip codes available after that there are different values.

04:06.030 --> 04:09.020
So again it is of normalized population.

04:09.030 --> 04:13.440
Again, there is a huge gap that is there are certain outliers present here.

04:15.240 --> 04:16.710
So let's look for those.

04:17.070 --> 04:23.220
So in different values, we have these in this signing estate and all of these

04:26.730 --> 04:27.040
here.

04:27.060 --> 04:34.110
We have price index, full price index one, price index six, which should be numeric, but I actually

04:34.320 --> 04:34.880
object.

04:35.130 --> 04:42.720
So we need to make sure that we convert all these objects into numeric shapes so we will decide.

04:42.870 --> 04:43.970
So then cutoff.

04:44.250 --> 04:46.080
So these are different kinds of values.

04:46.410 --> 04:51.060
So we have decided that the value to be five percent.

04:51.060 --> 04:58.710
So anything which has less number of values in a category less then could be one of three thousand one

04:58.710 --> 04:59.880
hundred we will come to.

04:59.970 --> 05:06.330
We will ignore that particular subcategory next, there are certainly medical columns, as we have already

05:06.330 --> 05:14.100
seen price index for fun six, price index three, these are all numeric types, but they are an object.

05:14.410 --> 05:17.880
So we will be changing them into a numeric book.

05:19.110 --> 05:27.360
So we are converting these columns into numeric columns by using numeric edwardson, which would basically

05:27.360 --> 05:37.080
take care of the warnings for the we have made the transformations and then we're converting these categorical

05:37.080 --> 05:43.870
variables into the using the cutoff of which we have decided and same has been done.

05:44.130 --> 05:46.840
So now all our data looks like a numeric data.

05:47.040 --> 05:50.850
So once we have this numeric data, we are going to build up more.

05:51.360 --> 06:00.510
Now, on top of this, what else you can do is you can find correlation values, you can apply logistic

06:00.510 --> 06:04.110
regression and you can find out which are important.

06:04.380 --> 06:06.250
So all these things can be done.

06:06.270 --> 06:07.890
So that is completely up to you.

06:07.890 --> 06:09.810
How you take this up.

06:10.020 --> 06:14.280
If you want to apply Pinder's profiling, you can apply that.

06:14.280 --> 06:19.230
Also, this is just one of the implementations, so I won't be showing everything.

06:19.470 --> 06:25.410
But I have already given suggestions how you can create models that you can use those suggestions.

06:28.500 --> 06:37.530
So these are just a method defined to create the report of the mean validation score and everything.

06:37.530 --> 06:44.350
Then we are running any grid search random such that this is one droplet which I have created here.

06:44.790 --> 06:54.360
This particular option will predict it will bring a grid search and after finding the grid search,

06:54.360 --> 07:00.330
it will give me desired Informatica cumulative metrics so that when I make next week I will be using.

07:00.360 --> 07:04.050
So here you can see there are different metrics which we have given.

07:05.280 --> 07:10.230
We will be giving different metrics, values and using those different metrics values.

07:10.310 --> 07:13.080
You can use any particular metric as a result.

07:13.440 --> 07:22.080
So here what we have done, we have given the start, if I gave it and using this, if I gave I'm giving

07:22.380 --> 07:28.740
in to it next, I'm applying a grid subsidy and I will be.

07:28.770 --> 07:31.490
This method will take any model.

07:31.860 --> 07:33.520
It will apply the parameters.

07:33.540 --> 07:36.000
It will take all these values which we usually give.

07:37.170 --> 07:45.630
And for scoring, it will use this quarter and it will use the money for school, which has been given

07:45.630 --> 07:45.910
here.

07:46.200 --> 07:51.960
So basically it is it will scoring the models, the modules based on these quarters.

07:52.270 --> 07:58.460
And then I want to refer this particular module to get the final module, the best model out of it.

07:58.680 --> 08:01.830
It will use it using the best score.

08:01.860 --> 08:05.830
The score, which I have as the equal scored, has been set here.

08:06.150 --> 08:07.440
So why have I done this?

08:07.440 --> 08:15.420
Because I am assuming that for me, any injunct property that comes in my listing, then I will have

08:15.420 --> 08:18.170
to face certain losses.

08:18.480 --> 08:25.500
So that is why I want to make sure that I identify these properties correctly.

08:25.650 --> 08:28.220
So I how I want to increase my record.

08:28.470 --> 08:31.980
So that is why I'm giving my first score as required here.

08:34.020 --> 08:40.740
So then this will apply the search and the grid search and give the predictions, make the predictions

08:40.740 --> 08:46.980
using, biproduct, and later it will give the confusion, metrics and all of the details back to you.

08:48.510 --> 08:53.210
So what we are doing here is we are deciding our different values.

08:53.430 --> 08:57.030
So here what I am doing is I have created certain scoters.

08:57.030 --> 09:04.130
So here is one Schauder, which is taking place in the school reclose for accuracy score Evita's.

09:04.350 --> 09:07.970
So I'm checking all of these scores on.

09:07.990 --> 09:12.990
The first thing which I am trying to do here is because in every time I'm not aware of what we value,

09:12.990 --> 09:17.940
should I be doing so I'm just trying different with the values.

09:18.300 --> 09:21.610
You can skip this entire thing, go with the original flow.

09:21.680 --> 09:28.060
Also, there is not much difference, but this is just different trials which have been done.

09:28.560 --> 09:37.350
So here what I am doing is I'm taking the Beta Schauder evidence and based on the evidence code, I'm

09:37.350 --> 09:39.660
trying to find out the full value.

09:39.900 --> 09:45.540
So I have just run all of these models using this grid search of output again and again because it's

09:45.540 --> 09:45.930
a trap.

09:45.930 --> 09:53.630
And what it allows me to find out is it allows me to find out the best evidence because it is fitting

09:53.640 --> 09:58.170
the model again and again based on the best performing every.

10:00.040 --> 10:07.210
It just simply fix it and then it gives me that the best barometer for better school is class with this

10:07.210 --> 10:11.650
and we did this and then I have defined the confusion.

10:11.990 --> 10:15.280
So it gives me the confusion that results from this.

10:15.400 --> 10:22.900
You can identify the false negatives and two positives and false positives.

10:22.900 --> 10:26.110
All of those you can identify easily and then decide.

10:26.140 --> 10:28.570
So here these are the predictions.

10:30.240 --> 10:32.380
Well, these are the wrong predictions.

10:35.310 --> 10:45.620
OK, so this you can see this is my data, so this seems fine next.

10:46.030 --> 10:53.830
Again, there are different combinations, different values coming out then four that I have tried from

10:53.830 --> 10:55.360
two hundred to two thousand.

10:56.020 --> 11:00.190
And further, I have just tried different values.

11:02.050 --> 11:06.450
Next, I have simply implemented extra reclassified examples classify.

11:06.640 --> 11:08.800
I'm using that to classify it.

11:08.800 --> 11:11.890
I'm just finding out different values again.

11:14.050 --> 11:16.200
Now this is completely up to you.

11:16.210 --> 11:21.610
I came to a conclusion that I will use two hundred as my value in here.

11:21.730 --> 11:23.530
I'm going with every other school.

11:23.800 --> 11:28.280
So this is how I'm finding out which every school I will be using.

11:28.690 --> 11:34.450
Now you can completely stick with Recall's because recall is the actual metric which will be giving

11:34.450 --> 11:35.460
us the best result.

11:35.920 --> 11:38.620
So we will be using recall at the end.

11:39.880 --> 11:47.020
So again, I'm just completing the scooter again, using the scooter, and after that I'm giving the

11:47.020 --> 11:54.640
class with investigators, then flipping the model after fitting the model.

11:54.970 --> 11:57.580
Here I am trying for precision scores.

11:57.580 --> 12:05.500
So this one gives me the best precision school here I'm trying for, because for all I have to do is

12:05.980 --> 12:09.590
I will just give the equal school here to different.

12:09.790 --> 12:12.760
So it will give me best results based on school year.

12:12.880 --> 12:16.180
I've given precision scored here in the inside.

12:16.180 --> 12:20.070
This school is giving me designs based on precision school.

12:20.260 --> 12:23.380
So how you want to use it is completely up to you.

12:23.390 --> 12:27.150
You just need to put this precision school inside this quarter.

12:28.540 --> 12:34.580
So the next week what we will be similarly, every school has been done here.

12:35.200 --> 12:38.690
So next, how you will be doing it is completely safe.

12:39.220 --> 12:46.090
So here what I am doing is let me show you this is orif the importance.

12:46.330 --> 12:50.110
So this will basically give me the importance of my features.

12:51.340 --> 12:52.720
Let's see what it is doing.

12:53.140 --> 12:57.190
So for this, I have used another method.

12:57.190 --> 12:58.610
So what is this method?

12:58.610 --> 13:00.070
This may explain this to you.

13:00.550 --> 13:09.880
So the basis of this particular method is that I am having some columns, so I have, let's say, around

13:09.880 --> 13:11.820
60 columns already present.

13:12.400 --> 13:16.480
So in this 60 columns I want to find which feature important.

13:17.680 --> 13:20.680
Now, when I'm finding out we feature importance.

13:21.110 --> 13:28.160
I'm not sure which feature which how many top features are actually important and how many important

13:28.240 --> 13:30.160
features are actually unimportant.

13:31.050 --> 13:34.810
OK, there is one line to find that this is good and this is bad.

13:35.350 --> 13:41.980
Obviously something that is a completely, completely random is a sure thing that it is a bad idea.

13:42.010 --> 13:49.540
But so if I have something like an idea, then surely that would be a random variable and that is why

13:49.540 --> 13:50.760
it is unimportant.

13:51.460 --> 13:58.530
So I'm doing the same thing here, so I'm introducing a random variable at random.

14:00.100 --> 14:07.690
Now I will bring my model by including this variable also, and then I will finally find out the feature

14:07.690 --> 14:09.100
importance of my video.

14:09.140 --> 14:12.640
But now now I have some features.

14:12.890 --> 14:16.740
Now there are 60 features of that, including this random variable.

14:17.170 --> 14:25.690
So there will be some variables, say twenty variables, which will be above this random variable in

14:25.690 --> 14:26.700
the future importance.

14:26.710 --> 14:33.850
Josh, just so these are actually important features because they do not have random behavior and whatever

14:33.850 --> 14:40.960
comes below this random variable in the feature importance of Java, which I will be generating, those

14:40.960 --> 14:45.790
are completely random variables because they are having importance less than other items.

14:47.080 --> 14:51.400
So these are short, unimportant and not important for me.

14:51.790 --> 14:53.880
So I will completely get rid of those.

14:54.040 --> 14:56.200
So that is what I will be using here.

14:58.000 --> 14:59.440
So what we do here is I'm.

14:59.620 --> 15:02.290
Fitting the model, finding out the importance.

15:02.540 --> 15:08.770
So when I find out the importance, these are different variables, see price indexes, normalized population,

15:08.770 --> 15:09.750
all of these are there.

15:10.090 --> 15:16.540
And further, I have this column that I know, which was a completely random variable in this particular

15:16.540 --> 15:18.560
column by completely random values.

15:18.880 --> 15:21.940
So then you see this insurance premium property.

15:21.940 --> 15:28.690
It states you these are having importance less than this random.

15:29.740 --> 15:37.990
So this means that these variables are shortly having no importance in my modern building process so

15:37.990 --> 15:44.650
I can get rid of all the variables from this random building and.

15:47.410 --> 15:53.980
So here this is one solution which we have found of and same thing I have just gotten rid of all those

15:53.980 --> 16:01.090
columns, I simply kept only the columns for which the importance value is greater than the important

16:01.090 --> 16:01.900
value of my life.

16:03.610 --> 16:04.870
So I kept need those.

16:05.630 --> 16:13.720
I'm then I have simply drained my model one by one in sequential manner, using the sequential training

16:14.140 --> 16:17.260
by finding out the lower ninth grade class weight.

16:18.910 --> 16:24.280
It gives me the best learning, great best number of any estimate, the best class, which would be

16:24.280 --> 16:30.240
that then for the it gives me the maximum depth, which would be the answer.

16:30.910 --> 16:37.890
So I keep on doing that and for good I get the best out of these.

16:38.140 --> 16:40.110
So that is exactly what I have done.

16:41.440 --> 16:43.490
So this was the implementation.

16:43.510 --> 16:45.320
This is my implementation.

16:45.640 --> 16:51.190
You can try your own gradient before you can apply any other Muggleton.

16:51.430 --> 16:53.790
You can try and my this also for this.

16:54.160 --> 16:57.410
So that might also performed really well.

16:57.910 --> 17:00.580
You can go ahead and try neural networks as well.

17:00.910 --> 17:03.520
So you can try different combinations.

17:03.520 --> 17:04.750
You can try stacking.

17:04.780 --> 17:08.100
You can try a random.

17:08.350 --> 17:11.810
So that is completely up to you how you want to make your model.

17:11.870 --> 17:13.510
This is just one of the samples.

17:13.750 --> 17:18.100
I hope you understood the problem and this particular solution.

17:18.430 --> 17:19.000
Thank you.