WEBVTT

00:00.300 --> 00:06.160
OK, now let's have a look at the solution for the safe travel classification problem.

00:06.660 --> 00:10.360
So here's a data science field.

00:10.380 --> 00:19.950
It has basically gained strength and we have a lot of statistical libraries which can help us to compute

00:19.950 --> 00:22.230
regression or classification of data.

00:22.770 --> 00:30.390
Now, actually, what the people who the data scientist, which are present in insurance companies,

00:30.900 --> 00:35.060
they were called actuaries a lot of time ago.

00:35.550 --> 00:45.810
So they used to collect data from different sources and analyze that and claim data to identify a fraudulent

00:45.810 --> 00:50.630
transactions that led them to classify or keep the premiums.

00:51.660 --> 00:59.180
So if anything, data science technology of today has given far more tools to perform their analysis

00:59.190 --> 01:00.010
very easily.

01:00.900 --> 01:09.090
So the data here has a few ordinary categorical data that needs to be fudged and digitized properly

01:11.130 --> 01:11.530
here.

01:11.700 --> 01:23.640
Our goal is to predict a binary outcome of one which indicates a safe driver and zero which will indicate

01:23.640 --> 01:26.680
that the drivers do that needs to be reviewed.

01:27.750 --> 01:38.610
So if the driver is safe, then it means that this particular accident could be a genuine accident if

01:39.060 --> 01:40.680
the driver is not thief.

01:40.680 --> 01:47.070
And that could mean that there could be some problem with this particular data or there could be some

01:47.280 --> 01:48.610
flaws in the data.

01:50.370 --> 01:55.080
So this could be a fraudulent insurance claims.

01:56.130 --> 02:03.660
So we will look at the Cartman's variables and we will fill in the missing data with the mean or median

02:03.660 --> 02:11.890
in order to not you research now after cleaning up the data and filling the missing values.

02:11.910 --> 02:19.410
We will look at the features and their correlations so that we can drop highly correlated data, which

02:19.410 --> 02:20.850
may impact our research.

02:21.870 --> 02:30.020
So what we are doing here is we are importing all the important Vitan libraries and packages and importing

02:30.030 --> 02:32.400
farmers and nubby for data frames.

02:32.700 --> 02:38.670
My third live for visualization and is known for statistical algorithms understanding of the need.

02:41.040 --> 02:47.020
So these are all different inputs which we have done here for.

02:47.340 --> 02:52.250
What we are doing is we are getting the same driver data set from this Excel sheet.

02:53.730 --> 02:57.780
And next, we are finding out the data.

02:58.140 --> 03:04.440
Their safe driver is less than zero, that the value is less than zero, and finding the sum of it,

03:04.740 --> 03:09.510
checking if there are any values which have negative, negative values.

03:11.640 --> 03:14.030
So these are different values.

03:14.050 --> 03:20.800
So for that, the idea of having this value target has this value and these are different values and

03:20.820 --> 03:23.760
don't really look much useful.

03:24.300 --> 03:29.020
So next, we are checking if there are any time this particular data set.

03:29.050 --> 03:37.150
So here this is the driver info.

03:37.470 --> 03:40.010
So these are all the data that we present here.

03:40.040 --> 03:44.800
Here you can see there are a lot of of in 64.

03:45.030 --> 03:52.520
One, two, three, four, five, six, seven, eight, nine columns which are of object type.

03:52.980 --> 03:56.470
Then there are seven in 64 and one fluid column.

03:57.990 --> 04:02.180
Apart from that, these do not have any value.

04:02.230 --> 04:05.130
So we are actually good at that particular front

04:08.460 --> 04:09.100
here.

04:09.390 --> 04:14.210
The count of miles driven and will bucket has only lesser income.

04:16.530 --> 04:23.040
Now let us describe this particular data on the see the numerical variables.

04:23.610 --> 04:30.290
So I.B. has values which could be dropped target.

04:30.730 --> 04:35.200
You can see 25 percent to 50 percent.

04:35.220 --> 04:43.710
There are a lot of zeros, but after 50 percent, you can see most of the values and unhappy ones in

04:43.710 --> 04:43.890
them.

04:45.030 --> 04:51.600
Next, we have Engeneic B, which you can see there is there are a few outliers in genetic.

04:53.610 --> 04:55.170
Credit history looks good.

04:57.210 --> 05:00.510
Your experience, again, looks fine.

05:03.990 --> 05:06.160
This is also fine.

05:06.180 --> 05:12.360
There seems to be a strange increase in the miles driven annually.

05:12.540 --> 05:20.130
So this looks like there is some outlier present in miles driven, unwieldy size of the family.

05:20.130 --> 05:20.760
Looks fine.

05:23.560 --> 05:31.240
So now we will have a look at the credit history, so these this is basically the credit history.

05:31.270 --> 05:33.390
So this is the plug for the credit history.

05:34.990 --> 05:36.670
So here you can see.

05:46.470 --> 05:55.570
Now we will check and see if we have an imbalanced class, so the truth claims is twenty one thousand

05:55.570 --> 05:58.840
three 96 total number of records is 30000.

05:59.200 --> 06:02.840
So the percentage of true claims is almost 71 percent.

06:04.510 --> 06:12.190
So this means that our dataset is imbalanced and 29 percent of the detail will be having false claims

06:12.190 --> 06:15.300
and 71 percent of the data will be having to claims.

06:15.580 --> 06:18.240
So we will have to use some weighted classes here.

06:20.500 --> 06:25.240
So we will balance later using this technique.

06:25.240 --> 06:33.940
And the dataset contains several categorical data and that ends with a bucket that needs to be either

06:33.940 --> 06:36.640
dropped or converted to numerical value.

06:36.680 --> 06:39.580
So we need to get rid of the categorical deny.

06:39.580 --> 06:44.080
Then we will drop those columns or we will convert them into dummy columns, as we have already discussed

06:44.080 --> 06:45.130
how we actually do it.

06:46.750 --> 06:52.240
So these are different categorical columns which are gender, marital status, Reagan type age bracket,

06:52.240 --> 06:58.450
Indian market expedients, vukic miles driven annually by credit history, market state.

06:59.380 --> 07:06.570
So among these categorical variables, we will retain gender, marital status and age bracket.

07:07.830 --> 07:14.860
Regarding the engine bucket, your experience, -- miles driven market and credit history market.

07:15.190 --> 07:20.680
They have a corresponding Continenza even if you see these values.

07:28.920 --> 07:31.100
These have continuous values.

07:33.660 --> 07:42.450
So what we do here is we will keep the age bracket as there is no continuous variable to be present

07:42.450 --> 07:42.940
each.

07:45.030 --> 07:51.710
So now we can split the data by state and analyze each state differently.

07:53.520 --> 07:58.980
So for now, we will drop this date column and analyze the data across the entire nation.

08:02.300 --> 08:09.490
So we have dropping these columns, which is in June is freemarket years, miles driven and credit history

08:09.500 --> 08:10.280
working for now.

08:12.290 --> 08:15.080
And we will analyze this data.

08:15.110 --> 08:20.090
So here you can see this is the target, one zero one zero one gender.

08:20.100 --> 08:27.260
If an engine HP credit history, your experience and you claim to see this vehicle like my instrument

08:27.770 --> 08:30.620
size of family and age bracket and state.

08:33.040 --> 08:35.500
Now, if.

08:41.210 --> 08:49.220
We take the median of the legal papers truck as all the order number values for truculently.

08:50.090 --> 08:56.120
So basically what it means is that if we are talking about a particular you have particular vehicle,

08:57.020 --> 09:04.070
then for that we can take the median of see if we are talking about truck, then the median of the trucks

09:04.370 --> 09:06.240
could be the price for the bus.

09:06.530 --> 09:10.430
Similarly for car, the median of the cars would be price for the car.

09:10.700 --> 09:16.820
So if you want to impute values, if you want, you know, value, so that you would have been sort

09:16.820 --> 09:17.840
of legal status.

09:19.160 --> 09:26.390
So if we have practic the median of trucks and that if you have got the median of cars and imputed that.

09:29.480 --> 09:31.200
So these are the values.

09:31.220 --> 09:32.930
So we have four cars.

09:32.990 --> 09:34.400
You have targeted one truck.

09:35.060 --> 09:38.750
These are these values in GeneDx before God is 140.

09:38.760 --> 09:40.200
Truck is 150.

09:40.240 --> 09:43.350
You these 130, one is 120.

09:43.670 --> 09:48.640
So you can see that the engine h.b is also within the door likely.

09:49.580 --> 09:55.120
When we talk about the credit history, it is also meeting with the Wakeling.

09:55.700 --> 09:58.580
When we talk about car and truck, it is almost similar.

09:59.150 --> 10:02.810
One utility and man, it is also similar amongst these.

10:04.040 --> 10:11.660
When we talk about years of experience, God in Iraq has almost seven and four utility and man, it

10:11.660 --> 10:12.410
is almost seven.

10:12.410 --> 10:16.670
So you can see there is a little similarity between these two.

10:17.990 --> 10:22.940
So when we talk about God in Iraq, the values are a little similar for these.

10:24.590 --> 10:30.070
When we talk about utility man, the values, what these two are similar.

10:30.440 --> 10:32.950
So you can see the relationship being created here.

10:34.310 --> 10:41.200
So we will replace the not no values in miles driven and we leave with the made median value for prop.

10:41.930 --> 10:48.590
There may be a better way to build these missing detail, but we have just it not a no sense out of

10:48.590 --> 10:54.080
some 30000 plus rules, which is less than zero point three percent, same building with median does

10:54.080 --> 10:55.960
not make much difference.

10:55.970 --> 10:57.300
What we can do that very simple.

10:59.270 --> 11:01.040
Next, we will check for the values.

11:01.040 --> 11:02.480
Are there any other values?

11:02.480 --> 11:04.850
Then we will impute them again.

11:07.880 --> 11:18.080
So here we don't have any values, all the values have been imputed, so when we have a look at the

11:18.080 --> 11:24.860
feature values, these different values above the range of values of each other within the.

11:28.790 --> 11:36.740
Right, so you can see that the values are ranging a lot, this has won this as well is 695.

11:37.070 --> 11:39.980
This is seven one one three one four seven.

11:40.010 --> 11:40.750
This is four.

11:41.030 --> 11:44.320
So these are not skewed accordingly.

11:45.170 --> 11:51.940
So we will be using preprocessing Botkin from Escalon to scale all these columns accordingly.

11:52.640 --> 11:58.490
And after scaling, you can see we have scaled these columns and we still have these categorical columns

11:58.490 --> 11:59.840
which are still present here.

12:01.550 --> 12:09.260
So next, we are using data visualization for finding out the distribution of the features and also

12:09.260 --> 12:11.310
the correlation between different features.

12:13.190 --> 12:18.950
Now, after that, we can drop one or two features based on the distribution of correlation accordingly.

12:20.750 --> 12:24.050
So for now, we are dropping the target.

12:25.730 --> 12:33.710
OK, and we are creating the scatter drop scatterplot for the bi weighted relationship.

12:35.120 --> 12:46.670
So here we have different columns that is Engeneic be created, history, year of experience and will

12:46.670 --> 12:49.490
claim miles driven.

12:49.820 --> 12:52.520
And you can see how the relationship is present.

12:54.720 --> 12:58.200
So this these are almost uniform.

13:01.210 --> 13:03.070
This is uniform, this is uniform.

13:03.500 --> 13:04.550
This is also uniform.

13:06.180 --> 13:12.240
These are these are also no relationship here.

13:12.370 --> 13:22.020
You can see between credit history and in general, there is some relationship.

13:22.870 --> 13:29.550
So these lines are actually showing relationships, but we don't really know the magnitude of them.

13:29.560 --> 13:31.710
So we can't really take a decision as of now.

13:32.500 --> 13:39.310
So we will create the heat map or find out the correlation coefficient for this and still got an idea

13:39.310 --> 13:41.350
that these are actually correlated.

13:41.650 --> 13:43.520
But the magnitude we still need to.

13:45.670 --> 13:49.650
So here we have the magnitude of the correlation coefficient.

13:51.460 --> 14:01.750
So here you can see that the target I'm imagining to be has minus 22 096 Credit History and zero point

14:01.750 --> 14:02.730
zero zero two one.

14:03.040 --> 14:10.410
So the features are not highly correlated with the target, so we can keep the remaining features as

14:10.410 --> 14:10.810
it is.

14:14.570 --> 14:21.320
So we are keeping all of these features now next to what we are doing is we will have a look at the

14:21.320 --> 14:28.340
relationship between a dependent variable with the category could be so beloved by many variables.

14:28.340 --> 14:31.910
But now we will take into consideration the category of individuals.

14:32.660 --> 14:35.360
So we will applaud the book's plot.

14:35.370 --> 14:40.090
But the target variable, so these are the box plot.

14:40.370 --> 14:43.140
So here you can see no much difference.

14:43.160 --> 14:44.700
Again, no much difference.

14:45.200 --> 14:48.680
This, again, has not much difference.

14:50.390 --> 14:51.710
This is again, similar.

14:52.040 --> 14:58.490
This is again, similar that only a reading one is the size of the family.

14:58.520 --> 15:02.630
So with the with the size of the family, the dog is slightly really.

15:05.220 --> 15:14.520
So as the size of the families increasing, the driver actually is changing, so you can see that the

15:14.520 --> 15:22.380
box plot indicates that there are some outliers in the engine, be a lot of outliers, actually, credit

15:22.380 --> 15:24.030
history and miles driven.

15:24.330 --> 15:32.550
So here and miles driven again, there are outliers to entry, but we need to keep the outliers unless

15:32.550 --> 15:36.240
they affect our results and take another look at them later.

15:36.270 --> 15:37.790
So we will take care of it later.

15:39.660 --> 15:42.870
Then we separate our feature set from liberal target.

15:42.870 --> 15:49.710
We convert all of them into numeric variable and split the feature and do our training and test data.

15:49.710 --> 15:56.220
So we will convert the categorical features into numeric by giving the best to each variable.

15:56.700 --> 15:58.120
So here we have Jundah.

15:58.140 --> 16:00.390
We are creating one two p.m. to Boomi.

16:01.050 --> 16:05.080
Michael, she does one a single to us, Mark Wacol type.

16:05.130 --> 16:09.180
We are using label encoder and for each bucket again we are using label.

16:11.550 --> 16:17.610
Now we are not using dummies or one word encoder because this creates sparse matrices and increase the

16:17.610 --> 16:20.840
dimensionality by giving one or two farseeing.

16:21.180 --> 16:25.710
Those details, we are giving higher returns to Modig by assigning a value to.

16:27.390 --> 16:30.660
Similarly, we are giving higher ratings to me here.

16:30.720 --> 16:31.010
I'm.

16:35.460 --> 16:45.200
Now, let us go hit these images of the damage we have now, and after converting these into a label,

16:45.240 --> 16:49.230
encoders and numeric labels, this is the final data which we have here.

16:51.520 --> 16:58.720
Now, we will drop the target column from the training data frame as that is available, and we will

16:58.720 --> 17:02.200
create a new data frame by which will have the target value.

17:03.070 --> 17:09.670
Now we will use all dummies to resolve the categorical data steeped in the numeric values.

17:11.320 --> 17:20.000
And next, as we have already found, their target levels with just 70 percent failure.

17:20.020 --> 17:28.030
That is a big driver where the value target is one and 30 percent success, where the person is a good

17:28.030 --> 17:30.700
driver or a target is zero.

17:32.230 --> 17:36.880
So now we will balance the class using smoke.

17:37.280 --> 17:45.790
OK, so we are simply modeling smoke from erm I, I am beknown but oversampling.

17:45.790 --> 17:48.490
So we are simply sampling the data for the.

17:50.530 --> 17:55.110
So after sampling the deductible's, basically create equal dataset here.

17:55.570 --> 18:01.000
So the length of oversample data is forty two thousand seven hundred and ninety two.

18:02.230 --> 18:09.540
The number of negative plus is oversampled in oversampled data is twenty one thousand three ninety six

18:09.550 --> 18:14.890
and number of positive plus in oversampled is twenty one thousand three ninety six.

18:15.040 --> 18:16.990
So now we have created equal.

18:17.320 --> 18:20.350
Basically we have created a balance between the classes.

18:20.680 --> 18:29.320
So we have some oversampled one of the class so that we could have the same number of rows of data for

18:29.680 --> 18:30.380
the class.

18:31.330 --> 18:33.970
This is what this small has given us.

18:36.430 --> 18:42.550
So we will have to find out how significant other features are in predicting our label.

18:42.880 --> 18:47.300
So we will use the feature important method from the line of what is classified.

18:48.190 --> 18:53.140
So after that, we will need the importance of the features using a.

18:54.550 --> 19:04.330
Again, you can add a random randomly will in saying this and all and find out the central importance

19:04.330 --> 19:13.690
for all the features which represent whatever columns are falling under or below the value of the random

19:13.690 --> 19:14.070
column.

19:14.470 --> 19:23.730
You can remove those columns and keep whatever is above the random column in the feature importance

19:23.740 --> 19:24.110
list.

19:24.430 --> 19:32.130
So this is the quickest and most stable way of doing so.

19:32.140 --> 19:37.480
What we do here is we are finding out the random forest and from the random what is we are finding out

19:37.480 --> 19:42.280
the feature importances and you can see these on having very low importances.

19:42.280 --> 19:43.450
That is almost zero.

19:43.720 --> 19:46.050
And these are highly important.

19:47.650 --> 19:50.200
So these are my different variable importances.

19:50.200 --> 19:57.850
So credit history is Hyman's driven and it's high in December expediencies, a little gem that is lower

19:57.850 --> 20:03.640
than that age bracket vehicle, Gapminder Souters size of family, an annual theme is.

20:07.020 --> 20:16.320
What we can do for the rest, we will run the different models on top of it, you can run the logistic

20:16.320 --> 20:25.470
regression, then run the of random Flores decision tree and so on, and find out the best model that

20:25.470 --> 20:29.490
you get out of it and then create the confusion with things accordingly.

20:30.060 --> 20:31.290
From the confusion, we.

20:31.620 --> 20:34.270
Again, you can broaden the scope here.

20:34.290 --> 20:36.410
I have gotten zero scope here.

20:36.720 --> 20:41.430
So here you can see this is a vehicle which has been generated.

20:41.940 --> 20:45.260
So I get zero point six six percent out of it.

20:45.510 --> 20:49.480
So our confusion matrix based on the decision does not look good.

20:49.860 --> 20:53.620
So it is showing high number of false positives and false negatives.

20:53.850 --> 21:03.110
So next, we are building a support vector classifier and from the support vector classifying yet again

21:03.120 --> 21:10.390
would the bagenal itself, which again is a lower performance in comparison to the previous one.

21:10.920 --> 21:17.060
So SVM classifier returns lesser, less than better results than the decision tree.

21:17.100 --> 21:22.300
So we will try stochastic gradient descent, classify the system and the glass of wine, which we have.

21:23.070 --> 21:27.630
And similarly, we can try different algorithms.

21:27.630 --> 21:33.200
And from all these algorithms you can build and find out the best one.

21:33.230 --> 21:35.340
Find out which one works the best.

21:36.890 --> 21:38.220
I use that one.

21:38.250 --> 21:45.900
So here I have Waldo precisions full of summaries for all of those and then the comparison of all of

21:45.900 --> 21:46.260
these.

21:46.530 --> 21:50.860
And you can see that the gradient is performing pretty well here.

21:51.600 --> 21:54.120
So that is something which I will be using.

21:54.360 --> 21:58.270
And here again, you can see gradient boosting is actually performing pretty well.

21:59.010 --> 22:01.650
So that is the one which we will be picking up.

22:03.210 --> 22:07.650
So we are using really interesting in this particular solution.

22:07.950 --> 22:11.220
But you can again use any combination.

22:11.220 --> 22:20.430
You can use gradient boosting your mood and maybe you can use or three also and create a stack out of

22:20.430 --> 22:24.760
these and use them so that it's completely up to you how you want to solve the problem.

22:25.170 --> 22:27.630
This is just one of the solutions.