WEBVTT

00:00.960 --> 00:07.470
In this session, we will be discussing about the implementation of Chian, so let's have a look at

00:07.470 --> 00:07.800
this.

00:08.500 --> 00:13.770
So for this particular implementation, I have faith feed my indians'.

00:13.780 --> 00:14.980
Diabetes does it.

00:15.720 --> 00:18.950
So let us start with importing the required libraries.

00:19.500 --> 00:24.460
So the importance libraries are no Fondas and my.

00:28.060 --> 00:36.280
Next, we will load the data set for loading the data set, we will again use the dock re CSP, which

00:36.280 --> 00:41.750
will allow us to read the CSC fight the CSP files there.

00:41.980 --> 00:47.140
These are CSFI and there is no specific different delimiter.

00:47.530 --> 00:51.400
So that is why we are not giving any delimiter here and here.

00:51.400 --> 00:53.800
We have the data from the.

00:54.660 --> 00:57.860
Now we will be printing the first five rows of the data.

00:58.620 --> 01:03.920
So here we have the first five rows of data from which we have printed using the dotted.

01:04.940 --> 01:16.370
Now, the columns I pregnancy's glucose, blood pressure, skin thickness, insulin, BMI, diabetes,

01:16.370 --> 01:20.450
videgaray function, age and outcome.

01:22.210 --> 01:30.670
Here outcome one person that the person has diabetes and outcome zero represents that the person does

01:30.670 --> 01:32.080
not have diabetes.

01:33.010 --> 01:35.950
So now we will check the shape of the data frame.

01:36.310 --> 01:39.370
So the shape of the data frame is seven.

01:39.370 --> 01:41.210
Sixty eight, Guama nine.

01:41.470 --> 01:45.970
This we have obtained for using the shape attribute.

01:46.330 --> 01:51.120
So the shape gives the shape of the data frame the seven sixty eight point nine.

01:52.000 --> 01:58.960
Now we will check that we have to do seven sixty eight rows, nine columns.

01:59.230 --> 02:08.020
And out of these nine columns, the first eight columns are the features that this these first columns

02:08.140 --> 02:15.400
are the features and the last column is the Thugged or the label value which we want to predict.

02:17.800 --> 02:27.530
So we will create No by Eddie for features and targets, so the fullest Eddie will be ex and the second

02:27.580 --> 02:30.040
will be via X will contain the.

02:31.470 --> 02:37.170
X values, that is from pregnancy's to each and Y will contain the outcomes.

02:38.190 --> 02:45.990
How we do that, we simply beg the veto threat and if we will drop the outcome from this, we will get

02:45.990 --> 02:48.010
all of these columns except the outcome.

02:48.600 --> 02:50.060
So that is what we do here.

02:50.550 --> 03:03.330
We do the don't drop the column name and the axis Rumbo axis Zettl stands for Raus and Axis one stands

03:03.330 --> 03:04.590
for columns.

03:04.860 --> 03:08.880
And then we find out the values using DOT values.

03:09.060 --> 03:13.140
So we are getting all the values from the data from.

03:15.010 --> 03:23.050
And after dropping the outcomes from this, now, Vivi will be having only the outcome.

03:23.710 --> 03:31.630
So we are doing the outcome and hopefully we could have taken the data from itself also by doing the

03:31.930 --> 03:35.700
proper outcome while my ex is one and the.

03:37.370 --> 03:40.830
With the outcome, but this is like another way.

03:40.880 --> 03:42.800
So there are two ways how we can do it.

03:43.040 --> 03:48.200
So it is completely up to you if you want to get a Bayati or you want to get the delphiniums.

03:53.980 --> 03:58.300
Now we will split the data into training and testing data set.

03:59.310 --> 04:07.020
So let us split it using the brain split so we will import the brain split from Ascalon lot more this

04:07.020 --> 04:07.500
election.

04:08.070 --> 04:14.760
So from this election we are importing festering split and we will get split the data.

04:15.210 --> 04:20.520
Now, here I am splitting the data with this size equal to zero point for.

04:21.710 --> 04:30.260
And here I am, having started this fight right now, we have given random state to be 42.

04:30.290 --> 04:33.740
You can give any random state based on your convenience.

04:34.380 --> 04:37.810
I these are the values which we will be getting from this.

04:38.330 --> 04:50.600
So the X will be divided into extreme and X test and the Y will be divided into vibrant and whitest.

04:53.180 --> 04:56.970
Now, let's create a classified using dangerous neighborhood algorithm.

04:57.440 --> 05:04.320
So first, let us observe the accuracy for different values of now what case?

05:04.340 --> 05:07.070
The number of nearest neighbors.

05:07.760 --> 05:16.700
OK, so we what we will do, we will import the nearest neighbors, classify it from the Escalon neighbors.

05:18.030 --> 05:24.360
Now they will set up the area to store the training and the testing accuracy, so we have created this

05:24.360 --> 05:25.350
at a neighbour's.

05:26.560 --> 05:30.010
Which has values from one to nine.

05:31.100 --> 05:32.600
Now, we were creating.

05:34.330 --> 05:39.220
I've won it as drin accuracy and another area as best accuracy.

05:40.890 --> 05:42.870
Now they are running value.

05:44.110 --> 05:49.060
On the values of think, that is on the values of the.

05:50.630 --> 05:56.180
Neighbors which will be enumerated so basically on all the neighbors, the.

05:57.390 --> 05:58.780
Running this particular.

06:00.040 --> 06:03.940
Now we are creating an object of the cannon classify it.

06:04.830 --> 06:13.110
And in this object, we are putting the number of brzeski, so each time it will keep running the loop

06:13.440 --> 06:20.380
and it will change the number of neighbors from one do it, all the values will be taken in this case.

06:20.730 --> 06:27.120
Now, the other thing, the model using, again, is not fit and giving in the extreme, and rightly

06:27.510 --> 06:27.900
so.

06:27.900 --> 06:31.230
It belongs on the extreme and vibrant values.

06:32.770 --> 06:41.320
And after the food is done, we can get the cocaine in school, so for in school, we're running the

06:41.320 --> 06:50.090
values from ecstasy and victory and we are getting the training accuracy from this eye for testing accuracy.

06:50.410 --> 06:53.980
We are running it on its best and brightest.

06:55.170 --> 07:02.760
Now, from this, we have obtained these two areas which have been accuracy and testing accuracy.

07:04.690 --> 07:13.170
So now we will create the lot to actually visualize what we have obtained from this train accuracy and

07:13.180 --> 07:14.830
the best deputies.

07:16.690 --> 07:22.660
Now we can see that this blog title we have given game, nearest neighbor, varying number of neighbors,

07:23.140 --> 07:25.620
the plot has two lines.

07:25.930 --> 07:32.820
So the first line is from the accuracy and the second line is from the three inaccuracy.

07:34.820 --> 07:42.320
And we have created a legend, provided the exlibris and vibe when we are printing the Lord, we are

07:42.320 --> 07:45.400
able to see these two lines generated.

07:47.530 --> 07:56.020
Now, when you see the training, accuracy is decreasing as the number of members increase.

07:57.630 --> 08:02.160
And the testing accuracy is actually increasing slowly.

08:03.500 --> 08:06.050
And if we compare the values.

08:07.000 --> 08:16.210
These values increased in the number seven, number seven is the point where these accuracy's are closest

08:16.420 --> 08:19.100
and the best accuracy is pleasing.

08:19.390 --> 08:24.280
And after seven, the best accuracy actually starts to go down.

08:24.730 --> 08:30.010
So this is the reason why we will select seven as the number of neighbors.

08:31.930 --> 08:39.730
So we can observe that we get maximum testing accuracy for people to seven, so we will create the dangerous

08:39.730 --> 08:42.710
classifier with a number of neighbors, that's seven.

08:42.940 --> 08:44.340
So how will we do that?

08:44.350 --> 08:47.290
We will meet again another object of neighbors.

08:48.700 --> 08:54.770
And this time, we will provide the number of neighbors s7 in this object.

08:54.790 --> 08:59.540
We will provide the extreme and vitrine and then fit the model.

09:00.130 --> 09:04.170
Now, when we put the models, the model will learn from this data.

09:06.030 --> 09:10.980
And after learning from this data, we get the Cannon School.

09:12.180 --> 09:17.220
This Ganin score comes out to be zero point seven three zero.

09:18.850 --> 09:26.980
Now, next is creating a confusion matrix now of confusion matrix, is it a move that is often used

09:26.980 --> 09:34.000
to describe the performance of a classification model on a set of test data for which the true values

09:34.000 --> 09:35.170
are already known?

09:35.920 --> 09:43.100
So Escalon provides the facility to calculate the confusion matrix using the confusion matrix method.

09:43.120 --> 09:45.160
So that is what we will be using here.

09:45.940 --> 09:48.240
So what is the confusion matrix?

09:48.250 --> 09:52.210
We are importing the confusion matrix using Escalon, dot matrix.

09:52.450 --> 10:01.390
Remember Escalon, Dot Matrix holds all the different types of matrix which we can use for evaluating

10:01.390 --> 10:02.310
our models.

10:02.560 --> 10:07.990
So always keep exploring these metrics and comparing different matrix.

10:08.230 --> 10:15.910
Use a lot of matrix and compare the models using these matrix and see which metric actually works well

10:15.910 --> 10:16.270
for you.

10:19.140 --> 10:21.180
Then let us give the prediction.

10:21.230 --> 10:29.220
So far, the predictions we are simply using can not predict on X this data and then we have the predictions

10:29.220 --> 10:30.310
from the best data.

10:30.700 --> 10:39.610
Now, we already have the very values and now we have generated divided values by using can and predict.

10:39.960 --> 10:43.800
So we are predicting the values on the X test values.

10:44.400 --> 10:51.870
We already have X values, so we predict the values for these X best values using the model which we

10:51.870 --> 10:55.560
have generated and we get the predicted values.

10:55.980 --> 11:05.400
Now we compare these Y predicted values with the widest value which we already have and in the confusion

11:05.400 --> 11:06.060
matrix.

11:07.990 --> 11:12.280
So this confusion matrix give these, Eddie.

11:13.560 --> 11:17.100
Which is basically true, negative as one sixty five.

11:18.460 --> 11:20.920
Proof positive as 60.

11:22.120 --> 11:28.120
False positive as 46 and false negative as forty seven.

11:30.550 --> 11:40.090
We can also obtain the confusion matrix using the crosstab method, so we can simply say we don't crosstab

11:40.870 --> 11:51.100
and provide Vytas here, then why predict in the role we will give the true value and in the column

11:51.250 --> 11:53.200
we will provide the predicted value.

11:53.980 --> 11:56.500
So this way we will get predicted.

11:57.690 --> 12:08.760
Zero one zero zero one, so Ventoux is zero, it is false, Ventoux is one, it is true I.

12:10.290 --> 12:17.090
This is true, this is false, and so we can decide it accordingly.

12:20.150 --> 12:28.640
So we have classification report, which is another matter, which is a textual summary of the precision

12:28.640 --> 12:36.890
recall, if one school for each and every class so we can use it on imported from Eskil on dogmatics.

12:37.980 --> 12:45.900
Now we will generate declassification report on Vytenis and predict now here you can see that the decision

12:46.350 --> 12:52.950
is zero point seven eight four nonminority and zero point six to four day.

12:52.950 --> 12:57.360
But the goal value is zero point eight to one zero point five six.

12:57.690 --> 13:04.380
If one score is zero point zero zero point five, me, I'm supposed value is two hundred and one and

13:04.380 --> 13:05.370
one hundred and seven.

13:07.460 --> 13:16.190
The accuracy at micro level on four weighted average of seventy and seventy three, so these are different

13:16.190 --> 13:16.760
values.

13:17.770 --> 13:27.640
Now we can find the orosco also for this, so what is Orosco again, the sequel is the plot, which

13:27.640 --> 13:32.800
is, if true, positive rate with respect to the false positive rate.

13:33.010 --> 13:38.180
We have discussed this during the time when we discussed the logistic regression.

13:38.890 --> 13:45.820
So it is a plot of the true positive rate against the false positive rate for the different possible

13:46.090 --> 13:51.860
points for a diagnostic test and auto cycle demonstrates several things.

13:52.180 --> 13:53.260
What are those things?

13:53.500 --> 14:01.390
Number one, the rate of between sensitivity and specificity, that is any increase in sensitivity will

14:01.390 --> 14:05.380
be accompanied by a decrease in the specificity.

14:07.530 --> 14:17.790
And the closer the Gulf follows the left hand border and then the border of the SP's, the more accurate

14:17.790 --> 14:19.950
the best is.

14:21.320 --> 14:28.250
Now, the closer the Gulf comes to the 45 degree diagonal, the less accurate the values.

14:29.270 --> 14:33.170
So we want to move it towards the top left corner.

14:33.830 --> 14:37.970
Now, the area under the Gulf is a measure of the accuracy.

14:41.310 --> 14:48.800
What is now we will be making the predictions, we will be predicting the probability using the excess

14:48.840 --> 14:52.680
data and we get the probability to vibrate from.

14:53.820 --> 15:01.130
Now, let us just read the sequel, so again, we will import the sequel from the Eskil or Not Matrix.

15:02.210 --> 15:03.800
Now, here we will have the.

15:05.150 --> 15:10.630
Fear that this false positive debate, the true positive rate and the threshold value.

15:11.810 --> 15:20.480
So we will generate the sequel for Vytenis with the with companies in Dubai predicted probabilities

15:21.290 --> 15:24.080
and we will create the plot for the scene.

15:24.980 --> 15:27.820
So this is the plot which we have created.

15:28.010 --> 15:34.210
And here you can see that the these values are what are good enough.

15:34.220 --> 15:42.410
So this is one value, which is which is a nice value where we have it close to the left side also and

15:42.420 --> 15:43.510
towards the upside.

15:46.600 --> 15:52.460
Now, if you see the area of the Gulf, we can find it using you for.

15:53.470 --> 15:56.790
But just coming out with zero point seventy four five.

15:58.740 --> 16:07.050
Now, this is one method which is using this brain split, but we have used another method, which is

16:07.050 --> 16:09.730
a very good method that is cross-validation.

16:09.960 --> 16:13.260
So let us implement this using cross-validation as well.

16:14.010 --> 16:15.400
So how would we do that?

16:15.840 --> 16:20.730
So we will get the hyper barometer's and we will use the cross-validation.

16:21.000 --> 16:23.400
Now, again, what is false validation?

16:23.400 --> 16:27.690
The PRIMORDIA performance is dependent on how the device splitted.

16:27.990 --> 16:35.430
So instead of using the hold out method, we will use the cross-validation what is called validation.

16:35.640 --> 16:42.150
Cross-validation is a technique to evaluate predictive models by partitioning the original sample into

16:42.150 --> 16:47.400
a training set to train the model and test set to evaluate it.

16:47.760 --> 16:55.320
So we will create key features and then add every time each one will be selected as a distinct field

16:55.320 --> 16:58.800
and other fields will be selected as the training fields.

16:58.950 --> 17:01.230
And that is how we will get those values.

17:01.230 --> 17:05.730
And then we can average them out, then find the final validation.

17:07.060 --> 17:15.670
So we will be doing hypovolemic for this now, we have already selected the value of, but now we want

17:15.670 --> 17:20.320
to find out the optimal value of using hypovolemic.

17:23.130 --> 17:30.030
So for this, we will try some different hypovolemic, those values and then sort them all separately

17:30.030 --> 17:34.440
to the model, and then we will choose the best one out of this.

17:34.560 --> 17:39.660
So we will use TV for this, which we have already built on great TV.

17:40.080 --> 17:41.450
So how does it work?

17:42.450 --> 17:49.380
We will create the bottom three in this forum, great, we are giving the neighbors as the barometer

17:49.620 --> 17:53.370
and the values are ranging from one to 50.

17:55.300 --> 18:04.150
Now we are creating an object of canibus classifier and we are creating an object of great TV which

18:04.150 --> 18:12.550
has the model, which is Ganin and the Bottom Great and the C.V, that is the cross validation for number.

18:12.730 --> 18:14.200
That is five foot.

18:16.050 --> 18:20.330
Now, the feeding the data into it, so we have it X and Y it.

18:22.270 --> 18:27.190
After four X and Y, we have found out the best score.

18:28.410 --> 18:36.630
Is zero point seven five seven eight and the best barometer is anybody's guess 40.

18:37.410 --> 18:45.300
So this is what we have or been that is we have found almost seventy six percent accuracy using the

18:45.300 --> 18:47.130
neighbors numbers 14.

18:47.340 --> 18:50.470
And what is really what we have found out was seven.

18:50.730 --> 18:59.280
So here we have improved the accuracy by almost three percent using the grid to a TV and by selecting

18:59.280 --> 19:03.210
the higher number of neighbors.

19:04.340 --> 19:08.710
So next, we will learn about the next algorithm in the next session.