WEBVTT

00:01.000 --> 00:08.710
High in this particular session, we will go over the logistic regression food, so first of all, we

00:08.710 --> 00:12.430
will allow we will import the important libraries.

00:12.670 --> 00:16.330
So the full library, which we will import is the brain faceplate.

00:16.750 --> 00:21.240
And after that, we will import the number and find us as required.

00:22.580 --> 00:28.390
And I have defined the record function as well, which we have used earlier also.

00:28.580 --> 00:37.490
So this report will only generate the report of the greatest TV or Reinoso TV, which we will perform

00:37.730 --> 00:46.880
and retrieve the best models out of it so we can simply put in the CSV, object to it and specify the

00:46.880 --> 00:49.110
number of models which we want to obtain.

00:49.370 --> 00:56.040
So based on the number which we will provide, it will generate the top those number of models.

00:56.090 --> 01:01.490
If you provide five, we will get the top five models, a very expensive five hundred, then it will

01:01.970 --> 01:04.200
retrieve the top hundred models out of it.

01:04.520 --> 01:08.000
So without losing time.

01:08.030 --> 01:11.410
So let us start with the model details.

01:11.750 --> 01:14.150
So in this particular example.

01:15.030 --> 01:23.130
I have taken up a dataset which has two partitions to work, so usually earlier what we were doing is

01:23.130 --> 01:30.510
we had a data set, which was only dataset which we had in that particular dataset.

01:30.510 --> 01:33.810
We have the data for training on testing also.

01:34.020 --> 01:37.610
So what we used to do was we will have that particular data.

01:37.800 --> 01:44.970
So we will split the data into two parts that this one training data set and another testing dataset.

01:45.420 --> 01:51.690
Now, from the training data set, we will train the model and from the testing dataset, we then evaluate

01:51.690 --> 01:52.160
the model.

01:52.980 --> 02:01.870
Now, in case you don't really have that one data set, but you have two data sets explicitly provided.

02:02.670 --> 02:11.070
So in this case, what we will have to do is because then we had only one data set, it was more convenient

02:11.070 --> 02:11.610
for us.

02:12.210 --> 02:14.850
We will simply do our data consummation.

02:14.850 --> 02:20.280
We can simply perform entire data analysis on that single desk.

02:20.460 --> 02:27.960
And once we were done doing the data transformation, then we will split the data now because we were

02:27.960 --> 02:31.440
splitting the data from one particular data frame only.

02:31.650 --> 02:40.710
We did not get any issues regarding the data consistency because the number of columns created will

02:40.710 --> 02:43.080
be the same in both the dataset.

02:43.320 --> 02:49.140
The transformations which we would have done, the dummy columns which we would have created always

02:49.140 --> 02:51.740
will be same in that particular data.

02:52.620 --> 02:59.520
But now, because we have two different datasets, there will be a lack of consistency.

03:00.150 --> 03:06.900
Now, in this case, whatever changes we make to the false data frame, the same changes we will have

03:06.900 --> 03:13.860
to make on the second detail frame that is in the training data set in the testing dataset, the changes

03:13.860 --> 03:21.570
will have to be seen because if the changes are not seen, then let's say I have been my model on the

03:21.570 --> 03:28.560
training data on which I have performed all the transformations, and now I want to test my model on

03:28.560 --> 03:35.220
the testing data, see if they are not in sync with each other in the training and the testing dataset.

03:35.220 --> 03:36.900
I'm not in sync with each other.

03:37.170 --> 03:44.670
Then my model will not be able to understand which column are we discussing about if there is any more

03:44.670 --> 03:51.020
number of columns or less number of columns in case of testing data, then it will ask for you that

03:51.030 --> 03:53.140
why is there a column missing?

03:53.610 --> 04:00.840
I have been trained on a particular dataset which have been Gollum's, and now you are giving me another

04:00.840 --> 04:04.520
dataset which has of 17 or 20 columns.

04:04.650 --> 04:07.890
So how am I able to understand this particular dataset?

04:09.140 --> 04:13.980
And then it's been throwing it on through, so to solve this problem.

04:14.510 --> 04:16.200
We will see how we can do that.

04:16.520 --> 04:22.820
So for this particular situation, what we will do is we will combine both the training and testing

04:22.820 --> 04:27.380
data set, do our transformations, do what analysis?

04:27.560 --> 04:32.030
We will perform our detailed mission on the combine data set.

04:32.300 --> 04:38.480
And then once we are done with the entire data formation and we are satisfied with the transformation,

04:38.750 --> 04:46.100
we will split the data, set the back to the training data set and the testing dataset.

04:47.190 --> 04:54.960
Now, once we have made the transformations and split the data frames back, now the data frames which

04:54.960 --> 05:05.160
we have will be in sync with each other, will have similar type of columns in the dataset, and now

05:05.280 --> 05:08.400
when the model will be trained on the training data set.

05:09.290 --> 05:13.100
And will be actually evaluated on the test, does it?

05:13.460 --> 05:15.890
It will understand what the columns mean.

05:17.020 --> 05:25.180
Because now there is no inconsistency between the problems of the desert, so this will give you a picture

05:25.390 --> 05:33.020
of how the whole nation needs to be done in case of any production data or when you are walking into

05:33.070 --> 05:34.060
a life situation.

05:35.200 --> 05:43.200
Because in case of a real life situation, you will have to make their transformations and the convergence

05:43.810 --> 05:45.730
is such that both had in sync.

05:46.810 --> 05:54.370
In the example questions or in different tutorials, if you use a simple dataset and then split it and

05:54.370 --> 05:57.600
then evaluate the model, it will be very simple in nature.

05:57.820 --> 06:01.500
But this is exactly what you will have going on in your life.

06:02.110 --> 06:05.290
So you need to understand this very clearly.

06:06.340 --> 06:12.400
So without wasting any time, let me go ahead and explain this to you, so we have taken the data,

06:12.400 --> 06:19.750
set the first date as it is our dream dataset and the second data set is the best data set.

06:19.780 --> 06:22.080
This R.G. stands for revenue growth.

06:22.870 --> 06:26.080
Now, we will have a look at the data in some time.

06:26.260 --> 06:30.520
So please be patient and let us read the files.

06:30.550 --> 06:38.200
So we will be updating the first five by using please don't read CSFI and we don't read CSFI gives me

06:38.740 --> 06:41.310
the data from Vitrine.

06:41.950 --> 06:47.770
Similarly, I'll read those ESV for the test file and create a data frame.

06:47.770 --> 06:49.300
B the underscore test.

06:50.530 --> 06:54.820
Now, I will create a new policy.

06:54.850 --> 06:58.450
Now, why am I read this column, so let us see the data first.

06:58.480 --> 07:02.290
OK, so let us go to the head of the board.

07:02.920 --> 07:05.690
Now, what do we have in the vitrine data set?

07:05.950 --> 07:13.210
We have these columns reference no children, age, band status, occupation, occupation, partner,

07:13.210 --> 07:19.150
home status, family income, self-employed, self-employed, partment, and a lot more columns.

07:19.490 --> 07:26.710
And finally, we have the region and the revenue great value is to.

07:27.800 --> 07:32.660
OK, this is the revenue, great value, and we don't have this data column yet.

07:33.710 --> 07:37.820
This has not been ordered till next in the testing data.

07:38.450 --> 07:43.760
We have all the problems, but the revenue grid is not would not be presented.

07:44.180 --> 07:50.210
So what we will do is because in the training data said, we will have all the volumes, but in the

07:50.210 --> 07:57.380
best data set, we will have the Fargood column missing because it is expecting us to find out the testing

07:57.410 --> 07:57.880
values.

07:58.040 --> 07:59.840
So the target will be missing here.

08:00.980 --> 08:09.290
So what we do is we create a new grid column in the testing dataset and fill it with the null values,

08:09.290 --> 08:11.390
we fill it with not a number of values.

08:11.810 --> 08:20.330
And in the B Train the Avenue column, which is data in which we fill three and in every test again,

08:20.330 --> 08:24.870
we create another column, name data and we fill test and.

08:26.230 --> 08:28.150
After that, we will.

08:29.480 --> 08:37.500
Get the column names from the train and put those columns which represent then maybe bring in the very

08:37.550 --> 08:43.340
best, what it does is it will get all the column names from the train and it will fill that.

08:43.340 --> 08:49.820
I would be best for all of these columns so that we don't have any extra columns in the B B bringing

08:50.510 --> 08:51.980
this dataset.

08:52.520 --> 09:01.040
Now, after that, we are concatenating the train and we need this data set by X is zero, which means

09:01.280 --> 09:08.480
we will have all the train data set data on the top, followed by the test data.

09:08.750 --> 09:10.490
Now let us see what it looks like.

09:10.490 --> 09:12.740
And we have put this into video.

09:13.040 --> 09:14.930
So what do we have in the train?

09:14.930 --> 09:15.860
And we did test.

09:16.980 --> 09:25.890
So in the big three, we have the entire data, which was present in between earlier and additionally,

09:25.890 --> 09:31.610
we have created our data column, which has the label print written on it.

09:32.340 --> 09:38.080
Apart from that, in the best, we have all the same columns which were already present.

09:38.490 --> 09:44.850
We have created one revenue column, which has not a number of algebra's in it and our data column,

09:44.850 --> 09:47.200
which has values best present in it.

09:47.490 --> 09:48.990
Why have we done this?

09:49.230 --> 09:53.760
Because we have combined the columns.

09:53.760 --> 10:01.860
We have combined the did the data for baby brain and body test and we need to have an identifier on

10:01.860 --> 10:05.930
which data belongs to b B train and which data belongs to.

10:05.940 --> 10:06.620
We need this.

10:06.840 --> 10:11.100
So whenever we will check the column data we will check for the value.

10:11.470 --> 10:17.310
So if the value print is present in the data, that means that the data belongs to be debri.

10:17.700 --> 10:25.500
And if the value test is present in the column data, it will mean that the data belongs to the best

10:26.160 --> 10:26.850
data from.

10:27.900 --> 10:34.980
And the value for revenue growth will be null in the big test by the values of revenue.

10:35.320 --> 10:38.520
Great will have values to work on in the brain.

10:39.570 --> 10:43.110
Now, let us have a look at the combined dataset.

10:45.180 --> 10:50.040
So we be underscored all dark and.

10:51.910 --> 11:00.130
So this is the head data, it contains all the columns and it has the same data set, if you will have

11:00.130 --> 11:01.440
a look at the daily.

11:02.760 --> 11:05.220
You can see that it has the.

11:07.180 --> 11:08.050
Databased.

11:09.380 --> 11:18.890
The best data, basically, the values, the rules for the train have been present at the top and below

11:18.890 --> 11:22.880
that the data for the brain test have been pasted.

11:24.580 --> 11:31.240
Now, what we will be checking is now you can see here we we are checking the revenue, great values.

11:32.020 --> 11:36.310
So for all the values, for revenue, great, we are checking the value count.

11:36.670 --> 11:44.470
So from the value count, we get to know that there are seven thousand and sixty one rolls which have

11:44.470 --> 11:49.780
the value to Winnik and each sixty three values which have value one in.

11:51.100 --> 11:58.480
Now we will find out the ratio so you can see that the class, too, has actually.

12:00.030 --> 12:04.480
Eight point four one finds values in comparison to the last one.

12:05.160 --> 12:11.360
So this means that it is up in balance, plus it is an unbalanced class.

12:11.550 --> 12:18.900
So when it is an unbalanced class, then we will have to provide certain weights to these classes so

12:18.900 --> 12:27.090
that these classes become balanced and we are able to get a better result out of it, because if we

12:27.090 --> 12:33.960
do not balance out the classes, then what will happen is that the model will train and it will try

12:33.960 --> 12:43.750
to predict everything as as to and because almost eight point four times values are presented as two.

12:44.610 --> 12:51.210
So what will happen is you will get an eighty four percent accuracy, even if all the values which are

12:51.210 --> 12:58.200
predicted are predicted as to save the model starts predicting everything as to then also you will get

12:58.200 --> 13:00.150
eighty four percent accuracy from this one.

13:01.070 --> 13:02.970
So we don't want that to happen.

13:03.110 --> 13:06.980
That is the reason why we will balance out these classes.

13:08.760 --> 13:10.180
So let us go further.

13:10.230 --> 13:16.530
Now let's check about the data now, the learnings that you have done, then all will come into the

13:16.530 --> 13:17.010
picture.

13:17.160 --> 13:20.100
So you will be doing several things.

13:20.300 --> 13:27.330
Now, what I will be doing would be some short good methods, but you via you will be implementing this

13:27.630 --> 13:29.510
code of data of your own.

13:29.760 --> 13:37.930
You will be implementing it with all the data analysis, with all the numbers, you can apply the most

13:37.960 --> 13:38.520
profiling.

13:38.520 --> 13:42.360
On top of it, you will find no different correlation values.

13:42.540 --> 13:49.770
You will find out the columns which need to be removed based on the correlation matrix or VAW matrix.

13:49.950 --> 13:54.300
So you will be applying all those things which we have known until now.

13:55.330 --> 14:02.200
And while you are applying logistic regression, so we will get to know about how we can decide upon

14:02.200 --> 14:07.810
the future importance, also like we have learned about the efficient.

14:08.940 --> 14:15.030
In the case of linear regression, so similarly, the politicians can be obtained in case of logistic

14:15.030 --> 14:15.540
regression.

14:15.780 --> 14:21.750
So based on the confessions, you can actually decide which column you need to keep and which columns

14:21.750 --> 14:22.570
you need to remove.

14:22.770 --> 14:25.730
So we will be applying all the learnings here.

14:25.890 --> 14:33.030
So whatever I am applying is a different thing, but you need to apply all the learning so that you

14:33.030 --> 14:34.210
get a better picture.

14:34.590 --> 14:40.140
And while these sessions, I will be using several different data set.

14:41.200 --> 14:50.290
And I could have used the same data set for all the algorithms, but I decided to pick out different

14:50.290 --> 14:57.820
data sets and applied different the methods for implementations so that you would get to know what all

14:57.820 --> 14:59.230
you can do with this lone.

15:00.410 --> 15:08.300
So try to implement all the code of your own with all the learnings which I have given you, with all

15:08.300 --> 15:15.290
the things which I have taught you, so that you will actually learn a lot from this, and please know

15:15.300 --> 15:22.520
that you will learn a lot if you will practice this thing until and unless you type in the code and

15:22.520 --> 15:27.860
do everything on your own, it will be a little difficult to get ahold of all of this.

15:28.010 --> 15:35.090
And it will be really fun once you start implementing these, because there is a lot of things to run

15:35.090 --> 15:41.050
and a lot of things to learn from this and as you will get towards the end of the code.

15:42.070 --> 15:48.700
You will actually see the accuracy's will actually see how well the models are performing, and then

15:48.700 --> 15:55.390
you will want to actually improve the models and that is the fun of machine learning and the signs that

15:55.390 --> 15:59.620
you will actually feel the fun of improving the models.

16:00.070 --> 16:02.270
It does not even think that much effort.

16:02.650 --> 16:05.190
All the models are already implemented.

16:06.270 --> 16:15.480
All you need to do is just implement them, just type in the code and just select the model, they select

16:15.480 --> 16:19.980
the type of scoring method you want to use and just implement those.

16:20.850 --> 16:21.260
Right.

16:21.360 --> 16:23.570
So it is really fun.

16:23.580 --> 16:27.840
Just try it out, just go through it and you will really enjoy it.

16:28.020 --> 16:35.810
So without losing all the time and without playing with everything, let us start with the code.

16:35.940 --> 16:43.770
So here I am zipping all the column names with the data types and the number of unique values.

16:46.690 --> 16:50.120
Now, what does this number of unique values give me?

16:50.320 --> 16:55.090
It will give me the number of values which are present in each column.

16:57.600 --> 17:03.570
So in case of a reference number of just as a guideline.

17:04.490 --> 17:13.210
The reference, no application, no rule number of a person SSN, no, although these numbers, which

17:13.210 --> 17:19.510
are basically the identifiers, basically something which prove the identity of a person.

17:20.670 --> 17:26.070
Will not really impact how the model is walking.

17:27.160 --> 17:35.290
If I change the reference number or let's say I change your loan application, I would literally impact

17:35.290 --> 17:38.840
if the loan approval or not.

17:39.520 --> 17:46.480
Not really rate the loan approval will depend on your salary, your number of dependents, how much

17:46.960 --> 17:48.130
income you have.

17:48.370 --> 17:48.720
Right.

17:48.880 --> 17:53.740
Only the application number would not be back if the application number was some other number.

17:53.750 --> 17:57.470
Still, your application would have been approved or rejected.

17:57.730 --> 18:05.950
So we can completely get rid of this particular column because it has no significance in machine learning.

18:07.610 --> 18:10.340
Next, we have these children called.

18:11.510 --> 18:15.330
Now, the children column has the type of.

18:16.620 --> 18:22.620
Now, this children column has dated, which rings a bell to me because children.

18:23.650 --> 18:26.060
Has to be on numeric value.

18:26.290 --> 18:27.750
Now let us get back to it.

18:29.450 --> 18:36.770
So here, once I get to it, I see that all of the values are unlimited, but only this zero value is

18:36.770 --> 18:39.700
causing this column to be of theta.

18:40.280 --> 18:47.480
So what I will do is I will convert and replace this string zero with the numerical and can word this

18:47.480 --> 18:50.360
entire column do numeric right.

18:50.780 --> 18:52.430
Next is easy to find.

18:52.880 --> 18:57.300
Now let us see that each band has a difference of five minute.

18:57.920 --> 19:00.740
All the values are having a difference of five.

19:00.980 --> 19:08.450
So what I can do is I can simply say, I mean each and create a new column which will have the mean

19:08.450 --> 19:10.730
value of forty one and forty five.

19:10.730 --> 19:11.850
Forty five and fifty.

19:12.050 --> 19:15.140
So here I will have forty two point five.

19:15.140 --> 19:18.500
Here I have fifty forty seven point five and so on.

19:18.800 --> 19:25.160
They instead of having this age band, which is not mathematically appropriate, I can simply have the

19:25.160 --> 19:26.340
mean value replaced.

19:27.710 --> 19:32.570
Next is statis, which is what kind of values does it have?

19:32.810 --> 19:37.950
It has the object and there are five different types of values present.

19:38.150 --> 19:45.380
So again, we will analyze what proportion of values are present and the how we should convert that

19:45.380 --> 19:48.570
into one more than pudding or dummy variable.

19:48.770 --> 19:52.410
So that is something which we will have for now, again, occupation.

19:52.430 --> 19:55.070
So again, there are nine values in occupation.

19:55.280 --> 20:01.820
So we have to decide if we need to keep all the nine values or we can keep we need both five or built

20:01.820 --> 20:08.390
for occupation types, then occupation partner goes the same way if you want to keep all nine or a few

20:08.390 --> 20:08.930
of them.

20:09.200 --> 20:11.030
The same thing applies to homes.

20:11.300 --> 20:17.200
This family is showing us object but should be numeric.

20:17.810 --> 20:19.930
We don't know ways of showing like that.

20:19.940 --> 20:23.250
So you can see that we have several ranges in family.

20:24.940 --> 20:32.110
So, again, what we can do is we can people I mean, family income here, instead of having the entire

20:32.110 --> 20:34.100
range, let us keep Amien family in them.

20:34.270 --> 20:35.920
Now, what is the range for this?

20:36.160 --> 20:42.730
The family income is actually ranging from twenty seven thousand five hundred thirty thousand.

20:42.730 --> 20:48.950
So it is almost two point five five thousand twenty five to two thousand five hundred here.

20:49.530 --> 20:51.890
Here again, two thousand five hundred is there.

20:52.450 --> 20:57.100
Here we have four thousand to eight thousand.

20:57.110 --> 21:00.290
So here the gap is of around five thousand.

21:00.460 --> 21:03.700
So now the gaps are inconsistent.

21:04.770 --> 21:13.420
So because the gap is inconsistent, we cannot take a mean value, so what we will do, we will convert

21:13.450 --> 21:19.410
these family income into family income in our family income tax.

21:20.380 --> 21:26.710
This is one thing which we can do, another thing which we can do is we can keep a family income level

21:27.250 --> 21:28.870
because these are different levels.

21:29.150 --> 21:31.210
These are different levels of family income.

21:31.390 --> 21:37.990
So maybe we can do something like we can keep eight thousand, four thousand to eight thousand as income

21:37.990 --> 21:45.340
level one, then eight thousand do let us say at eleven thousand income level to eleven thousand to

21:45.910 --> 21:53.080
fourteen thousand income level three and so on, so that we can decide what family income level should

21:53.080 --> 21:53.500
be given.

21:53.680 --> 21:56.260
We can convert this into an ordinal variables.

21:57.470 --> 22:01.380
Right, so that is something which we can decide how we need to take care of it.

22:01.630 --> 22:08.020
Or maybe you can create two or three different columns out of it, one being family income law, family

22:08.020 --> 22:11.500
income tax, then another being family income mean.

22:11.860 --> 22:20.920
Then we can give a family income level and then we will be finding out the coefficients using Lassalle.

22:21.040 --> 22:26.110
We can actually determine which type of variable actually held us up.

22:26.770 --> 22:28.450
So that is another thing which we can.

22:29.700 --> 22:37.680
Right, then we have self-employed, it is a binary value, so it will be converted into zero and one

22:37.950 --> 22:40.890
self-employed partner will be converted and go zero and one.

22:41.070 --> 22:45.360
So like this, we can determine how we need to change the values.

22:45.610 --> 22:49.350
Region value again will change into four different regions.

22:49.890 --> 22:58.260
General will convert into binary data value will stay intact because we need this value to split the

22:58.260 --> 22:58.980
data back.

22:59.280 --> 23:01.740
So we need not touch this particular column.

23:05.620 --> 23:12.720
So here we have the data, so here you can see there are different floating columns which are fine.

23:13.030 --> 23:14.920
We only need to take care of the.

23:15.890 --> 23:20.990
Object data Pullum so we can determine how the transformation needs to be done.

23:21.320 --> 23:30.630
So this is the be all and for all regions we can see these other different value columns for the regions.

23:30.980 --> 23:36.440
So from these region values, you can determine how many region types you want to consider.

23:36.920 --> 23:44.690
Now, what we can do is reference number and post code and we'll still to see what this post code enforcement

23:44.690 --> 23:45.080
area.

23:46.420 --> 23:50.470
Postcode and coastal area has some values.

23:51.430 --> 23:52.780
Post code.

23:54.310 --> 24:02.460
Here we have postcode and coastal area, which are object, they have a lot of cardinality, it is has

24:02.470 --> 24:06.570
it having a very high cardinality with respect to the object today.

24:06.850 --> 24:10.710
So this means that it has a lot of irrelevant data.

24:10.720 --> 24:16.960
We cannot convert ten thousand values into ten thousand dummy values.

24:16.960 --> 24:17.170
Right.

24:17.200 --> 24:22.990
We cannot create them thousand dummy columns so we can simply get rid of this particular column.

24:23.290 --> 24:25.420
And the same thing applies to the pool.

24:25.420 --> 24:25.910
Stadio.

24:29.700 --> 24:31.360
So next thing is children.

24:31.620 --> 24:39.780
So for children, we will convert zero two zero four plus to four and then convert this entire column

24:39.780 --> 24:40.850
to numeric.

24:41.780 --> 24:47.780
For each bank, we will convert this into dummies, for starters, occupation, occupation, partner,

24:47.780 --> 24:51.130
home estate as family income, we will convert this into dummies.

24:52.830 --> 25:01.230
Self-employed, Devora, Origin, gender, all will be converted into dams and a new breed of one will

25:01.230 --> 25:03.100
be converted to one for Mazzitelli.

25:05.240 --> 25:12.110
So the same thing we are doing here, so we are using prop function to drop these particular columns,

25:12.110 --> 25:18.320
which we provided in a list for Axis one means we want to work with the respective columns.

25:18.330 --> 25:25.640
So we are moving these columns and in place means that we want to remove these columns from the oil

25:25.640 --> 25:26.400
completely.

25:27.900 --> 25:34.470
Then be all children, so we are checking and not where, so wherever the children value is zero, we

25:34.470 --> 25:35.430
are putting zero.

25:36.030 --> 25:42.330
Otherwise we are putting the additional value of video on children wherever the value is one.

25:44.850 --> 25:52.020
For all the values, the last value, we are changing forward to four other ways, we are keeping the

25:52.020 --> 25:56.890
values children again for all children.

25:57.330 --> 26:03.540
We are converting it to numeric and delineator scores so that if there is any missing value or null

26:03.540 --> 26:06.540
value, it will be converted into a null value.

26:07.230 --> 26:12.480
Next, we are having them and you get great value for them and Newbridge value equal to one.

26:13.260 --> 26:18.040
We are converting it into the digital form.

26:18.570 --> 26:21.830
So for value one, it will convert to one for value.

26:21.910 --> 26:25.970
Two, it will give false and convert it into Siedel.

26:28.460 --> 26:31.620
Next, we will be creating categorical variables.

26:31.640 --> 26:36.140
We will get the categorical variables using select data types.

26:37.380 --> 26:45.180
Giving needed life to be object, so this will give all the object data types to the categorical variables.

26:46.260 --> 26:46.710
Then.

26:47.680 --> 26:53.110
For the categorical variables, except for minus one, why are we giving it minus one?

26:53.110 --> 26:55.210
Because we don't want to transform this data.

26:56.050 --> 27:03.640
So this minus one will just consider all the categories from stopping the region and ignore this data

27:04.090 --> 27:06.250
and then it will convert the dummy's.

27:07.680 --> 27:13.870
All for all of these columns, so we are converting these age one status, occupation, occupation partner

27:13.870 --> 27:20.920
who start this family income, self-employed, self-employed, partner to the area, gender and region

27:20.920 --> 27:22.240
in two different columns.

27:22.420 --> 27:26.140
And after conversion, we have ninety six columns.

27:27.150 --> 27:34.070
And now we are checking if there is any null value, so in the children column, using DOT is not a

27:34.080 --> 27:36.780
dot some we see that in the children column.

27:37.770 --> 27:43.150
There are might be null values otherwise all there are no other null values.

27:43.380 --> 27:50.480
So what we will do is wherever the value is null, so be all location, be all children.

27:50.490 --> 27:53.280
It is checking wherever the children value is not.

27:55.040 --> 28:00.590
The children column, so the Selecting the Children column from this data, wherever the children value

28:00.590 --> 28:00.740
is.

28:01.400 --> 28:03.620
And in that column, it is putting the.

28:05.040 --> 28:13.820
A mean off the body already that train children, so wherever the data is, brain data type is strain.

28:14.160 --> 28:19.110
So from that training data, it is picking the children column I'm thinking of.

28:19.650 --> 28:25.380
So it is not considering the testing data for the mean of the values.

28:25.620 --> 28:30.900
So it does not mean from all the entire data set, but only from the training dataset.

28:34.720 --> 28:39.610
Because the testing data say it might have different outliers and different ideas, so we don't want

28:39.610 --> 28:42.070
to get into that, we will simply take the.

28:43.130 --> 28:44.780
Nine of the training that only.

28:46.080 --> 28:51.270
Now, next ones, we have done all the transformation so they'll know what the transformation has been

28:51.270 --> 29:00.300
done, now, what we will be doing is from in the train, we will be all all the columns, there be

29:00.390 --> 29:03.110
all data column has the value print.

29:03.450 --> 29:11.440
This will select all the rules which have the value print in the data column and put it into the print

29:11.490 --> 29:18.000
dataset and it will delete the B train data column.

29:18.210 --> 29:24.080
So it has created the train data frame and remove the data column from the data.

29:25.050 --> 29:32.160
Now from the testing data frame, it will remove the revenue column also and data column also once it

29:32.160 --> 29:33.270
has copied the.

29:34.520 --> 29:39.770
Data from all wherever the data column values test.

29:40.870 --> 29:48.220
So we have simply selected all the data wherever the value is drawn and quartered in the we.

29:49.090 --> 29:57.220
Now, one more thing is, in case you are not able to understand any line of food, you can simply try

29:57.220 --> 29:58.570
to run it separately.

29:58.570 --> 30:03.010
Just try to run it piece by piece, try to run it piece by piece.

30:03.010 --> 30:06.860
That would be the ordering data is equal to plain gives.

30:07.330 --> 30:10.980
Now, what will be all three in you?

30:11.740 --> 30:17.440
So then you will get to know that in the brain I'm creating this data frame, that I'm putting all these

30:17.440 --> 30:17.880
values.

30:18.310 --> 30:26.620
So in case you get stuck and you will destroy this method and do it step by step and then any piece

30:26.620 --> 30:33.730
of code food, when you get stuck, just run the thing or try to create the logic step by step from

30:33.730 --> 30:38.680
the very small piece of junk and then put the entire picture.

30:39.940 --> 30:47.140
So that is the best practice which we have invited on any programming language, because initially you

30:47.140 --> 30:48.850
would not know what you want to create.

30:48.850 --> 30:54.380
Initially you would only know of the entire thing or the entirety of it.

30:54.610 --> 30:58.960
But the logic building grows from the very minimal things.

30:59.170 --> 31:06.550
So first of all, you will get what e then you will get B and once you have both and B, then only you

31:06.550 --> 31:07.150
will add.

31:07.150 --> 31:10.480
And so that is the practice which you will follow.

31:11.640 --> 31:16.870
Now we have all this data now be the best and we have been generated.

31:17.070 --> 31:24.150
So the next thing which we will be doing is we will import the largest regression model from the lineaments.

31:25.490 --> 31:33.410
Now we will get the score from the Escalon documentary, and these are the barometers which we will

31:33.410 --> 31:36.430
be having the fullest barometer last week.

31:37.100 --> 31:42.070
Next is penalty, which means that Elvan is what is Ferlazzo.

31:42.290 --> 31:49.720
So we are selecting Brembo to the original penalties and the best one will come out on its own.

31:49.940 --> 31:56.820
So never think that always will give a better result than average or Lazreg will perform better than

31:57.620 --> 32:01.610
Ribeau the scenarios and then see what actually performs with.

32:04.090 --> 32:13.800
Then see the next alpha value, so from in the space, we will get divided from zero point zero one

32:13.810 --> 32:14.840
two thousand.

32:15.130 --> 32:24.040
We are getting 10 those then the such values and we are creating logistic regression model of the object

32:24.040 --> 32:28.840
of logistic regression model and we are giving food in the setting will do to now.

32:28.870 --> 32:34.460
Food in the septic will do through means that we want to get the beat done, not value.

32:34.490 --> 32:38.230
Also we will be getting the old fashioned values.

32:38.230 --> 32:39.460
Ba dum ba dum.

32:39.460 --> 32:45.000
And we, that will be the three intensive means that we want to get to be done by mutual.

32:46.630 --> 32:50.470
So for this, we will apply the great TV.

32:51.590 --> 32:58.670
Now we will learn how to apply a random saltville until that is a different implementation.

32:58.700 --> 33:01.040
So for now, we are working on digital TV.

33:01.190 --> 33:06.230
In the next session, maybe you can learn about the dinosaur TV and then see the implementation.

33:06.560 --> 33:09.930
Here we are applying to TV, so we give the name.

33:10.670 --> 33:11.750
This is the modern name.

33:12.230 --> 33:17.870
Then they give the barometer grade, which means that all of these bad images which we have, they give

33:17.870 --> 33:23.060
the see the value that because we want to have five folds in the cross validation and they give the

33:23.150 --> 33:25.500
sporting images to see if you see.

33:27.170 --> 33:34.630
In the last week, because we already know that the glass is imbalanced, that is eight point one for

33:34.640 --> 33:35.690
value ratio.

33:35.810 --> 33:40.940
So in that case, we can also give all another value.

33:41.150 --> 33:46.760
That is another set of values that it can take value from this to this also.

33:47.940 --> 33:54.850
So the class can be given in the form of a dictionary so we can give another type as violence, none.

33:54.900 --> 33:57.060
And the second one would be class level.

33:57.210 --> 34:02.940
So we can simply say the label is one for one it.

34:06.490 --> 34:08.650
One and four zettl.

34:09.990 --> 34:16.410
The wait is eight point for one, so this is something that you can actually verify what the lead should

34:16.410 --> 34:18.000
be and then accordingly give it.

34:18.300 --> 34:20.580
So here, what do we have?

34:22.620 --> 34:30.570
Here, the class zettl has a twin for one week and class one has a twin.

34:31.720 --> 34:33.700
One as the week, so that is.

34:35.360 --> 34:41.220
So we can debate accordingly, and after that, we will run Vetrini.

34:41.540 --> 34:49.510
So here in extreme, we are dropping the revenue grid column from the B train dataset.

34:49.880 --> 34:57.560
And in the very thing that is the very value, the way the Dufrene, which will have only the volume

34:57.590 --> 34:59.810
that is the target volume.

34:59.810 --> 35:08.900
So X will contain the features or attributes, or we can say X values and will contain all the values.

35:08.900 --> 35:12.290
That is one single column which will have the values of the revenue.

35:13.600 --> 35:14.140
Now.

35:15.330 --> 35:24.750
We will do the grid search, which means that we are training the model on these X and Y values, and

35:24.760 --> 35:28.770
after the footing is done, you will get something like this.

35:29.940 --> 35:37.380
The entire design of the grid search from the grid search, you can use grid search, dot best estimate

35:37.890 --> 35:47.220
to obtain the best model which has been generated, and you can get the detail into this new model named

35:47.220 --> 35:48.060
which is global.

35:48.090 --> 35:52.010
It will see if the details of the best model which we have created.

35:52.320 --> 35:58.410
Now we can create the report using grid to TV design and we are giving the number of results we want.

35:58.440 --> 36:03.330
So we want five results, so it will give us the five best ranking models.

36:03.540 --> 36:10.300
So from here you can see that the main validation score or the accuracy that we are looking for, a

36:10.330 --> 36:13.480
U.S. score is coming out with the zero point nine five four.

36:13.800 --> 36:21.120
And here the sea value is zero point zero one class, which is balanced and the value is a one.

36:22.480 --> 36:30.550
Now, next thing is, we have got the best estimate in the longer rate, this longer contains the details

36:30.550 --> 36:36.680
of the best estimate that these values, that which model has actually given us the best result.

36:36.970 --> 36:43.780
Now we will fit the model again, because now we have the research object, which contains a list of

36:43.780 --> 36:47.590
thousands of models which we have think we don't want all those models.

36:47.590 --> 36:49.700
We only want that one best model.

36:49.900 --> 36:56.830
So we will take the best model here, the attributes of the best model here and set it on the training

36:56.830 --> 36:57.360
the desert.

36:58.090 --> 37:02.360
Now, after training, we will generate a lot of fun.

37:02.470 --> 37:10.190
Now, how can we create value so for generating the value we have this particular logic which we have.

37:10.390 --> 37:17.350
So this is we are focused upon creating a little space of zero point zero one zero point nine nine and

37:17.350 --> 37:19.860
the engine revving ninety nine such values.

37:20.110 --> 37:24.850
So this will give us values from zero point zero one two zero point nine.

37:26.160 --> 37:33.630
Now, out of all of these values, what we are doing is we are making predictions using the model which

37:33.630 --> 37:40.290
we have dreamed Logan has the model, which we have been great and predictable, but all the is the

37:40.290 --> 37:47.820
function which gives the probability as a prediction, if you will have only predict, if you have only

37:47.820 --> 37:51.920
predict, it will give the exact high class value zero or one.

37:52.200 --> 37:58.700
But if you want to find out the probability as a prediction, then you will use predict probable.

37:59.070 --> 38:00.960
So what will predict probable.

38:01.080 --> 38:06.090
It will be the values of the existing values or the evaluation values.

38:06.090 --> 38:07.800
Whatever they decide.

38:07.800 --> 38:15.120
You have the X values you have for which you already have values in place, you will use those to actually

38:15.120 --> 38:16.320
create all these things.

38:16.500 --> 38:23.180
So I am using extranets and from it I am getting the predictions in school.

38:23.340 --> 38:28.830
So this is the predictions which I am getting from the Logan model, which I have generated.

38:29.190 --> 38:34.890
And Vytorin is the data which contain the actual value values.

38:35.910 --> 38:38.820
Right from these actual values.

38:38.820 --> 38:45.130
I create a new data frame which has the values, the actual values, which is named Azriel.

38:45.570 --> 38:51.600
Now I am checking for brain for that is my training school value.

38:51.600 --> 38:54.980
Whatever braining school value is greater than zero point two.

38:55.530 --> 38:58.270
I am checking this particular condition.

38:58.290 --> 38:59.280
So what will happen?

38:59.730 --> 39:05.420
This has the probability values so probability values can range from zero to one.

39:05.850 --> 39:10.560
So the probability value could have been zero point one or zero point towards zero point nine.

39:11.190 --> 39:16.970
So whatever the probability value was, zero point two or zero point one nine.

39:17.100 --> 39:22.740
For that, it has given proof and wherever the value was greater than zero point two, it has given

39:22.740 --> 39:23.220
false.

39:24.550 --> 39:29.620
So if whatever value was greater than zero point two, it has given to this is a big mistake.

39:29.860 --> 39:35.340
So whatever the value is greater than zero point two, it has given through there.

39:35.470 --> 39:42.250
And whenever the value is less than zero point two or equal to zero point two, it has given false.

39:43.390 --> 39:43.760
Right.

39:43.960 --> 39:52.810
So it will have maximum values as through, because the value which we have provided is zero point to.

39:53.710 --> 40:00.010
Named in the title, value will be low in the title, value will be lower than this condition will result

40:00.010 --> 40:08.080
in maximum values rate and indicative value will be higher than the desired maximum for those values.

40:08.770 --> 40:16.690
Right now, this is a method which is this method which actually gives us or allows us to find out the

40:16.690 --> 40:18.040
best value.

40:18.370 --> 40:27.820
If we talk about the logistic regression or any other method, any other scale model, it will by default

40:27.820 --> 40:31.450
assume the value to be zero point five.

40:32.650 --> 40:38.600
But if you still want to decide upon a better value, you can use this for development.

40:39.520 --> 40:41.510
So what what do we have here?

40:41.890 --> 40:44.110
We have this Gesell.

40:44.140 --> 40:50.760
This is a blank list, which we have, and we are running it for different values.

40:50.770 --> 40:52.200
What are these values?

40:52.210 --> 40:55.930
D.R.C., two point zero one zero point nine nine.

40:57.250 --> 41:06.040
So for each of you, it will come back to the training school, which we have obtained, the predictions

41:06.040 --> 41:15.700
which we have obtained and see it and the predicted value, so we will get either zero or one based

41:15.700 --> 41:19.720
on these training school compared to the cutoff values which we have.

41:21.250 --> 41:27.940
So the cutoff value will be zero point zero when most of the values will be to when the value will be

41:28.030 --> 41:29.150
zero point nine nine.

41:29.320 --> 41:31.860
Most of the values will be false.

41:33.000 --> 41:34.360
Now, think about that.

41:34.950 --> 41:39.210
We have these two positive through negative, false positive, false negative.

41:39.270 --> 41:47.670
This is the implementation of the finding out of how many troops was there, how many negative are there

41:47.670 --> 41:49.250
and how many which values of it.

41:49.560 --> 41:56.190
So it is as simple as finding out the predicted value equal to true and real value equal to one.

41:57.400 --> 42:02.240
So if both of these values are equal to one, then it gives up.

42:02.950 --> 42:10.450
So it does add all these values then based on these conditions, it is just adding all the values I'm

42:10.450 --> 42:16.420
getting you all these positive negatives, false positives, false negatives, the gives, the all the

42:16.420 --> 42:20.200
positive values, which is equal to positive plus false negative.

42:20.800 --> 42:26.270
You can force this and try to get to this slowly also.

42:26.530 --> 42:34.150
So basically, whatever proof positive is present and whatever false negatives present, false negative

42:34.150 --> 42:38.650
means that you have predicted it to be negative, but it was actually positive.

42:38.670 --> 42:40.480
That is why it is a false prediction.

42:40.480 --> 42:40.790
Right.

42:41.230 --> 42:43.630
So it is a positive value.

42:43.960 --> 42:47.050
Similarly, we have all the negative values.

42:48.020 --> 42:49.040
Now, this is the.

42:50.580 --> 42:58.590
Formula for Geass calculation, which is too positive, divided by the positive values minus false positive,

42:58.590 --> 43:07.260
divided by the negative values, and it is just simply upending all the key values to this guest list

43:07.410 --> 43:11.940
so that you have all the key values for all the kind of values.

43:13.080 --> 43:19.020
And based on these gifts values, you can decide based on the maximum.

43:20.290 --> 43:27.730
So it is checking the value of incase all is equal to maximum Gesell and getting the index of.

43:29.140 --> 43:37.420
So the index gives me the value as zero point five, so the cutoff, which I will be considering is

43:37.420 --> 43:38.670
zero point five.

43:39.540 --> 43:48.300
Right now, Lowgar Dot Intercept gives me the value of the intercept, which is Zettl, I am zipping

43:48.300 --> 43:55.170
the columns with the coefficient values, so it gives me all the coefficient values.

43:55.320 --> 44:00.030
And you can see that there are several propositions which have got the value zettl.

44:00.210 --> 44:08.570
So this means that these columns can actually be removed from the training of my future models.

44:08.730 --> 44:18.060
So when I will be training my decision, three random forest or exergy most or any other more than I

44:18.060 --> 44:26.580
can very easily get rid of these columns because these modules have no impact in the calculation of

44:26.580 --> 44:27.960
my target class.

44:29.040 --> 44:31.740
So I can easily get rid of these claims.

44:33.070 --> 44:40.630
Now, let us get four billion and we can simply submit these values or let us say you want to participate

44:40.630 --> 44:46.470
in any competition or do you want to create the output of the desire to make predictions.

44:46.600 --> 44:50.640
So what you can do is simply use the word or predict Prova.

44:50.950 --> 45:00.550
I give the input X values that is X data free and get the output values from it.

45:00.580 --> 45:07.450
This is the Y values which have been predicted and you can simply create a new data frame of this,

45:08.660 --> 45:13.500
values these values and put it into a CSFI using the ASV.

45:13.510 --> 45:17.740
And this is the C ASV name which you have used.

45:18.160 --> 45:21.220
Now in case you want to provide the high classes.

45:21.220 --> 45:29.410
I'm not really the probability then you can simply compare it with the cost of value and put it as type

45:29.410 --> 45:35.600
and so it will convert it into zero and one and then you can create a file out of it.

45:36.370 --> 45:46.030
So in case you will be using a manual method for this particular model of introduction, moving this

45:46.150 --> 45:48.110
moving this model into production.

45:48.280 --> 45:55.480
So in that case, you can simply export the CSV files to get the output of the model which you have

45:56.080 --> 45:56.680
created.

45:56.680 --> 46:01.230
I'm getting the the output, all the predictions from the models.

46:01.360 --> 46:08.260
So I hope you understood this in case you have not understood any of the things which I have told you,

46:08.260 --> 46:15.790
please go through the video again and again, because this is a very important concept and I hope that

46:15.790 --> 46:22.540
I will try to clarify these things again and again over the time in my videos for the.

46:22.660 --> 46:26.390
So that is that is something that you would not have understood.

46:26.590 --> 46:29.020
You can learn from those videos.

46:29.020 --> 46:34.660
And in case you don't still don't understand, then you can get back to me and I can explain those again

46:34.660 --> 46:34.840
to.

46:36.090 --> 46:40.560
So I hope this would have helped you in the next session.

46:40.590 --> 46:45.770
We will talk about different types of metrics which we have.

46:46.050 --> 46:52.230
And so without using those metrics, you will get to know what metrics you can use and where you should

46:52.230 --> 46:52.920
use what.

46:53.940 --> 46:54.790
Thank you.
