WEBVTT

00:01.560 --> 00:07.110
In this session, we will learn how to derive some relief from the numerical data.

00:09.330 --> 00:14.460
So the first step would be seen involving the bonders and library.

00:17.320 --> 00:20.500
Often, including the library, I would involve the fire.

00:20.540 --> 00:23.590
So this is the fire which we will be using.

00:25.250 --> 00:34.700
And this is the Ffion import, so I'm reading the file from the read CSFI function and this is the file

00:34.700 --> 00:39.290
path and my file has a delimiters semicolon.

00:39.290 --> 00:41.960
So I'm using the LAMIDO in this.

00:41.960 --> 00:44.360
I will not use this delimiters.

00:44.360 --> 00:46.190
So let me show what happens.

00:47.550 --> 00:49.110
So if I had run this.

00:50.360 --> 00:51.080
And.

00:52.280 --> 00:53.480
Get the big head.

00:53.780 --> 01:00.140
So here you can see the detail comes in a single column, but if I will, I'm in the middle.

01:01.940 --> 01:08.690
Then it will be shown in different columns, it will be read properly, so please make sure that the

01:08.690 --> 01:11.870
delimiter which you are selecting is correct.

01:12.710 --> 01:21.150
Now, the next thing which we can do is we can describe the details of the numerical values.

01:21.440 --> 01:24.170
So these are the numeric columns in my data.

01:25.590 --> 01:31.710
So I can find out the ground, the mean value, the standard deviation, the minimum value, twenty

01:31.710 --> 01:38.070
five percentile, 50 percentile, 75 percentile and the maximum value of my numerical columns.

01:39.640 --> 01:42.670
The next thing which I can do is I can do it in full.

01:42.910 --> 01:46.920
So this will give me the information about my Dufrene.

01:47.110 --> 01:54.400
So it does me that my data frame has forty five thousand one hundred eleven entries ranging from zero

01:54.400 --> 01:56.860
to five forty five thousand two hundred.

01:56.860 --> 02:07.150
And then it tells me that the columns are both 17 and in this data columns, these are my column names.

02:07.720 --> 02:12.810
These are the normal values, which means I do not have any null values in my data.

02:13.450 --> 02:17.040
And it also tells me that the data of the columns.

02:17.290 --> 02:21.720
So by this I can find out which values I need to modify.

02:22.030 --> 02:26.730
So here you can see that the monthly column has value as object.

02:27.400 --> 02:29.710
So let us see what my monthly column has.

02:29.920 --> 02:34.150
So month's column has values as me, June, July like this.

02:34.420 --> 02:39.820
So what we can do is we can create another mapping.

02:39.830 --> 02:42.850
So instead of January we can put one.

02:42.850 --> 02:45.120
In February, we can do so.

02:45.130 --> 02:49.840
Similarly, we can change the values of the month to get the values in a numerical.

02:51.260 --> 02:59.960
And other volumes are like Joel Madden, education, the federal housing loan.

03:02.230 --> 03:11.290
And the outcome and why these are the values which we can then vote into now regarding the contract,

03:11.290 --> 03:14.590
we will actually have, we'll see what kind of value it holds.

03:14.950 --> 03:18.540
So if you see here, the contract holds object.

03:19.060 --> 03:25.720
So we will have to find out what kind of value it holds and what number of values are having value as

03:25.720 --> 03:26.220
unknown.

03:26.470 --> 03:31.470
And then we will have to make our decision if we need to keep going back, forelimb or not.

03:31.780 --> 03:37.090
And in case we aren't keeping the contract volume, then if there is any particular value which we can

03:37.090 --> 03:37.950
impute or not.

03:38.230 --> 03:40.780
So these are certain decisions which you will have to make.

03:42.420 --> 03:48.810
Now, next, we have video which will allow us to get the name of the volume and we have a function,

03:48.810 --> 03:55.960
Nuni, which tells us that what are the number of unique volumes for each and every column?

03:56.220 --> 04:01.190
So for each volume, I have seventy seven unique numbers, but just fine.

04:01.410 --> 04:03.840
And the job is a categorical volume.

04:04.140 --> 04:13.160
I have 12 unique job types, so when I have 12 unique job types, it means that I have a wealth gap

04:13.250 --> 04:14.460
that needs to be created.

04:15.000 --> 04:23.010
Now, when I will be converting these 12 job tapes into categories, this will mean that I will be increasing

04:23.010 --> 04:26.670
at least 11 volumes to this entire dataset.

04:27.120 --> 04:34.020
Now, this is something which I will have before they decide if they are for a certain number of job

04:34.020 --> 04:39.850
types which are important and sorting job things, which I can actually remove from my dataset.

04:40.140 --> 04:43.140
So that is something which I will have to make a decision upon.

04:43.500 --> 04:46.960
Now, again, we have this material value, which is fine.

04:46.980 --> 04:49.290
We can create three dummy values for this.

04:49.440 --> 04:53.520
But education, again, we can make for dummy values for default.

04:53.520 --> 04:55.770
Again, the value can be created.

04:55.980 --> 05:00.660
Housing loan gone back for all of these values can be created.

05:02.550 --> 05:03.960
Then we have.

05:05.650 --> 05:11.380
The day's campaign duration, so all of these values.

05:12.700 --> 05:19.960
I'm numerical in nature, so this is fine, we can handle these accordingly in case we had any particular

05:19.960 --> 05:23.350
volume which had a lot of categories.

05:23.740 --> 05:31.480
So in that case, we would have had done what they're doing scattergories into certain levels or certain

05:31.480 --> 05:39.940
groups, or maybe we would have selected only five or seven top groups so that we could reduce the number

05:39.940 --> 05:40.530
of volumes.

05:40.840 --> 05:47.340
But again, how we will do the feature selection is another topic, which we will discuss further.

05:47.500 --> 05:51.670
But this is just to give you an insight into how we will be tackling the beat them.

05:55.070 --> 06:04.310
Now we have visual display so we can use the display directly and I can use to describe also this would

06:04.310 --> 06:06.680
just give me the details about my age column.

06:07.140 --> 06:09.410
Now, what we can do is begin again.

06:09.410 --> 06:14.750
Zibakalam means the day that I am the number of unique values together.

06:16.240 --> 06:25.420
This will actually give me a broad insight if I want to convert a particular value from categorical

06:25.420 --> 06:26.890
to numerical.

06:26.920 --> 06:29.210
And I have one number of categories.

06:29.210 --> 06:33.010
So here you can see that I have 12 categories, three categories.

06:33.020 --> 06:35.290
We've got the two categories.

06:35.290 --> 06:37.390
We are checking only for the object types.

06:38.310 --> 06:45.750
For the details or are the objectives which we have to go and work for in the 60s for VIPs, we don't

06:45.750 --> 06:49.900
have to do anything specific because they are already guilfoile.

06:51.390 --> 06:53.040
Yet again, it is, too.

06:54.110 --> 06:54.620
Three.

06:55.780 --> 06:56.390
To El.

06:58.440 --> 06:58.920
For.

07:00.040 --> 07:08.500
And so this is perfectly fine, we can walk ahead with this now here we can use certain aggregate methods

07:09.130 --> 07:12.590
to create values or to impute values.

07:12.850 --> 07:18.820
So here we have the age Takamine and the age of median, which I can find out.

07:20.800 --> 07:26.690
Using dot, dot, median function, so similarly, we have many aggregate functions available.

07:27.100 --> 07:30.670
So these are the names of several aggregate functions.

07:32.050 --> 07:40.050
They found some mean, mean, absolute deviation arithmetic, median minimum, maximum mode, absolute

07:40.050 --> 07:45.810
value products, the deviation so we can use all of these values, any of the values, whichever it

07:45.810 --> 07:46.470
is required.

07:47.070 --> 07:51.710
Now, if you want to find out a particular number of value count.

07:52.590 --> 08:01.050
So here we have the count which belong to each and every category in one particular feature or input

08:01.050 --> 08:02.130
value or volume.

08:02.730 --> 08:05.540
So this is a limited time job.

08:06.210 --> 08:14.670
So in the job column, or we can also pilot feature or input value or attribute.

08:15.060 --> 08:27.300
So in these we have these 12 values vagi blue collar management technician, admin services, retired,

08:27.300 --> 08:30.860
self-employed, entrepreneur, unemployed, housemaid's student.

08:30.930 --> 08:37.400
And now if see majority of the values are from the upper limit.

08:37.650 --> 08:43.800
So we have a values, a few values which have more number of data present with them.

08:44.070 --> 08:48.570
And there are certain categories which do not have much detail on them.

08:49.050 --> 08:52.320
So what we can do is we can photobook now here.

08:52.320 --> 08:59.300
We don't really get a view of how many values, how much percentage of data is actually presented by

08:59.310 --> 09:05.370
the blue collar people or how many, how many, how much percentage of the data presented by students.

09:05.730 --> 09:12.090
So what we can do is we can get the value count and along with it we can normalize the data.

09:12.570 --> 09:17.850
So when we normalize the data, what happens is we get a percentage view of this.

09:19.000 --> 09:26.860
So you can see that 21 percent of the danger presented by BP, 20 percent of data is represented by

09:26.860 --> 09:34.210
management, BP, 16 percent, this technician, 11 percent is admin, nine percent is services, then

09:34.210 --> 09:35.860
five percent is retired.

09:36.070 --> 09:38.590
And then three percent, three percent.

09:38.620 --> 09:41.080
These are self-employed, entrepreneur, unemployed.

09:41.290 --> 09:48.790
So what we can do is whatever is less than, say, five percent or three percent or two percent, that

09:48.790 --> 09:49.590
is your choice.

09:49.600 --> 09:56.620
What is the value you want to apply by reducing the number of features so we can decide on top of that?

09:56.770 --> 10:03.440
So let us say I want to keep only data which represents at least five percent of my population.

10:03.820 --> 10:12.460
So what I can do is I can keep the retired services, I mean, technician management and blue collar

10:12.460 --> 10:14.140
as my dummy columns.

10:14.530 --> 10:21.250
And I can remove self-employed entrepreneur or unemployed housemate's student and unknown from minding

10:21.250 --> 10:27.550
them so that the data which I have would have a good representation on you, because that is something

10:27.550 --> 10:29.470
which is actually going to help out.

10:29.860 --> 10:36.610
While I am actually trying to train my mind in some detail, which is actually less in number, will

10:36.610 --> 10:42.860
not be able to give that much weightage to the presentation.

10:43.060 --> 10:50.710
Now again, that is completely your choice and is something which is derived from the type of problem

10:50.710 --> 10:57.400
that we have, a legacy that we have something very special for students.

10:57.770 --> 11:03.720
We are talking about loans and we have a special category of student loans.

11:04.300 --> 11:15.310
So in that case, someone who is a student becomes a very important person in our in the problem solving,

11:15.310 --> 11:16.290
which we are doing.

11:16.660 --> 11:22.180
So in that case, we will keep this student no matter how small the percentages.

11:22.930 --> 11:28.810
So that is something that you will have to keep in mind and something which you will be having as a

11:28.810 --> 11:37.420
good idea by deciding if you should be removing a particular column or if you should be removing that

11:37.420 --> 11:45.370
particular category wise and wanting the job column in two different categories.

11:46.780 --> 11:54.340
Or in the dumbness, so similar thing would be applied to different type of features so far, different

11:54.340 --> 12:03.100
features like we have job, marital status, education, housing loans and so forth, these also same

12:03.100 --> 12:04.130
thing would be applied.

12:04.450 --> 12:08.200
So let's see, I have this column contact and.

12:09.280 --> 12:18.610
Ninety nine percent of my data is be presented by ACORN that I see value X, then what I can do is I

12:18.610 --> 12:24.730
can completely get rid of that one percent of my data because it doesn't really have very good representation.

12:24.910 --> 12:28.380
And all the values are actually constant in nature.

12:28.810 --> 12:31.090
Or like I say, there are two categories.

12:31.090 --> 12:31.900
Yes, no.

12:31.900 --> 12:36.460
And maybe and there is only one percent of people who are opting for.

12:36.460 --> 12:42.620
Maybe then we can get rid of that maybe and don't keep it as a good idea.

12:43.010 --> 12:43.480
It's a.

12:44.720 --> 12:46.330
So we can do something like that.

12:49.550 --> 12:51.770
Now, the next thing which we have is.

12:52.700 --> 12:56.410
We can check if there is any value in our data.

12:56.810 --> 13:03.590
So when I can simply say the need is not which, we'll find out if there is any null value.

13:03.920 --> 13:08.690
And then I think the values of the number of null values.

13:08.990 --> 13:12.500
So we get that there is no null value in this particular dataset.

13:12.830 --> 13:17.240
So here what I'm doing is I'm adding certain null values in that each column.

13:18.180 --> 13:23.400
And I'm adding certain land values in the violence problem and creating a new Dufrene.

13:25.730 --> 13:31.820
So this is my new day offering, which has null values in violence, volume and age volume.

13:32.150 --> 13:35.960
So now what I'm doing is I'm checking if certain volume is not.

13:37.480 --> 13:41.110
Then I can find out the value found in violence.

13:43.080 --> 13:47.410
So here I am finding out the value value found in violin study.

13:47.910 --> 13:53.340
You can see that it is showing for forty thousand, fifty thousand, thirty thousand.

13:53.340 --> 14:01.410
All these values are coming up, but it is not actually showing any value for not a number that we actually

14:01.410 --> 14:01.770
have.

14:01.770 --> 14:05.760
This not a number here, but the value found is not coming from.

14:07.810 --> 14:14.470
So what we can do here is there is one option, which is to include drop in.

14:16.010 --> 14:23.870
OK, so if we run this, we have this mighty violence, which is just the same thing, and we are getting

14:23.870 --> 14:30.260
out the value vote along with the value guns we can select to have normalized so that we are able to

14:30.260 --> 14:32.930
see a in this view of data and.

14:34.170 --> 14:40.820
When we say that Albany will default, which is by default, too, so what will happen is when we say

14:40.950 --> 14:45.920
Albany will default, then it will not remove the north no category.

14:46.230 --> 14:49.480
So here it was dropping as to.

14:49.740 --> 14:53.800
So that's the reason why it was not showing the null values here.

14:53.970 --> 14:57.330
But now here we are keeping an equal default.

14:58.610 --> 15:01.480
So it is showing the north the number and the percentage of.

15:03.890 --> 15:10.110
So now here you can see that there are actually 10 percent of data, which is not a number.

15:11.880 --> 15:15.700
So we will have to imbue these values and handle the according.

15:17.710 --> 15:24.160
So in this, you will have an equal to do, then you will not forget that your picture of what data

15:24.160 --> 15:24.640
you have.

15:25.720 --> 15:33.370
So I always make sure that every penny is equal to is equal false, actually, when we are checking

15:33.370 --> 15:38.920
if a particular value has to be dropped or not or if there is any particular value or not.

15:39.100 --> 15:43.450
So these things can be tracked by having this drop.

15:43.450 --> 15:44.520
Any defaults?

15:46.440 --> 15:53.190
I always vote to eat in every column one by one, so that you have a better picture of what kind of

15:53.190 --> 15:58.710
data you have, data transformation and feature selection.

15:58.980 --> 16:02.940
And working with the data is a very long process.

16:03.300 --> 16:11.340
In a normal project, it takes a lot of time, almost 60, 70 percent of the time goes into the detail

16:11.340 --> 16:12.090
preparation.

16:12.270 --> 16:17.970
And after that, almost 10, 20, 30 percent of time goes into the model.

16:19.260 --> 16:26.100
So it is always recommended to put in a good effort with the data preparation and feature selection

16:26.100 --> 16:30.870
and feature preparation so that the model that you will be training would be trained with.

16:31.900 --> 16:40.090
Because it is always said that garbage in, garbage out, which means that if in your model you do not

16:40.090 --> 16:47.800
have a good quality of data, then no matter how good your model is, stream, the output will not be

16:47.800 --> 16:48.760
satisfactory.

16:50.340 --> 16:54.520
So always make sure that the deed is done vainest.

16:58.030 --> 17:00.490
So here we are.

17:01.790 --> 17:08.470
Seeing this health on the value counts function so you can see what functions we have here.

17:10.240 --> 17:16.660
And how we can move on, so if you want to look at any other function, you can simply apply his wonderful

17:16.660 --> 17:20.920
for guidance and show you the entire documentation of the function.

17:23.380 --> 17:29.020
Then we have volume, so let us check the volumes at the end of the day, doesn't.

17:30.050 --> 17:34.970
I'm here, we have medical, so we have.

17:36.080 --> 17:42.290
Twenty seven thousand values as measured, 12 percent as single and five.

17:43.240 --> 17:45.820
As damaged now.

17:47.130 --> 17:54.990
By visualizing the data, we are just visualizing these data as a single than right now, but what we

17:54.990 --> 18:03.220
can do is we can also compare data based on multiple columns and analyze multiple columns simultaneous.

18:03.630 --> 18:07.890
So how we can do is we can do that by creating a that.

18:10.010 --> 18:17.890
So one meeting with your staff, what we can do is we can use feed or crosstab and we can put the column

18:17.900 --> 18:18.830
names into it.

18:20.290 --> 18:28.150
So I am creating a stand between default and housing, so this actually gives me the number of values

18:28.150 --> 18:28.960
which I have.

18:29.260 --> 18:35.770
So for default then the value is no, I'm for housing when the value is no.

18:36.160 --> 18:39.330
There are nineteen thousand seven hundred of people.

18:41.150 --> 18:46.630
When the fall is yes and housing is no, there are three hundred and eighty rows of people.

18:47.740 --> 18:49.840
Then the fall in housing water.

18:49.870 --> 18:52.930
Yes, there are four hundred and thirty five volumes of data.

18:53.110 --> 18:59.860
So this gives a more clearer picture than actually there are more number of people who have the as no.

19:00.370 --> 19:04.140
And there is a list number of people who have voted as yes.

19:04.570 --> 19:15.880
And if we see somebody give me the people who have housing have more number of default in comparison

19:15.880 --> 19:18.150
to the people who don't have housing.

19:20.650 --> 19:23.340
Now, here we are creating another froster.

19:24.910 --> 19:32.170
So this gives a more hybrid version of it, it just gives the addition of these two values, so it gives

19:32.170 --> 19:35.640
nineteen thousand seven hundred and one plus twenty four thousand here.

19:35.950 --> 19:39.890
And these two values are added here and these two values added here.

19:39.970 --> 19:43.620
So it gives a complete value count to.

19:46.070 --> 19:58.650
So we are just getting the body count and here select the day, it allows us to get a particular DNA

19:59.390 --> 20:01.170
from the entire database.

20:01.680 --> 20:03.620
So here I am selecting object.

20:04.010 --> 20:08.860
So it will allow me to get all the object I need from here.

20:09.260 --> 20:11.660
So I have doubled my education.

20:11.660 --> 20:15.770
So all these volumes which have a database as objects have been selected.

20:16.810 --> 20:20.470
Now, here, what I'm doing is I'm getting the categories.

20:22.070 --> 20:29.270
Which of these I as objects, so I got all the following names where the data like is object.

20:30.340 --> 20:38.170
And here I'm just pointing out the data and the value found for all the categorical data.

20:38.680 --> 20:44.800
So this will actually allow me to visualize what all these volumes do I need to have what sort of value

20:44.800 --> 20:53.470
should I have for removing the particular columns, for removing a particular activity by converting

20:53.470 --> 20:54.550
them into dummies?

20:54.580 --> 20:59.500
So all these things can be analyzed from this entire plot, which we have got.

21:00.100 --> 21:03.130
Then we have this group by option, which we have seen.

21:03.490 --> 21:11.460
So it allows us to group and view the mean value so we can apply any aggregate function on top of it.

21:11.710 --> 21:13.300
So I'm applying mean here.

21:14.810 --> 21:16.130
Yes, I mean.

21:18.340 --> 21:25.720
Here I am again, applying mean and viewing only a few minutes of viewing all the problems, and here

21:25.720 --> 21:30.130
what we can do is we can apply a different aggregate functions also by.

21:32.480 --> 21:34.610
These are the different aggregate functions.

21:37.610 --> 21:42.290
Next, we will have a look at data visualization.