WEBVTT

00:02.140 --> 00:09.760
Didn't know we have discussed about fighting different Bible libraries like no one does, and we have

00:09.760 --> 00:14.190
also discussed about how we can modify the data.

00:15.040 --> 00:22.720
Now, as we know all of these, we actually need to understand how we would prepare the data using the

00:22.720 --> 00:24.840
tools like Mambi and find those.

00:25.750 --> 00:34.320
So the first thing which we need to understand is the different types of variables.

00:35.200 --> 00:38.220
So there could be three types of input variables.

00:38.530 --> 00:40.820
One is categorical variable.

00:41.020 --> 00:46.150
Next, a quantitative variable and the last one being order would be.

00:47.320 --> 00:56.740
Categorical variables hold the values that can be organized into categories, and these are not numerical

00:56.740 --> 00:57.290
in nature.

00:58.150 --> 01:02.710
Some of the examples of categorical variables could be.

01:07.380 --> 01:17.040
City and gender of a person, the next category of the input variable date is numerical.

01:18.180 --> 01:27.420
These could be age or salani, something which is numerical in nature and they can perform and mathematical

01:27.420 --> 01:31.260
operation on both of these are called numerical believes.

01:33.090 --> 01:43.740
The third type is or the movie or the variable is with a natural order, it is the variable on top of

01:43.740 --> 01:45.900
which we can apply a sequence.

01:46.770 --> 01:54.870
So here we have these ordinal variable rating, which has values, good average grade.

01:54.870 --> 02:02.520
But I think by greed and good, these values, although they look like categorical variables because

02:02.520 --> 02:10.140
these are in textual form, but they actually have a sequence present in them, we know that grade would

02:10.140 --> 02:16.130
be the best rating while followed by a good then average, then bad.

02:16.620 --> 02:18.690
In the end they would be pathetic.

02:19.730 --> 02:24.400
Why, in case of Citi, we do not have any particular order.

02:25.190 --> 02:29.830
We do not know how we can compare Delhi, Mumbai, Jenny and Golgotha.

02:30.200 --> 02:40.880
So that is the reason why Citi is a categorical variable while reading is an ordinary gender again,

02:40.880 --> 02:41.930
is a categorical.

02:45.940 --> 02:48.580
Now, let us see how we can handle the.

02:49.840 --> 02:54.750
So to the handling data, we have to think about the numerical.

02:56.120 --> 02:58.730
So let us see, we have this kind of data.

02:59.840 --> 03:04.290
Now, here we have each uncivility bhajan numerical in nature.

03:05.150 --> 03:08.480
So we do not need to perform any operation on top of these.

03:09.560 --> 03:17.390
The next is expedient, so although the values are numerical in nature, but they have some fixed presence

03:17.390 --> 03:22.370
in them, so if you think of it, these will be present in a string.

03:23.450 --> 03:28.130
So we will have to convert the expedients into a numerical form.

03:28.490 --> 03:37.910
So we will have to remove the wire for you from this one year so that only one is remaining in a numerical

03:37.910 --> 03:38.210
form.

03:41.810 --> 03:53.090
Next next, we will have to remove this plus sign on air from zero plus here, we can either write zero

03:53.090 --> 03:58.670
in place of zero years or we can write one wherever we have zero plus years.

04:01.180 --> 04:08.160
Next is categorical data in case of categorical data, we need to convert these to numerical one, not

04:08.170 --> 04:10.900
including legacy, how we can do that?

04:16.970 --> 04:24.080
City is a is a categorical leader and so is Gendell, let us see how we can handle this.

04:25.040 --> 04:28.250
So here we have the city here.

04:28.270 --> 04:31.480
We have Daily Mail by Jenny Callcott.

04:32.390 --> 04:35.410
This is the city column, which we originally had.

04:36.050 --> 04:43.310
So when we want to convert this into a numerical form, what we can do is we can instead of having one

04:43.310 --> 04:50.450
column for city, we can have four columns for the cities as we have four cities present in the city

04:50.450 --> 04:50.750
column.

04:51.560 --> 04:59.360
So now we have these four columns, Delhi, Mumbai, Chennai, and then the next thing what we can do

04:59.360 --> 05:02.420
is we can put values inside this.

05:09.570 --> 05:11.920
Now, let us try to fill this particular time.

05:14.140 --> 05:21.810
Now, what I have done is inseparable over the city was daily in the daily column, I have put one and

05:21.880 --> 05:23.920
in all of the places I have put to.

05:25.270 --> 05:25.990
This week.

05:28.650 --> 05:32.730
This way, all the columns will have value zero and one.

05:33.910 --> 05:36.440
Which is easily interpretable by the machine.

05:37.520 --> 05:42.300
Now, what we can do is let us try to notice a pattern between these.

05:42.730 --> 05:49.620
So wherever the CB is, Delhi, the number against is one, and all of those values are zero.

05:50.230 --> 05:54.400
Whatever the city's monthly, the value against Mumbai is one.

05:54.550 --> 05:56.800
And all of those values are against you.

05:57.760 --> 06:02.890
And whatever the value is, generally the value for China is one.

06:02.890 --> 06:04.690
And all of those values are zero.

06:05.350 --> 06:07.270
Now for Calcutta.

06:07.460 --> 06:13.390
Again, all the values are zero except for the value under the column.

06:14.770 --> 06:21.280
Now, instead of having these four columns, I can only have these three columns in.

06:25.260 --> 06:28.290
If I keep only these three columns.

06:29.790 --> 06:33.550
Then I will be showing the same information.

06:34.170 --> 06:43.110
What will happen is whenever the value for the lanterne is one 00, it will mean that we are referring

06:43.110 --> 06:46.210
to the whenever the value is zero one zero.

06:46.230 --> 06:50.970
We are referring to Mumbai whenever the value is zero zero one, we are referring to generally.

06:51.150 --> 06:54.930
And whenever all the values are three, we are referring to Calcutta.

06:55.290 --> 07:04.260
So there is no need to have a column named Call that this will in turn reduce the number of columns

07:04.260 --> 07:11.550
which we are having, because the high number of columns also leads to complexity in the calculation

07:11.550 --> 07:12.420
of more.

07:12.420 --> 07:13.080
Villone.

07:14.220 --> 07:22.890
So it is always suggested to have less number of columns and the moving the column, Golgotha is actually

07:23.010 --> 07:28.160
saving us one column and also not hampering the information.

07:28.440 --> 07:35.130
We still have the same amount of information in time, but we are just not having one extra call you.

07:38.980 --> 07:46.960
Now, the next is categorical data, so let us see another city, so let us look at this particular

07:46.960 --> 07:47.290
data.

07:47.680 --> 07:50.110
So here we have a beat.

07:50.590 --> 07:55.060
So underbite, we have the the month I leave.

07:55.720 --> 08:03.160
Now, when we have this particular date, it might not show any particular information because Mushin

08:03.160 --> 08:11.770
would not be would not be able to understand what Abeed stands for so we can convert a single date column

08:11.890 --> 08:21.970
into three different columns being the month I leave and for each day month, and we can fill in the

08:21.970 --> 08:23.020
values of the date.

08:23.230 --> 08:29.260
So for this particular date, the day will be one for this particular day.

08:29.470 --> 08:31.930
The day will be to hear.

08:31.930 --> 08:33.280
The day will be five.

08:34.330 --> 08:36.040
Here they will be eight.

08:37.560 --> 08:45.720
And the month will again be 12 one seven three, and it will be twenty twenty two thousand to twenty

08:45.730 --> 08:47.820
twenty and 1982.

08:49.110 --> 08:56.070
So this is how we will be able to convert the date, which is a categorical column, and we are not

08:56.070 --> 09:03.540
able to understand what the stand for about a month, year column will be able to express the information

09:03.540 --> 09:04.880
more clearly.

09:06.240 --> 09:13.050
Next is the ordinal columns, so whenever we have all the new columns, what we can do is.

09:17.680 --> 09:20.350
We can convert these to numerical columns.

09:21.510 --> 09:27.540
So let us try to convert this, so when we have uncategorically column, which is an ordinary column

09:27.540 --> 09:34.020
actually, so we can convert these into a numeric form, and instead of having five columns here, we

09:34.020 --> 09:39.480
can simply have one single column with showing the reading into a numerical.

09:40.580 --> 09:47.480
With great is depicted by five, good is depicted by four, average is depicted by three, then by the

09:47.480 --> 09:49.370
BITOU and pathetic by one.

09:52.540 --> 10:00.100
Here we have another example of categorical variable where we have gender and the gender has two values,

10:00.100 --> 10:01.310
male and female.

10:01.600 --> 10:07.110
So instead of having a gender column, we can instead have a male column.

10:07.390 --> 10:14.140
And wherever the value of male is one, it will refer to mean and whatever the value is zero.

10:14.230 --> 10:15.580
It will refer to female.

10:18.060 --> 10:25.350
Here, the numerical values, so here we are simply removing the year from these values and converting

10:25.350 --> 10:32.550
them into integer or floating point value so that we get the value in in the in a numerical form.

10:36.380 --> 10:39.440
Let us discuss about gladius problems with videos.

10:41.100 --> 10:47.670
When they are working with different variables and different features, we need to make sure that the

10:47.670 --> 10:52.230
features which we have, are you busy being a good amount of data?

10:52.410 --> 10:57.300
And also they should not have any irrelevant information.

10:58.020 --> 11:06.270
For example, let us say we have some irrelevant columns, like if we are talking about a loan approval

11:06.270 --> 11:06.950
application.

11:07.620 --> 11:15.540
Now, the application number is a completely irrelevant column in this context because the application

11:15.540 --> 11:17.310
number will not meet.

11:17.550 --> 11:24.000
We will not make any difference in approval or rejection of the application.

11:25.100 --> 11:26.910
Next is categorical claims.

11:27.590 --> 11:33.860
Now, let us say we have a category column which have a lot of categories.

11:34.320 --> 11:41.000
So in this particular situation, when there are a lot of categories, we cannot really Convergys into

11:41.000 --> 11:42.040
one, including.

11:42.890 --> 11:54.200
So in this particular case, we can select either the top seven or top 10 categories or we can divide

11:54.200 --> 11:56.950
the categories into subgroups.

11:57.380 --> 11:58.310
So let us see.

11:58.310 --> 12:00.970
We have a guy named city or State.

12:01.550 --> 12:09.440
So what we can do is if we have 50 cities, then we can maybe choose the most influential Thorpes,

12:09.440 --> 12:18.110
five out of seven cities, or we can group the cities into metropolitan cities and so on, so that we

12:18.110 --> 12:22.520
have all three groups of cities which in which multiple city scanline.

12:23.480 --> 12:27.220
Then the next thing is usof anomaly detection.

12:28.100 --> 12:35.330
There could be certain outliers or there could be some kind of data which is not really important in

12:35.330 --> 12:35.780
nature.

12:36.170 --> 12:45.020
So what we can do is we can use anomaly detection or we can use outliers or detection using box plots

12:45.230 --> 12:48.740
and get rid of the outliers instead of outliers.

12:48.740 --> 12:50.750
We can impute certain values.

12:50.960 --> 12:58.130
We can replace these values with the mean of the data or the median of the data or the mood of the data

12:58.130 --> 12:59.780
in case it is a categorical.

13:01.630 --> 13:10.250
In case there are any missing values in that particular case, we can add some other data.

13:10.600 --> 13:16.570
So in case of missing values, if there are a few values which are missing, then we can take a mean

13:16.570 --> 13:24.130
or median of the data and include those values in place of the missing values in case the missing values

13:24.130 --> 13:25.510
are a huge number.

13:25.800 --> 13:28.920
Let us see, 90 percent of our data is missing.

13:29.140 --> 13:36.510
In that case, we can directly get rid of the column itself because imputing values will not be fruitful

13:36.520 --> 13:41.010
for us because we would be creating false values in that situation.

13:42.720 --> 13:51.750
Next is noise, then there is some noise that is modification of the original value, these noise can

13:51.750 --> 13:58.210
actually look like a normal input data, but has some kind of fault in it and it is very hard to detect.

13:58.680 --> 14:02.370
So to avoid noise, what we can do is.

14:03.360 --> 14:08.790
We can make sure that the data from our sources is being pulled correctly.

14:09.480 --> 14:12.690
That is the only solution which we can have formalized eviction.