WEBVTT

00:00.300 --> 00:10.450
OK, so now let's have a look at the other solution, the solution for the gaming's flustering project.

00:10.740 --> 00:15.530
So this project is basically named Bastin Market Segmentation.

00:15.840 --> 00:20.750
So we will be using K means clustering for this particular project.

00:20.760 --> 00:23.280
You can use any other clustering algorithm.

00:25.050 --> 00:34.700
So here in drafting the user, basically the dataset is representing a random sample of thirty thousand

00:34.710 --> 00:43.500
US high school students who had profiles on the well-known S.A. So to protect the users anonymity,

00:43.500 --> 00:51.510
the census will remain unnamed and the data was sampled evenly across four different high school graduation

00:51.510 --> 00:59.460
years, representing the senior, junior, sophomore and freshman classes at the time of the data collection.

01:00.450 --> 01:08.880
Now, this particular dataset contains 40 variables like gender, age range, basketball, football,

01:08.880 --> 01:09.360
soccer.

01:09.360 --> 01:17.460
All these different criteria in which the students might have interest in the final data setting basically

01:17.460 --> 01:24.330
indicate for each person how many times each word has appeared in their Ascendis profile.

01:24.840 --> 01:30.510
So now what we will be doing is the first thing which we do is we import all the important libraries

01:30.510 --> 01:31.710
and we load the details.

01:31.740 --> 01:42.390
And so here you can see the dataset contains graduation year, the gender, age number of friends and

01:42.390 --> 01:48.430
the words in which the person has interest rates and how many times it appears.

01:48.630 --> 01:51.780
So you can see here the friends of your sixty nine times.

01:52.950 --> 01:57.100
If you go further, you can see here this comes four times.

01:57.960 --> 02:01.410
So these are different kind of words which are coming in.

02:01.800 --> 02:12.220
Now let's give somebody statistics so you can see that the graduation year is ranging from 2006 to 2009.

02:13.080 --> 02:22.230
Age of the student is three years to one hundred and six years for a number of friends, ranges from

02:22.230 --> 02:24.560
zero to eight, 30 and so on.

02:24.990 --> 02:28.490
And you can see the standard deviation of different words.

02:29.050 --> 02:34.650
So here you can see that the meaning of the word, which you can see here, is both for basketball.

02:34.650 --> 02:36.000
It is 26.

02:37.350 --> 02:40.650
Then you can see these are a little less popular.

02:40.650 --> 02:42.600
Basketball seems to be highly popular.

02:42.630 --> 02:43.740
So is football.

02:44.340 --> 02:52.560
Then if you see, then this is a little more less of one.

02:52.580 --> 03:00.030
And then there are kisses and these are less popular and so on.

03:01.560 --> 03:04.510
So you can see how much these words are popular.

03:04.830 --> 03:06.760
So this basketball, this one.

03:06.790 --> 03:08.760
So these are point to six.

03:09.190 --> 03:17.970
Then you can see something which is more popular, like dance is more popular, zero point four two,

03:19.230 --> 03:26.820
then zero point seven three, which is very, very popular music zero point four six gaudens, very

03:26.820 --> 03:27.440
popular.

03:27.690 --> 03:35.550
So you can see these are different words which are so popular amongst these students and that's about

03:35.550 --> 03:35.670
it.

03:35.970 --> 03:41.990
So next, what we will be doing is you can see the gender, this differentiation.

03:42.240 --> 03:51.480
So there are total twenty seven thousand rules of data and out of which there are two different unique

03:51.480 --> 03:54.820
values and females, the majority here.

03:55.900 --> 03:57.260
Let's see the difference in.

03:57.770 --> 04:03.660
So you here you can see there are not much missing values, but there are around 5000 missing values

04:03.660 --> 04:04.410
for each.

04:04.440 --> 04:09.240
So what you can do is you can simply take the mean of the ages and those values.

04:09.900 --> 04:14.750
So here at or below five thousand records of missing ages.

04:14.790 --> 04:21.150
Also concerning is the fact that the minimum and maximum values seem to be invincible, that there's

04:21.600 --> 04:22.940
a minimum of three years.

04:22.960 --> 04:28.620
So that's not possible that a three year old would be using Essence's or someone our age one hundred

04:28.620 --> 04:30.420
and six would be attending high school.

04:30.780 --> 04:38.460
So what we simply do is we take the number of male and female candidate you can see around grindy two

04:38.460 --> 04:42.060
thousand female and there are only 5000 Miller there.

04:43.260 --> 04:48.310
And here's the two thousand gender values are missing.

04:48.690 --> 04:53.780
So you're not number twenty seven hundred.

04:54.120 --> 04:58.680
So here you can see that there are twenty two thousand female 5000 men.

04:59.010 --> 04:59.520
And when.

04:59.710 --> 05:06.640
Seven hundred missing values, so what we are trying going to do is we will fill all the null values

05:06.640 --> 05:07.730
with no gender.

05:08.260 --> 05:13.930
So we are saying that they have not disclosed their gender and we are not actually putting that as the

05:13.930 --> 05:20.490
majority value because then it would simply be female, because females there are already a majority.

05:20.500 --> 05:21.750
So we don't want to do that.

05:23.050 --> 05:28.620
So they put gender has not disclosed whatever the value is not a number.

05:30.040 --> 05:36.580
Next, what we do is we group the data based on the graduation year and the ages.

05:36.850 --> 05:38.490
So and we take the mean.

05:39.190 --> 05:40.990
So you can see the graduation.

05:40.990 --> 05:42.400
It is 2006.

05:42.820 --> 05:47.490
The age nine is in 19, then 18, 17, 16.

05:47.680 --> 05:53.680
So based on this graduation year, this seems that these are the reasonable ages which could be imputed.

05:53.950 --> 05:57.970
And this is a good way to actually find out what age a person should be.

05:58.360 --> 06:06.760
So what we will do is we will group the data by the graduation year and we will fill in the data based

06:06.760 --> 06:09.350
on the mean value from these values.

06:10.750 --> 06:19.330
So what we have got is we have imputed the values and similarly, we have found that moral values are

06:19.340 --> 06:20.140
present now.

06:20.140 --> 06:22.480
So we don't have any values present.

06:23.230 --> 06:26.130
Next, we will look at the outliers.

06:26.440 --> 06:32.770
So here you can see the original age range contains values from three to one hundred and six, which

06:32.770 --> 06:41.140
is unrealistic in nature because of age three or one hundred six would not entice a reasonable age range

06:41.140 --> 06:45.280
for people attending high school would be ranging from 13 to 21.

06:45.580 --> 06:52.180
The rest should be treated as outliers, keeping the age of student going to high school in mind so

06:52.180 --> 06:54.330
we will detect the outlier values.

06:54.330 --> 06:56.660
So we are generating the books plot.

06:56.710 --> 07:02.590
So here you can see that these are the actual correct values, the ones in the middle and all other

07:02.590 --> 07:04.440
values are actually outliers.

07:04.750 --> 07:12.670
So what we will do is we will find no the Q1 and Q2 values and accordingly we will find out the different

07:13.210 --> 07:15.790
ages for the outlier detection.

07:17.080 --> 07:25.320
And here you can see that twenty five percentile is 16 and maximum is 21.

07:25.330 --> 07:28.810
So what we are doing is we are putting the values here.

07:28.870 --> 07:31.200
What we have applied this condition.

07:31.660 --> 07:43.720
So what we have done, we have kept the data where the data age is greater than Q1 minus one point five

07:44.290 --> 07:48.640
and the data age is less than Q3 one point five IQ.

07:48.790 --> 07:51.190
So we have kept only these particular ages.

07:51.700 --> 07:56.820
So from this data now we will check the latest data, which we now have.

07:57.070 --> 08:07.420
So now we have reduced the number of rules to twenty nine thousand noley and now we have the minimum

08:07.420 --> 08:11.740
age as thirteen point seven one and the maximum age as 21 years.

08:11.750 --> 08:12.040
Now

08:15.190 --> 08:23.650
they now what we do is you can see now there are no outliers now regarding data.

08:23.650 --> 08:32.260
Preprocessing a common practice employed prior to any analysis using distance calculation is to normalize

08:32.260 --> 08:34.870
visa standardize for future.

08:35.110 --> 08:45.940
So we cannot have values with different ranges when we are using an algorithm which uses distance as

08:45.940 --> 08:47.200
its base metric.

08:48.160 --> 08:56.280
So all the distance based algorithms will need to have standard scanning applied.

08:56.350 --> 08:59.320
So here we will be applying scaling.

08:59.680 --> 09:02.220
So we will be applying standard schema here.

09:03.790 --> 09:11.560
So the process for ZAKES standardization skews the features so that they have a mean of zero one standard

09:11.560 --> 09:12.500
deviation of one.

09:12.880 --> 09:19.190
So this transformation changes the interpretation of data in a way that may be more useful.

09:19.510 --> 09:21.820
So that is what we will be doing.

09:23.740 --> 09:30.580
So here we are taking out the column names which have values which would be having different ranges.

09:32.630 --> 09:41.230
And we are applying standards killer on them, and now we have these skin granules present with us,

09:43.520 --> 09:49.580
so then next thing which will be done was to convert the objectivity to numerical.

09:49.610 --> 09:51.950
That is the categorical variables to numerical.

09:52.550 --> 09:55.080
So we will be doing the same.

09:55.100 --> 10:00.800
So we will take care of the gender converted into one two in three different categories.

10:01.970 --> 10:04.290
So now we have this particular data.

10:04.310 --> 10:10.590
After the transformation, the next thing which basically could be done is applying the key means model.

10:11.930 --> 10:16.090
So we have applied the gaming's model and we have for model.

10:16.430 --> 10:21.550
Next thing which we will be doing is we are running for a different number of clusters.

10:21.710 --> 10:31.100
So we are running it from cluster range one to 20 and we are applying it and we are running the Aluko.

10:31.100 --> 10:36.820
We are printing the L will go to see which number of clusters is actually helpful for us.

10:37.250 --> 10:43.100
Now, the location of a mean the plot is generally considered as an indicator of appropriate number

10:43.100 --> 10:43.800
of clusters.

10:44.270 --> 10:50.450
Here we have this about the size of five, which is here.

10:52.050 --> 11:00.210
So we will keep the number of clusters to we find that this key to be fight, so we will fit gaming's

11:00.210 --> 11:04.710
with case equal to five and these are the details which we obtain.

11:04.710 --> 11:07.210
And you can interpret the cluster sizes.

11:07.230 --> 11:14.490
You can find out the Szilard score and evaluate the model accordingly.

11:14.880 --> 11:24.540
You can also do this using any other than gordito of your choice and see how good values you can obtain

11:24.540 --> 11:25.140
out of this.

11:25.290 --> 11:26.610
That is completely up to you.

11:26.640 --> 11:28.690
This is just one of the implementations.

11:29.250 --> 11:29.850
Thank you.