WEBVTT

00:01.150 --> 00:07.690
In this session, we will discuss about the next unsupervised learning algorithm, which is key means

00:07.690 --> 00:08.230
clustering.

00:09.300 --> 00:17.760
Now, when I have explained the key and then Legatum to do you, I strictly said that this is one little

00:17.760 --> 00:21.960
guarded them, which is often confused with the key means.

00:21.970 --> 00:32.520
And so now we will have to go through again and means together so that we can actually draw a line between

00:32.520 --> 00:33.790
both of these called.

00:35.090 --> 00:41.930
So the very first thing which you need to know is what is key means so that we can draw a line between

00:41.930 --> 00:46.960
key means and can and so that you don't get confused between them.

00:50.170 --> 00:57.420
So what is key means it is just the names look similar, but the concept is a lot different.

01:00.120 --> 01:07.930
So in case of K means clustering, he is the number of clusters which we will be creating.

01:08.520 --> 01:19.890
So he means clustering, has a number of clusters, the mechanism that randomly initializes random centroid

01:20.160 --> 01:21.720
in the feature space.

01:22.970 --> 01:30.830
And at the time of initialization, the key points are not actual centroid of 30.

01:32.480 --> 01:38.330
So the data point are then assigned to the nearest key point.

01:39.370 --> 01:48.460
The same droids are then moved so that they are at the center of the current designated clusters, so

01:48.460 --> 01:59.020
the very first difference between K means and again is that the key in means stands for the number of

01:59.020 --> 01:59.800
clusters.

02:00.920 --> 02:09.340
Why the key in key nearest neighbors stands for the number of neighbors which we are looking for.

02:11.110 --> 02:20.410
Second and most important differences, that Cannon is in supervised learning algorithm, that is we

02:20.410 --> 02:32.410
use it for making predictions via the key means algorithm is used for classifying or clustering the

02:32.410 --> 02:32.980
data.

02:33.190 --> 02:37.390
That is, it is an unsupervised learning one called.

02:39.290 --> 02:40.310
Let us look for the.

02:41.230 --> 02:48.760
So here we have the data, so let us say we have these points, so out of these points, if we want

02:48.760 --> 02:55.180
to find two clusters, so what we will do is we will place two centroid points to it.

02:56.220 --> 03:04.800
Now, we have placed these tools and right these tools androids would have been present at any place,

03:05.730 --> 03:08.580
these are randomly placed in this particular.

03:10.590 --> 03:19.800
Now, once they have placed these Synthroid, they then assign these data points, the points which

03:19.800 --> 03:26.880
are nearest to this centroid, to this particular cluster, and the points which are nearest to the

03:26.880 --> 03:31.020
second centroid to a cluster which belongs to this particular.

03:33.270 --> 03:36.210
So now the boys have been assigned.

03:37.380 --> 03:45.570
So the Redpoint belong to the cluster with the radix as Android, and the blue points belong to the

03:45.570 --> 03:49.380
cluster with blue X as the centroid.

03:50.220 --> 03:52.480
Now if you look at the data.

03:52.710 --> 03:56.380
This doesn't really look like this in the auditioner.

03:58.020 --> 04:06.900
So now the centroid will actually move towards the center of the cluster, which these androids have

04:06.900 --> 04:07.540
created.

04:08.400 --> 04:14.420
So now the points are moved to the center of the clusters.

04:14.580 --> 04:21.600
So the Red Point is brought closer to the red cluster and the blue point is brought closer to the blue

04:21.600 --> 04:22.140
cluster.

04:23.550 --> 04:35.540
Now, once this has been done, now these points will again be rearranged so that the point assigned

04:35.540 --> 04:37.530
to blue cluster.

04:38.870 --> 04:46.910
Are having blue as the nearest point and the clairvoyants assigned to the red cluster are nearest to

04:46.910 --> 04:48.410
the Red Point.

04:49.070 --> 04:56.090
Now, again, once this has been done, the Red Point will be moved to the center of the data, which

04:56.090 --> 05:00.900
we have, and no point will we move towards the center of the blue data.

05:01.610 --> 05:09.590
And then again, the points will be reassigned based on their distance from the center right.

05:10.580 --> 05:19.430
This will keep on repeating until the points do not move again, until the centroid stop moving.

05:20.180 --> 05:22.010
So this is what he means.

05:22.010 --> 05:25.770
Algorithm is here, the things to notice.

05:26.000 --> 05:35.120
Number one, it will try to create circular clusters because it is looking for the area around the center.

05:36.200 --> 05:44.810
Second thing is that no matter what kind of distribution is, it will still try to create a cluster.

05:46.070 --> 05:54.830
Next point is that whenever the planes are assigned, whenever the same is pleased, so based on the

05:54.830 --> 06:01.910
placement of these Synthroid, there would be different types of clusters which could be generated in

06:01.910 --> 06:02.460
the end.

06:02.990 --> 06:07.640
So it is not a sure, sure thing that we will get these two clusters.

06:07.940 --> 06:15.060
The clusters might be based on the placement of the centroid at the initial point of time.

06:15.470 --> 06:23.930
So after every initialization, we might get different clusters of different brands of the good.

06:26.840 --> 06:34.610
Let us look at the assumptions which we have, the first assumption is that key means assumes the variance

06:34.610 --> 06:38.340
of the distribution of each attribute to be spherical.

06:38.690 --> 06:41.840
That is, it tries to create very good.

06:43.430 --> 06:53.210
Cluster's, second thing is that already you have the same mediant, that is all we will be having the

06:53.210 --> 07:01.060
same variants and here again, we will have to make sure that the data has been scaled properly.

07:01.070 --> 07:06.520
Otherwise, the shape of cluster's, which we will receive, might not be as expected.

07:07.820 --> 07:17.150
Next, the prior probability for all key clusters is the same that this each cluster has roughly equal

07:17.150 --> 07:18.490
number of observations.

07:18.890 --> 07:26.060
So whenever these clusters will be created, they will have roughly equal number of observations.

07:26.360 --> 07:28.020
And apart from that.

07:29.240 --> 07:40.160
Outliers will not be considered separate, outliers will also be included in these clusters, so a position

07:40.160 --> 07:45.830
or a placement of an outlier can actually impact creation of these clusters.

07:48.090 --> 07:56.400
So here you can see that this data is in uniform mixture and there is no actual cluster present in the

07:56.400 --> 08:05.910
data, but just because we have pleased to say androids and the androids had to make clusters and thus

08:06.030 --> 08:10.310
it has created two clusters in unknown cluster data.

08:12.380 --> 08:20.570
Next here, you can see that the clusters are expected to be of the same size, that is, the clusters

08:20.780 --> 08:25.850
will have similar number of data points in then.

08:26.120 --> 08:32.970
So here, because this is a concentrated this has and this has more scattered.

08:32.990 --> 08:37.420
That is why it looks like they have different sizes of data.

08:37.430 --> 08:42.230
But actually they have seen size of the present in the clusters.

08:45.180 --> 08:50.080
Next game means assigns to spherical fiesta.

08:50.100 --> 08:52.260
So here we have a very good cluster.

08:52.530 --> 08:54.000
So what will happen?

08:54.330 --> 08:58.140
We would have expected that it will create cluster.

08:58.380 --> 09:04.250
One cluster would be the outer ring and another cluster would be the inner area, inner circle area.

09:04.560 --> 09:12.630
But actually it has created spherical clusters, cluster this half to one cluster and the other half

09:12.630 --> 09:17.570
to the other cluster, which is not a good clustering technique.

09:19.930 --> 09:23.890
So what are the rules for this particular algorithm?

09:24.250 --> 09:26.050
Number one, it is simple.

09:26.530 --> 09:28.960
Number two, it is flexible in nature.

09:29.350 --> 09:32.630
Number three, it is suitable for a large dataset.

09:32.890 --> 09:36.670
And lastly, it detects this very good Lesters very well.

09:36.670 --> 09:43.440
It will be able to detect such clusters very nicely and these clusters, which are spherical in nature

09:43.450 --> 09:44.200
very nicely.

09:44.230 --> 09:46.630
So this one, again, it has detected very well.

09:46.960 --> 09:55.990
But where we have some data, which is not really in very good form or here, you can see that these

09:55.990 --> 09:57.820
are actually outliers.

09:57.820 --> 10:02.740
And there are only two cluster sectors, one, this cluster and this cluster.

10:02.950 --> 10:05.610
So they are actually only two clusters present.

10:05.980 --> 10:11.210
So it was not able to determine that because we have already provided the number of clusters that we

10:11.210 --> 10:11.380
need.

10:12.130 --> 10:18.580
So no matter if there are extra clusters, present or not, it was to create one extra cluster instead

10:18.580 --> 10:22.180
of creating just two clusters which are actually present in this data.

10:24.480 --> 10:32.340
So what are the cons that is sensitive to the initial Synthroid location, so based on the central location

10:32.340 --> 10:38.170
of this dog, the Thingo Spherical Cluster, then it is sensitive to outliers.

10:38.170 --> 10:42.470
So that is here we have these outlier values present.

10:42.780 --> 10:52.290
So it actually tried to include it in a cluster and hence ended up creating another cluster and adding

10:52.290 --> 10:53.490
them into a cluster of.

10:57.050 --> 11:01.920
Next is it always creates a spherical cluster, which we have seen here.

11:02.630 --> 11:04.820
And lastly, it detects the.

11:06.130 --> 11:14.230
It is not applicable for categorical data because a categorical data will not have so much difference

11:14.230 --> 11:18.550
in the distances, so we cannot really use it for a categorical data.

11:19.240 --> 11:24.520
Next, we will see a visualization of the K means clustering.

11:26.530 --> 11:32.470
So this is one website where we can actually visualize the K means clustering.

11:33.010 --> 11:34.420
So here.

11:35.390 --> 11:44.000
We can select the initial centroid, so let us select the farthest point to be the initial centroid

11:44.000 --> 11:48.850
here and the type of detail like the date of the O.

11:51.090 --> 11:52.330
Up in my.

11:53.630 --> 12:05.800
So here we expect the number of Kluster to be for where we have one to three and four clusters, so

12:05.810 --> 12:09.320
let us start let us absent right to it.

12:10.460 --> 12:17.270
On the centroid, one more centroid and another centroid, and let us start training this.

12:22.090 --> 12:23.380
So it will.

12:24.430 --> 12:26.880
Assign the points to these centroid.

12:28.070 --> 12:35.840
Now we will update the same Droits, we will update the central location to the center of the data.

12:37.490 --> 12:44.540
So it has updated the central location and based on the central location, now we will reassign the

12:44.540 --> 12:45.170
points.

12:46.750 --> 12:55.180
The boys have been reassigned to the nearest business, and right now we will again update this central

12:55.180 --> 12:55.720
location.

12:57.850 --> 13:00.170
Now, the central location has been updated.

13:00.190 --> 13:02.080
Now we will reassign the points.

13:03.140 --> 13:09.200
So if we keep on doing it, we'll just do this again and again until this point is actually the same,

13:09.200 --> 13:10.520
that all of these data points.

13:11.060 --> 13:19.310
So it has created four clusters, which was provided by us because we knew that there are four clusters,

13:19.520 --> 13:24.950
but because it wanted to create a spherical cluster, it was not able to determine.

13:26.100 --> 13:33.510
The location of the property, so let us have a look at it again, let us restart this.

13:33.990 --> 13:44.610
So now let us place the center randomly and this time we will have the device can bring us the data.

13:44.640 --> 13:46.800
So these are the different clusters.

13:46.810 --> 13:54.300
These are the data points, letters and centroid as one, two, one, three.

13:54.310 --> 13:55.890
Let us have four centroid.

13:58.160 --> 14:01.650
Or just have to be sentenced for this and let us go.

14:02.210 --> 14:06.920
So it has classified the points based on dissent right now.

14:06.920 --> 14:11.870
Again, we will update the central location so it will bring the same rights to the center of the.

14:14.480 --> 14:21.170
Now, again, we will reassign the points, so the points which have been misclassified like this,

14:21.170 --> 14:27.440
red wine, these red points will be made for the college classes.

14:27.710 --> 14:29.950
So it will reassign the points.

14:29.960 --> 14:33.020
Now, the red points have been drawn into green and blue.

14:33.770 --> 14:36.200
Now, again, we will update the central location.

14:38.460 --> 14:41.970
Now, again, when we reassign the points, these points will be corrected.

14:49.570 --> 14:58.720
So you can see it is always trying to create circular clusters and it will stop when we actually reach

14:58.870 --> 15:05.030
the center of the data and all the points, all the clusters have almost equal number of data points.

15:05.590 --> 15:07.690
So this is where it has stopped.

15:07.960 --> 15:13.420
And you can see that there are almost equal number of data points present in all the three clusters.

15:14.050 --> 15:19.130
So you can try different types of clusters here.

15:19.150 --> 15:24.970
You can try different types of mixtures and then see what actually comes out of.

15:29.290 --> 15:31.840
So let us try the Bembridge smiley.

15:33.360 --> 15:40.830
Now, look at this, Bembridge Smiley, there are these outliers present, so it will it has actually

15:40.830 --> 15:45.270
tried to assign these outliers also in the clusters.

15:45.270 --> 15:48.170
So this is what we don't really want to have here.

15:48.450 --> 15:55.700
And the creation of circular clusters has actually impacted the cluster formation.

15:55.860 --> 16:02.000
So we want to have another variant or some other algorithm which might perform better than this.

16:02.400 --> 16:04.230
So it will perform better.

16:05.220 --> 16:13.020
Canines will perform better if we have spherically clusters so we can use it more, use more appropriately,

16:13.020 --> 16:20.420
if we have spherically clusters and we have data points which are spherical in nature.