WEBVTT

00:01.290 --> 00:07.710
In this session, we will discuss about the first machine learning links, unsupervised learning algorithm,

00:08.010 --> 00:10.560
which is hierarchical clustering.

00:12.210 --> 00:20.040
Hierarchical clustering is one of the popular and easy to understand clustering techniques, and it

00:20.040 --> 00:22.830
is also known as agglomerated clustering Begbie.

00:24.040 --> 00:33.310
So in this particular technique, what we do is we initially find out the distance between each data

00:33.310 --> 00:38.290
point and each data point is considered as a cluster innocent.

00:39.640 --> 00:48.550
Now, at each iteration, the similar clusters merge with another cluster on the one cluster or clusters

00:48.550 --> 00:49.040
are formed.

00:49.450 --> 00:51.100
So what they will do is.

00:52.610 --> 00:55.100
We will simply have.

00:58.500 --> 01:00.660
Different data, datapoint.

01:01.960 --> 01:07.110
So each data point will be considered as a cluster actual.

01:08.490 --> 01:16.980
Then we will create a matrix which will have the distance of each data point with other data point.

01:18.260 --> 01:22.670
Then we will combine the nearest data datapoints.

01:23.810 --> 01:26.180
In one particular group.

01:27.290 --> 01:34.040
Then once we have these Meurice data points, they will again be combined to.

01:35.600 --> 01:43.310
Another group and so on, so it will keep on combining different groups, fill the required number of

01:43.310 --> 01:45.700
clusters which are required and reached.

01:46.280 --> 01:55.290
So what we do here is it creates a hierarchy of clusters and presents the hierarchy, then drugan cortically

01:55.340 --> 01:55.910
structure.

01:57.080 --> 02:04.670
Now, this method does not require the number of clusters to be specified at the beginning because in

02:04.670 --> 02:10.910
the beginning all the data points will be considered as individual clusters.

02:11.210 --> 02:18.790
So we will be starting the process and going to the point where there is only one cluster left.

02:20.060 --> 02:28.130
So all the planes will slowly and gradually be convoy combined into one single cluster, now we can

02:28.310 --> 02:36.560
subdivide or stop at any point of time when we want to have or when we actually achieve the cluster's.

02:38.080 --> 02:42.650
So the distance connectivity between the observations is measured.

02:42.940 --> 02:49.010
So we will find out the distance between each and every data point.

02:49.180 --> 02:55.580
So to find out each and every data points distance, we will need to have them skewed.

02:55.960 --> 03:03.430
We cannot have age to be ranging from zero to hundred and amount to be ranging in age, because that

03:03.430 --> 03:07.660
will be having a very large difference in the skin.

03:08.680 --> 03:16.150
When we have large difference in the scale, then the distance, which will be between the ages, will

03:16.150 --> 03:23.650
be diminished with due to the distance which would be present in the amount.

03:23.920 --> 03:25.090
So what will happen?

03:25.210 --> 03:27.780
Because age will have smaller distance.

03:27.970 --> 03:35.680
So all that age data, all the data points which are closer in age, will be grouped together and there

03:35.680 --> 03:37.600
will be no impact of.

03:38.840 --> 03:40.070
And the amount of.

03:41.480 --> 03:50.410
So what we do is in this, we apply a bottom up approach, so each observation starts in its own disaster

03:50.750 --> 03:55.640
and the similarity or the distance between each cluster is computing.

03:57.120 --> 04:05.390
And then we merged the two most similar ones, each operation, until there is only one cluster left.

04:06.410 --> 04:13.670
Now, these are the big computing, the distance between each and every cluster after every iteration.

04:14.900 --> 04:24.200
So you can see that this is so much computationally expensive because there will be a lot of competition

04:24.200 --> 04:25.680
which will be required for this.

04:27.660 --> 04:34.890
And because that is such a high in need of computation required, because of this particular reason,

04:35.190 --> 04:43.940
it is feasible to use agglomerated flustering or hierarchical clustering only when we have a small dataset,

04:44.610 --> 04:46.570
if they have a large dataset.

04:46.590 --> 04:49.830
It will take a lot of time to compute this.

04:51.490 --> 04:58.240
Now, a three that shows how clusters are merged, split or split hierarchically.

04:58.570 --> 05:00.550
This is what the underground is.

05:01.530 --> 05:08.280
Each node on the tree is a cluster that is each subtype here is a cluster.

05:08.640 --> 05:13.760
If we go further, deeper into it, there are some clusters internally present.

05:14.340 --> 05:19.590
So let us see if we want to have two clusters, then we can simply divide it here.

05:19.590 --> 05:23.880
And one cluster would be this one and another cluster would be this one.

05:24.940 --> 05:31.750
In case I want to have three clusters, then I can divide it here and now I have three clusters.

05:33.360 --> 05:37.410
Similarly, if I want to have four clusters, I can divide it here.

05:38.320 --> 05:42.040
And then I will have one, two, three.

05:43.090 --> 05:44.580
I would clusters in her.

05:44.890 --> 05:47.440
So this is how we find out the clusters.

05:50.560 --> 05:54.510
Now, let's talk about a few properties of this clustering technique.

05:55.570 --> 06:05.310
The first one is that it is computationally heavy, that is if we have a small dataset, then it will

06:05.320 --> 06:06.580
work really fine.

06:07.000 --> 06:14.130
But if we have a very large dataset, then we will not recommend someone to use hierarchical clustering.

06:15.280 --> 06:19.010
Next is gives same cluster every time.

06:19.240 --> 06:24.610
So we have several different clustering algorithms which will give different clusters.

06:24.850 --> 06:32.010
But hierarchical clustering is one method which will always give the same clusters every day.

06:33.170 --> 06:40.070
So there is no need to provide the number of clusters at the beginning because it will create the clusters

06:40.070 --> 06:47.120
in the same method I at the time or after the clusters have been created, we can decide what number

06:47.120 --> 06:48.110
of clusters we need.

06:49.250 --> 06:55.800
Lastly, it is having a difficulty in handling different sized clusters and the irregular shapes.

06:56.060 --> 07:05.090
So if there is data where we have clusters with different size or irregular shape, then it will be

07:05.090 --> 07:06.800
a little difficult to handle.

07:08.710 --> 07:14.450
So it will, because it will always try to combine clusters in a field of two.

07:14.590 --> 07:19.840
So that is why it will always kind of end up having clusters of similar sites.

07:20.620 --> 07:25.660
So this is about hierarchical clustering, how we actually implement this.

07:25.870 --> 07:29.110
We will have a look at this in the next session.

07:29.390 --> 07:33.270
Then we will implement this particular code for hierarchical clustering.