WEBVTT

00:01.220 --> 00:08.660
In this session, we would work on the rescan, an anomaly detection, so let us, first of all, implement

00:08.780 --> 00:09.620
the scan.

00:10.160 --> 00:17.410
So for that, we will implode the celebrity and scale for scanning the data.

00:18.200 --> 00:20.900
Then we will import the canings.

00:21.200 --> 00:27.230
We will import see one as soon as my Lord unobvious and.

00:28.630 --> 00:32.590
Now we will import the data set, which is more data set.

00:32.920 --> 00:40.090
This is a different type of data set, in which case means is usually not able to perform really well.

00:40.540 --> 00:42.580
So let us have a look at this.

00:44.950 --> 00:54.970
So here we have the data set, which I am reading using read CSFI, and it has this particular data

00:55.210 --> 00:59.090
which is having X values and Y values.

00:59.260 --> 01:07.300
So let me plot these X and Y values using C seven so you can see there are two 1/2 present.

01:07.690 --> 01:15.450
So if we are about to think of it visually, then we can see two types of clusters present.

01:15.460 --> 01:17.680
One is the upper half of the cluster.

01:18.780 --> 01:26.060
That is the first half semicircle and the second question is the second semicircle, which we have deported.

01:27.200 --> 01:34.280
So out of these two clusters, let us see how which clustering algorithm is going to perform.

01:34.760 --> 01:39.550
So the first algorithm which we are applying is the element of clustering.

01:40.070 --> 01:45.980
So here we are already providing the number of clusters to it because we already know that there are

01:46.220 --> 01:50.120
clusters and we are generating the plot for this.

01:50.420 --> 01:54.520
So argumentative clustering is not able to find out the cluster properly.

01:54.800 --> 02:02.810
It is not able to distinguish that this piece of the cluster is actually a part of the lower half moon

02:02.810 --> 02:04.550
and not a part of the upper half.

02:05.240 --> 02:06.820
So this doesn't perform then?

02:08.000 --> 02:15.410
Similarly, we will apply gaming's on top of it, so when we apply gaming's and how we will apply these,

02:15.410 --> 02:21.530
the code we have already discussed in the last few sessions, how we implement key means and how we

02:21.530 --> 02:23.800
implement Aggregative Flustering.

02:24.020 --> 02:26.050
So I will not discuss this again.

02:26.300 --> 02:29.570
So it is just the same algorithm implemented.

02:29.570 --> 02:32.110
Only the data set is different.

02:32.480 --> 02:34.580
So you should be able to understand this.

02:36.130 --> 02:42.640
So we will apply the K means clustering, so this is what he means clustering does now because K means

02:42.640 --> 02:49.150
clustering, always tries to create spherically clusters, so it divides these harpoons into two parts

02:49.390 --> 02:56.710
and creates a cluster one at the right left side and one on the right side and is not able to distinguish

02:56.710 --> 02:58.240
between both of these clusters.

02:58.570 --> 03:05.920
Now, each time I will run this again and again, it will give different clusters while in case of clustering.

03:06.340 --> 03:10.240
Even if I have done this a hundred times, it will still give the same cluster to me.

03:12.010 --> 03:19.840
So now let's run the scan, so for implementing the scan, we will need to provide the values that this

03:19.840 --> 03:23.800
epsilon value and the minimum sample size value.

03:24.010 --> 03:25.810
So here we have the.

03:27.470 --> 03:32.840
Library, which we will be importing, so we will import Escalon Cluster and we will import busgang

03:32.840 --> 03:33.210
from it.

03:34.010 --> 03:43.580
Now we will import, we will run the object of the scan in which we will provide the epsilon value now

03:43.580 --> 03:47.080
based on Absolon value, as we have seen in the visualisations.

03:47.360 --> 03:56.480
So, Max, the larger the size of Epsilon, the faster it will create the neighborhood and the larger

03:56.480 --> 03:57.590
the neighborhood would be.

03:58.070 --> 04:03.200
If we will have a small epsilon value than the neighborhood created would be smaller.

04:03.230 --> 04:04.800
No, that isn't a sudden number.

04:05.540 --> 04:08.600
Next, we advance setting the minimum sample is 30.

04:08.610 --> 04:11.620
So we are seeing that create a cluster.

04:11.930 --> 04:19.580
Only if there are at least 30 points present in that, then we are giving it the distance metric to

04:19.580 --> 04:22.930
be played in and then we putting the data to it.

04:25.100 --> 04:27.370
Now we are creating the cluster.

04:27.560 --> 04:31.340
So after running this, we are putting the data that is clean.

04:31.350 --> 04:37.010
So this has already said to the data which we have provided, and we can basically get the labels from

04:37.010 --> 04:38.820
the divi scan object.

04:39.380 --> 04:42.890
So here are the labels in my column.

04:43.250 --> 04:51.200
So now we will plot the data using some alien plot and we will give X and Y values and the hue will

04:51.200 --> 04:55.310
be the cluster, which is the color would be provided based on the clusters.

04:55.790 --> 05:03.820
Now here you can see that it was easily able to classify, basically cluster those two types of data,

05:03.830 --> 05:06.410
those two clusters of data properly.

05:06.410 --> 05:11.000
And we have we have received good clusters.

05:12.210 --> 05:20.370
Now, you can also see that it has created these blue dots and these blue dots are nothing but the points

05:20.370 --> 05:27.210
which have not been included in any of the clusters which you can identify from the cluster.

05:27.840 --> 05:35.100
Whenever the cluster level comes out to be minus one, it means that these points are not part of any

05:35.100 --> 05:35.640
cluster.

05:36.770 --> 05:44.720
So and the next cluster's is zero and one, so we have got two clusters created clusters as orange cluster

05:44.720 --> 05:50.600
one as the green one, and the rest of the points are the ones which are not included in any of the

05:50.600 --> 05:51.200
clusters.

05:52.500 --> 05:55.080
So next, we will find out the value.

05:55.390 --> 06:01.470
So here you can see that in the fullest cluster there are around nine hundred and ninety nine values

06:01.470 --> 06:05.330
and in those second cluster there are nine hundred ninety five values present.

06:07.120 --> 06:14.350
Now, let us implement anomaly detection with DV scan, so anomaly detection is the practice.

06:15.130 --> 06:19.690
We want to isolate the points which are actually out there for the data.

06:20.230 --> 06:28.520
We don't we want to capture those data points, which are not like the other data point.

06:29.200 --> 06:30.660
So how do we do that?

06:30.970 --> 06:34.180
So we will be able to implement that using the scan.

06:34.360 --> 06:36.940
So it has already been implemented here.

06:36.940 --> 06:43.300
So you can see these blue points, which we have left here, are actually the outliers and these are

06:43.300 --> 06:45.520
actually the anomalies in the data.

06:45.670 --> 06:51.050
That is, it does not like the other data points which we have in these clusters.

06:51.310 --> 06:55.210
Similarly, we have this wholesale customer data set.

06:55.480 --> 07:01.390
So in this dataset, we have data about Milkin grocery and different groceries items.

07:01.510 --> 07:10.470
So we want to check about Milkin Grocery that how these values are associated and who are good customers,

07:10.470 --> 07:20.110
who bad customers and which customers are buying more of the products from from us and which customers

07:20.110 --> 07:22.350
are buying less products for from us.

07:22.690 --> 07:25.060
So we will simply plot this.

07:25.060 --> 07:30.520
So here, using these values, you can see the plot which has been generated, looks something like

07:30.520 --> 07:30.790
this.

07:32.220 --> 07:40.140
Now, from this visualization, we can easily see that the points which are scattered outside this dark

07:40.140 --> 07:45.340
area together are the outliers or are the anomaly points.

07:45.630 --> 07:53.040
So all of these points are actually the anomaly points and they should not be a part of this particular

07:53.040 --> 07:53.550
cluster.

07:53.940 --> 07:56.040
So let us implement the same.

07:56.050 --> 08:03.040
So now we will implement the V scan for this and we will run this for different epsilon values.

08:03.270 --> 08:10.550
So here we are running this four different epsilon values ranging from zero point five to five.

08:11.100 --> 08:17.880
And we are giving the V we scan running the B scan for this and giving the minimum sample values 20.

08:18.450 --> 08:21.530
Then we're fitting the grocery dataset to it.

08:22.170 --> 08:27.620
Now we will see the values for Epsilon and the outlier data.

08:28.860 --> 08:35.340
So here you can see these are the different Absolon values and this is the percentage of data which

08:35.340 --> 08:38.520
is constructed, which is considered in the outliers.

08:42.410 --> 08:50.810
Now, using this particular data, using this particular data, we are calculating these outliers by

08:51.590 --> 08:59.450
rounding out the data that is the top and some of the outliers divided by the labor data.

08:59.720 --> 09:02.340
So this gives the percentage of outliers.

09:02.550 --> 09:12.590
So let us see if I want to find out the top one percent or top two percent of my customers to whom I

09:12.590 --> 09:19.460
want to give a particular discount so that they buy more products from us, then I can do that from

09:19.460 --> 09:20.690
this outlier data.

09:21.110 --> 09:27.800
So what I can do is I can find out that these are the top two percent, which is present at the epsilon

09:27.800 --> 09:32.420
value, one point to thirty one point three to one point forty one.

09:32.660 --> 09:39.140
So based on these Epsilon values or four, five percent of my body that I can use Epsilon values, zero

09:39.140 --> 09:40.480
point seven seven five one.

09:40.670 --> 09:48.680
So I'm all of these Epsilon values, I can actually find out the number, the percentage of the outliers.

09:48.980 --> 09:54.440
And based on the outlier, which I want to see, I want five percent of outlier data or six percent

09:54.440 --> 09:55.340
of our player data.

09:55.550 --> 10:02.750
I can find out those data points and I can isolate those data points and then try to get them for my

10:02.750 --> 10:08.210
next campaign or the or the special or for which I have for them.

10:09.390 --> 10:16.800
So this is what I can do using anomaly detection and similarly, if there are certain transactions which

10:16.800 --> 10:24.510
are actually of which are actually out of the usual transaction, which a person makes, for example,

10:24.510 --> 10:27.650
we are talking about credit card transactions.

10:27.840 --> 10:31.740
So there could be certain transactions which are slightly out of range.

10:31.920 --> 10:36.670
But I want to find out those transactions which are highly different from the actual data.

10:36.870 --> 10:44.670
So maybe I can consider the top five percent of the outliers and then the people and tell them that

10:44.910 --> 10:48.030
these are the actions which are happening in Ukraine.

10:48.600 --> 10:53.090
So is there any kind of problem with your facing or have you lost your credit card?

10:53.100 --> 10:56.470
So these kind of questions can be asked to those people.

10:56.470 --> 10:56.870
It's.

10:58.820 --> 11:05.270
So here's what I found is I have used Absolon values zero point seven seven, which is.

11:08.230 --> 11:11.780
The percentage of outlier being five point six eight.

11:12.160 --> 11:17.300
So for this I'm finding out and creating the disasters here.

11:17.500 --> 11:20.370
So here you can see the cluster which has been created.

11:20.380 --> 11:27.940
So these are the outlier values which are coloured and blue and having the label minus one and the other

11:27.940 --> 11:33.690
cluster, the actual cluster is in color, orange labelled as zero.

11:34.030 --> 11:40.240
So this is what we have gained from the scan and how we can implement anomaly detection and clustering

11:40.240 --> 11:41.120
on top of this.

11:41.500 --> 11:49.140
Next, we will discuss about a very important algorithm, which is the dimensionality reduction.

11:49.300 --> 11:52.090
So I hope you will learn a lot from that.

11:52.390 --> 11:52.860
Thank you.