WEBVTT

00:01.080 --> 00:09.540
Hey, so let us use support to make the machines and try to implement the spam classifier.

00:10.170 --> 00:17.890
So for this particular classification, I have imported the spam data.

00:18.540 --> 00:24.920
So it has a data of multiple values like these are the messages.

00:25.650 --> 00:34.670
So these messages contain different spam with data and different good important messages also.

00:35.160 --> 00:38.280
And it has one more variable, which is the target.

00:39.510 --> 00:45.870
And the target even consists of either the Moelis spam or is it.

00:50.700 --> 00:57.050
So we will discuss about different libraries, which we have important, we will explore the data set,

00:57.060 --> 01:05.490
we will look at the distribution of spam and not spam data, then we would apply text analytics and

01:05.640 --> 01:06.640
feature engineering.

01:07.170 --> 01:09.660
Finally, we will apply predictive analysis.

01:10.570 --> 01:11.920
For this particular date.

01:13.170 --> 01:20.060
So these are the libraries which I have imported, the libraries are no Findus, my.

01:21.680 --> 01:24.290
Then I have imported council from collections.

01:26.070 --> 01:35.720
I have important feature extraction model selection, knife based metrics and SVM from the Skillern

01:35.730 --> 01:36.240
Library.

01:37.590 --> 01:41.550
And I have important image from if I can display.

01:42.760 --> 01:50.350
I have also imported warnings so that I can prepare the warnings and ignore the warnings, which we

01:50.370 --> 01:52.960
all which I created during this run of the.

01:55.080 --> 02:06.510
And this is the magic statement which allows the plot to be generated inside the Jupiter book itself.

02:09.220 --> 02:12.490
So the first step is to import the data.

02:12.760 --> 02:20.410
So here we are importing the spambot CSB this data consists of different messages.

02:21.750 --> 02:25.740
So we will, first of all, look at the distribution of the data.

02:26.370 --> 02:34.700
So here you can see that we have taken out the value count of the first column.

02:36.400 --> 02:46.930
So the value comes around five thousand and around eight hundred, you can see out of five thousand.

02:47.320 --> 02:49.930
There are eight hundred spam messages.

02:50.470 --> 02:58.900
So we have given the account plus blog and we have plotted this a bar chart on top of it.

02:59.170 --> 03:04.030
And we have provided the colors of the bar chart, which you want to block that is blue and orange.

03:04.600 --> 03:06.760
And we have given the title for the.

03:09.380 --> 03:11.420
Here is the bar chart that we have.

03:12.330 --> 03:18.600
Now, next, us, we will plot the future of this particular data.

03:19.380 --> 03:23.940
So, again, we have taken the same data that is found and we have plotted it.

03:24.270 --> 03:32.190
So here the plot contains eighty seven percent messages and 13 percent spam messages.

03:33.380 --> 03:41.120
Now, we would apply text analytics on top of this data, so we want to find the frequencies of the

03:41.120 --> 03:47.440
words in the spam and non spam messages, so the words of the messages will be the modern feature.

03:47.660 --> 03:53.660
So we will be converting these messages into different count vectors.

03:54.050 --> 04:02.030
And these count vectors will have frequencies of each and every word, which we will be getting after

04:02.030 --> 04:09.290
filtering out the stop words or the punctuations or after applying limited ization or stimming.

04:09.410 --> 04:17.030
So whatever treatment we want to apply on top of this data, we can apply that treatment and retrieve

04:17.030 --> 04:23.060
all the data to we those inventers which you want and then create the tokens out of.

04:25.120 --> 04:28.450
So here we are using the counter.

04:29.480 --> 04:39.800
And we are joining the data from the column one, wherever the data value is, wherever the column one

04:39.800 --> 04:40.610
value is.

04:41.420 --> 04:43.370
We are getting the data from it.

04:44.460 --> 04:47.820
And then joining those in count one.

04:49.840 --> 04:53.720
Then we are creating a dictionary or datastream out of.

04:55.230 --> 05:04.480
Next, from the state of Maine, we are renaming the columns as wards in non spam and the count of them.

05:05.280 --> 05:12.780
Next, we are creating two of the veto with the value is spam.

05:12.780 --> 05:20.760
Wherever the data is of spam, we are getting the data and finding out the most common 20 votes from

05:20.760 --> 05:22.410
the most common 20 words.

05:22.410 --> 05:27.330
We are again creating a data frame and via the.

05:28.730 --> 05:35.900
Columns for Divya created these two data friends, frenzied, every one of them do with 20 most common

05:35.900 --> 05:40.190
words from the spam data.

05:42.040 --> 05:49.510
I have printed a different one and the same for you, so here you can see that the words in non spam

05:49.510 --> 05:50.960
are to you.

05:51.010 --> 05:57.040
I though and I in you is my need for that.

05:57.190 --> 06:04.390
So these are different words which are most common in the non spam data while in the spam the most common

06:04.390 --> 06:05.020
ones are.

06:05.020 --> 06:07.480
Do you call Paul?

06:08.480 --> 06:10.340
So these are the words of a journalist from.

06:11.530 --> 06:12.550
Now you can see.

06:13.710 --> 06:22.170
That the most common words, which we have to have the words which are actually Stopford, so this means

06:22.320 --> 06:29.760
that it is very important to remove these words from these dataset so that we can actually get the words

06:29.760 --> 06:34.200
which are impacting the spam and ham classification.

06:36.310 --> 06:45.960
So let us plot the words like which words most common, so here we have the plot for frame one.

06:46.000 --> 06:52.180
So these are the words which are most common and these are the words which are most common in the spam.

06:53.840 --> 07:05.060
Now, you can see that in case of spam, mostly someone is being asked for the calling, so let us see

07:05.440 --> 07:06.890
how this actually impact.

07:07.280 --> 07:12.350
Now we can see that the majority of frequent wards in both classes are stop words.

07:12.680 --> 07:16.760
So stop for to be referred to the most common words in the language.

07:16.960 --> 07:19.970
There is no simple universal list of support.

07:20.270 --> 07:24.880
So we will try to create a list of stop words and to remove those.

07:25.250 --> 07:27.200
So the first thing which we do is.

07:28.340 --> 07:34.640
We removed the stopwatch to improve the analytics because most of the words forward, so they will always

07:34.730 --> 07:36.790
end up with what we are trying to do.

07:37.100 --> 07:45.290
So we are using future destruction, dot, dot com victimiser and from the Gulf mycorrhiza, we are

07:45.530 --> 07:54.910
getting the Stopford sequel to English, which means that in the count is it we want to remove the stopwatch.

07:54.920 --> 07:57.290
So these are the words that you want to remove.

07:57.590 --> 08:08.420
So now we are fighting and transforming this particular data that this video I'm getting the X, so

08:09.290 --> 08:17.600
X has a shape of fifty five seventy two eight four zero four, which means that we have created more

08:17.600 --> 08:20.750
than eighty four hundred new features like this.

08:20.870 --> 08:22.940
Eighty four hundred votes.

08:24.310 --> 08:33.580
The new feature in the ruly is equal to one if the word W appears in the text example, it is Zettl.

08:33.580 --> 08:40.840
If not so, if a particular word is present in the text, then a particular word will be present in

08:40.840 --> 08:42.010
the future.

08:42.250 --> 08:46.200
If the word is not present in the text, then it will not be present.

08:47.300 --> 08:47.930
So.

08:50.430 --> 08:59.100
Here you can see that X is a sparse matrix, so that means that it saves space wherever required.

08:59.880 --> 09:02.180
Next, we will apply predictive analysis.

09:02.370 --> 09:06.920
So our goal is to predict if one of us is spam or not.

09:07.170 --> 09:14.050
So we assume that it is much worse, classifying non spam, then misclassifying spam.

09:14.220 --> 09:22.170
So if there is a message which is not spam and we classify it as spam, then it would be a more difficult

09:22.170 --> 09:26.440
situation because in that case we might miss out on the important message.

09:26.760 --> 09:33.600
So that is why we will give priority to a non spam spam message in comparison to spam message.

09:35.880 --> 09:43.020
So now we will transform the video this fall, as I'm not spamming to buying any value, so we give

09:43.020 --> 09:46.020
spam to be one and have to be zettl.

09:47.190 --> 09:55.170
So we will simply apply a map and whatever values in the column, one would represent it this time it

09:55.170 --> 09:57.810
will change it to one, and if it is, it will change.

09:57.820 --> 10:03.320
And then we are simply applying this split on top of it.

10:05.350 --> 10:09.910
So here you can see the shapes so extreme has a shape.

10:09.940 --> 10:14.230
Thirty seven three three eight four zero four.

10:15.890 --> 10:20.030
The shape of excessed is adding three eight three nine.

10:21.230 --> 10:22.540
And it was beautiful.

10:24.720 --> 10:32.460
Now we will apply the multinomial name base and we will be using the regular recession barometer Alpha.

10:33.580 --> 10:38.130
So we will evaluate the accuracy, recall and precision for the modest.

10:39.440 --> 10:40.010
So.

10:41.250 --> 10:50.190
We have got the score three, that is we are putting in zero values for the score brings for different

10:50.190 --> 10:57.860
scoring values and four alpha in the list of Alphas that does values from one one hundred thousand.

10:58.200 --> 11:01.950
We are generating values, Skippy.

11:02.890 --> 11:03.880
Community values.

11:06.190 --> 11:15.450
So the bees via is the object which we are creating from the name based model that is multinomial name

11:15.470 --> 11:18.600
based and they are applying bees to it.

11:19.620 --> 11:29.900
Now, we would apply school train, we are getting out of school every so busy in school, get us extra

11:30.360 --> 11:36.630
and why don't we just score using the Bayesian method, then we are scoring.

11:37.650 --> 11:39.060
The testing data.

11:40.010 --> 11:49.310
Then we are putting the metrics that this recall score to the recall test and the precision school to

11:49.310 --> 11:50.690
the precision best.

11:52.900 --> 11:59.920
This is what each and every value now we will see the Force10 learning models, which are being created

11:59.920 --> 12:01.400
for different alpha values.

12:01.750 --> 12:06.240
So these are different values for which we have created the models.

12:06.430 --> 12:11.510
And you can see these other inaccuracies and best accuracies which we have to treat.

12:12.100 --> 12:16.060
So we select the model with the most, best precision.

12:16.600 --> 12:21.220
OK, so we will select the model, which has the highest, best precision.

12:22.760 --> 12:31.430
So the best model comes out to be the one with the alpha value, fifteen point seven three and the three

12:31.580 --> 12:34.190
accuracy is a zero point nine seven.

12:34.430 --> 12:37.100
The best accuracy is zero point nine six.

12:37.310 --> 12:41.990
The best goal is zero point seven seven, and the best precision is one.

12:42.560 --> 12:49.970
Now, here we have used this particular model because it does not produce any false positives.

12:50.360 --> 12:57.460
So this is what we are focusing on because we don't want any spams.

12:57.470 --> 13:01.010
No one seems to be going into the spam folder.

13:01.190 --> 13:04.910
So we are focusing on hundred percent precision.

13:05.180 --> 13:10.190
So we are getting the best precision from the top models.

13:10.370 --> 13:15.510
So these are the different best precision, these are the best precision.

13:15.830 --> 13:21.360
So we will select some model which has the maximum accuracy out of these.

13:21.800 --> 13:29.240
So the one which comes out with the maximum best accuracy is we are selecting it using ide x max.

13:30.870 --> 13:36.360
So it is the model with alpha value, fifteen point seven three zero one zero.

13:36.630 --> 13:43.780
So this is the model which actually has the maximum, best precision and corresponding maximum best

13:43.830 --> 13:44.490
accuracy.

13:46.460 --> 13:52.610
Now, next, we will generate the confusion matrix of this so we can generate the confusion matrix using

13:52.610 --> 14:00.290
matrix dot confusion matrix, and we can simply apply the same color method which we have.

14:00.290 --> 14:06.730
That is, the columns will be predicted zero and predicted one, and the index would be actual Zyda,

14:06.740 --> 14:07.610
an actual one.

14:09.000 --> 14:10.680
So these are the values which we get.

14:10.980 --> 14:18.720
So here you can see that something which is actually zettl and predicted as one is zero, that is what

14:18.720 --> 14:24.910
we were focusing on, that a non spam should not going to spam for.

14:26.520 --> 14:33.320
So here we have actually misclassified fifty fifty six spam messages, that is something which is actually

14:33.330 --> 14:41.760
a spam and do not spam, but that is not a problem because we are main target was this particular zettl,

14:41.760 --> 14:42.980
which we have obtained.

14:43.500 --> 14:45.840
So we are going to apply.

14:46.850 --> 14:53.990
The regularisation barometer, see, do it so for this, we are again applying the creating the list

14:53.990 --> 15:01.070
of CS, so here we are generating a list of CS from five hundred to two thousand.

15:02.290 --> 15:09.760
And the school train is again blank school, this is again blank, so we are creating different Zeitels

15:09.790 --> 15:10.830
values for this.

15:11.530 --> 15:13.930
Now we are running SVM.

15:15.840 --> 15:24.420
With different sea values, so here again, we are applying the SVC fit and then we are scoring these

15:24.420 --> 15:30.360
different services and finding out the score and the precision scored from this.

15:32.530 --> 15:36.010
And these are the values which we have generated from this.

15:36.220 --> 15:40.660
Now, again, we are looking for the best decision to be one.

15:42.670 --> 15:50.600
So we are looking at the model with the most, best precision, so that is data point nine nine five.

15:51.370 --> 15:56.390
So this is the best model which does not produce any false positives.

15:56.920 --> 15:59.500
So this was our goal.

15:59.770 --> 16:03.430
So that is why we are taking this particular model.

16:05.750 --> 16:11.110
Here are the monitors disposition is equal to one which we don't really have as of now.

16:12.840 --> 16:21.510
So we can check that there are very few false positives so we can get the list of the best motives.

16:21.720 --> 16:28.230
So this is the best way to deal with ninety nine point five percent or this decision and let this generate

16:28.230 --> 16:34.290
the confusion, matrix or some confusion matrix can be obtained from matrix stored confusion matrix.

16:36.980 --> 16:40.520
Using whitelist with respect to the speed or predict.

16:42.070 --> 16:50.410
And we can get the predicted values, so here we have actually zero predicted as one as on day one and

16:51.250 --> 16:54.540
those bombs classified as Norns from thirty seven.

16:54.790 --> 17:00.160
So we have misclassified thirty seven spams and misclassified only one non spam.

17:00.640 --> 17:02.260
So as a conclusion.

17:03.780 --> 17:12.440
We can say that the best model is the one which is having ninety eight point three percent accuracy.

17:15.950 --> 17:19.930
And it classifies every non spam message correctly.

17:20.370 --> 17:28.730
So this is what we have obtained from this, so you can try different regularisation parameters and

17:28.730 --> 17:36.140
try your own versions of these codes, try using the grid search KVI random.

17:36.140 --> 17:46.400
So KVI for simplifying these implementations so that you do not waste time on running several loops

17:46.400 --> 17:48.440
of codes which we have done here.

17:48.710 --> 17:51.440
So you can try all that using grid.

17:51.450 --> 17:52.080
So could.