WEBVTT

00:00.930 --> 00:04.390
So let us begin with the solution for this particular problem.

00:04.410 --> 00:13.830
So in this problem, we have a very highly imbalanced data that there's a lot of transactions are for

00:13.830 --> 00:18.980
the normal actions and very few transactions are pressing.

00:19.350 --> 00:23.800
But we are trying to focus on different transaction detection.

00:23.820 --> 00:26.910
We want to find out the fraud immediately.

00:27.090 --> 00:35.400
And that is the reason why we will be focusing towards the more sensitivity and precision in this particular

00:35.400 --> 00:36.230
implementation.

00:36.240 --> 00:37.770
I will be using this.

00:38.130 --> 00:39.870
You can use any of the reboarded.

00:40.110 --> 00:41.930
You can stack algorithms together.

00:41.970 --> 00:43.790
The process would remain the same.

00:44.070 --> 00:50.880
You just need to try different legal teams, compare them, find out which algorithm was the best,

00:51.120 --> 00:53.940
and then find you on that particular and got.

00:55.740 --> 01:05.040
So here we import the required libraries and after importing those libraries, we see the effect here

01:05.040 --> 01:09.630
that the site contains column names as we run the three and so on.

01:09.640 --> 01:13.950
So we don't really know which of variable depicts.

01:13.950 --> 01:16.520
What about the transaction?

01:16.890 --> 01:22.280
We just know that these are the we will only have certain values associated with that.

01:23.580 --> 01:26.980
And you can see these values are also skilled in nature.

01:27.000 --> 01:34.860
So we cannot determine anything specific from that, from the top level, at least have a few columns

01:34.860 --> 01:43.710
as dying among them, plus classes that are variable that we have and the amount in the amount of praise

01:43.710 --> 01:46.940
for which the transaction has happened or something of that.

01:49.020 --> 01:52.140
Here we have the value.

01:52.200 --> 01:57.690
So time is of the transcontinental values.

01:58.070 --> 02:07.410
Other than that, all of the values are usually ranging from a negative, some negative value minus

02:07.410 --> 02:11.010
five to some positive values of five 10.

02:11.310 --> 02:14.060
So these are basically they knew from minus two plus then.

02:14.070 --> 02:17.770
So I feel that these are actually already standardized in nature.

02:18.420 --> 02:22.230
So we don't need to perform any skilling in this particular dataset.

02:23.910 --> 02:30.690
Now, if you see this data, so we have zero point two percent for transactions and ninety nine point

02:30.690 --> 02:33.920
eight percent transactions are genuine in nature.

02:37.440 --> 02:47.730
A lot of the time to see if there is any particular named associated with the time it set, so we are

02:47.740 --> 02:48.710
plotting time.

02:49.110 --> 02:53.020
So here we have the transaction and this is the time which we have.

02:53.370 --> 02:59.280
So based on the time you can see these are the genuine transactions and these are the transactions.

02:59.640 --> 03:08.790
So you can see there are specific spikes in the transactions and they occur immediately on the lower

03:10.320 --> 03:11.070
side.

03:11.250 --> 03:15.620
So they literally happen in a particular point of time.

03:15.630 --> 03:18.740
Only these transactions are more than enough.

03:19.380 --> 03:27.720
So the time feature shows that the rate of transaction is picking up during the day, but the number

03:27.720 --> 03:32.510
of transactions have almost similar dependence on time of the day.

03:32.940 --> 03:38.190
What would the classes, what would the classes of this bring down and going up, then going down and

03:38.190 --> 03:41.060
going up, similarly, down and up and down.

03:41.280 --> 03:49.940
So the trend remains the same, but during the day, the number of transactions increased.

03:51.180 --> 04:00.450
OK, so this feature does not give much predictive value to us, but we based our leader so far.

04:00.450 --> 04:10.250
No, we will drop the dime from our particular deposit and keep only the name as one of the features.

04:12.570 --> 04:16.330
Next, what we do is we will check the feature amount.

04:17.550 --> 04:24.360
So for a moment you can see the genuine transactions arranging for a higher value as well.

04:25.140 --> 04:33.810
But for different transactions, they are actually for less than a month or so, the transaction occurred

04:34.800 --> 04:38.970
more number of small transactions, small amount transactions.

04:41.020 --> 04:46.840
So this shows that oil transaction amount greater than 10000 are genuine class only.

04:47.320 --> 04:47.730
All right.

04:49.210 --> 04:56.380
Also, this Amand feature is not on some scale as Principal Confidential will standardize the value

04:56.380 --> 05:02.710
of a feature using this time scale, because this is the only goal which was not standardized in nature,

05:02.710 --> 05:05.910
along with so many days, this amount.

05:07.600 --> 05:14.560
So we have standardized this amount volume and we have found out that the transactions are usually a

05:14.710 --> 05:15.430
small little.

05:18.230 --> 05:27.110
Next, we will check the coalition and shapes of 25 principal components, so for each feature, we

05:27.110 --> 05:31.820
will be bringing the coalition for that.

05:32.180 --> 05:40.450
So here we are flaunting the January antifraud transaction, the plot forward.

05:40.940 --> 05:47.680
So if you see these a little reading, these are also kind of video.

05:48.020 --> 05:52.210
So we are basically finding out what actually overlaps.

05:52.430 --> 05:58.640
So for the features, we are the genuine and for transactions are almost overlapping like this one and

05:58.640 --> 05:59.120
this one.

05:59.450 --> 06:05.730
This shows that this feature is not much relevant to us because it is not giving any specific facts.

06:06.140 --> 06:11.750
But when we see here, this shows that for a genuine transaction, this would be a little different

06:11.750 --> 06:13.450
from the fraud transaction.

06:13.460 --> 06:17.780
So we can see there is a slight difference between the types of transactions.

06:18.590 --> 06:24.890
So we will keep only the features, which will give us some different back then.

06:24.920 --> 06:28.640
So again, for this particular feature, you can see there is some difference in the back.

06:28.830 --> 06:31.010
We will keep only those types of features.

06:32.450 --> 06:33.550
So let's go further.

06:34.820 --> 06:40.440
These are all different transactions so you can easily find out what is overlapping and what will happen

06:41.180 --> 06:45.660
so far to some of the features, what the classes have similar distribution.

06:45.950 --> 06:50.630
So we don't expect them to contribute towards classifying flavor of the month.

06:51.210 --> 06:59.250
OK, so it is best to drop those and reduce the complexity and hence we will reduce the chances of opening

06:59.270 --> 07:06.880
those and we will check the assumption and we will check also invalidate this thing again.

07:07.520 --> 07:13.370
Actually, they have some importance or some input towards the classification.

07:14.600 --> 07:21.890
So now we have brought certain volumes, certain features, and now we will split the methane doing

07:21.900 --> 07:26.510
beauty that is 20 percent best and 80 percent bringing us in.

07:27.350 --> 07:34.390
Now we have distributed and we have set a strike.

07:34.430 --> 07:43.280
If they will do so, the strike, if we will delay, does what is it will basically give equal distribution.

07:43.790 --> 07:56.360
So in my estimate that also there will be equal percentage of the frog transactions as it is in the

07:56.600 --> 07:57.470
training business.

07:57.770 --> 08:04.570
So if might open of for transactions, I assume then in the spring I told you there will be two percent

08:04.790 --> 08:08.960
transaction and in the best details that we would do, pushing for transactions.

08:11.660 --> 08:17.620
So here we have created one function which will give us the predictions in the form of a confusion matrix,

08:17.960 --> 08:22.010
and this particular function will give us those false.

08:23.570 --> 08:30.020
So now we are implementing because you need this and we are training the model in everything.

08:31.400 --> 08:35.790
After training the model, you can make no different pieces.

08:36.470 --> 08:42.120
So the first cases when we drop back again, we get the results out of it.

08:42.500 --> 08:44.960
So this is what we get in the brain set.

08:48.180 --> 08:49.650
This is a diffusion matrix.

08:50.350 --> 08:57.020
Let's have a look at the U.S. media report so you can see that he is ninety six point three.

08:57.840 --> 09:02.750
And let's look at the record, which is zero point eight four.

09:04.200 --> 09:06.390
And you can see the precision is quite low.

09:09.290 --> 09:17.120
And if one spot is already here, next is when they are dropping some of the principal components of

09:17.120 --> 09:18.760
a job, similar distributions.

09:19.070 --> 09:20.690
So then we have a look at this.

09:20.690 --> 09:27.350
You can see the precision has improved, the precision has improved.

09:28.310 --> 09:34.760
The precision was five percent, but now the precision is eight percent, which is not much of an improvement.

09:34.760 --> 09:36.740
But still, there is a slight improvement.

09:37.010 --> 09:41.610
If reporters have been pushing here, their goal was 10 percent.

09:41.630 --> 09:47.990
So there is a slight improvement and accuracy is a little boy named it, which is just the same.

09:49.160 --> 09:53.750
And we can see an improvement in the which is improvement here.

09:55.250 --> 09:57.990
So here we have got some improvement.

09:58.010 --> 09:58.280
Right.

09:58.580 --> 10:00.850
So next, let us have a look.

10:00.860 --> 10:04.020
I mean, drop some principal companies and also paint.

10:04.040 --> 10:06.390
So we are dropping also in this case.

10:07.070 --> 10:13.480
So what we see here is that equal school has improved, accuracy has improved.

10:13.970 --> 10:20.360
If one school was zero point one five four and it is zero point one five percent, accuracy is zero

10:20.360 --> 10:22.330
point ninety three for accuracy.

10:22.350 --> 10:25.390
Still, the same precision is also the same.

10:25.550 --> 10:28.940
So we can see that it's not helping us.

10:28.980 --> 10:31.210
The damage that is not helping much to us.

10:31.460 --> 10:33.230
So we will remove Tamaz with.

10:34.960 --> 10:41.230
The next case is when we are dropping the principle of opening, dropping time and also the dropping

10:41.230 --> 10:42.850
the skilled amount.

10:43.360 --> 10:50.380
So now if they see the recall is zero point eight seven seven, it's still the same.

10:50.920 --> 10:52.270
Precision is zero point.

10:52.270 --> 10:57.970
If six improvement, then we have a Funspot zero point one five seven.

10:58.720 --> 11:00.260
Still a little improvement.

11:00.260 --> 11:02.680
The accuracy zero point nine need here.

11:02.680 --> 11:10.050
Accuracy, two point ninety six zero point nine six one one here the in six zero point nine six one

11:10.060 --> 11:10.300
three.

11:10.660 --> 11:13.000
Not much difference, but maybe a little bit.

11:13.000 --> 11:20.500
So we can see case for gives us a better model sensitivity and precision as compared to this one.

11:20.510 --> 11:24.240
So dropping some of the redundancy just will help us.

11:24.250 --> 11:24.580
Right.

11:25.020 --> 11:27.850
So this is the improved one.

11:28.120 --> 11:33.370
So here we can see that we have got a good result out of this entire thing.

11:34.680 --> 11:42.160
So what we can do is we can simply, again, train this and then the logic of it.

11:42.400 --> 11:45.940
So when we do it, check again, we define definers.

11:45.970 --> 11:51.380
But now let's check the score for logistic regression instead.

11:51.460 --> 11:53.280
What logistic regression?

11:53.290 --> 11:55.770
So we need logistic regression.

11:56.380 --> 11:57.280
These are the results.

11:57.580 --> 12:07.530
So you can see the ROIC has improved zero point nine seven, but the one has reduced and so has front

12:07.540 --> 12:14.470
for every school has also improved, but the recall has declined.

12:15.910 --> 12:22.660
So as we see by learning from fully imbalanced dataset, this Deepali logistic regression is performing

12:22.660 --> 12:26.740
very poorly because the vehicle is very we don't want this slowly equal.

12:26.950 --> 12:28.510
It's not remembering anything.

12:28.870 --> 12:31.000
It is just a hunch which it is having.

12:31.270 --> 12:35.440
So now let us try to balance out the glasses and then.

12:35.450 --> 12:42.790
Right, so here we will give them the indexes for the flawed engineering classes and then either sample

12:42.790 --> 12:43.780
the entire data.

12:44.020 --> 12:51.120
When we understand the entire data, we have another sample in the lead up to 90 before transfer transactions.

12:51.940 --> 12:58.330
And now the genuine transactions are between five and zero for this actually to confirm that this number

12:58.330 --> 12:59.120
of transactions.

13:00.610 --> 13:09.130
So when we run this particular model with the 80, 20 string samples, we see that the Odyssey is zero

13:09.130 --> 13:15.580
point nine eight, which is a huge improvement, but equal to zero point nine five, which is again

13:15.580 --> 13:16.660
a huge improvement.

13:16.870 --> 13:20.860
So we can see, as expected, it has performed very well.

13:23.000 --> 13:28.820
Now, next, what we will be doing is we will be checking the performance of this particular model on

13:28.820 --> 13:31.740
the entire dataset, thus he will beat it.

13:32.090 --> 13:37.700
So we will run this and we will check the predictions and we can see that the goal is still zero point

13:37.700 --> 13:39.830
ninety one and the precision is zero point four.

13:40.520 --> 13:44.110
So we can say that it is a good model.

13:46.260 --> 13:54.180
Now, we will compare the scores for this war and the Gulf War, which we have created, and we can

13:54.180 --> 13:58.350
see here the legal score is zero point nine.

13:58.770 --> 14:03.480
So we can see that these are almost all good models.

14:03.480 --> 14:11.310
And the logistic regression which we created later on, that understandably gave us a very better result

14:11.310 --> 14:12.750
in comparison to the other one.

14:14.580 --> 14:22.680
So no doubt logistic regression will give better models instead of being bad for positive predictive

14:22.680 --> 14:25.170
value for my business that more than double.

14:25.500 --> 14:28.780
So my business is performing better.

14:29.130 --> 14:35.730
So you can compare and you can find out which one works better for you.

14:35.730 --> 14:42.050
If you are only concerned about the goal, then yes, logistic regression is performing better.

14:42.270 --> 14:48.600
But if you are looking at two people in position, it then we should be picking up the might the solution.

14:49.110 --> 14:50.790
This is just one of the solutions.

14:50.790 --> 14:58.170
You can think of any solution and you can try different combinations and create your own solution for.