WEBVTT

00:01.280 --> 00:08.210
In this session, we will discuss about Osama bin Laden, so we have already discussed about this case.

00:09.380 --> 00:13.520
And they know that decision trees have certain drawbacks.

00:14.590 --> 00:17.240
This is really thanks to Overthought.

00:18.710 --> 00:23.810
Hence, we need a bit more news, which we're really not all Wolfert.

00:25.250 --> 00:31.150
So let us have a look at ensemble and try to understand what ensemble can do for us.

00:34.150 --> 00:40.270
And so learning combines multiple Zik algorithms to form a strong.

00:41.760 --> 00:48.760
Using ensemble methods allows to produce better predictions compared to a single model.

00:49.260 --> 00:57.480
So in case something happens, filming the event, combining Swanzey algorithms or small weak monitors

00:57.690 --> 00:59.310
to form a strong.

01:01.600 --> 01:08.370
And it will allow us to produce better predictions, come back to us in this role model.

01:09.220 --> 01:17.050
So instead of creating a large decision party, which is very, very prone to being overfitting, we

01:17.050 --> 01:19.750
can create smaller decision, please.

01:19.900 --> 01:28.090
I'm pulling them together in such a way that they even have a strong machine learning model which will

01:28.270 --> 01:29.890
not really overfit.

01:30.970 --> 01:31.950
So let us go for.

01:33.720 --> 01:39.180
Now, that is this one condition, Vias and Gideons trade off.

01:39.900 --> 01:44.310
Now let us try to understand what bias invariance trade offers.

01:45.500 --> 01:53.810
Bias straight off is when we have a lot of bias, when the data is on the.

01:56.330 --> 02:04.790
Then I'm wondering, does not bring enough and it's loans only from one particular partner and ignores

02:04.790 --> 02:11.190
another important fact, and it is going to be biased towards that particular fact.

02:12.430 --> 02:21.610
So let's see, we have the python suppression that is being featured in one feature makes us know that

02:21.790 --> 02:24.280
the value of the loan, what should be.

02:24.490 --> 02:32.520
Yes, while there is another feature which tells us that the value can be Nolensville.

02:33.400 --> 02:45.010
Now, what happens is that the more activity known from the featured a lot ignores the feature, which

02:45.010 --> 02:52.600
is the reason why the more it makes up usually prediction that this always approving the.

02:54.150 --> 03:03.470
This kind of situation leads to a biased result, that is, it will always make a prediction as the

03:03.470 --> 03:08.910
norm to be approved, but actually the law would have been.

03:10.160 --> 03:11.450
Injected in this case.

03:12.700 --> 03:24.370
So this is all bias when a model tries to capture only one type of fighting, it is highly biased and

03:24.370 --> 03:33.610
this media needs to understand that this the model is able known only one type of fighting from the.

03:35.820 --> 03:46.050
Another situation is when the data has high readings of more morning, they have high ratings if it

03:46.050 --> 03:49.710
will try to capture all the pythons from the data.

03:50.530 --> 03:57.710
If a more price Bryzgalov from all of the violence, from the data and even loans, small conflicts

03:57.730 --> 04:03.870
of these from the data, then the data does not really stays generalized enough.

04:04.330 --> 04:12.120
So now the more data is not generalized enough that this is sort of creating a straight line for the

04:12.170 --> 04:12.640
question.

04:12.790 --> 04:21.520
It draws a zigzag line to actually captured each and every data point, which actually leads to overfitting.

04:22.620 --> 04:30.510
So this is called Hibernians, when the model tries to learn a lot and extra from the data point.

04:32.360 --> 04:40.640
Now, what happens is when the water starts to try to unlock, the bias decreases.

04:41.620 --> 04:51.730
So here, the orange line depicts the bias, so as the model starts to learn from the back, then the

04:51.730 --> 04:54.370
bias starts to decrease.

04:55.690 --> 05:05.410
And when the morning starts to learn extra things, it starts to get over overcrowded and the variance

05:05.410 --> 05:06.220
increases.

05:07.470 --> 05:15.490
So because of this, when the bias is decreasing, the ratings is actually increasing.

05:16.200 --> 05:21.420
Similarly, when the variance is decreasing, the bias is increasing.

05:22.430 --> 05:31.280
So an awful lot more than it would be when there will be no high bias, I know high ratings, a point

05:31.610 --> 05:37.310
in between these will be an optimal balance of bias and ratings.

05:38.330 --> 05:48.790
Hence, the test error will be minimal when the via's I'm valiance both are in an optimal level.

05:50.210 --> 05:57.560
This is what Via's various trade office, that is when the bias decreases, the variance starts to increase

05:57.740 --> 06:06.170
and then the variance decreases, the bias starts to increase, and we want to find out an optimal balance

06:06.350 --> 06:14.900
between the bias and variance so that we can make good predictions and have a generalized model which

06:14.900 --> 06:22.310
does not have a bias towards a few features or does not try to capture the extra information or even

06:22.310 --> 06:24.580
noises from the data.

06:28.420 --> 06:34.180
Now, this image perfectly defines how bias and variance looks like.

06:35.190 --> 06:43.890
So here in this image, you can see that the actual target value is the red one, but because the model

06:43.890 --> 06:51.480
is biased, it tries to predict something else and it completely misses the target.

06:52.200 --> 06:54.090
This is called high bias.

06:55.870 --> 07:00.170
That is, the model is not able to identify the correct target.

07:01.490 --> 07:11.450
Now, next is high variance, high variances when the model actually knows what the target is, but

07:11.450 --> 07:18.950
the values are so trying to capture the noise and different complexities that it does not hinder dog

07:19.100 --> 07:21.250
but gets scattered around.

07:22.210 --> 07:27.430
So the value all this keeps on deviating from the actual target value.

07:27.740 --> 07:33.970
Sometimes it will be less than valuable and sometimes it will be greater than good value, but it will

07:33.970 --> 07:36.740
be really hitting the actual target value.

07:36.760 --> 07:39.850
It will be really predicting the actual value.

07:40.330 --> 07:41.920
This is all Hibernians.

07:43.730 --> 07:54.870
And the optimal solution will be no ratings and no bias when the answers would be detailed and will

07:54.870 --> 07:59.020
not have a lot of variance, that is they will be to the point.

08:00.470 --> 08:05.540
So what we want to achieve is no bias and no ratings.

08:07.930 --> 08:21.160
Now, let us discuss about ensemble learning methods within what we call making them together to create

08:21.160 --> 08:25.950
a stronger, more like we still don't know what ensemble is doing in.

08:26.620 --> 08:33.550
So let us try to work this time about ensemble learning based on the ensemble learning methods that

08:33.550 --> 08:34.030
we have.

08:38.500 --> 08:42.250
So bigging is the forced ensemble method.

08:43.230 --> 08:51.070
What bigging does is that bigging will create small models, so this model, one more to do, more than

08:51.090 --> 08:52.740
three will do for more than five.

08:53.070 --> 09:00.140
These are small modules which will be created randomly from these models.

09:00.360 --> 09:05.970
We will be bringing them individually and then we will make them wards.

09:06.860 --> 09:14.480
And from there, what we will decide the majority, what will be the answer or the average of the values

09:14.480 --> 09:18.090
to be the answer of the problem?

09:18.440 --> 09:28.580
So the verdict about you will be the average or the majority walk from these five evenly created models

09:28.760 --> 09:32.110
because these models are completely unrelated with each other.

09:32.270 --> 09:39.320
So they can generate these models fairly and then simply combine them by taking a majority from them.

09:43.100 --> 09:44.380
Next is bigging.

09:45.340 --> 09:53.830
Banking consists of building different battery models, so in case of bigging, we will create a different

09:53.830 --> 10:03.220
parallel models and each of these models has a different set of input sampas, which has to be a unique

10:03.220 --> 10:03.460
one.

10:04.150 --> 10:06.520
Now, think about a simple decision.

10:07.210 --> 10:14.590
Now, then we will be creating a decision if I will provide the same input value and the same output

10:14.590 --> 10:17.900
value and the parameters which I am giving to it.

10:17.920 --> 10:19.400
That is their next of the tree.

10:19.660 --> 10:22.060
If all of these are given they by.

10:23.200 --> 10:26.360
Then the mortgages which will be generated will be.

10:28.590 --> 10:37.150
So if the input data is safe for a you for decision, then the water which will be generated will be

10:37.160 --> 10:40.390
the same as the other decision trees.

10:40.920 --> 10:47.730
So if we are taking wood, we are taking a majority vote or we are taking the average of the results

10:47.940 --> 10:52.130
from these decision trees, then it will be of no benefit.

10:54.030 --> 11:01.410
It is something like, let's say we have five friends and all five friends have same opinion.

11:02.400 --> 11:08.730
Then the answer, which will be will be getting would be same as before Streambed or so, there is no

11:08.730 --> 11:14.390
need to have all the five things asking and giving us the answers if the answers are going to be the

11:14.390 --> 11:15.170
same always.

11:15.540 --> 11:22.440
So we want to have friends with differing opinions so that they can give us different ideas and different

11:22.440 --> 11:25.770
ideologies so that we can have a better outcome.

11:26.970 --> 11:33.840
This is the reason why we will be building these modules by visiting datasets.

11:35.260 --> 11:43.150
They will be creating these models by selecting the data, the me, I'm setting the date that I them,

11:43.570 --> 11:48.460
so that the models, which are generally they're not unique models.

11:50.790 --> 11:57.720
And then the result which is generated, it is generated by taking the average of these positions predictions

11:57.990 --> 12:00.690
or by the majority of them.

12:02.840 --> 12:11.820
Bigging is also useful when we want to create a model by decreasing the variance, by keeping the bias

12:11.830 --> 12:21.530
seem so bigging is usually used when we have models which are highly overinflated in nature.

12:22.220 --> 12:31.010
So when a model is highly overrated in nature, then we want to decrease the variance error, which

12:31.010 --> 12:39.400
is why we will be using the bigging method which decreases the variance while keeping the bias in.

12:41.450 --> 12:48.770
It works this way because bigging is kind of averaging technique, so it will take up all the different

12:48.770 --> 12:55.850
values which it will get and all the variance which will be present in the different models, it will

12:55.850 --> 12:59.300
average that out and get a single target value.

13:01.450 --> 13:01.930
So.

13:03.010 --> 13:10.180
When we have high ratings, the data points are scattered, but the bias is perfectly fine.

13:10.420 --> 13:17.710
So in this case, we will take an average of these values which will allow us to get the accurate answer.

13:18.370 --> 13:24.910
That is why we use bigging method, because it averages out the variance and reduces the variance without

13:25.210 --> 13:27.220
reducing the bias for the.

13:30.150 --> 13:38.790
Then it does not help much with models which have high bias now when we will apply of the bigging model

13:39.060 --> 13:41.050
where we have high bias.

13:41.220 --> 13:47.280
So in that case, the averaging will be that in this particular location will not actually improve the

13:47.280 --> 13:54.870
accuracy because the accuracy will improve when the values will move towards the target, when the value

13:54.980 --> 13:56.880
of the bias will reduce.

13:57.030 --> 14:05.070
So as bigging in order to reduce the bias so it cannot be applied to such kind of problem, it can be

14:05.070 --> 14:13.020
applied to a problem where we have a high variance issue, not where we have a high bias issue.

14:15.820 --> 14:18.640
Next type of method is boosting.

14:19.850 --> 14:25.090
So what is boosting is a state, they are creating sequential models.

14:25.320 --> 14:31.580
So first we will create a model one, then we will create model after that model three, then more before,

14:31.580 --> 14:32.510
then more than five.

14:34.220 --> 14:40.670
Now, what these models will be doing is all of these modules will get X value as an input.

14:42.680 --> 14:49.700
The first model will get X and Y as input, and it will try to predict the value Y.

14:51.560 --> 15:00.560
But the model being a VC model, it will not be able to identify all the patterns and it will simply

15:00.800 --> 15:03.650
predict something which is not really close to Y.

15:04.640 --> 15:10.060
So the prediction which will be made by this model, one late in the Vivan.

15:11.120 --> 15:19.640
So now that EDAR, which we will be having from the model one will be via minus Vivan, which is the

15:19.640 --> 15:22.760
actual value, minus the predicted value.

15:23.150 --> 15:24.180
So this is the error.

15:24.830 --> 15:31.400
So for the next model, we will be predicting we will be getting X values as input again.

15:31.670 --> 15:39.680
But this time, instead of predicting Y, we will be predicting the error value, the value, which

15:39.680 --> 15:41.880
is the difference of V and Vivan.

15:42.650 --> 15:49.280
So now we will try to improve this model by simply predicting the difference between the actual value

15:49.490 --> 15:52.460
and the value which we are getting from this particular model.

15:53.480 --> 15:56.060
So we will try to improve that.

15:56.880 --> 16:01.080
So what will happen from what we do, we will get another prediction.

16:01.260 --> 16:04.620
I'm that prediction will somehow improve the.

16:05.660 --> 16:12.950
The value more so now, what is the error, which is now the error, which is left this value minus

16:12.950 --> 16:14.310
Vivan plus vital?

16:14.810 --> 16:18.500
So there is a little lesser amount of error left.

16:19.040 --> 16:25.160
Similarly, Model three will try to predict this value minus violent, less vital.

16:26.090 --> 16:29.390
Then Model three will also learn a little more.

16:29.690 --> 16:35.210
A new pattern will be learned by more than three, and this model will again improve the.

16:36.420 --> 16:43.530
This prediction, I know the editor will be further reduced to by minus Vivan, plus by two plus white

16:43.530 --> 16:43.790
three.

16:45.850 --> 16:54.970
This same process will keep on going on and finally, we will have a model which we have via minus Vivan

16:54.970 --> 17:02.170
plus why do plus, why three plus four and so on, all the values still this entire film becomes.

17:04.780 --> 17:11.470
Now, this entire town will become zero because we will be improving the model one after the other.

17:12.380 --> 17:18.330
Now let us see the difference between buying and losing in case of buying the models I created by Lily.

17:19.480 --> 17:22.060
Here, the models are created sequentially.

17:23.310 --> 17:27.960
In this boosting model, each model is trying to improve the previous model.

17:29.430 --> 17:36.480
In bagging, these models are independent and do not have any relation between each of them in bigging

17:36.660 --> 17:39.050
Veiga Arbitrageur majority vote.

17:42.400 --> 17:50.260
In boosting the right to predict the error of the previous model, so each model is trying to predict

17:50.260 --> 17:52.540
the error of the previous model.

17:54.220 --> 18:00.800
What does bagging do, bagging tries to reduce the variance of the smaller model.

18:01.480 --> 18:09.010
And what does boosting do Woolston will that I do to reduce the bias of the models, let us know more

18:09.010 --> 18:09.790
about boosting.

18:10.270 --> 18:16.240
So boosting consists of building different sequential models one after another.

18:18.090 --> 18:24.910
Each model has seen X as input, and the first model predicts the value Y.

18:25.710 --> 18:34.680
Then after that, each model predicts the error value left from the previous model until the error is

18:34.920 --> 18:35.420
zero.

18:36.510 --> 18:40.290
Now, this particular value will be down to zero.

18:40.320 --> 18:47.350
Now, this model is actually used to decrease the bias and building a strong predictive model.

18:48.060 --> 18:51.450
So they may sometimes overdo it on the training data.

18:51.660 --> 18:52.500
Now, look at it.

18:52.680 --> 18:55.410
We are trying to reduce the error more and more.

18:55.410 --> 19:04.110
And so there it will try to create a finite number of models so that it can have the added value as

19:04.350 --> 19:04.450
well.

19:05.310 --> 19:11.910
But while we are reducing the error value to be zero, we are somehow moving towards over.

19:14.090 --> 19:14.570
So.

19:15.710 --> 19:16.910
In this particular.

19:19.320 --> 19:23.580
Figure, you can see we have high bias, so when we are using the.

19:24.640 --> 19:33.530
Boosting method price to improve the value of the prediction slowly so that it moves towards the target.

19:37.600 --> 19:43.810
So for each iteration, boosting a the weight of the samples.

19:45.150 --> 19:51.990
So that some folks that are misclassified by the ensemble can have a higher rate.

19:52.290 --> 19:59.100
So what will happen when they are making some prediction and there are some values which are misclassified

19:59.100 --> 19:59.880
by this model?

20:00.090 --> 20:06.690
So the next model will make sure that the values which were misclassified by Model one have a higher

20:06.690 --> 20:13.920
rate in this particular situation so that this model will actually try to improve those wrong predictions.

20:15.930 --> 20:22.610
So the samples that are misclassified, the ensemble can have a higher weight and therefore a higher

20:22.620 --> 20:27.020
probability of being selected for training in the new classified.

20:28.050 --> 20:28.400
OK.

20:30.610 --> 20:38.850
Now, Bargain will mainly focus on getting an ensemble model with less variance then its components,

20:39.460 --> 20:45.460
so Bargain will try to reduce the variance by boosting and stacking.

20:45.580 --> 20:51.840
We mainly try to produce a stronger model, which is less biased than the component.

20:52.480 --> 20:56.410
And even if the ratings can be reduced, then it will try to reduce the output.

20:56.890 --> 21:02.920
So boosting will try to reduce the bias and variance both, but majorly the bias.

21:03.070 --> 21:07.510
While Beijing will try to reduce the variance with the Vyvyan keeping the.

21:09.110 --> 21:10.370
Biase.

21:12.060 --> 21:16.800
Now, we just stated an algorithm stacking here, but we don't know what this is.

21:16.830 --> 21:18.580
So let's discuss about stacking.

21:19.230 --> 21:20.460
So what is stacking?

21:20.760 --> 21:27.780
Stacking allows to create a linear combination of multiple nonlinear models.

21:29.470 --> 21:36.040
So what are known models and what are union workers, we discussed both union aggression and logistic

21:36.040 --> 21:36.620
regression.

21:36.790 --> 21:40.710
So both of these models are actually linear models.

21:41.020 --> 21:50.770
And when we talk about decision three or random forest or boost or bragging or boasting algorithms applied,

21:50.980 --> 21:54.640
these algorithms are actually called non-union models.

21:54.910 --> 22:02.000
Any model which is trying to create a non-linear relationship is called the nonlinear model.

22:02.380 --> 22:08.820
So stacking creates a hierarchy of models using the outputs from the previously.

22:10.290 --> 22:14.190
So stacking will try to combine different models.

22:15.390 --> 22:19.160
So how does that we will look at that after some time.

22:22.450 --> 22:29.890
So first of all, we need to understand a few important things before we actually dig into the bigging

22:29.890 --> 22:31.210
and boosting, mordent.

22:32.650 --> 22:37.660
So let us understand what these modules are and what we know nurser.

22:38.570 --> 22:47.420
So we are talking about the models which are being used here, these small models, which we have just

22:47.420 --> 22:54.440
discussed in case of bagging and boosting, these are called the base models or cloners.

22:54.680 --> 22:59.360
Now, what are these, the building blocks for designing more complex models.

22:59.990 --> 23:07.280
And they do not perform well because they have either high bias or too much variance, because these

23:07.280 --> 23:10.000
are very basic and very small models.

23:10.160 --> 23:16.880
So they will not be having any complex nature and hence they will either have a very high bias or they

23:16.880 --> 23:18.730
will have very much variance.

23:19.250 --> 23:27.350
So the ensemble method will try to reduce the bias or the variance of such cloners by combining several

23:27.350 --> 23:27.720
of them.

23:28.430 --> 23:35.510
So we will take a lot of small models which will have either small high bias or high variance, and

23:35.510 --> 23:41.670
then we will try to reduce the variance or reduce the bias by using boosting or bigging method.

23:42.990 --> 23:51.210
Now, these small modules will be helpful because they will help in creating stronger modules or ensemble

23:51.210 --> 23:54.330
modules that achieve better performance.

23:54.510 --> 24:01.710
So when we create a single strong model, what happens is that the model might it or it might try to

24:01.710 --> 24:04.250
capture the complex pattern from the data.

24:04.380 --> 24:07.980
But we do want to capture that small, such strong patterns.

24:08.130 --> 24:16.290
And hence what we do is we create one vehicle owners or one based models, and then we try to combine

24:16.290 --> 24:23.670
them using the informal learning methods which actually allow to reduce that bias or variance which

24:23.670 --> 24:28.640
those small models have and does not actually cause the problems.

24:28.650 --> 24:33.550
Which one having we were having for creating such complex models, which did not work that.

24:35.080 --> 24:45.430
So we have a better option of using an ensemble aloni, which allows us to use small models which are

24:45.430 --> 24:54.280
very less complex in nature and easy to create and simply combine them so that we can get strong models.

24:58.220 --> 25:07.730
Now, what more can be, oh, the clone of a linear model, which has a very high up and nobody can,

25:07.750 --> 25:15.410
the vehicle owner, for example, a linear model which has very high or very high end range applied

25:15.410 --> 25:23.250
to it, can work as a vehicle or another vehicle Linnean model with a subset of variables.

25:23.390 --> 25:31.160
So a linear model with not all the features available, but which is created by only a few subsets of

25:31.160 --> 25:33.810
variables, can be used as a vehicle.

25:34.760 --> 25:40.170
Or we can use a decision tree which is very shallow or is just a stump.

25:40.370 --> 25:42.040
That is the depth of one.

25:42.590 --> 25:43.910
So we can use that.

25:47.270 --> 25:54.900
Now, why do we use the cloners, we you use vehicle owners because they cannot learn the Netsch five

25:54.900 --> 25:57.550
didn't handle it, hence they cannot overfit.

25:58.430 --> 26:03.640
Now, a combination of these Viglione has been captured on General Bigman.

26:04.010 --> 26:05.930
So let us try to understand this.

26:07.210 --> 26:16.030
So let us consider a decision three, which we are creating now, if a decision tree is a strong decision

26:16.030 --> 26:21.530
tree and it has a depth of legacy of six or seven or let us say 12 in that case.

26:21.760 --> 26:30.370
So this decision tree will have learned from a lot of features and it will have also learned from the

26:30.520 --> 26:32.560
noise which is present in the data.

26:33.720 --> 26:43.710
Now, because the decision tree has a high depth, so it will be having sort of bias or certain variance

26:43.710 --> 26:50.690
to it, it will have some amount of data which it will have learned from the noises.

26:51.090 --> 26:56.190
Now, the noise will allow the model to actually make wrong predictions.

26:57.550 --> 27:06.280
But when we create small quarters, what will happen is that these small V models will not know about

27:06.280 --> 27:08.090
the complexities of the data.

27:08.500 --> 27:12.300
It will not have long the complex patterns of the data.

27:12.460 --> 27:17.500
So the vehicle owner will have learned a small, generalized patterns.

27:18.340 --> 27:24.940
Now, when we have a hundred or five hundred such vehicle owners, they all would have learned something

27:24.940 --> 27:33.460
different and still something which is very similar now because the models have learned things which

27:33.460 --> 27:34.710
are similar also.

27:34.930 --> 27:35.350
So.

27:36.540 --> 27:44.220
It will allow the model to actually create a generalized model because they all would have blown the

27:44.230 --> 27:52.770
prominent features from the data but neglected the impact of the noise which would have come in the

27:52.980 --> 27:54.900
final decision because if they were.

27:55.740 --> 28:03.090
So a very large decision tree might have got impacted due to some extra features, some noisy features

28:03.300 --> 28:05.510
or due to some noisy data.

28:05.730 --> 28:12.920
But these smaller decision systems may have got impacted, but not all of them would be impacted.

28:13.110 --> 28:20.070
So because of that, because we will be taking an average and because we will be taking up a majority

28:20.070 --> 28:20.550
vote.

28:20.610 --> 28:29.340
So those small wrong learnings which these models would have done, they will be completely neglected.

28:30.480 --> 28:32.880
So because they will be neglected.

28:33.060 --> 28:37.740
So the combination which will be launched would be a general factor.

28:38.640 --> 28:43.890
So as a result, it will have not come by the noise.

28:43.920 --> 28:50.430
It will not have combined the noise and it will be ignoring the noise completely.

28:51.210 --> 28:58.320
So this is the reason why we use the cloners, because the vehicle owners will not learn a lot from

28:58.320 --> 29:00.440
the noise or a lot from the features.

29:00.600 --> 29:08.760
And when we combine them together, they kind of diminish the impact of the noisy features or noisy

29:08.940 --> 29:09.770
variables.

29:11.540 --> 29:18.710
In next session, we will begin with the implementation of bagging and boosting algorithms, so we will

29:18.710 --> 29:23.420
learn one example of bigging algorithm and another example of boosting algorithm.

29:23.630 --> 29:26.360
So let us look at that in the next.