WEBVTT

00:01.300 --> 00:09.200
High in last session, we discussed about boosting the Legatum, so and we also discussed about boosting.

00:09.460 --> 00:12.240
So let me summarize the boosting value.

00:12.610 --> 00:15.940
It works on a similar method as discussed above.

00:15.940 --> 00:21.850
It fits a sequence of cloners on different weighted training data.

00:22.870 --> 00:29.620
So basically, we will be creating small rules and then combining all of those small rules together.

00:30.010 --> 00:37.030
Now it starts by predicting additional data set and gives an equal weightage to all the observations.

00:37.300 --> 00:46.030
When a particular observation is misclassified, it gives a higher wage or more attention to that particular

00:46.930 --> 00:49.560
value, to those particular rules of data.

00:49.900 --> 01:01.030
And then it will try to reiterate and continue and learn to correct those rules of data which were incorrectly

01:01.030 --> 01:01.600
predicted.

01:02.700 --> 01:11.290
Now, this will keep on continuing to fill up limit is raised in the number of models or the accuracy.

01:11.520 --> 01:19.500
So if we reach the maximum number of models or if we reach and limit of accuracy at that point, the

01:19.500 --> 01:28.260
other boosting stops now, mostly the EU's decision tree stumps with either boosting, but we can also

01:28.260 --> 01:38.560
use machine learning algorithms as based on its weight on training data set so we can use our robust

01:38.580 --> 01:42.210
algorithm for both classification and regression.

01:42.720 --> 01:49.260
So this is about either boosting and either boosting is one of the very basic algorithms which we have

01:49.530 --> 01:51.820
for the boosting algorithms.

01:52.020 --> 01:55.200
It's like a very basic go to the for the boosting algorithms.

01:55.950 --> 01:59.580
So now we will go ahead with the mathematics behind.

02:01.070 --> 02:04.820
The boosting algorithms and try to understand that.

02:06.010 --> 02:14.380
Now, the math behind the boosting algorithm is majorly consisting of three important steps, the fourth

02:14.380 --> 02:15.410
step, the.

02:16.500 --> 02:24.960
An initial model, if not, which is defined to predict the target variable, but if not is are very

02:24.960 --> 02:35.000
forced to be the model for the very first vehicle owner and the model will be associated with the residual.

02:35.190 --> 02:36.560
That is an error.

02:36.930 --> 02:38.990
It is also known as such as you do it.

02:39.330 --> 02:45.750
So it will be by minus, if not because Y is what we were trying to predict.

02:45.930 --> 02:53.760
And if not, is the prediction which we have got from this particular first model, the 08 model.

02:54.060 --> 02:55.410
That is the very beginning.

02:56.250 --> 02:59.640
Now a new model, each one will be generated.

02:59.880 --> 03:04.470
Now this model is for to the residual from the previous step.

03:04.680 --> 03:08.190
So now the target value will actually be by a minus.

03:08.200 --> 03:08.820
If not.

03:09.920 --> 03:19.130
So now, if not on each one are combined to give F one, that is if one model is created by combining

03:19.130 --> 03:27.350
the two one model and the if not more, each one model, actual model, three models, these are all

03:27.350 --> 03:35.640
the vehicle models after the very first model like this, if not so F not is our very first model.

03:35.660 --> 03:44.300
So if we have a look at the diagram, then this model will be the if not model and all the other next

03:44.300 --> 03:50.140
models will be each one H2, H3 at Ford.

03:50.330 --> 03:55.980
That is intermediate models and the combination of these models.

03:56.150 --> 04:05.810
So the first model plus the second model that is if not plus each one will be if one.

04:06.990 --> 04:20.550
Now, if not, plus one plus two will be if two more similarly, if not, plus one plus two plus three

04:20.760 --> 04:22.540
will be the F three model.

04:22.710 --> 04:25.090
So this is how the naming convention would go.

04:25.740 --> 04:27.200
So let us go the.

04:28.380 --> 04:29.640
So here you can see.

04:31.720 --> 04:40.630
We have if not and it's not, which combine to give everyone the boosted version of, if not the means

04:40.630 --> 04:46.420
squared error from F1 will be lower than that from the.

04:47.080 --> 04:48.350
If not, why?

04:48.640 --> 04:54.640
Because if not, has one decision, stumm and each one has another decision stumm.

04:54.880 --> 05:02.110
So if one will be having some value, if one would have given some residual value, that is why minus

05:02.110 --> 05:06.580
if not now, each one will also have predicted some value.

05:07.180 --> 05:10.990
It should not be also given some improvement in the prediction.

05:11.200 --> 05:15.590
So that improvement, let us see if there's some other value.

05:15.760 --> 05:23.800
So when we combine this with this, the resulting residual, the resultant error value will be lower

05:23.800 --> 05:26.610
in comparison to the previous value.

05:26.800 --> 05:31.350
If it does not lower, then we will not consider that particular stamp.

05:31.570 --> 05:37.160
So we will not consider that particular model only if it actually increased the error.

05:37.390 --> 05:42.850
So because that is a completely opposite, the target of the modelling process, which we are following.

05:44.150 --> 05:51.410
So now to improve the performance of the F1 model, which we just created here by including one model,

05:51.710 --> 05:55.000
one decisions dump into the original decision system.

05:55.310 --> 06:04.580
So now we will again model the model, this particular model and indigenous team to this particular

06:04.580 --> 06:09.390
model, which has already has a residual value of F1.

06:09.770 --> 06:17.540
So now we will create a new model if this new model, if you will, the combination of the original

06:17.540 --> 06:19.310
F1 plus H2.

06:19.520 --> 06:20.620
So what would this be?

06:20.630 --> 06:23.480
It would be if not plus each one plus H2.

06:23.750 --> 06:28.100
So now again, we will expect the error to reduce further.

06:28.250 --> 06:34.040
So like this, after we add another one, another one of the S3 and the Richwood.

06:34.290 --> 06:40.580
So like this, we are expecting the residual to decrease slowly and gradually.

06:41.570 --> 06:47.980
We do not want to do that as you do will do decrease very drastically.

06:48.200 --> 06:51.560
We want to reduce slowly and gradually.

06:51.560 --> 06:55.990
We want each dump to learn only some part of the department.

06:56.420 --> 07:02.290
We don't want this dump to be a large dump or we don't want it to be a large decision.

07:02.310 --> 07:08.570
Three, we want it to be as only a small vehicle owner so that it could improve slowly.

07:09.680 --> 07:12.920
Now, this can be done from immigration.

07:12.970 --> 07:21.640
So after immigration, the final model, if m will be a combination of F of M minus one plus itchin

07:22.040 --> 07:28.310
and this would be continued in the schedule has been minimized as much as possible.

07:29.620 --> 07:37.960
Now, here, the additive lonas do not disturb the functions created in the previous setup, so F2 will

07:37.960 --> 07:45.490
not impact H2 or if one and if one will not impact each one, or if not, these are independent in this

07:45.490 --> 07:46.370
particular nature.

07:46.390 --> 07:48.410
The only thing is we are just combining.

07:48.730 --> 07:51.430
We are just adding something on Bulford.

07:53.130 --> 07:54.120
So instead.

07:55.310 --> 07:59.960
They impart information of their own to bring down the visitors.

08:04.580 --> 08:11.900
Now, let us consider that the Prediction Act, I believe I should be, is if of.

08:13.470 --> 08:23.580
So this will be equal to a function of X, so prediction at any particular iteration is a 50, which

08:23.580 --> 08:32.610
is a final combined model that is at any that the combination of if not and each one is núñez.

08:33.840 --> 08:43.470
If the while the small vehicle owners are each called, if the small if the I, these will be added

08:43.470 --> 08:44.460
from zero to.

08:45.510 --> 08:53.550
Whatever hydration we ate, so we are seeing that we will have small, if not smaller if one, small

08:53.550 --> 08:58.870
if two and so on, to give us a larger if you value something like that.

08:59.370 --> 09:03.180
So these will be small we cloners, which we will be adding.

09:03.190 --> 09:07.750
And this is the final of strong learning which we will be getting.

09:08.310 --> 09:12.310
So the loss value, the value of the error.

09:12.570 --> 09:14.520
So how do we calculate errors?

09:14.730 --> 09:20.520
The errors are simply isolated by volume minus the predicted value.

09:21.870 --> 09:29.760
So the predicted value will be F of the capital of the fixed, so the loss value will be this particular

09:29.760 --> 09:33.010
value G G is also going to loss function.

09:33.300 --> 09:36.020
So loss function or cost function.

09:36.210 --> 09:39.360
So cost will be summation of the entire loss.

09:41.250 --> 09:48.890
So loss, this is the loss function, loss function is at any point of time, what is the error?

09:49.850 --> 09:58.640
So Ed will be by minus 50 like this, if of the fakes, doctors error and the predicted value at any

09:58.640 --> 09:59.370
point of time.

09:59.840 --> 10:06.800
So when we subtract error value from the target value, it is called.

10:08.040 --> 10:09.750
Edit And when, Wendy.

10:10.900 --> 10:13.340
Do a summation of these errors.

10:13.780 --> 10:14.760
It is called the.

10:16.220 --> 10:16.880
Cost.

10:17.920 --> 10:25.900
So this is the cost function now we are trying to find out the change in cost with respect to the model

10:25.900 --> 10:28.560
at every step, how much the cost is changing.

10:29.620 --> 10:35.890
So how we need to improve the model at each and every occasion, that is what we are trying to find

10:35.890 --> 10:36.030
out.

10:36.040 --> 10:44.980
So it will be Belge differentiated with respect to FFP of Fix, because here the only thing which is

10:44.980 --> 10:47.890
changing is if the function if.

10:48.960 --> 10:57.000
Rate X and Y are already constant, so the only thing changes F will be so we will differentiate with

10:57.000 --> 10:58.770
respect of a fee.

10:59.160 --> 11:08.940
So the deluge by itself of the effects comes out the summation of lost function of via F the X.

11:10.190 --> 11:15.080
Differentiated with respect to Beloff of the office now.

11:16.520 --> 11:18.610
The change which should be brought.

11:19.600 --> 11:23.770
Do a particular model, is this particular value?

11:24.930 --> 11:33.700
So the next model which would be created, that is this this F of the any particular iteration is F

11:33.700 --> 11:38.130
of your right now the F of five plus one.

11:38.790 --> 11:47.160
This the next model, the next complete model will actually be a some of this current model, plus the

11:47.160 --> 11:49.500
change which we want to bring to that model.

11:50.420 --> 11:59.750
So F of B plus X, the 50 plus one is equal to minus new value by the lefty, which is this particular

11:59.750 --> 12:08.970
deal, the change which we want to bring to our model with respect to the entire function and do a constant

12:09.050 --> 12:14.000
on that is the learning rate, how fast we want to grow our model.

12:14.330 --> 12:19.150
So this will be the next vehicle which needs to be added.

12:19.940 --> 12:28.760
So the next final model will actually be the previous model, the previous final model, the with all

12:28.760 --> 12:35.550
the layers, plus the new model, the new vehicle stuff which has been added to it.

12:36.200 --> 12:38.150
So this is F of B plus one.

12:38.360 --> 12:39.970
That is the final model.

12:41.140 --> 12:48.860
I will I know that the Operation B plus one and this is the final model, dehydration B.

12:50.150 --> 12:53.570
Or to which we add another stump for fee plus one.

12:55.350 --> 12:59.530
So you can see that this is the final model as of now.

12:59.850 --> 13:07.080
Now, when we will add another small model to it, when the others was done to it, the if the capital

13:07.080 --> 13:16.300
F of the plus two will actually be equal to a full fee plus one plus small fee plus two.

13:16.590 --> 13:25.080
So every time, every time we add a small sum to this particular model, the stronger model will keep

13:25.080 --> 13:25.920
on increasing.

13:27.300 --> 13:33.310
So let us apply a regression to it, so when we will be applying regression to this particular model.

13:33.600 --> 13:39.270
So the loss function is log function for regression is sum of squared off added.

13:40.560 --> 13:48.330
So the laws function will be via a minus of the affects who will square, not the of error.

13:49.230 --> 13:56.400
Now, when we were differentiated with respect to f off the we will get negative V minus F of B.

13:57.900 --> 14:03.180
This will be equivalent to now of the plus one.

14:04.760 --> 14:12.450
If all 50 plus one is nothing but the change which we will be having in the previous one.

14:13.230 --> 14:20.670
So this multiplied with the learning rate gives the next improved model, the next stump, which we

14:20.670 --> 14:21.500
will be adding.

14:21.900 --> 14:24.690
So the next dome will be giving us.

14:25.790 --> 14:35.090
This particular learning rate in via minus F will feel fixed, so this is the improvement which we will

14:35.090 --> 14:37.260
be bringing to our model.

14:38.190 --> 14:40.160
Now, let us have a look further.

14:40.430 --> 14:45.570
Now, what we will be doing is now let us say we are working with classification.

14:45.920 --> 14:47.500
This was the case of regression.

14:47.510 --> 14:55.040
So the loss function for regression is the difference between the value, the actual value, minus the

14:55.040 --> 14:59.810
predicted value holds to give the loss for the regression problem.

15:00.140 --> 15:08.120
Now, the for classification problem, the loss function is nothing, but the model will be one upon

15:08.120 --> 15:10.550
one plus E to the power minus F of the.

15:11.830 --> 15:23.140
So the loss function for this will be equal to minus Vei in the logoff, BP plus one minus Y.A. in the

15:23.260 --> 15:25.170
logo, one minus BP.

15:25.450 --> 15:32.470
You remember the likelihood equation which we had the likelihood equation stated that B to the power

15:32.810 --> 15:36.010
VI plus one minus B by.

15:40.810 --> 15:43.460
To devour one minus Y.

15:43.960 --> 15:50.710
So when we take a log of this entire thing, we go bayoumy via a new logo, the.

15:53.230 --> 15:57.040
Plus one minus Y into log one minus B.

15:57.370 --> 16:06.250
So when we solve this integration, we get Lawwell one plus E to the power of the X minus via into F

16:06.250 --> 16:06.670
of B.

16:07.390 --> 16:13.070
Now, when we differentiate this with respect of F of the X, what do we get?

16:13.330 --> 16:14.230
We get.

16:15.400 --> 16:26.260
Minus Y, minus one Oborne one plus E to the power minus four feet, which is again in the form of minus

16:26.260 --> 16:29.860
of VIII, minus B of the.

16:31.530 --> 16:36.090
Now, that's because this is the B of B date, if we replace this one little new.

16:38.440 --> 16:44.950
So what do we get we get Vijaya minus BofI for this next game.

16:45.220 --> 16:47.560
So what is the next step for the model?

16:47.580 --> 16:49.900
What is the next viglione which we will be getting?

16:50.230 --> 16:51.340
The next decision?

16:51.340 --> 16:55.690
Stumm will be a fee plus one is equal to the loan.

16:55.690 --> 16:57.800
And great interview minus BP.

16:57.940 --> 17:01.180
Now, if we come bad, does this not look similar?

17:01.210 --> 17:02.760
This is just a similar thing, right?

17:03.010 --> 17:06.410
No, my benefit is a regression on classification.

17:06.600 --> 17:10.000
The improvement which we need to bring is just the same.

17:11.860 --> 17:19.660
So what are the different issues with Gbps now, let us consider different issues with GBM, so these

17:19.660 --> 17:21.890
are different values is these are different.

17:21.940 --> 17:24.000
This is the model which we have generated.

17:24.190 --> 17:25.900
So this is how we are improving.

17:26.050 --> 17:30.210
So what we will be doing is we will be having a vehicle or not.

17:30.220 --> 17:34.420
We will create a vehicle, not small if zettl.

17:36.400 --> 17:36.880
So.

17:38.200 --> 17:48.520
This will be a small if of zero, we will add another one to small for one to it, which will combine

17:48.520 --> 17:51.070
to form form capital F of one.

17:52.490 --> 17:57.250
Right, then we will have let me draw this for you.

17:58.870 --> 18:08.560
So here you can see that initially we have the vehicle, which is if we can find another model, another

18:08.560 --> 18:15.460
vehicle stump, if one to it, and the model at that particular station, which is a combination of

18:15.460 --> 18:18.490
these two models, which is Capital One.

18:19.400 --> 18:27.830
Now, when we arrived at the capital, if one to another storm, which is if we get the capital to Morton,

18:28.580 --> 18:35.450
which is the combination of all seem linked now to the combination of all three models that this capital

18:35.450 --> 18:36.320
is the model.

18:36.530 --> 18:39.650
Then we add another vehicle, another small F3.

18:39.920 --> 18:43.670
We get the capital EFSI model and it goes on the same.

18:48.960 --> 18:56.310
So until you can see that we have this, if not more, do we add another small if if one model to it

18:56.580 --> 19:05.370
and we get the if one when we add capital if one and small if we get capital and.

19:07.290 --> 19:12.450
So this is the value of 50 plus one.

19:12.690 --> 19:20.640
So the next model will actually be the current model plus small change of the.

19:22.170 --> 19:25.320
Function, which we have small change in the lost function.

19:26.240 --> 19:29.290
Which will cause the next more days, a week later.

19:32.990 --> 19:41.740
So this is what DBM is now, what is Jim Gimmies gradient boosting machine, which is what we have discussed

19:41.810 --> 19:42.090
now.

19:42.800 --> 19:45.500
So what is the issue with this?

19:45.800 --> 19:50.630
The first issue is that the laws function does not consider more than complexity.

19:51.690 --> 19:58.470
Here we have not considered if our model, the vehicle under which we have considered is actually weak

19:58.470 --> 20:05.370
or not, what really happened in case this vehicle or not is actually having a higher depth and it does

20:05.370 --> 20:07.000
not really a vehicle or not.

20:08.700 --> 20:16.070
It tends to overreact as it does not know where to stop now, because we don't know where this F plus

20:16.170 --> 20:20.480
one should stop being created, so we don't really know where this should stop.

20:21.240 --> 20:28.280
And next is it is not generalized in nature, so it will not know when it actually starts to overwhelm.

20:29.680 --> 20:37.300
So for this, we have another solution, which is egg boosting, which is extremely great in boosting

20:37.300 --> 20:37.780
machines.

20:39.420 --> 20:50.030
Now, hosting was introduced in 2014, Exposed has been loudly lauded as the holy grail of machine learning

20:50.040 --> 20:51.590
algorithms and competitions.

20:51.840 --> 20:59.400
So in case you have visited the website Jagi, you will have seen that there are a lot of competitions

20:59.400 --> 21:07.320
which are being held for the machine, learning a beginner to intermediate level people who actually

21:07.320 --> 21:12.040
participate in different competitions and showcase their skills.

21:12.600 --> 21:17.710
So the goal of the algorithm is a boost for them.

21:18.120 --> 21:25.950
So if you see all the winner models would have been trained on extra boost in case of any of this or

21:25.950 --> 21:27.430
any of such competitions.

21:28.170 --> 21:35.820
So from predicting I click through rates to classify high energy physics events, Extra Boost has proven

21:36.060 --> 21:39.510
its mettle in terms of performance and speed.

21:39.870 --> 21:47.910
So extra boost has a very great performance and the high speed for predicting and for training as well.

21:49.020 --> 21:57.330
Now, the execution speed generally moves this fast, really fast when compared to other implementations

21:57.330 --> 21:58.490
of gradient boosting.

21:58.710 --> 22:04.310
But newly introduced light GBM is faster than exergy boosting.

22:04.620 --> 22:10.770
So there is another variant of boosting, which is like the GBM, which is faster than H2 boosting.

22:11.250 --> 22:21.690
Now the model performance boost in the mind dominated structure or tabular data set on classification

22:21.690 --> 22:23.570
and regression predictive models.

22:24.000 --> 22:31.230
So the evidence is that it is the go to algorithm for competition winners on the competitive data science

22:31.230 --> 22:31.750
platform.

22:32.730 --> 22:40.920
So in case you are not looking for a new model and you just you don't want any clarifications from the

22:40.920 --> 22:47.430
model, you don't want to understand how a model is working and you are happy with having a black box,

22:47.430 --> 22:47.900
Maubee.

22:48.180 --> 22:52.440
And the only thing that you want is a good performance.

22:53.220 --> 22:57.360
Then you can straightaway go do extra boosting requoted.

22:59.660 --> 23:07.130
And you can straightaway skip the linear models and random photos and decision, and basically we're

23:07.130 --> 23:09.280
going to move forward for the implementation.

23:10.010 --> 23:10.610
No.

23:13.050 --> 23:22.110
Now, what algorithm does she to use, so the rules library implements the gradient boosting decision

23:22.350 --> 23:29.190
algorithm, so it is applying the gradient boosting algorithm on the small decision tree stumps.

23:29.970 --> 23:37.460
The algorithm goes by a lot of different names, such as in boosting multiple additive regression trees

23:37.680 --> 23:41.730
stuck at stochastic gradient boosting or gradient boosting questions.

23:42.360 --> 23:49.710
Boosting is an ensemble technique which we have already learned about when new models are added to correct

23:49.710 --> 23:51.960
the errors made by the existing one.

23:53.350 --> 24:01.420
Modules are added sequentially until no further improvement can be made, a popular example is the Boost

24:01.420 --> 24:07.450
algorithm, which we just discussed, and this reads the data points that are hard to predict.

24:07.480 --> 24:14.290
So basically, what are the algorithm, which we have learned was that we will have certain data points

24:14.530 --> 24:19.250
and the data points, which will not be classified correctly by the previous model.

24:19.540 --> 24:25.480
The new model will give more weight based on this these particular data points and try to predict them

24:25.480 --> 24:27.310
better in case it is.

24:27.670 --> 24:32.350
It has some more residues which are not predicted that a key.

24:32.380 --> 24:37.790
Then again, it will give higher ratings to them and then the next model will try to improve basic.

24:39.130 --> 24:46.450
So grading, boosting is an approach with new models out there that predicted as you do it or instead

24:46.450 --> 24:51.590
of prior model, I then added together to make the final prediction.

24:51.820 --> 24:58.210
It is all very interesting because it uses a gradient descent algorithm to minimize the loss when adding

24:58.210 --> 25:04.030
the new model, which we have just discussed, this gradient, which we are calculating.

25:05.370 --> 25:12.600
This differentiation, which we are calculating all the cost function is actually the gradient which

25:12.600 --> 25:15.570
we are applying for the creation of the next model.

25:16.970 --> 25:19.200
This is the next model which we are creating.

25:19.460 --> 25:25.940
So this is the gradient which has been applied for creation of this next Vignola, but just the base

25:25.940 --> 25:27.320
of degrading, boosting the.

25:29.610 --> 25:34.100
This approach supports both aggression and classification, predictive modeling.

25:34.770 --> 25:42.630
So hence we will be using it for both of them now ASG boosting, what is the difference between JVM

25:42.630 --> 25:45.470
and extra boost algorithm?

25:45.630 --> 25:51.230
So bored to extra boost and Devean follow the principle of gradient boosting that.

25:51.600 --> 26:00.360
However, the difference in modeling is specifically to boost use to a more regularized model formalization

26:00.360 --> 26:04.360
to control overfitting, which gives it a better performance.

26:04.620 --> 26:12.570
This is what we discussed here that GBM thanks to all Wellford ads, it does not know where to stop

26:12.750 --> 26:18.290
and it does not consider the model complexity and it is more generalized in nature.

26:18.300 --> 26:21.390
So to consider the more complexity.

26:22.840 --> 26:31.720
We have introduced more regularisation in case of extreme boost, so in case of extra boost, we have

26:31.720 --> 26:36.340
used more regularized, more formalization to control overfitting.

26:36.560 --> 26:44.680
So instead of the draining loss alone, we have added another regularisation film to the objective function

26:44.680 --> 26:46.150
for this particular model.

26:47.240 --> 26:55.350
So the regularisation don't control the complexity of the model, which helps us avoid overfitting.

26:55.850 --> 26:59.660
This sounds a bit abstract, so let's consider the following problem.

26:59.840 --> 27:05.960
In the following picture, you had asked if it was really a step function given the input data point

27:05.960 --> 27:08.070
on the upper left corner of the image.

27:08.300 --> 27:11.360
Now, the solution among the three, do you think is the best fit?

27:11.390 --> 27:13.850
So you need to find out which one is a better fit.

27:14.450 --> 27:15.950
Now, these are the images.

27:17.320 --> 27:18.610
So what is a better.

27:21.800 --> 27:24.500
So if you see these are the data points.

27:25.960 --> 27:35.020
Now, for these data points, if we create these kind of splits, then what will happen is it will keep

27:35.020 --> 27:35.290
on.

27:36.480 --> 27:42.390
Wooing by the decision that it will need a lot of decision trees, these are the number of stamps which

27:42.390 --> 27:49.470
will be needed and you can see that it is it will take to overfit here the.

27:51.080 --> 27:54.070
Value for regularisation is value.

27:55.420 --> 28:04.420
Here, the function is very complex in nature, that is, it is trying to override the values, it is

28:04.420 --> 28:10.720
trying to learn everything, it is trying to learn all the items, which is why when you look at this

28:10.720 --> 28:16.590
particular dataset, at this particular split, it has made a split at the wrong position.

28:17.470 --> 28:23.590
It should have made a split at this particular position, which has been corrected in this particular

28:23.980 --> 28:24.550
diagram.

28:25.000 --> 28:31.710
Here you can see this line is also incorrect and this line is also incorrect right here.

28:31.720 --> 28:37.900
Both the lines are completely in sync with the data, although there will be very slight error present.

28:38.170 --> 28:43.150
But at least it has learned the generalized model, which was actually the expectation here.

28:43.630 --> 28:49.510
So here the there is a balance between the log function or time function.

28:51.310 --> 28:52.840
So let's go further.

28:53.260 --> 28:55.660
So this is the objective function.

28:56.970 --> 29:04.560
This objective function contains the traditional Lospalos function, the old lost function, which we

29:04.560 --> 29:07.500
had from the DBM, right.

29:07.560 --> 29:09.940
This is what we remember from the DEVEAN.

29:11.160 --> 29:17.070
This is the last function from the Devia, so this is still present in the boost.

29:18.930 --> 29:26.850
They have added another regularisation function on top of it, in case of boosting, so extreme boosting

29:26.850 --> 29:37.320
is a Baoshan or upgraded version of Gbemi where we have considered the complexities of the models so

29:37.320 --> 29:43.500
that we can actually regularising regularize these models and we have added a regularization term.

29:44.610 --> 29:46.580
Now, let us see how this actually works.

29:50.150 --> 29:59.270
Now, in exposed package, I did step Vardas to find the three Ifti that will minimize the objective

29:59.270 --> 29:59.750
function.

30:00.620 --> 30:10.070
So we have this loss, but if the minus one plus FP is created and we have this loss function.

30:11.490 --> 30:17.420
This particular regularisation function, and this is the entire object of effort, so this is the lost

30:17.430 --> 30:22.560
function and this is the regularisation function, the regularisation is essential.

30:22.590 --> 30:27.000
Now, we have been learning regularisation from the very first one.

30:27.690 --> 30:35.520
You should remember, regularisation has been implemented in case of models as in the form of rage and

30:35.520 --> 30:37.000
laso regression.

30:37.320 --> 30:40.920
These are two regularization parameter which we have been using.

30:41.460 --> 30:50.100
Then we have applied certain parameters or the stopping points do decision trees which actually allow

30:50.100 --> 30:53.760
in regularization of the decision is random.

30:53.760 --> 31:02.040
Forest again, is not really secularized, we could see, but it is using bigging mechanism on the small

31:02.340 --> 31:02.900
models.

31:03.120 --> 31:09.800
So it is again being regularized on the basis of averaging out the values and it is reducing the variance

31:09.810 --> 31:10.030
of it.

31:10.630 --> 31:16.610
Now the boosting has the mean concept of improving the base.

31:16.740 --> 31:25.380
So instead of having the high bias which was already present, it is trying to decrease the bias in

31:25.380 --> 31:27.240
this case of extra boost.

31:27.570 --> 31:28.020
So.

31:29.160 --> 31:32.100
We have been using regularization ever since.

31:32.340 --> 31:38.050
So here they colorization is essential to prevent overfitting to the training set.

31:38.280 --> 31:42.230
So when we apply regularisation, this overfitting will not happen.

31:45.150 --> 31:51.740
So without any regularisation, the three will split until it can predict the training set perfectly,

31:52.020 --> 31:59.430
so it will keep on creating new F.P. models, new small, iffy models and combining it with the previous

31:59.430 --> 32:04.260
complete strong Kloner in the time it gets the exact value.

32:04.620 --> 32:07.020
That is what we have learned in the beginning.

32:07.020 --> 32:07.290
Right.

32:07.530 --> 32:12.170
So but we need to know where we should stop so that we do not overfit.

32:12.570 --> 32:19.290
So this will usually mean that the three has lost overgeneralization and will not do well on the new

32:19.290 --> 32:20.050
best data.

32:20.280 --> 32:25.950
So the more they may be able to predict this training data, it will predict this training data.

32:26.730 --> 32:30.870
But in case I give it any other point, any other data point.

32:32.140 --> 32:34.600
Then it will not be able to predict that properly.

32:35.900 --> 32:42.160
So in to boost the regularisation function shows the model complexity.

32:43.190 --> 32:47.870
So now we will consider more complexity in case of extreme boosting.

32:49.930 --> 32:56.460
So this is the function which we have, this is the globalization of function, which we have for the

32:56.470 --> 32:57.610
more complexity.

32:57.730 --> 33:02.720
So there are these different components which are present in this more complexity.

33:02.890 --> 33:06.310
So we need to know what this more complexity actually means.

33:07.510 --> 33:11.260
So here the is the number of leaves in the.

33:12.610 --> 33:22.290
So are regularizing the number of bodies in the street, so we don't want to have a lot of leaves,

33:22.960 --> 33:26.750
so this will actually reduce the size of the tree.

33:27.280 --> 33:35.200
So when we have regularisation on the number of leaves of the tree, it will actually stop the tree

33:35.200 --> 33:36.790
from growing very large.

33:38.680 --> 33:46.150
Next, we have this W, which is the score of the Liefooghe, so every player will have a certain number

33:46.150 --> 33:46.750
of UVs.

33:46.900 --> 33:58.450
So for even if we have a vid associated, this Veith is actually what allows the next model to predict

33:58.600 --> 33:59.670
the values better.

33:59.890 --> 34:06.640
So we have been discussing about the frenzied state that Adam boosting tries to give more weight age

34:06.940 --> 34:14.770
to a particular node or to a particular data point so that that particular data point is classified,

34:14.770 --> 34:15.810
but a key next thing.

34:16.390 --> 34:19.420
So this is what the Vade will be doing.

34:19.420 --> 34:22.240
This weight will be giving the score to the leaf.

34:22.990 --> 34:28.350
Now, this Guerma is the Levett penalty barometer.

34:29.230 --> 34:34.810
So Gamma will try to finalize the weight of the.

34:36.110 --> 34:42.690
Leith, so we don't want a particular vicuña to learn a lot.

34:42.830 --> 34:49.680
We want to increase the precision or increase the performance slowly and gradually.

34:49.970 --> 34:58.220
So what we will be doing is we will be penalizing the leaflet so that it does not learn a lot at one

34:58.220 --> 34:59.450
particular point of time.

34:59.840 --> 35:03.630
We are trying to stop it from launching a lot.

35:05.190 --> 35:14.040
And the lamda is the precise penalty and we did, which again was along with the evalu so that the three

35:14.040 --> 35:15.860
sides does not grow very large.

35:16.020 --> 35:24.690
So at the end, lamda actually regulating the size of the tree while W and them are actually helping

35:24.960 --> 35:29.190
in reducing the score or the age of different leaves.

35:30.840 --> 35:35.440
So how does the function optimize the above objective function?

35:35.670 --> 35:44.850
So we have this objective function, so this particular objective function when this is differentiated,

35:45.180 --> 35:53.640
so it will actually give us these this particular dome, Belge, will be the first differentiation of

35:53.640 --> 35:58.290
this value, and it will be the seventh the second differential of these values.

35:59.740 --> 36:03.920
For a differential of the lost function and second differential of the loss function.

36:04.270 --> 36:08.020
So here you is the value here.

36:08.020 --> 36:14.860
The Q value which we have generated is the value which maps the input features to the leaf nodes in

36:14.860 --> 36:17.110
the tree in our lost function.

36:17.410 --> 36:24.340
This objective function is much easier to work with because it is now giving a school that we can use

36:24.340 --> 36:26.630
to determine how we structure this.

36:26.860 --> 36:35.980
So this entire values which we have, these entire values, that is the lambda gumline w will help in

36:35.980 --> 36:39.710
creating a tree structure which is regularized in nature.

36:39.820 --> 36:48.790
So the value of the and the value of lambda will try to reduce the size of the tree and the value of

36:48.790 --> 36:56.730
W and Gummo will try to reduce the leaf weight so that it does not overfit and it does not overflown,

36:56.980 --> 37:00.310
so that the size of the stump is smaller here.

37:01.720 --> 37:08.440
The entire mathematics is not really necessary for you to understand the mean things which you need

37:08.440 --> 37:17.430
to think of is that the values of the novel and the size of the tree and Gumline W will actually involve

37:17.440 --> 37:19.600
the weight of the leaf node.

37:19.870 --> 37:27.130
And the objective function is the loss function, plus the more complexity which has been added in the

37:27.370 --> 37:27.970
GBM.

37:28.090 --> 37:36.100
So in case you have created a model and it is the overfitting, then the next model which you will be

37:36.100 --> 37:37.780
going do would be extra to boost.

37:38.860 --> 37:46.360
Because she to be less over for them and it will be reducing the mortgage complexity, hence it will

37:46.360 --> 37:47.050
all be over.

37:47.860 --> 37:52.240
So this is how you can decide actually which money you need to pick up next.

37:53.460 --> 37:59.040
So when we are discussing all these to my face, you don't really need to remember everything from the

37:59.040 --> 38:03.220
mathematics, but you need to know that how these things are generated.

38:03.990 --> 38:10.800
So when you know how these things are generated and how the model is being created, it will actually

38:10.800 --> 38:17.600
give you an indication of how the model is actually being generated, how everything is working up,

38:18.000 --> 38:25.080
so that in case something does not go well, then you know that which hypovolemic that you need to you

38:26.070 --> 38:28.700
like the machine guns on you.

38:28.710 --> 38:30.120
It is the main object.

38:30.120 --> 38:30.690
They will, Wolf.

38:30.690 --> 38:38.370
Introducing the mathematics of any model to you is that it will allow you to understand how the world

38:38.370 --> 38:38.760
works.

38:39.600 --> 38:44.880
And I will be going a little overboard with the mathematics.

38:44.880 --> 38:49.020
I will be going a little more in-depth with the mathematics.

38:49.350 --> 38:54.580
But you need to understand the concepts behind that is the main objective here.

38:54.600 --> 38:58.270
It is fine if you don't understand the entire mathematics of it.

38:58.530 --> 39:05.400
The main task here is to understand how this entire thing works and to understand what the importance

39:05.400 --> 39:12.030
of different hypovolemic does, which we have in the different algorithms, because that is exactly

39:12.030 --> 39:12.750
what you will have.

39:14.340 --> 39:26.640
So I have three and a different video for the how the steps for barometer tuning and for different metrics.

39:26.790 --> 39:34.500
So you can refer to all those guidelines videos so that you can actually get to know what other mean

39:34.500 --> 39:36.490
things which you need to focus on.

39:37.230 --> 39:43.620
So for the entire process, I've created one guideline video which you can watch and understand the

39:43.620 --> 39:44.670
entire process.

39:49.270 --> 39:52.120
So next, what we will be discussing.

39:54.580 --> 39:57.970
Is the moral complexity for the.

39:59.560 --> 40:07.840
So here you can see that in case of will think we have the original function, this regularisation method,

40:07.840 --> 40:18.630
when we have this lambda and the guy might be the Lambda V, so there are different kind of defining

40:18.640 --> 40:19.940
a more complexity.

40:20.230 --> 40:23.930
So that is one this method of defining the more complexity.

40:23.960 --> 40:26.690
This is another method of defining the model complexity.

40:26.920 --> 40:32.650
So all of the more complexities work in the same way, a little different from each other.

40:32.980 --> 40:37.280
But the main task here is that the regularization has to happen.

40:37.510 --> 40:44.750
Now, the regular is one five miles, three packages read less carefully or simply ignored.

40:45.100 --> 40:50.600
This was because the traditional treatment of the early learning on the emphasis is on improving the

40:50.610 --> 40:52.300
impurity by the complexity.

40:52.300 --> 40:56.300
Control was left to heuristics by defining it formally.

40:56.320 --> 41:01.110
We can get a better idea of what we are learning and it works well in practice.

41:01.300 --> 41:02.410
So that is why.

41:03.650 --> 41:11.870
We have provided different definitions and explained how these moral complexity works so that you can

41:11.870 --> 41:17.810
actually compare the more those that you are generating and compare then modern complexities and see

41:18.320 --> 41:20.200
which one is actually better.

41:20.210 --> 41:23.970
If you understand the mathematics, you will know how things actually work.

41:24.560 --> 41:34.820
So we could have gone from a very simple is will we have a loan from a very simple model to a complex

41:34.820 --> 41:35.300
model?

41:35.450 --> 41:43.400
But we could have directly due to the complex models that is used in random forest and we would have

41:43.400 --> 41:49.150
done with it because these are the only wonders which you will be using very frequently.

41:49.790 --> 41:57.350
But the main objective here was to make you understand each and every step and each and every model

41:57.590 --> 42:01.810
because each model has some mathematics behind it.

42:02.000 --> 42:10.730
And the mathematics from the simpler model allows you to understand the complex model more better.

42:11.300 --> 42:16.430
So that is why we have discussed all of these models and we will be learning about different models

42:16.430 --> 42:19.960
for them so that you will be able to understand how things work.

42:21.440 --> 42:31.910
So here is one link for Gbagbo says Exposed wasis allied GBM, so you can go through this particular

42:31.920 --> 42:37.030
code and see how these are coming back and how these actually work.

42:38.700 --> 42:45.870
Now, let us discuss about the unique features of boosting so reboost is a popular implementation of

42:45.870 --> 42:46.900
the union boosting.

42:47.100 --> 42:49.100
So these are the important features.

42:49.110 --> 42:51.060
The first one being regularisation.

42:51.270 --> 42:57.980
So she has an option to penalize complex models through both L1 and Regularisation.

42:57.990 --> 43:00.910
So the regularization helps in preventing overfitting.

43:01.170 --> 43:07.830
So again, when you will be applying to regularization, then you can get the importance handling sparse

43:08.310 --> 43:14.910
data and missing values or data processing steps like one or two, including maybe Despres.

43:15.510 --> 43:21.800
So because what does one word, including one word and coding, will be giving Zettl values and one

43:21.810 --> 43:22.900
values to the data.

43:23.280 --> 43:29.250
So when you have a lot of zero one values, there will be a lot of land values, there will be a lot

43:29.250 --> 43:30.180
of zettl values.

43:30.540 --> 43:36.810
These zero values actually create the sparse data when we have a lot of values.

43:38.000 --> 43:47.870
So of Wells incorporates us for city of finding algorithm to handle different types of sparsity patterns

43:47.870 --> 43:48.480
in the data.

43:48.620 --> 43:56.390
So it take years, takes care of the data in guys that are such kind of one encoded data that this dummies

43:56.390 --> 43:59.430
which have been created from the categorical columns.

43:59.640 --> 44:02.820
So it looks good on the go to the columns as well.

44:03.330 --> 44:10.520
Then we have waited for four days sketch that is most exciting through this algorithm can find the split

44:10.520 --> 44:13.410
points when the data points out of equal weight.

44:13.700 --> 44:20.720
So if different data points are of equal weight, then it can still find out the splitting points.

44:21.410 --> 44:24.780
However, they are not equipped to handle the data.

44:24.980 --> 44:31.970
So in case there is some kind of latency weighted classes that this imbalance glasses, then it might

44:31.970 --> 44:32.890
not do that well.

44:33.200 --> 44:39.500
So H.G. Wells has distributed we did a sketch algorithm to effectively handle the data.

44:40.570 --> 44:41.320
Next.

44:44.430 --> 44:53.730
So exiguous can actually handle the date of the imbalanced classes very well, so next is unique features.

44:53.760 --> 44:56.760
So again, we have block structure for parliament building.

44:56.970 --> 45:00.960
Now in different models, there are only one jobs which could be done.

45:01.350 --> 45:09.090
That is, we can run only one thread for those jobs, but in case of extra boost, we can actually run

45:09.090 --> 45:10.630
several jobs together.

45:10.950 --> 45:16.320
So for faster computing, you can make use of multiple cores on the sea view.

45:16.620 --> 45:22.710
This is possible because of a block structure and a system design data is sorted and stored in the memory

45:22.710 --> 45:26.130
units called blocks, unlike other algorithms.

45:26.160 --> 45:32.020
This enables the data but will be used by subsequent iterations instead of computing it again.

45:32.340 --> 45:39.570
So this feature also solves useful for steps like split finding and columns upsampling so you can use

45:39.570 --> 45:44.250
multiple jobs to get better and faster of the work.

45:44.830 --> 45:52.710
Then we have Gashi Venice, so boost non continuous memory access is required to get the baby inside

45:52.710 --> 45:54.030
the stick index.

45:54.540 --> 45:59.520
Hence extra boost has been designed to make optimal use of hardware.

45:59.850 --> 46:06.410
This is done by allocating internal buffer in each thread where the gradient statistics can be stored.

46:06.420 --> 46:11.550
So it is aware of the cache and makes use of the hardware very efficiently.

46:11.970 --> 46:19.260
Then it has the full computing, which is same as by learning, so it will just use the codes and use

46:19.260 --> 46:21.940
the available disk more efficiently.

46:22.230 --> 46:29.260
So basically it because it has multiple jobs and efficiency to run in the hardware more efficiently.

46:29.550 --> 46:35.790
So that is why it will run faster in comparison to any other modeling launched a little faster in comparison

46:35.790 --> 46:39.050
to other models or other variants of boosting itself.

46:40.420 --> 46:48.010
So this is about actually boosting next, we will learn about some important hypothalamic those, so

46:48.430 --> 46:53.990
these are the type of and we do to have one is that is done.

46:54.010 --> 46:55.740
That is the Shinkichi do.

46:56.770 --> 47:06.100
So each new breed that is added has its weight shrunk by this parameter, preventing overvoting, but

47:06.100 --> 47:10.290
at the cost of increasing number of rounds needed for convergence.

47:10.540 --> 47:14.270
So as you increase the value.

47:14.560 --> 47:23.560
So when we increase the value, what happens is that it will make sure that the previous models shrink.

47:23.950 --> 47:27.750
That is, the weight which is provided by the previous model is reduced.

47:28.030 --> 47:36.850
So it will prevent overfitting if you have higher added value, but it will also slow down the learning

47:36.850 --> 47:37.370
process.

47:37.540 --> 47:45.000
So make sure you're not reducing or increasing the item very high because then it will be very difficult

47:45.010 --> 47:46.830
for you to find a good algorithm.

47:47.260 --> 47:49.490
It will take a lot of time to bring.

47:52.060 --> 47:58.960
Then we have the Gummo value, which is a three size penalty, then we have maximum statis, maximum

47:58.960 --> 48:05.740
depth of each three Stumm minimum eaglet is the minimum wage that the Notkin have.

48:05.740 --> 48:10.120
If the minimum is not met, then a particular split will not occur.

48:10.150 --> 48:17.340
So that is the minimum amount, minimum number of values which should be present in the child.

48:17.890 --> 48:20.770
Then Subsample gives the opportunity to perform.

48:21.490 --> 48:29.590
So if we want to have all these samples of data being used for each three, then we can use subsample.

48:29.980 --> 48:33.500
Then for example, by three, it allows to perform feature bigging.

48:33.520 --> 48:40.150
So basically, if you want to select a subset of the variables or subset of the features, then we can

48:40.150 --> 48:41.600
use, for example, by three.

48:41.770 --> 48:48.460
So subsample and sample, but we are introducing the dumbness which we had in the random for this to

48:48.820 --> 48:50.130
include boosting bolthole.

48:51.140 --> 49:01.650
Then we have lamda value, which is the L to leave no penalty sort of penalizes the weight of the leave

49:01.700 --> 49:01.970
not.

49:03.490 --> 49:08.910
So this is about actually boosting so this is the entire theory which is associated with the extreme

49:08.950 --> 49:17.110
boosting, the main task for you is to understand these different barometers and get familiar with the

49:17.860 --> 49:19.200
two things which are present.

49:19.540 --> 49:23.890
One is the loss function and another is the more complexity.

49:24.250 --> 49:26.800
You don't need to get into the mathematics.

49:27.070 --> 49:35.800
I'm reading this again because in no interview session, you will be asked about these of these domes

49:35.800 --> 49:39.780
or of the very core mathematics.

49:40.000 --> 49:44.600
But what you would be asked about is how one model is better than the other.

49:44.860 --> 49:48.290
How is better than gradient boosting?

49:48.520 --> 49:55.120
So then you can simply see that it is because of the model complexity and you can explain that the model

49:55.120 --> 50:01.010
complexity is improving the leaf the size of the tree and the weight of the leaves.

50:01.240 --> 50:04.470
So that is something which is important for you to understand.

50:05.130 --> 50:11.890
The mean thrust of the concept is what you need to learn instead of making up the environment the takes,

50:12.130 --> 50:14.580
because that is not the object of.

50:15.700 --> 50:23.140
One should know how something works and not muck up all the mathematics behind it, so that is exactly

50:23.140 --> 50:24.430
what we are focusing on.

50:24.440 --> 50:29.710
That is exactly what is required in the organizations and on the field.

50:29.920 --> 50:33.340
So that is something that we are focusing on the entire course.

50:34.330 --> 50:41.780
So in the next session, we will go towards the implementation of boosting and see how we can implement

50:41.800 --> 50:41.900
it.
