WEBVTT

00:01.210 --> 00:02.310
Hi, Dad.

00:02.360 --> 00:08.440
Now, we have learned about how we can play different roles for categorical and numerical variables,

00:08.740 --> 00:12.680
and we have also seen that how decision trees are created.

00:13.270 --> 00:21.970
We know that the decision trees are based on the different rules which we have, and these rules only

00:21.970 --> 00:26.740
define what decision node has to be placed next.

00:27.190 --> 00:34.120
So now we will learn the next most important criteria, which is the splitting of data.

00:35.260 --> 00:46.150
So each split is made would be aligned to an axis with each feature present an axis.

00:46.420 --> 00:47.550
So what does this mean?

00:48.610 --> 00:55.450
So here this means that each split made would be aligned to an axis where each region is in axis itself.

00:55.930 --> 01:02.470
So let us have a look at this particular data where we have values or we can consider these values being

01:02.470 --> 01:03.430
different classes.

01:03.940 --> 01:06.010
So these classes.

01:07.180 --> 01:12.910
I present in four different regions, so the first class is present here.

01:12.960 --> 01:14.120
This is the second class.

01:14.140 --> 01:15.220
This is the third class.

01:15.220 --> 01:16.550
And this is the fourth class.

01:17.140 --> 01:24.340
Now, if they want to subdivide and try to find out where what class is present, then if we want to

01:24.340 --> 01:26.980
make such a decision, then how we can do that.

01:26.980 --> 01:33.510
So we can do that by creating rules so we can create rules by legacy.

01:33.730 --> 01:35.350
Having one rule as.

01:36.460 --> 01:44.650
It's this being the x axis and this being the Y axis, so we can have the first Ruelas X greater than.

01:45.690 --> 01:53.250
Zero point five, when X is greater than zero point five, then this particular line would be selected,

01:53.250 --> 02:00.630
which means that X is greater than zero point five, then it would probably be a part of Class one or

02:00.630 --> 02:01.290
class three.

02:02.470 --> 02:08.830
Now, after we have decided if X is greater than zero point five, then we will have under the condition

02:09.010 --> 02:11.410
that Y is less than.

02:13.060 --> 02:13.970
Zero point five.

02:14.770 --> 02:21.440
So for tree, where we have X greater than zero point five and Y less than zero point five.

02:21.670 --> 02:23.620
That would mean that the glass is one.

02:25.190 --> 02:32.270
Now, remember, the condition we created X is equal to zero point five, X is greater than zero point

02:32.270 --> 02:32.610
five.

02:32.780 --> 02:38.300
So for the condition, yes, the condition which we just discussed about would be prison.

02:39.410 --> 02:48.500
I'm on the other end, we will have X Leiston zero point five, so now for X less than zero point five,

02:48.510 --> 02:50.410
there are two types of conditions present.

02:51.380 --> 03:00.260
So now we will again have another condition that is X is greater than minus one point one.

03:01.650 --> 03:07.030
So if X is greater than minus one point one and this was the previous condition.

03:07.140 --> 03:14.950
So again, the same situation would come up where we will have to create a split on the Y axis.

03:15.930 --> 03:24.180
So based on the y axis, we will again see that if the value of Y is greater than minus zero point five,

03:24.490 --> 03:28.860
if the value is greater than minus zero point five, then it belongs to Ford.

03:29.010 --> 03:33.960
If it is less than zero point five, then there could be a good idea where it could be to.

03:35.180 --> 03:36.400
So we will get to this.

03:37.460 --> 03:45.920
Now, again, there would be another split at this particular point that is being done, Y-axis again.

03:46.160 --> 03:54.140
So it will check if the value is between minus one point two and zero point five.

03:54.440 --> 03:55.940
And it is less than.

03:57.630 --> 03:58.500
Zero point.

03:59.440 --> 04:00.400
Two or three.

04:00.850 --> 04:06.160
And so it will decide that and based on that, it will decide that this would be the last to.

04:07.180 --> 04:16.330
And finally, when the condition was that the value of X is less than or greater than zero point one

04:16.330 --> 04:17.540
minus one point two.

04:17.770 --> 04:23.530
So based on the negatives, that action would be placed for the oil values being for.

04:25.090 --> 04:33.040
So you can see that each split which we are making is in concordance with the X and Y axis, so each

04:33.040 --> 04:37.810
rule which we will be creating here, will be belonging to the X and Y axis.

04:38.730 --> 04:45.040
So let us see how we can actually select different booths so there could be several booths present.

04:45.300 --> 04:52.140
I could have had a room for minus one point five, Wojo I could have had four plus one point five total.

04:52.380 --> 04:57.230
But these rules will actually not give us any good results.

04:57.360 --> 05:01.380
So that is why we have to choose these rules which have been highlighted here.

05:02.290 --> 05:06.980
So how will we actually get to know that Richard Rule is actually useful?

05:07.600 --> 05:12.980
So for that, we will look for a split which gives the most homogeneous chain.

05:13.600 --> 05:18.720
So when we say most homogeneous try node, we don't really know what homogeneous means.

05:19.300 --> 05:21.870
Let us say we don't know the meaning of homogeneous.

05:22.060 --> 05:27.070
So let us try to think about it and what it actually means and let us formulate.

05:27.970 --> 05:35.680
So the different ideas for rule selection are the guinea index, entropy and divisions.

05:35.980 --> 05:40.350
These are three formulas which are used for creating the rules and selecting the rules.

05:41.530 --> 05:47.280
Now, we still don't know how they actually work, but we will look at it in something.

05:48.070 --> 05:56.050
So the Guinea next formula is one minus summation of the probability squit the formula for entropy is

05:56.290 --> 06:01.180
minus probability of probability and the low probability.

06:01.870 --> 06:07.360
So similarly, we can have these formulas in hand and then use them how we will use them.

06:07.370 --> 06:09.120
We will look at it in some way.

06:09.340 --> 06:16.630
So for that, just wait for a moment and let us first of all, understand terrifically what this means.

06:18.870 --> 06:21.480
So let's say we have this entire dataset.

06:22.720 --> 06:31.090
This data contains four glasses, these four glasses of blue, green, red and yellow.

06:32.140 --> 06:38.440
Now, out of these four glasses, I have equal amount of data presenting all the four plus.

06:39.580 --> 06:45.970
Now, the problem here is that I want to decide that what is with.

06:47.460 --> 06:48.900
So let us try to do that.

06:49.590 --> 06:56.980
So for this, I need to create a split in this particular data so that I will be able to recognize what

06:57.090 --> 06:57.900
glasses, glasses it.

06:58.760 --> 07:05.870
If if someone would ask me which a place, a point they were and ask me which point would be that particular

07:05.870 --> 07:09.800
class, I might not be able to answer based on this particular image.

07:11.160 --> 07:14.190
So I need to make a split on this.

07:15.180 --> 07:15.810
Image.

07:17.060 --> 07:19.090
So let us make the split.

07:20.020 --> 07:24.290
So the first bridge that we will be making is the horizontal strip.

07:24.760 --> 07:30.430
So when we make this horizontal split, you can see that data has been divided into two portions.

07:30.430 --> 07:35.130
One is the top portion and the one being the bottom portion in the top portion.

07:35.290 --> 07:45.310
We have us some part of red points, a few blue points and all the yellow points present while in the

07:45.310 --> 07:46.960
bottom split we have.

07:47.990 --> 07:52.430
Some of the great points, all the Green Point and some of the good points.

07:53.540 --> 07:58.120
So here we don't have a clear majority of any points present.

07:59.780 --> 08:05.630
Now, if we have a look at the split, too, you can see they have made this vertical split, which

08:05.750 --> 08:13.670
divides the data into pool spaces where the left hand side space has only to be the point, which is

08:13.670 --> 08:20.540
blue and green, and that to all of the blue and green points right on the right side, we have all

08:20.540 --> 08:22.580
the red and yellow points.

08:23.450 --> 08:31.280
So if you see in this split is done, then if we try to make another split, then what will happen is.

08:32.580 --> 08:33.690
Let us make fun of this.

08:36.320 --> 08:41.600
So for this particular split, the next to which we will have to make is at this point.

08:43.110 --> 08:48.900
Which will divide the blue and green point and the next split which will be created would be at this

08:48.900 --> 08:52.380
point, which will divide the red and the yellow point.

08:52.980 --> 08:58.860
Now, if we talk about the four split and we make another split here, then the split would have to

08:58.860 --> 09:03.730
be made at this particular point and under the split at this point.

09:04.440 --> 09:08.800
Similarly, once it would be made at this place and another the split.

09:09.570 --> 09:16.740
So now if you see they have to make only one, two and three split in the case of the second split,

09:17.430 --> 09:25.260
they've had to make one, two, three, four and five split in the case of the split one.

09:26.070 --> 09:33.050
So it is declining, lesser number of split in case of split two in comparison to split one.

09:33.240 --> 09:40.890
And also the data is being divided in such a way that we have all the data point of one class present

09:40.890 --> 09:41.990
in one position.

09:43.230 --> 09:50.590
So it is better to make a split two in comparison to split one because it divides the data to be in

09:50.610 --> 09:52.640
a more homogeneous location.

09:53.220 --> 10:01.230
So the data in the two is more homogeneous because we get only two classes on the left side and only

10:01.230 --> 10:02.730
two classes on the right side.

10:02.940 --> 10:08.980
While in the case of split one, we have three classes at the top and another three classes at the bottom.

10:09.510 --> 10:13.880
So hence it is less confusing in case of split two.

10:14.100 --> 10:20.520
If I have to make a decision about this at this particular point, then I can easily say on the basis

10:20.520 --> 10:24.240
of this split, which we have made, that I like this place.

10:24.240 --> 10:31.260
Either it would be a blue point or a green point, but if I make the same point in the four split,

10:31.950 --> 10:37.220
then I will have to say that the point could either be blue or yellow orig.

10:37.560 --> 10:41.100
So the confusion which we already had still exists.

10:41.400 --> 10:48.270
We were able to eliminate only one class using this particular split out of the four classes which we

10:48.270 --> 10:53.510
had while we were able to eliminate two classes with just one split in the split.

10:54.640 --> 11:00.250
So that is the reason why we will to now let us discuss what information gain is.

11:01.500 --> 11:08.940
Now for information, Jeanne, let us try to find out the probability first, so for finding out, probably

11:08.940 --> 11:10.980
let us look at this particular data.

11:11.010 --> 11:14.360
So here we have 12 points of data.

11:14.730 --> 11:19.540
So out of these 12 points of data, we are making a split.

11:19.560 --> 11:23.280
I'm driving this down to two different sub nodes.

11:23.730 --> 11:30.750
So the full split is creating this particular division and the second split is making this division.

11:31.720 --> 11:37.650
Now, let us try to find out the probability of the balance for finding the probability of the finding,

11:37.870 --> 11:40.730
we will let us consider the green point.

11:41.110 --> 11:44.620
So what are the probability of getting the green points out of this baneful?

11:44.770 --> 11:48.420
It will be six divided by the total of 12.

11:48.730 --> 11:50.410
That is zero point five.

11:52.000 --> 11:59.590
Now, let us have a look at the probabilities of both the split so far split one at the left hand side,

11:59.770 --> 12:04.330
the probability of getting the green point will be three by the triple.

12:04.330 --> 12:12.010
Seven is equal to zero point for the higher the probability will be three divided by five, which is

12:12.010 --> 12:13.270
zero point six seven.

12:14.660 --> 12:21.600
For the second split, the probability of getting the green pointis four divided by five, which is

12:21.620 --> 12:29.520
zero point B here, the probability is two divided by seven, which is zero point to it.

12:30.350 --> 12:34.240
Now, let us try to find out the probability of getting the Redpoint out of this block.

12:34.580 --> 12:40.690
So that would be five divided by seven, which will be almost 72 percent.

12:41.420 --> 12:48.560
So you can see that when we are trying to decode values from the first split, the four split is less

12:48.560 --> 12:52.310
homogeneous in nature because this is having more mixture.

12:52.970 --> 13:01.040
While in case of the second split, the values are more towards more bias towards a particular class,

13:01.190 --> 13:08.030
and hence we can easily make out that this is a great class study, this is a green class and this one

13:08.030 --> 13:09.250
is a red class.

13:09.890 --> 13:15.680
So when we pick out the point from this particular portion, it will always give out a green point in

13:15.680 --> 13:16.310
majority.

13:16.580 --> 13:19.760
And then we decode points from this particular split.

13:19.850 --> 13:22.510
It will give out more red points.

13:22.790 --> 13:26.750
So hence it being more homogeneous in nature.

13:26.960 --> 13:33.290
We will consider this split as are actual and correct split for the next splitting.

13:35.120 --> 13:38.690
So let us find out the information game using guinea pigs.

13:38.930 --> 13:41.310
So what was the formula for Guinea next?

13:41.320 --> 13:45.520
The formula for guinea pigs was one minus summation of probabilities.

13:45.530 --> 13:51.150
What you can use any of the indexes, you can use Guinea, that you can use entropy, you can use any

13:51.170 --> 13:51.520
means.

13:51.740 --> 13:54.760
But for now, we're explaining it using the Guinea index.

13:56.220 --> 13:59.780
So let us calculate the value for the Fed, including.

14:00.720 --> 14:05.610
This note will have Ghneim Index as one minus.

14:07.080 --> 14:13.590
Six by 12, all squared plus six by 12 square.

14:13.800 --> 14:18.990
So we will be taking the proportion of each class in consideration here.

14:19.200 --> 14:25.950
So it will be one minus six by 12 of the Greenglass and one minus and plus six by 12, all of the grade

14:25.980 --> 14:26.370
class.

14:26.550 --> 14:30.270
And when we solve this, we will get zero point five.

14:31.780 --> 14:35.540
Now, let us get the guinea values for the full split.

14:35.770 --> 14:36.920
This is up for a split.

14:37.120 --> 14:39.580
So the Guinea index would be one minus.

14:40.710 --> 14:41.880
Four by seven.

14:42.890 --> 14:46.340
Plus three by seven all square.

14:46.370 --> 14:49.000
So it comes out to be zero point for evening.

14:49.910 --> 14:56.630
Now let us guys wait for the light inside of the four split, it will be one minus two by five.

14:57.870 --> 14:59.940
Plus three by five.

15:01.090 --> 15:03.760
Holds square and it comes out with zero point forty.

15:04.960 --> 15:08.260
Now we will calculate the weighted average of the game.

15:08.860 --> 15:14.350
Now how will we calculate the weighted average of the weighted average will be calculated by taking

15:14.350 --> 15:15.910
the left hand side.

15:16.090 --> 15:18.110
So how many points we have here?

15:18.130 --> 15:21.510
We have seven points here and in total we have 12 points.

15:21.820 --> 15:29.300
So it will be seven by 12, including gaining of the left hand side, plus the five points here.

15:29.500 --> 15:35.180
So five divided by the of 12 in the beginning of the right hand side.

15:35.320 --> 15:41.970
So it comes out to be zero point forty six and the game comes out to be zero point five dollars million

15:42.010 --> 15:47.860
of the beginning of the BEREC, minus the beginning of the child, which is zero point one for.

15:49.460 --> 15:52.290
Now, let us calculate the need for these second.

15:54.550 --> 16:02.520
The beginning of the second split comes out to be for the left inside the guinea will be one minus four

16:02.710 --> 16:03.220
green.

16:04.350 --> 16:08.850
Divided by the total faith and the whole square of this plus.

16:10.340 --> 16:17.560
One read divided by the total five and one square of this, so it comes out three zero one three two

16:17.570 --> 16:17.920
zero.

16:18.560 --> 16:21.620
Now for the right insight, again, one minus.

16:22.560 --> 16:32.850
Five divided by the total seven plus two, three divided by the total seven and the whole square of

16:32.850 --> 16:36.250
this, so it comes out with zero point four zero eight two.

16:36.900 --> 16:39.090
Now, again, we will take the weighted average.

16:39.360 --> 16:42.430
So this campaign's five out of 12.

16:42.960 --> 16:51.420
So five by 12 into the beginning of the left and say zero point three to zero and plus the.

16:52.600 --> 17:02.040
Seven out of the 12 Endou zero point four zero eight two now we will and we got some of this, will

17:02.040 --> 17:02.170
we.

17:02.500 --> 17:03.940
The two point three seven one.

17:05.500 --> 17:13.000
Now, the gain value will be zero point five minus zero point thirty seven, so it comes out to be zero

17:13.000 --> 17:18.710
point one the week now when we compare the zero point zero one four with zero point one two eight five.

17:19.000 --> 17:25.170
We can easily see that the gain for the second split is more in comparison to the four split.

17:25.360 --> 17:32.020
So we will consider the first the second split for the splitting of this particular pairing.

17:33.130 --> 17:37.540
So we will split a particular node and then we will get these.

17:37.570 --> 17:44.080
So then after getting the split, so once we have split this, then again we will consider how we want

17:44.080 --> 17:49.020
to make the split and which is the best split and again, split this particular node.

17:49.390 --> 17:55.510
So it will keep on going on until we reach a point with each of those conditions, which we have already

17:55.510 --> 17:58.360
set for stopping the splitting.

17:59.880 --> 18:03.420
Now, let us have a look at this particular image.

18:03.630 --> 18:09.550
So here we have the data where we have this these lines of separation.

18:10.110 --> 18:15.230
So let us try to create the rules for this or decision for this.

18:15.630 --> 18:22.480
So the four split, which we can create this X X one less than three point five.

18:23.310 --> 18:27.210
So if X one is less than three point five, then we get this particular region.

18:28.720 --> 18:32.390
And if one is greater than three point five, then we get this particular region.

18:32.410 --> 18:34.720
So let us consider X1 less than three point five.

18:35.710 --> 18:37.970
So the next one is less than three point five.

18:37.990 --> 18:40.010
That is in this particular region.

18:40.300 --> 18:47.620
Then the second condition, which we need to check is if X2 is less than seven point five or not, if

18:47.620 --> 18:53.790
X2 is less than seven point five, then it comes out to be X, otherwise it comes out to be zero.

18:55.520 --> 19:02.240
Now, let us consider the X1 greater than three point five, so if X1 is greater than three point five,

19:02.540 --> 19:06.070
then there could be this particular area which we want to divide.

19:06.910 --> 19:11.960
Now, for this, we need to check if the value is between four four and seven point five.

19:12.110 --> 19:18.410
If the value is between four and seven point five, then we will again check if the value of X1 is less

19:18.410 --> 19:18.980
than seven.

19:20.060 --> 19:25.460
If the value is less than seven and the value is between four and seven point five, then we get this

19:25.460 --> 19:32.140
particular area, which means it is X, and for all other conditions, the market does zero.

19:32.990 --> 19:35.090
So this is how we will create a decision.

19:36.000 --> 19:41.960
Now, once we have got these homogeneous pieces that does this particular space, which is completely

19:41.960 --> 19:49.050
homogeneous and has all the points there is no meaning of creating for the split because the Guinea

19:49.140 --> 19:53.240
index of this particular area will be seen.

19:53.270 --> 19:57.000
The index of this particular area will still be one hundred percent.

19:57.270 --> 20:00.910
And again, if we make another split, it will still be one hundred.

20:00.920 --> 20:03.690
So there will be no information gain present.

20:03.920 --> 20:09.800
So unless and until we have a preliminary information gain, we will not make a further split.

20:10.910 --> 20:18.140
Because if we keep on making up for the split, the three will keep on growing and it will not support

20:18.140 --> 20:24.630
anything, it will just keep increasing the size of the tree without giving much information.

20:27.320 --> 20:32.810
Now, let us talk about regression vs regression trees are what we use in the case of regression.

20:33.020 --> 20:40.810
So in this case, we use the mean and median of the records, I believe not to find out the value,

20:40.820 --> 20:43.040
the average value of the predicted value.

20:43.430 --> 20:47.000
Now, let us try to learn about what regression Tresa.

20:48.130 --> 20:56.440
So for selection of rules in election scenarios, we were using the Guinea Lexar entropy, but in case

20:56.440 --> 21:00.430
of regression, we use the sum of squared error.

21:00.640 --> 21:01.920
So how do we do that?

21:02.110 --> 21:07.180
So we will find out the sum of squared error of the different type of split.

21:08.220 --> 21:13.940
And then find out the minimum split, the value which has the minimum sum of squared error.

21:15.320 --> 21:17.880
So let us consider this particular split.

21:18.530 --> 21:26.330
So this is the original parent or the parent node contains values five, six, four, six, 11, 12,

21:26.330 --> 21:27.320
13 and 12.

21:28.010 --> 21:33.920
Now, if they will protect the values, the predicted value will be the average or the mean of these

21:33.920 --> 21:34.380
values.

21:34.730 --> 21:38.510
So the average or mean of these values comes out to be eight point six.

21:39.200 --> 21:44.690
Now, if we will surprise this eight points, six from the target value, we will get these ordinator

21:44.690 --> 21:45.160
values.

21:45.920 --> 21:50.270
Now we will find out the sum of squared error as described in the.

21:52.550 --> 21:58.970
And the definition of the regression tree, so let us where these values so on squaring these values,

21:58.970 --> 22:02.610
we get thirteen point one seven six point one nine one.

22:02.630 --> 22:05.090
So these are different values of the squared values of the.

22:06.800 --> 22:15.440
And then finally, we are the all of these squared values and get me some of squared error, so what

22:15.440 --> 22:21.890
is almost square and some square in it is we will find out the error, then squared all the error values

22:21.890 --> 22:25.550
and then find us some of these squared error values.

22:27.140 --> 22:32.660
Now we need to make certain splits, and by making the splits, we will have to find out which one has

22:32.660 --> 22:34.900
the lowest sum of squared error.

22:35.890 --> 22:42.700
So this is the first type of split, this four split divides the data such that the target value has

22:42.700 --> 22:49.340
five, six to 11, 12 in the left hand side and 11 six thirty nine four on the right hand side.

22:49.600 --> 22:53.510
So here the predicted value will be the mean of these dot good values.

22:53.650 --> 22:59.320
So for the left side, the mean value is eight point seven five, and for the right side the mean value

22:59.320 --> 23:00.250
is eight point five.

23:01.030 --> 23:06.760
Now we will find out the error by finding the difference between the target and the predicted values.

23:07.180 --> 23:10.720
And then we will square the values and then find the.

23:11.790 --> 23:14.640
Squared error, the sum of squared error.

23:14.850 --> 23:20.350
So now we the add the sum of squared errors of the left and right side of this particular split.

23:20.550 --> 23:25.350
So we get forty two point seven, five plus fifty three, which is ninety five point seven.

23:26.220 --> 23:32.010
And the original sum of squared error of the betting mode was ninety five point eighty seven.

23:32.160 --> 23:36.510
So hence we can see that there is not much improvement by doing this particular.

23:37.050 --> 23:40.500
We have not really gained much from this particular split.

23:41.720 --> 23:47.510
Now there is another split which divides the data such that the target value values on five, six,

23:47.510 --> 23:53.210
six four in the left hand side and 11, 12, 13 and 12 at the right hand side.

23:53.510 --> 23:58.420
Now, again, we will find out the predicted values using the average of these data, good values.

23:58.580 --> 24:01.640
So the average of these would be five point five and four.

24:01.640 --> 24:03.350
These it would be 12.

24:04.390 --> 24:09.460
Now, again, you will find the added value by finding the difference between the target and the predicted

24:09.460 --> 24:09.970
values.

24:10.970 --> 24:17.460
Then we will square these air values and find us some of these squared values.

24:17.750 --> 24:25.130
Now, when we sum the sum of squared errors, we get two point seven five plus two, which is four point

24:25.130 --> 24:25.820
seventy five.

24:26.330 --> 24:29.420
Now, let us compare these sum of squared errors.

24:29.570 --> 24:33.250
So the second squared has a all four point seven five.

24:33.590 --> 24:36.570
The false flag, five ninety five point seven five.

24:36.810 --> 24:42.020
I'm the Fed incommode had a city of ninety five point eighty seven pence.

24:42.140 --> 24:48.770
If we are to choose between different splits, we will decide upon this second split, which reduces

24:49.010 --> 24:53.060
the sum of squared error by almost nine to one.

24:55.240 --> 25:02.530
So this is how we decide on which split we should be choosing now, let us have a look at this particular

25:02.530 --> 25:02.930
image.

25:03.040 --> 25:05.990
This blue line actually shows like what?

25:06.010 --> 25:14.400
How we have created different decision nodes and created split so that we have a decision three in place.

25:15.240 --> 25:22.720
Now, this decision tree is able to predict the training data very greatly and all the values are predicted

25:22.720 --> 25:29.130
very finely, like when we are actually trying to predict certain values, such as the value at this

25:29.440 --> 25:33.790
point and a value at this point or value at this point.

25:33.940 --> 25:38.020
It actually gives a lot of it because for this point it will.

25:38.900 --> 25:43.920
Predict this value for this point, it will predict this value for this point.

25:43.970 --> 25:45.290
This will predict this value.

25:45.590 --> 25:46.400
So now.

25:47.490 --> 25:51.330
If you see the error, which we have is higher.

25:52.270 --> 25:58.990
In these cases, but instead, we could have simply solved this problem by using linear regression.

25:59.560 --> 26:05.380
So if we would have applied linear regression, then we would have created a straight line and it would

26:05.380 --> 26:07.270
have predicted these values could.

26:09.020 --> 26:10.670
Or with the minimal edit.

26:11.580 --> 26:20.460
So this is the reason why we should always start with the very basic models, always implement the linear

26:20.460 --> 26:27.120
regression or logistic regression, forced to get the benchmark value, and then slowly and gradually

26:27.360 --> 26:32.400
get to the next models and see what is the level of improvement we are getting.

26:33.150 --> 26:40.590
If we are not getting much improvement using the higher level of algorithms, then we should always

26:40.590 --> 26:47.220
use the simpler model that is either linear regression or the model which gives a better accuracy.

26:47.970 --> 26:54.960
So that is how we should decide on which model should we be choosing never to go for a complex model,

26:55.170 --> 27:01.780
always decide by having a simpler model which gives a good design to us.

27:02.520 --> 27:09.220
So that is the reason why instead of having a decision, it would have been better to decide and select

27:09.220 --> 27:10.820
the linear regression here.

27:11.580 --> 27:17.550
Like this does not mean that we will always select the linear regression or logistic regression, always

27:17.550 --> 27:20.340
go for a model which gives you a better performance.

27:24.120 --> 27:29.970
Now, we have been creating this decision tree, now this decision tree would have been better if we

27:29.970 --> 27:35.220
would have just dropped earlier point of time, but how would we know that there should we stop?

27:35.700 --> 27:43.440
So to stop, we should either just or select of these five conditions, either of these five conditions

27:43.440 --> 27:45.780
would help us to decide when we should stop.

27:46.140 --> 27:49.430
The first condition being when the node becomes homogeneous.

27:49.710 --> 27:57.300
That is when all the norms of this particular data becomes the same.

27:57.790 --> 28:04.500
When the old data becomes homogeneous, then we should stop splitting because if the nodes are already

28:04.500 --> 28:09.950
homogeneous, then on splitting, there will be no information gain.

28:11.240 --> 28:20.990
Next is when a maximum depth of reporting is reached, so we should decide on that value, a maximum

28:20.990 --> 28:29.630
depth value and based on the maximum value, we can stop at five, seven, 12 or 15 Dep't, which is

28:29.630 --> 28:32.060
considered a good bet for a decision.

28:32.090 --> 28:36.800
Three, and we should not allow the tree to grow a lot.

28:38.170 --> 28:44.950
Next is we should alert, allocate a maximum number of leads, no limit that is reached.

28:45.100 --> 28:52.150
So if we do not know what that we should be having, then we should at least set a limit to the number

28:52.150 --> 28:54.630
of leaf nodes can which can be created.

28:55.380 --> 29:01.720
The next clay is the leaf node has observations less than the loan limit of evolutionism.

29:01.930 --> 29:07.420
So let say I have one hundred rows of data and I am making a particular split.

29:09.460 --> 29:20.050
Now, for this particular split, let's see, I already have the lead up and not having four values

29:20.050 --> 29:24.040
of one class and five values of another class.

29:24.970 --> 29:33.580
Now, if I will be making a split, then it will be creating and giving me one node with forward values

29:33.580 --> 29:36.550
and the node with five values of the other plus.

29:37.520 --> 29:44.450
So this kind of split will be giving me one node which has less than five values.

29:45.420 --> 29:50.250
Which is a very low number of values of No one to muddle through of data.

29:50.970 --> 29:58.740
Now, I could have set the minimum number of Neef, no observations do 20 in case I had a larger amount

29:58.740 --> 29:59.190
of data.

29:59.340 --> 30:03.810
But when I have only hundred rows of data, five would be a lower.

30:04.400 --> 30:08.780
Let's we have a thousand points in the entire data.

30:08.940 --> 30:13.150
Then in that case, we could have said 10 or 20 as the lower number.

30:13.380 --> 30:20.030
So we have to decide on the basis of how much data we actually have so that we don't end up with a leave

30:20.040 --> 30:25.920
node with very minimal amount of data, which does not actually count for a good majority.

30:27.620 --> 30:32.690
Then the split results in at least report having less than no lower limit of value.

30:32.870 --> 30:41.210
So either the D.A. has a lower limit of value or the split after this, particularly if not, will result

30:41.210 --> 30:45.080
in a lower limit of value, then we should stop the splitting.

30:46.750 --> 30:53.470
So there are several hyper barometers, which we have the fullest type of barometer is criterion, which

30:53.680 --> 30:59.500
defines that which method, which way we should be using, which is either guiney or entropy.

30:59.830 --> 31:04.630
The function is to measure the quality of the supported criteria.

31:04.630 --> 31:09.280
Zarghami for the intro before for the information.

31:10.180 --> 31:11.920
Now another is the splitter.

31:12.130 --> 31:16.400
The speaker is either we will choose the best splitter or random speaker.

31:16.600 --> 31:21.040
So which split should you be choosing is what speaker decides.

31:21.060 --> 31:25.260
So by default, the value is best and we should always go for the best.

31:26.110 --> 31:32.830
Next is maximum depth, so maximum depth is the maximum depth of the which we want to have.

31:33.160 --> 31:39.970
So if none of these values, if no value for maximum Deftest chosen, then the Nordon that expanded

31:39.970 --> 31:48.850
till the end of the entire three, till all the leaves are pure or and they all contains less than the

31:48.850 --> 31:49.740
minimum sample.

31:50.800 --> 31:58.030
So we should define a maximum depth or we should define the minimum sample split for.

31:59.670 --> 32:07.620
Next is minimum sample split, minimum sample split has a B fold value to so it will not allow the minimum

32:07.620 --> 32:12.470
sample split to be of value to have more than less than two values.

32:12.690 --> 32:19.050
So the minimum number of samples required to split an internal node is two now.

32:19.050 --> 32:25.060
Next is minimum sample believe that is the minimum number of samples required to be identified.

32:25.360 --> 32:29.120
So there has to be at least one value at the leafleted.

32:29.520 --> 32:32.580
Now, next is maximum features like this.

32:32.730 --> 32:38.910
What is the maximum number of features we should consider by looking for the best split now we could

32:38.910 --> 32:46.050
either choose automatically, choose automatically, or we can choose different values or or square

32:46.050 --> 32:46.840
root or law.

32:46.870 --> 32:53.730
To now maximum feature allows us to select the maximum number of features because we cannot really use

32:53.730 --> 33:01.110
all the features all the time, because then it will create a very huge bucket of rules.

33:01.230 --> 33:04.730
So that is the reason why we don't use a max feature.

33:05.100 --> 33:07.350
Next is maximum leave node.

33:07.530 --> 33:12.060
So it allows to grow a three with maximum leaf node.

33:12.300 --> 33:17.270
That is the maximum number of leaf nodes which could be created.

33:17.880 --> 33:27.780
And lastly, we have last week, which allows us to define if the classes are equally present in the

33:27.780 --> 33:28.770
data or not.

33:29.910 --> 33:31.830
So what does Glass Week help?

33:32.160 --> 33:35.250
So in case, let us say we have a double.

33:35.490 --> 33:37.150
We have two glasses.

33:37.410 --> 33:38.430
Yes, I know.

33:38.820 --> 33:43.380
And there are hundreds of data out of which five hundred rules have.

33:44.280 --> 33:44.820
Yes.

33:44.820 --> 33:48.030
And five hundred rules have no as they're about a good one.

33:48.420 --> 33:53.190
So in this case, we have equal amount of data protection for both the classes.

33:54.180 --> 34:01.590
So hence, we don't need to provide any any details to it, but in case, let us see.

34:01.610 --> 34:09.270
We have a scenario where one glass has only 10 percent of the data and the glass has 90 percent of the

34:09.270 --> 34:18.000
data, then we will have to provide a voyage to the second glass because it has a very less amount of

34:18.000 --> 34:18.390
data.

34:18.780 --> 34:26.550
So what we will do is we will give a high the higher wage to the class, which has less amount of data.

34:26.670 --> 34:33.930
That is the one which has only 10 percent of the data, such that each point of that particular class

34:34.140 --> 34:43.890
is considered as 90 days of the actual or actually nine times of the actual loss so that there are balance

34:43.890 --> 34:45.840
created between the glasses.

34:46.530 --> 34:49.920
So we will be providing that in the way for.

34:52.140 --> 34:56.460
So this is how we will create the maximum split.

34:58.520 --> 35:05.650
Next is Overfitting, next is to define the overfitting, so the decision carries over.

35:05.930 --> 35:12.410
When we have very high depth or there are certain observations which are noisey in nature, and while

35:12.910 --> 35:19.230
we have considered those observations or if we have some features which are not really required and

35:19.250 --> 35:20.620
they are moisy in nature.

35:20.810 --> 35:30.250
So in those situations, overfitting can and thus we have other algorithms which can actually help preventing

35:30.290 --> 35:34.560
overvoting in decision 3s, which we will be launching later.

35:35.540 --> 35:42.710
For now, we will be moving towards the code of decision three and how we can actually implement decision

35:42.710 --> 35:43.110
trees.