WEBVTT

00:01.320 --> 00:09.210
Now that we have discussed about different unsupervised learning methods for clustering, now comes

00:09.210 --> 00:14.340
the dawn of a very important algorithm, which is dimensionality reduction.

00:14.640 --> 00:23.760
Dimensionality reduction helps us in reducing the number of columns and appropriately features, selecting

00:24.000 --> 00:26.980
now feature selection and damage.

00:27.060 --> 00:28.110
Double digit option.

00:28.440 --> 00:37.290
Do the same thing, but in a different way, because in case of feature selection, we already have

00:37.290 --> 00:46.320
several columns and several attributes, but we only have to select some of them and drop other columns

00:46.320 --> 00:51.330
from the dataset while in case of dimensionality reduction.

00:51.600 --> 00:59.790
We completely transform the dataset from, let's say, 80 columns or hundred columns or three hundred

00:59.790 --> 01:09.010
columns into all different data, set off a smaller number of columns, see 11 columns.

01:09.750 --> 01:18.090
Now, the amount of information which will be present in the original dataset with eighty or a hundred

01:18.090 --> 01:26.910
or two hundred columns, will be similar to the amount of information which we will get after the transformation

01:26.910 --> 01:28.810
using dimensionality reduction.

01:29.640 --> 01:36.600
So in feature selection, we use to select a few features from the entire dataset.

01:36.840 --> 01:45.930
While in dimensionality reduction we convert the data into a new fresh dataset where we have the same

01:45.930 --> 01:47.370
amount of information.

01:47.670 --> 01:56.070
The information which we have in the original data is intact, but the number of columns has now reduced.

01:57.030 --> 02:06.810
Now, because the number of columns has reduced, it is less complex to use them, and it also reduced

02:06.840 --> 02:10.740
the time consumed for training, different it.

02:12.510 --> 02:13.890
So let us see for the.

02:16.110 --> 02:26.040
So what does feature engineering feature engineering is the process of creating features so that we

02:26.040 --> 02:30.360
can use them for machine learning algorithm training.

02:31.110 --> 02:38.700
So feature selection is one part of feature engineering where we actually select different features

02:38.700 --> 02:41.130
from all the features which are already present.

02:42.650 --> 02:49.430
Now, once Abraham Lincoln said, give me six hours to chop down a tree and I will spend the forest

02:49.430 --> 02:58.250
for sharpening the axe, the above quote has a great influence in machine learning to when it and when

02:58.250 --> 03:05.600
it comes to modeling different machine learning models most of the time needed to spend on the data

03:05.600 --> 03:12.920
processing and feature engineering stage, where we actually transform the data and create new problems

03:13.130 --> 03:16.940
and select good columns from all the columns which we have.

03:17.960 --> 03:25.910
So what this course of damage to the incomplete data with few features led to the development of the

03:25.910 --> 03:26.700
common myth.

03:27.440 --> 03:35.210
So when we have incomplete data and when we have few features, then we think that our model is not

03:35.210 --> 03:40.360
working fine because we have less amount of detail, we have less number of features.

03:40.670 --> 03:41.980
But that is not true.

03:42.320 --> 03:45.170
Having less number of feature is not a problem.

03:45.530 --> 03:52.400
Having features which does not have good information, good quality of information is the problem.

03:53.060 --> 04:00.440
So having more features and more data will always improve the accuracy of solving the machine learning

04:00.440 --> 04:00.950
problem.

04:01.130 --> 04:03.170
So this is the myth which is there.

04:03.350 --> 04:11.510
But in reality, this is of course, more than a game and a lot of features with very few data points

04:11.720 --> 04:19.560
in this case for doing a model in this scenario often leads to low accuracy model, even with many features.

04:20.000 --> 04:22.310
This is called cause of dimensionality.

04:22.520 --> 04:29.170
That is when they increase the number of features reserves in the decrease in model accuracy.

04:30.230 --> 04:32.810
So increasing number of features.

04:34.320 --> 04:42.510
In Greece, the model complexity that is more precise, the model complexity increases exponentially.

04:42.930 --> 04:50.610
So basically they don't want to have such large features because we want to have a smaller feature space,

04:50.790 --> 04:58.170
some good features which will be able to explain the problem, which will be able to represent the actual

04:58.170 --> 05:05.130
information instead of having irrelevant to hundreds or hundreds of columns which don't really help

05:05.130 --> 05:10.410
us much and only increase the complexity and increase the calculation time.

05:10.440 --> 05:11.550
We don't want that.

05:11.550 --> 05:16.020
We only want some few good quality features.

05:17.380 --> 05:19.940
So what how can we do that?

05:20.050 --> 05:24.670
So there are two ways to stay away from this course of dimensionality one.

05:25.730 --> 05:33.020
Add more data to the problem, second, reduce the number of features in the data, now adding more

05:33.020 --> 05:38.840
data to the problem is not possible because we have a limited amount of data.

05:39.050 --> 05:43.910
And if we have data, then we would have already used the data.

05:44.120 --> 05:46.450
But we have some limited amount of data.

05:46.460 --> 05:51.590
But the feature space is so large that we don't know how we can reduce it.

05:51.860 --> 05:56.030
So I think the data may not be possible in many scenarios.

05:56.270 --> 06:00.090
Hence reducing the number of features is more preferable.

06:00.290 --> 06:05.180
So for that we have a technique which is known as dimensionality reduction.

06:06.130 --> 06:17.740
So what does feature engineering and engineering allows us to identify the influence features over all

06:17.740 --> 06:23.270
the available features, so the identified features used to train the model.

06:24.040 --> 06:31.450
So identifying the influence feature or the influential features does not mean picking the features

06:31.990 --> 06:33.430
in an analytical way.

06:33.610 --> 06:39.880
So we don't it doesn't mean that if we have analyzed and found out some features, then we have done

06:39.920 --> 06:41.560
an analytical way.

06:41.770 --> 06:48.780
Something would be done in a different way also where we can actually do some conversion and get the

06:48.790 --> 06:50.570
meaningful features out of that.

06:50.950 --> 06:58.510
So without really applying the different methods which we have been using till now, we can actually

06:58.510 --> 07:06.040
reduce the size of data by using dimensionality reduction without applying any analytical knowledge,

07:06.040 --> 07:14.020
without applying different concepts like correlation, value, finding out multipolarity or ViiV value

07:14.020 --> 07:15.790
or applying find us providing.

07:15.790 --> 07:18.260
No, none of that is really required.

07:18.820 --> 07:22.680
What we can actually do is we can basically use them as a digital.

07:23.990 --> 07:31.580
One famous approach for dimensionality reduction is principal component analysis, so of what we will

07:31.580 --> 07:35.600
be doing is we will be using PVC for now.

07:35.720 --> 07:37.800
So what is Fessey exactly?

07:38.330 --> 07:44.150
So let's imagine a simple problem where we have all the gold.

07:44.150 --> 07:47.400
We are recording the motion of the spring here.

07:47.630 --> 07:49.880
This is one spring or pendulum.

07:50.880 --> 07:58.020
And this moves enough of one single direction in this direction in which I am pointing it out right

07:58.020 --> 07:58.300
now.

07:58.680 --> 08:07.950
Now, in this particular motion, if someone is not really aware of where they should check this particular

08:07.950 --> 08:13.110
motion, whether someone should check for this one, what they will be doing is they will apply different

08:13.110 --> 08:14.760
cameras at different angles.

08:15.030 --> 08:19.710
And now applying cameras at all these angles will not really help.

08:20.920 --> 08:27.640
So if someone knows where the camera has to be, then they will apply three cameras and actually be

08:27.640 --> 08:29.310
able to capture the emotion.

08:29.740 --> 08:37.300
But if someone is not aware of the exact direction of motion, then they might be applying a number

08:37.300 --> 08:42.790
of cameras to record the moment with at least three cameras.

08:43.000 --> 08:50.050
And we will play some different or cameras in case we don't know what cameras we need to apply.

08:50.380 --> 08:57.460
So we will keep on giving cameras to capture this motion, but we will not be able to capture the motion

08:57.460 --> 09:02.550
entirely or we will need a lot more number of cameras to capture the motion.

09:02.560 --> 09:08.440
If we don't really know that we need to apply these cameras at 90 degree angle with each other.

09:09.190 --> 09:16.900
So what we will have to do is how what we actually get from this is that a camera at 90 degree angle

09:16.900 --> 09:22.320
will be able to capture more information in comparison to cameras placed at different locations.

09:23.610 --> 09:26.970
So the same thing is what we will be doing in Keisel.

09:27.900 --> 09:35.040
So what we do is in B.C. a technique, it transforms the odd features that bizarre cameras into new

09:35.040 --> 09:42.450
dimensional space and represents it as a set of new orthogonality water onto one of the variables or

09:42.600 --> 09:46.530
one of the variables are the variables which are perpendicular to each other.

09:46.890 --> 09:50.280
So we already have some dimensional space.

09:50.460 --> 09:57.390
We already had a feature space which had multiple dimensions, but all of these dimensions were capturing

09:57.390 --> 09:59.190
the data in different, different ways.

09:59.700 --> 10:06.900
But because these dimensions were not really at the 90 degree angle and they were not able to capture

10:06.900 --> 10:08.100
much of the information.

10:08.520 --> 10:09.630
Now what we did.

10:11.020 --> 10:19.300
The applied B.S. I'm using this year, we created another dimension, another dimension of space where

10:19.330 --> 10:22.060
all the dimensions were at 90 degree degrees.

10:22.780 --> 10:29.500
Now, these 90 degree angle were able to capture more information from different dimensions then they

10:29.500 --> 10:31.040
were actually able to capture.

10:31.540 --> 10:38.170
So what happened is these orthogonal features are actually called principle components.

10:38.500 --> 10:46.240
So these orthogonal variables will be able to observe the problem with reduce the cycle features.

10:46.540 --> 10:50.350
Now, in practice, our data is like the motion of the pendulum.

10:50.590 --> 10:55.090
If we had complete knowledge of the system, we will require a smaller number of features.

10:55.420 --> 11:02.410
Now we have to observe the system using a set of features which will convey maximum information if they

11:02.410 --> 11:03.870
are orthogonal in nature.

11:04.090 --> 11:08.410
So the maximum information will be captured using the orthogonality.

11:08.890 --> 11:12.030
So this is done using principle component analysis.

11:12.250 --> 11:21.550
The new set of features which are produced after a transformation are minutely uncorrelated as they

11:21.550 --> 11:22.470
are or to woman.

11:22.720 --> 11:30.550
So the features will be created, will be linearly uncorrelated, that the features will be independent

11:30.550 --> 11:31.150
in nature.

11:31.390 --> 11:38.860
So the thing which we were trying to do earlier with all those columns is actually being given to us

11:39.040 --> 11:45.850
and with a lesser number of columns, basically in terms of these principle components or the orthogonal

11:45.850 --> 11:49.630
features, which principal component analysis will be giving to us.

11:50.880 --> 12:00.120
And on top of that, getting these features is already good enough, but are also getting these features

12:00.120 --> 12:04.640
in sorted order, and what is this sorted order?

12:04.650 --> 12:08.310
This sorted order is we will get principal components.

12:08.310 --> 12:09.090
So let's see.

12:09.090 --> 12:14.220
We get then principal components out of all the hundred columns which we had have.

12:14.460 --> 12:19.920
Now these principal components will be sorted in in order to be the first principal component will be

12:19.920 --> 12:27.000
able to capture the maximum amount of information, then the second component to recapture another amount

12:27.000 --> 12:33.030
of information, which will be at next quantity, quantity ways and then lowering it.

12:33.210 --> 12:38.880
So what we can do, we can select the top three, top five of seven principal combatants, which will

12:38.880 --> 12:44.820
be able to capture most of the information, the amount of information we actually want them to capture.

12:45.360 --> 12:47.330
So it will what will it do?

12:47.610 --> 12:53.490
The first principle component alone will explain a very large component of the data.

12:54.410 --> 13:02.000
The second principle component will explain less than the first component, but more than all the other

13:02.000 --> 13:07.530
components, the last principle component will explain only a small change in the data.

13:07.820 --> 13:15.710
So we run back into the top principle components such that they together explain most of the data.

13:17.020 --> 13:26.440
So in most analytical problems, explaining 90 to 99 percent of the data is considered very high, so

13:26.530 --> 13:34.330
we will select, convert the data into principal components and then select the top few principal components,

13:34.330 --> 13:37.010
which will be able to explain most of the data.

13:37.570 --> 13:39.690
Now what it actually looks like.

13:40.000 --> 13:42.210
So we will have some data.

13:42.230 --> 13:47.620
Let's say we have one hundred columns and they were able to explain hundred percent of the data because

13:47.800 --> 13:48.700
that is some data.

13:48.730 --> 13:51.790
So the data will be hundred percent in those hundred columns.

13:52.250 --> 14:00.760
Now, we have applied principle competent analysis and we have received these then vincible component.

14:02.490 --> 14:09.360
So each of these principal companies will be able to explain some variance, what is this variance?

14:09.600 --> 14:12.490
This variance is the amount of information.

14:12.840 --> 14:17.280
So the first component here explains 40 percent of the information.

14:17.640 --> 14:20.370
Next component explains almost 20 percent.

14:20.700 --> 14:24.030
Next component explains a little less and so on.

14:24.330 --> 14:28.110
So now we will try to get these components.

14:28.120 --> 14:29.960
So I will pick the first component.

14:30.390 --> 14:36.290
It gave me what people, when I consider another component, don't along with it.

14:36.570 --> 14:44.220
That is, I converted a hundred columns and I got these 10 columns out of those hundred columns after

14:44.220 --> 14:45.110
transformation.

14:45.420 --> 14:50.820
And now when I'm considering one column, I get 40 percent information.

14:51.180 --> 14:55.680
When I consider two columns, I get almost 50 percent information.

14:56.040 --> 15:01.370
When I consider almost these three columns, I get almost 70 percent information.

15:01.740 --> 15:08.550
And like this, when I go to the seventh columns, seventh principle component, it is able to explain

15:08.970 --> 15:13.180
more than 90 percent of the information, which is exactly what they want.

15:13.470 --> 15:21.750
So I think these seven components out of the thin components which were created for me, I now I have

15:21.750 --> 15:29.070
captured the information which was provided by one hundred columns earlier after transformation and

15:29.070 --> 15:37.410
selecting these principal components, I have reduced the number of columns from one hundred to seven

15:37.410 --> 15:37.710
here.

15:38.250 --> 15:43.260
So now I will use these seven columns for my model generation.

15:44.520 --> 15:53.760
There is only one drawback which I have here, which is these columns will not have any legal associate,

15:54.210 --> 15:58.170
so these columns will not give me the column name.

15:58.440 --> 16:06.690
Earlier, I had column names like Age Silalahi, the number of children, the number of number of houses

16:06.690 --> 16:08.640
they own and different divisions.

16:08.850 --> 16:12.000
But now these companies have the information.

16:12.270 --> 16:15.720
But I cannot say that this is the age criteria.

16:15.750 --> 16:17.450
This is the salary criteria.

16:17.610 --> 16:23.880
No, because the agent salary and all those details have been somehow distributed between these seven

16:23.880 --> 16:24.450
components.

16:24.720 --> 16:30.610
The information is present here, but I don't know how it will be represented.

16:30.690 --> 16:35.940
It is somehow fixed and created and combined and presented in this way.

16:37.160 --> 16:45.680
So this is a very useful and very nice implementation, so it is something which has done wonders for

16:45.680 --> 16:45.900
us.

16:46.160 --> 16:52.460
So now, instead of walking with all those features, if they are not really concerned with all the

16:52.460 --> 16:58.030
features, we can simply convert them into a principal component and use them.

16:58.280 --> 17:03.920
But the only drawback which we will have is we will not be able to explain the relationships.

17:05.180 --> 17:13.000
If you want a Blalock's model and you want to create that blackbox model with less complexity, just

17:13.010 --> 17:14.210
get the data, do.

17:14.210 --> 17:23.540
P.S. apply in principal component analysis, get the components and apply the regression or classification

17:23.540 --> 17:24.890
algorithm on top of it.

17:24.890 --> 17:25.950
And you're go to.

17:27.290 --> 17:34.400
No need to do different things like data transformation and then feature selection selection, that

17:34.400 --> 17:35.510
all is not required.

17:35.900 --> 17:43.880
So this is something which is really helpful, but it comes with its own good idea that you will not

17:43.880 --> 17:47.730
get to know what is explained inside this.

17:47.750 --> 17:50.780
You just get the features, used them and stay happy.

17:52.190 --> 17:58.490
So here are the domes of a few domes like covariance and what is cool readings.

17:58.490 --> 18:00.200
We have already discussed this.

18:00.440 --> 18:03.860
So cool radiance is of here.

18:03.860 --> 18:06.470
You can see the value is decreasing.

18:06.800 --> 18:10.310
So this has a large negative covariance here.

18:10.310 --> 18:16.160
We have near zero readings and here we have a large positive comedians.

18:16.430 --> 18:18.350
I hope you already know this.

18:19.160 --> 18:24.600
When greater value of one variable mainly corresponds to lesser value of other.

18:25.610 --> 18:28.730
This is covariance has negative.

18:29.060 --> 18:37.280
If the sign of convenience shows the tendency in the relationship between the variable, the magnitude

18:37.280 --> 18:41.960
of the winds is not easy to interpret because it is not normalized.

18:41.960 --> 18:44.540
Hence depends on the magnitude of the variable.

18:44.900 --> 18:49.530
The normalized version of comedians is called correlation coefficient.

18:49.850 --> 18:54.410
However, shown by its magnitude the strength of the relationship.

18:54.410 --> 19:02.660
So evidence provides the direction of the relationship, like the correlation provides the magnitude

19:02.660 --> 19:03.590
of the relationship.

19:07.460 --> 19:14.240
So these are the details about the principal component, that is how do we transform a given set of

19:14.240 --> 19:18.720
features into new features such that they are orthogonal?

19:19.340 --> 19:22.600
So the answer for this is Eigenvectors of the Matrix.

19:22.940 --> 19:26.740
So we know that eigenvectors are to talk to each other.

19:27.020 --> 19:34.010
So transforming our features in the direction of eigenvectors will also make them or the one before

19:34.010 --> 19:35.390
transforming the Matrix.

19:35.420 --> 19:42.080
It is always recommended to normalize, so always normalize the data and then apply.

19:43.640 --> 19:44.120
B.C..

19:44.540 --> 19:50.120
So if the matrix is not normalized, our transformation will always be in favor of each other with the

19:50.120 --> 19:51.740
largest scale of values.

19:52.040 --> 19:56.230
This is why I became sensitive to the relative scaling of the original variable.

19:56.450 --> 20:01.790
So it is very important to normalize the data and then we will apply this year on.

20:02.000 --> 20:03.910
Now, why do we need this?

20:03.920 --> 20:11.030
We need this to be political linearity, to reduce the number of features and to be able to visualize

20:11.030 --> 20:11.860
the data.

20:13.210 --> 20:16.930
Here you can see one diagram.

20:16.960 --> 20:26.980
So here we have converted a huge of dimensional data into just two dimensions where we can easily find

20:27.010 --> 20:27.640
clustering.

20:29.410 --> 20:35.470
So we are able to visualize the clustering because the data has been transformed into two dimensions.

20:36.910 --> 20:43.800
So this is about PC, so I hope you will be able to use PC very nicely.

20:44.020 --> 20:51.130
So we will discuss the coding for dimensionality deduction using PC in the next session.

20:51.130 --> 20:54.970
So I hope it will be a great learning experience.

20:55.300 --> 21:02.920
And with this, we end the sessions for the general topics which we have in machine learning.

21:03.160 --> 21:11.650
Apart from this, I will be adding another video for The Matrix, which we will be using for the clustering

21:11.650 --> 21:12.440
algorithms.

21:12.880 --> 21:15.160
So thank you very much.

21:15.970 --> 21:22.330
The next thing which you have to do is work on the project and I will provide proper guidelines on the

21:22.330 --> 21:22.930
project.

21:22.930 --> 21:29.560
How you need to work on the project and I hope you have been working on the assignments and the assessments

21:30.010 --> 21:32.040
and performing really well on it.

21:32.620 --> 21:41.170
So I have a great learning time and I hope you will learn a lot and grow a lot in machine learning domain

21:41.500 --> 21:44.740
and have a great future in the scene.

21:45.400 --> 21:46.030
Thank you.