WEBVTT

00:01.560 --> 00:08.000
Hylan, I hope you have learned a lot from these supervised learning and unsupervised listening sessions

00:09.030 --> 00:13.380
the Supervised Learning Project continues to produce.

00:14.280 --> 00:20.150
The first project is a regression project and the second one is a classification project.

00:21.090 --> 00:31.530
This particular review will give you an outline of how you should work on the project in case of supervised

00:31.530 --> 00:32.010
learning.

00:32.190 --> 00:35.480
As we already know, we have two types of variables.

00:36.030 --> 00:43.020
The first one is the value which we want to predict, and the second one are the different values using

00:43.020 --> 00:49.140
which we will learn certain ideas about the values and we will predict those.

00:49.860 --> 00:51.810
So what is the data which we have here?

00:56.800 --> 01:01.660
Here we have this entire dataset, which is about the whole state.

01:03.380 --> 01:12.110
The target here is the price column, and here you can see we have around twenty one thousand rows of

01:12.110 --> 01:16.160
data in this twenty one thousand rows of data.

01:16.170 --> 01:21.520
We have various details about the prices, which is the value which you need to predict.

01:22.280 --> 01:24.810
As you can see, this is a continuous value.

01:24.830 --> 01:27.590
So this should be a regression from.

01:28.940 --> 01:33.320
Now, there are a lot of columns which are present in this particular dataset.

01:34.700 --> 01:43.620
The values being I.B., the bedrooms, number of bathrooms, the adult, the house, the waterfront

01:43.640 --> 01:48.060
condition, if it is a waterfront, what is the condition level of the house?

01:48.290 --> 01:50.300
What is the grade of the house?

01:50.570 --> 01:55.490
And when it was built, when it was renovated, the zip code and so on.

01:56.450 --> 02:00.020
Using all of these details, you need to create.

02:01.140 --> 02:09.030
Features and then use them, as you can see, that there are no categorical columns pressing, you need

02:09.030 --> 02:16.500
not convert anything into the tables, but you can consider these your renovation.

02:18.080 --> 02:25.100
Zip code, as I've got to go to columns, I'm convert to renovation and find some relationship between

02:25.730 --> 02:27.220
the renovation.

02:27.530 --> 02:32.990
Similarly, you can find out some details about this date, what this date signifies.

02:33.320 --> 02:35.510
So this is what you can do.

02:39.640 --> 02:46.810
And similarly, here are some ideas now based on your learning, you would better know what you need

02:46.810 --> 02:47.930
to do with this idea.

02:49.150 --> 02:56.410
So now once you have this entire dataset, you will follow the process, which has been clearly defined

02:56.410 --> 02:57.930
and described to you.

02:59.780 --> 03:06.650
So you will find out different outliers and missing data, and in case there are certain outliers or

03:06.650 --> 03:11.340
missing data presented, then you will try to impute the values to it.

03:11.540 --> 03:16.260
Now, what values you should be imputing, you would have learned during the preparation.

03:17.030 --> 03:23.060
Similarly, you can create new columns and using these techniques, which we have discussed.

03:23.300 --> 03:27.260
You can remove the columns and do feature selection.

03:28.800 --> 03:35.040
Then you will be applying different algorithms and comparing the performances.

03:36.150 --> 03:43.000
We have very rewarded performances, not that you don't need to improve the model a lot.

03:43.950 --> 03:49.080
The model improvement is a dome, which you need to know that even.

03:50.040 --> 03:58.450
If a model you see at one point of time, is it the person using a linear model?

03:58.620 --> 04:00.750
I'm using a random for this.

04:00.750 --> 04:09.360
Let us see in case the eye could see changes to eighty eight point one, then it is not recommended

04:09.360 --> 04:17.520
to use a random forest because we are not gaining a lot in terms of accuracy, but we are increasing

04:17.520 --> 04:18.840
the complexity of the.

04:20.110 --> 04:27.520
So always consider how much you're increasing the complexity of the model in terms of how much you are

04:27.520 --> 04:29.050
gaining in terms of accuracy.

04:29.800 --> 04:32.050
Same thing goes for stacking.

04:32.560 --> 04:40.450
If you are random forest or you most orny, this model is giving you an IQ of see eighty eight point

04:40.450 --> 04:45.630
six percent and a stacking model gives you be eight point seven percent.

04:45.970 --> 04:53.500
Then go ahead and use the old model, the range of what is a used or whatever model you have used instead

04:53.500 --> 04:59.350
of going for the stacking model, because again, the stacking model will be a highly complex model,

04:59.530 --> 05:01.270
which you don't want to use.

05:02.580 --> 05:10.410
OK, so these are a few things which you need to keep in mind and also remember, there is no great

05:10.410 --> 05:11.380
or good model.

05:11.820 --> 05:15.700
A model should be having a good accuracy.

05:15.700 --> 05:21.090
And your target here is to learn how to implement the different algorithms.

05:21.720 --> 05:25.870
So do not skip to any specifical automatically.

05:26.100 --> 05:33.840
What I would recommend to you is to figure out the dataset and implement all the algorithms which you

05:33.840 --> 05:37.470
have learned and improve each and every algorithm.

05:38.770 --> 05:45.790
By this, you will learn how to fine tune different models and you will also learn how you can compare

05:45.790 --> 05:46.870
different models.

05:50.440 --> 05:59.440
During this entire course, I have used a lot of data sets, and this gives you an additional opportunity

05:59.740 --> 06:03.300
to use those data sets for you for the learning.

06:03.880 --> 06:09.510
So as a project, you can implement several algorithms on this particular dataset.

06:09.700 --> 06:16.060
And after you are done with the project as a future learning, you can pick out any datasets which I

06:16.060 --> 06:24.040
have shared and then start exploring them, stop walking on them and implementing different algorithms

06:24.040 --> 06:25.660
on top of those datasets.

06:26.520 --> 06:33.610
That is the reason why I have so many data sets so that at the end of the course you will have enough

06:33.610 --> 06:35.020
data to practice on.

06:35.900 --> 06:45.860
So I hope you will work hard on this project and improve the accuracy of the models and create great

06:45.860 --> 06:49.970
models and have a bright future using this morning.

06:50.660 --> 06:51.230
Thank you.
