WEBVTT

00:02.380 --> 00:09.060
Let us implement the code, so the first thing which we will be doing is equally important in campaign

00:09.060 --> 00:10.140
finance, Liberi.

00:12.050 --> 00:17.290
So often in building the libraries, people read those.

00:18.560 --> 00:24.100
So we have done this earlier also, so we will read the city's history by using Beedie, don't read

00:24.110 --> 00:24.730
CSFI.

00:25.040 --> 00:28.140
And this is the real estate data center, which we will be using.

00:28.970 --> 00:34.790
So using the function, I will be retrieving the top 10 rolls of the data.

00:35.390 --> 00:37.470
So these are the top 10 rolls of the data.

00:37.790 --> 00:39.560
The columns are No.

00:40.610 --> 00:49.330
X1 transactionally x two, housing X three distance to the nearest and station, then X four, which

00:49.330 --> 00:55.870
is number of convenience store X five but just magnitude six, which is wrong with you and Y, which

00:55.870 --> 00:57.440
is the house price of the unit.

00:59.130 --> 01:07.800
So it is self-explanatory that we have these several X values, which is six feature columns or six

01:07.800 --> 01:13.250
attributes using which we want to find out the value by.

01:14.230 --> 01:16.510
So this is the value, which is the target value.

01:16.540 --> 01:21.460
This is the value that you want to predict and these are the values which we already have.

01:22.080 --> 01:29.230
Now, while we will be training the model, we will be providing all of this data to the model to learn

01:29.230 --> 01:38.380
from it, and then we will provide it and unseen data so that the model you can actually predict the

01:38.380 --> 01:38.980
values.

01:41.220 --> 01:50.910
So first of all, we check the summary of this data so we can see the values of the column, so you

01:50.910 --> 01:52.850
can see all of these are numerical.

01:53.010 --> 01:56.790
So we don't really have to apply any data transformations here.

01:57.850 --> 02:02.610
So the values for number is one hundred and forty nine, Foulk.

02:04.090 --> 02:09.190
So the value minimum is one, the maximum value is 14 for the.

02:10.330 --> 02:13.900
Then we have next Gollum's X1, which is transactionally.

02:15.190 --> 02:17.770
So here the value is again.

02:18.760 --> 02:23.960
Similar, so the transaction date has two thousand and twelve 917.

02:24.250 --> 02:25.740
These are in numeric form.

02:26.560 --> 02:34.240
So what we can do is we can transform this transaction date to give three different columns.

02:34.510 --> 02:41.050
That is one with the Second Amendment and third one with the day of the month.

02:43.050 --> 02:51.780
Next, we have housing, so housing is again in a no fly zone, which is perfectly fine, then we have

02:51.780 --> 02:52.500
dist..

02:53.730 --> 02:59.790
Then we have a number of convenience stores, so this is, again, in the new Med. So as you see all

02:59.790 --> 03:03.600
of the columns on a new form, so we don't have to do much to it.

03:05.780 --> 03:06.960
So let's go further.

03:07.250 --> 03:14.370
Let us check how many null values are present, so to remove the null values, we can simply do real

03:14.390 --> 03:16.310
estate, not drop any.

03:17.720 --> 03:23.960
Which will allow us to drop all the null values so you can see that there were no null values, that

03:23.960 --> 03:27.170
this might be four hundred and fourteen rules are in that.

03:30.010 --> 03:34.840
Next, we will check the money, so for the House to mean.

03:36.480 --> 03:43.560
Value we are putting the real estate value, we get the X to house each value from the real estate data

03:43.560 --> 03:52.400
frame and find the meaning of it, then we can simply print the mean age of the house.

03:52.770 --> 03:57.570
So the median house is seventeen point seven one years.

03:58.670 --> 04:05.400
Next, we have mean convenience stores, so the mean number of convenience stores is calculated by the

04:05.660 --> 04:11.230
State of the Union, then the column name, and then we apply the mean function to it.

04:11.810 --> 04:15.980
So we get the mean convenience store us four point zero nine.

04:18.690 --> 04:23.920
Now, let us make up a few blocks out of this data.

04:24.330 --> 04:27.740
So first we will create a regression plot.

04:28.620 --> 04:37.800
So here we are creating a regression plot of the data with X as the X value with the house, each value

04:38.190 --> 04:41.620
with respect, with the comparison to the Y value.

04:41.850 --> 04:45.960
So we are comparing the house each with the value.

04:47.050 --> 04:52.360
So you can see that the data is slightly.

04:53.330 --> 04:59.880
In a straight line, and there is nobody not a very high correlation of the value.

05:00.110 --> 05:06.970
So let us see what is the data so you can see we are finding out the correlation value.

05:07.550 --> 05:14.960
So we are importing stats from sci fi and we are finding, though, we are still pushing and value.

05:17.430 --> 05:24.780
So Pearson Corporation does nothing but the correlation value, so we are running the stock start Pearson.

05:25.990 --> 05:31.240
And we are putting in the extra house each and every house price to it.

05:32.150 --> 05:39.680
So we have found out that the correlation between the house and house price is the person go fishing

05:40.250 --> 05:42.910
villages minus zero point two one.

05:43.160 --> 05:47.930
So you can see the correlation value is negative, which is also visible from the graph.

05:48.950 --> 05:55.190
Because it is going towards the negative direction, but it is not going in a very straight line, it

05:55.190 --> 05:57.630
is not going in a complete diagonal line.

05:57.980 --> 06:03.590
So it shows that the relation relation is a weak relationship and also a.

06:04.880 --> 06:10.730
Low value, so it is not a very strong relationship, so it is a negative zero point one value.

06:11.150 --> 06:20.030
Now next we have a value veges one point five six, which is, again, not a very high P value.

06:23.050 --> 06:30.400
So now we will check again for the extreme cold extremes, the distance to the nearest amortisation

06:30.610 --> 06:36.630
with respect away, so we will find out the correlation value and the p value for this.

06:37.000 --> 06:42.460
Now, here, the correlation value is minus two point six seven, which is, again, not a very high

06:42.460 --> 06:48.370
correlation and the P value is four point sixty, which is also an acceptable value.

06:50.460 --> 06:58.320
Next, we are comparing the export problem with why, so for this, the correlation value is zero point

06:58.320 --> 07:01.140
five seven one, which is, again, not a very high value.

07:02.360 --> 07:07.060
And the value is three point forever, so it is also acceptable.

07:08.830 --> 07:15.520
Next, we have a correlation between the latitude and the house price.

07:16.570 --> 07:21.810
So on comparison, the Confucian correlation coefficient is, again, not very high.

07:24.280 --> 07:32.830
Next, we have the relation between longitude, so the longitude correlation coefficient is not also

07:32.830 --> 07:41.160
very high, we would have removed this column only if this value has been greater than zero point nine,

07:41.320 --> 07:43.600
then only we will remove this particular column.

07:44.020 --> 07:46.660
Now, I have done this one by one.

07:46.660 --> 07:50.340
I have found out the correlation coefficient one by one.

07:50.800 --> 07:59.500
You could have simply applied the Findus profiling for finding out multicore linearity and finding out

07:59.500 --> 08:01.020
the details of the data.

08:01.030 --> 08:05.160
But because we have already done that and we have already learned about this matter.

08:05.440 --> 08:08.100
So I'm showing you a different method here.

08:09.500 --> 08:16.580
Apart from this, you can apply the Ditech function, which we have built in the confirmation class

08:16.880 --> 08:22.970
where we have discussed about the direct function, which would allow us to find out the correlation

08:22.980 --> 08:30.950
Confucians for all the columns and then allow us to remove the columns, which had a very high correlation

08:30.950 --> 08:31.520
coefficient.

08:36.450 --> 08:44.730
So now let's go ahead and apply the linear regression on top of it, so for applying the linear regression,

08:44.730 --> 08:50.810
we will first have to import the model from the model library.

08:51.330 --> 08:53.760
So the library is Escalon.

08:54.450 --> 09:01.900
Escalon Library is a very extensive library, which has a lot of models in it.

09:02.310 --> 09:08.160
So the model is like a regression decision, trees, whatever.

09:08.170 --> 09:10.800
We will be learning, we will be learning it.

09:11.070 --> 09:14.490
I'm using the Skillern implementations of this.

09:15.270 --> 09:17.620
So we are importing linear regression from.

09:19.370 --> 09:23.540
So the first thing which we do is we create an object.

09:24.560 --> 09:25.760
Of the Linnean modern.

09:27.500 --> 09:32.260
And after creating this particular object, we will get the data.

09:33.250 --> 09:37.660
So the data which we have is the x value and the value.

09:38.560 --> 09:43.040
So we will need to have the X values and the right frame separated.

09:43.810 --> 09:52.490
So I'm getting the value of X to house each in my X to data frame.

09:53.110 --> 09:58.750
This is because I want to implement a simple linear regression and I want to implement it on the basis

09:58.750 --> 10:00.010
of just one volume.

10:00.010 --> 10:06.280
As of now, we will see how we implement it for multiple columns, but for now we are implementing it

10:06.280 --> 10:07.520
for just one volume.

10:07.930 --> 10:14.470
When we implemented using the phone column, only one column, then the equation which we will be getting

10:14.740 --> 10:22.300
would be something like Y is equal to a makes policy advice, equal to be done on plus than one X one.

10:22.630 --> 10:25.320
So this is what we are trying to implement as of now.

10:26.720 --> 10:34.110
So this is the X data frame and this is the way that every exit frame will have the X value and everything

10:34.160 --> 10:38.870
will have the Y value, next thing which we do is we train the model.

10:39.110 --> 10:44.030
So to train the model, we simply say Alamdar fit inside it.

10:44.030 --> 10:48.110
We provide the X and Y column to.

10:49.070 --> 10:53.960
We provide only the X and Y column, so based on this X and Y column.

10:56.920 --> 11:03.280
Based on this X and Y column, we will be training our models, so when we run this line of code, the

11:03.280 --> 11:06.670
model trains and learns from the data.

11:07.660 --> 11:15.160
After training and learning from the data, we can use them, don't predict to actually make predictions

11:15.760 --> 11:17.520
on the data.

11:17.800 --> 11:21.460
So I'm giving the column on which I want to make prediction.

11:21.610 --> 11:25.530
So I'm giving the new X values and I can give all the X values.

11:25.540 --> 11:29.810
Also, I can give any X values which are from the same column.

11:30.520 --> 11:34.590
So we have to make sure the quality that we are using is the same here.

11:34.930 --> 11:43.080
Whatever input we have given in the same type of input has to be given in the predict.

11:43.660 --> 11:50.440
If we have given some training detail of X, then the testing data for X two has to be given here.

11:51.470 --> 11:58.760
So the same format would be expected here and then once we do and don't predict, it will predict the

11:58.910 --> 12:03.470
respective values for X and give them back in the Y.

12:05.640 --> 12:07.980
So these are the values which we have predicted.

12:08.460 --> 12:10.510
These are the different values which we have predicted.

12:10.980 --> 12:18.600
Now we have three functions available, three attributes available, which is a limbered intercept,

12:19.260 --> 12:24.120
a lame duck position and a lame duck score.

12:24.840 --> 12:31.880
11. Intercept will provide us the value of the intercept, which is the value of the be done on.

12:33.080 --> 12:39.450
A limbo of confusion will provide the value of the provision, which must be the one in this case.

12:39.800 --> 12:46.900
Remember, the equation which we are generating as of now is why is equal to be done on plus one in

12:46.910 --> 12:48.090
two x one.

12:48.380 --> 12:52.760
So this is the Danoff and this is the one.

12:54.610 --> 13:02.210
And when he you caught we are getting the R squared value, so the score is zero point zero four for.

13:05.100 --> 13:11.880
Now, we will be plotting that, as you will applaud that as you will, but is nothing but just the

13:12.180 --> 13:15.090
values of the data, which we have predicted.

13:15.450 --> 13:21.980
So we are plotting the extra values and the values, and these are the values which we have.

13:22.640 --> 13:24.280
But this is the plot for the scene.

13:24.810 --> 13:29.300
Now we will compare the actual values words versus the predicted values.

13:29.910 --> 13:31.530
So on the basis of.

13:33.220 --> 13:40.850
We are giving the real estate White House price of the unit and we are plotting another distribution

13:40.850 --> 13:43.130
plot of the very highest value.

13:43.150 --> 13:47.440
So in the first one, we are giving the.

13:48.800 --> 13:54.980
Actual values in the second one, we are giving the very highest values, these are two different plots.

13:55.700 --> 14:01.150
This is the first plot and this is the second line which we are plotting in the same figure.

14:03.010 --> 14:11.010
So we have also given the labels so we have the legend, which has great for actual value.

14:11.290 --> 14:15.490
So you can see these are the actual values and these are the certain value.

14:15.500 --> 14:19.600
These are the values which we have predicted in local.

14:20.610 --> 14:26.010
So you can see there is a lot of difference between the actual value and the values.

14:27.460 --> 14:34.150
So the next thing, what we will be doing is we will try to put a model on the basis of all the columns

14:34.150 --> 14:34.990
which we have.

14:36.390 --> 14:37.950
So we will create.

14:39.410 --> 14:47.870
A new data from it with all the required columns that does X one transaction need extra extra X for.

14:49.110 --> 14:50.880
It's five and six.

14:52.340 --> 14:56.180
On the basis of this, we will run the limbered Fichte.

14:57.620 --> 15:00.190
And we will predict the values of Z.

15:01.330 --> 15:09.020
So when we predict the values of Z, we get the new Y and we will bring the plot for the scene.

15:09.340 --> 15:13.080
So now you can see the values are somewhat similar.

15:13.090 --> 15:14.560
The values have improved.

15:14.560 --> 15:17.450
The model has improved.

15:17.470 --> 15:17.740
Now.

15:19.450 --> 15:27.300
Now, the next thing which we will be doing is we will split the data into two different splits.

15:27.940 --> 15:35.860
You remember we discussed about the first method, which is the hold out method, where we split the

15:35.860 --> 15:37.510
data into two parts.

15:37.870 --> 15:41.650
One is the training data and another one is the testing data.

15:41.920 --> 15:48.040
So here we will import the model selection that this segmenter.

15:48.460 --> 15:53.110
So the brain this method will allow us to split the data into two parts.

15:53.860 --> 15:55.240
So in this method.

15:56.630 --> 16:03.560
We provide the Z value, I'm divine value Z value is a new date every which consists of columns one,

16:03.560 --> 16:04.640
two, six.

16:05.120 --> 16:06.440
And why is that?

16:06.440 --> 16:10.220
The definition which contains the values, the actual value values.

16:12.050 --> 16:21.650
And here we have the fist size, which defines what proportion of the an entire data we want to have

16:21.650 --> 16:23.150
as a test data.

16:23.510 --> 16:30.010
So here I'm saying I want to point to as my data, which means I want 30 percent of my data.

16:30.010 --> 16:31.370
Do we estimate data?

16:31.580 --> 16:36.160
So the rest of those 70 percent of data will become my training data.

16:37.660 --> 16:45.730
So I provide the random state, random state to be zettl so that each time and it gives the same type

16:45.730 --> 16:49.570
of split, the values which are given in each state remain the same.

16:50.200 --> 16:50.650
So.

16:51.990 --> 16:53.910
Here I am printing the lightest.

16:55.120 --> 16:57.730
Here you can see the extreme follows.

17:01.740 --> 17:03.420
These are the extreme values.

17:06.090 --> 17:11.940
Based on the split, so when we are making the split, we are giving Zelon way and after the split it

17:11.940 --> 17:21.000
gives for civil X values, which is X training value, the X value on which we have to train, then

17:21.000 --> 17:28.590
the X value on which we want to best, then divide value on which we want to bring and the value on

17:28.590 --> 17:29.720
which we want to test.

17:29.910 --> 17:37.500
So make sure the players which you are using for training are extreme and widely and the fear that you

17:37.500 --> 17:41.310
are using for testing is X and Y.

17:44.900 --> 17:47.520
So here I am comparing the value.

17:47.580 --> 17:54.740
So here I am comparing the values of the real estate problem with the very best value.

18:03.670 --> 18:10.390
So it comes out to be almost similar, so you can see the mood has improved.

18:13.530 --> 18:15.840
Here is the list of the intercepts.

18:18.100 --> 18:19.540
So this is the Intersect.

18:21.160 --> 18:27.000
And these are the Old-Fashioned values with respect to each one.

18:27.610 --> 18:32.670
So for the first volume, this is the Disick for the House.

18:33.370 --> 18:36.380
This is the intercept for extra distance.

18:37.060 --> 18:40.480
This is the intercept forex phone number of convenience stores.

18:40.750 --> 18:43.660
This is the intercept for five latitude.

18:44.540 --> 18:47.680
This is the intercept for six longitude.

18:47.890 --> 18:49.530
This is The Intercept.

18:52.930 --> 18:55.240
Now, these values.

18:56.300 --> 19:05.300
These values are the guiding the linear regression now for that, we will have a look at two variants

19:05.300 --> 19:06.650
of linear regression.

19:07.430 --> 19:10.670
So let us look at those in the next session.