WEBVTT

00:02.040 --> 00:09.810
In this session, we will discuss about dissolution of the used car price prediction problem, so this

00:09.810 --> 00:15.750
Craigslist, which is the data center, which we have, is the world's largest collection of huge vehicles

00:15.750 --> 00:16.530
for sales.

00:16.890 --> 00:24.780
And this particular dataset includes every used vehicle for within the United States, which is present

00:24.780 --> 00:25.570
on Craigslist.

00:26.370 --> 00:33.640
So initially we would include the libraries, which are known by abundance, even as I live in invalidly

00:33.640 --> 00:34.930
leadership and were of off.

00:36.240 --> 00:43.380
So as you can see, this particularly date Kansas, the I.D., UARS region, region area and many other

00:43.380 --> 00:46.880
object type data columns.

00:47.130 --> 00:54.810
So we will be converting these data columns into medievalists and such data columns which don't really

00:54.810 --> 00:58.950
have any relevance related to the price we would be dropping them.

01:00.840 --> 01:06.680
So the dataset contains around five like thirty nine thousand rolls of data.

01:07.350 --> 01:15.600
And these are the columns which are you as you can see, the columns you are in the region, you all

01:17.280 --> 01:22.150
image you are in these columns have no relevance to us.

01:22.170 --> 01:31.110
So we would be removing these columns and the other object types we will be analyzing and then seeing

01:31.110 --> 01:35.360
if we will have to convert them into the dummy beads.

01:37.860 --> 01:42.520
So the first thing which we can do is we can have a look at the value columns.

01:42.810 --> 01:51.030
So if you see the Idee column has one as the value count, that means that these all are unique values.

01:51.330 --> 01:53.670
So we will be getting rid of this one.

01:54.270 --> 02:00.270
Similarly, the yuan is again unique so we can get rid of them.

02:00.990 --> 02:02.500
Next comes the region.

02:02.550 --> 02:06.350
So there are a total of four hundred and three regions.

02:06.720 --> 02:15.780
So we will have to find out a few regions which we need to keep and get rid of all others.

02:16.080 --> 02:24.690
So if you see the top five regions can be a high number of values, that is around four thousand, three

02:24.690 --> 02:31.410
thousand and the ones below the earth towards the bottom, they can be around thirty eight or thirty

02:31.410 --> 02:32.910
seven guys and their listings.

02:34.020 --> 02:36.670
So we will be selecting only the top of him.

02:38.070 --> 02:44.850
Next comes the region guaran so again, we will be getting rid of these region bureaus because we cannot

02:45.150 --> 02:50.480
delay much from this next if we look at the price.

02:50.730 --> 02:59.010
So you can see that for zero as a price, we have around forty four thousand barrels of data, which

02:59.250 --> 03:01.020
cannot really be possible.

03:01.410 --> 03:09.500
And towards the bottom, again, a lot of cars have a very low price.

03:10.530 --> 03:14.870
So there are certain prices which have only one values available.

03:15.150 --> 03:20.330
But again, these are not of our problem because these are integer type.

03:20.580 --> 03:28.290
So the only thing which we need to worry about here is this particular dataset where we have zero as

03:28.290 --> 03:30.180
the value of the price.

03:30.420 --> 03:32.440
So that is something which is not possible.

03:32.460 --> 03:34.240
So this is something which we need to be careful.

03:35.400 --> 03:36.950
So why are you taking care of this?

03:37.290 --> 03:47.010
What you can do is you can be a mean value or a median value of that particular region where it belongs.

03:47.280 --> 03:55.110
Or you can drive on other different methods or find different patterns between the regions or the type

03:55.110 --> 04:01.600
of cars and then find out the median or mean value accordingly and then put that here.

04:02.820 --> 04:06.530
So that is completely up to you, how you want to use these values.

04:06.960 --> 04:11.330
So because this is a large number you can automatically take on me.

04:12.630 --> 04:20.400
So I would suggest you to take a mean of different regions and then infuse those values in these blank

04:20.400 --> 04:20.850
values.

04:23.610 --> 04:25.210
Next, we have here.

04:26.460 --> 04:30.710
So these are the years when these cars were manufactured.

04:31.020 --> 04:36.900
So here you can see that there are a few cars which are very old.

04:37.170 --> 04:42.860
And most of and here we have a lot of cars which are buried under recent years.

04:42.870 --> 04:50.070
So what you can do is for the cars, which have very old data, you can basically create a separate

04:50.070 --> 04:58.650
column or maybe see older than 50 years or older than 40 years and create a single category out of it

04:58.860 --> 04:59.850
instead of having.

05:00.130 --> 05:01.060
These combined.

05:03.410 --> 05:13.160
Similarly, here we have some of the fringe manufacturers of these cars, so you can see that there

05:13.160 --> 05:19.790
are a few manufacturers which have a lot of cars listed here, while some manufacturers have a very

05:19.790 --> 05:24.900
few number of cars listed like the Abdeslam, Harley Davidson, Alfa Romeo.

05:25.070 --> 05:28.680
So what you can do is you can put these into other categories.

05:28.680 --> 05:34.100
Only of these categories don't really make a huge impact on the prices.

05:34.640 --> 05:38.480
But you can see that this actually makes an impact.

05:38.480 --> 05:45.670
Like if we have Ferrari, Aston Martin, these cars are actually having a very high price.

05:45.950 --> 05:49.220
So this will actually have a huge impact on the prices.

05:49.430 --> 05:56.540
So we cannot really converted directly without having a good analysis on this.

05:56.660 --> 06:01.430
So we will find out and divide these into different categories.

06:01.430 --> 06:08.030
Or you can divide these manufacturers into different levels which are associated with the price.

06:08.960 --> 06:10.910
So this is another thing that you can do.

06:15.870 --> 06:26.140
Next, we have the models now, these models are actually very unique things, and these are very specific

06:26.140 --> 06:28.350
needs for a particular God.

06:28.350 --> 06:36.510
I'm here if you see here we have this Silverado fifteen hundred and this name has been split inside.

06:36.750 --> 06:42.740
And I have another six thousand rows of the David fifteen hundred and another 5000 bottles of the Dovid

06:42.750 --> 06:43.460
Silverado.

06:43.830 --> 06:50.100
So which means that these are kind of replicative and it does not really have a good information present

06:50.100 --> 06:56.030
here because the same guy has been listed in three different categories inside this.

06:56.310 --> 07:03.720
So what we can do is we can actually decide if we need to keep this model or not or if we need to convert

07:03.720 --> 07:07.890
this into a different type of category or how we need to deal with it.

07:07.920 --> 07:09.870
This is something that we will have to find.

07:14.570 --> 07:23.210
Next are the conditions, so because these are very lesser numbers, we can have all these conditions

07:23.210 --> 07:33.050
and read them and convert them into an ordinary variable, which looks like the best way out, then

07:33.050 --> 07:38.160
we have the number of cylinders which can directly be converted into numeric value.

07:40.070 --> 07:41.510
Then we have FuelCell.

07:41.510 --> 07:49.940
These can be converted into a categorical dummy variable and then we have all the which is statically

07:49.940 --> 07:51.380
an integer type.

07:51.380 --> 07:54.890
So we don't really need to think much about it.

07:57.500 --> 07:59.680
Then comes the title.

08:00.330 --> 08:03.700
So this needs to be converted into dummy variables.

08:04.100 --> 08:07.150
Then we have the transmission again.

08:07.160 --> 08:09.530
This would be converted into dummy variables.

08:10.310 --> 08:19.670
After that, we have a van, which is basically the details about the car, which seem to be a lot unique,

08:19.670 --> 08:25.470
as there are around one like eighty one thousand different values associated with it.

08:25.640 --> 08:29.300
So the best way out would be getting rid of this particular column.

08:32.190 --> 08:38.040
Next, we have the type of drive which can be directly converted into a dummy, the.

08:40.520 --> 08:49.040
Size can again be converted into the type of the car can again be converted into a dummy variable,

08:50.600 --> 08:51.960
then paint color.

08:51.990 --> 08:59.690
We can maybe decide the top four colors and based on the top four colors, we can see if we need to

08:59.690 --> 09:01.190
convert them into three variables.

09:01.190 --> 09:06.410
And we can actually get rid of the view towards the bottom because they are a lot of colors.

09:06.410 --> 09:13.820
We don't want to create a lot of columns out of just colors because that would hardly have any impact

09:13.820 --> 09:14.540
on the pricing.

09:16.250 --> 09:19.210
Next, we have image euro, which we can directly get rid of.

09:20.630 --> 09:25.620
After that, we have this huge text column, which is description.

09:26.030 --> 09:28.550
So there are two ways.

09:28.920 --> 09:35.120
First would be you can create a model by getting rid of this column entirely.

09:35.430 --> 09:44.900
And the second would be you can write the first method and additionally create counterfeiter's or Defitelio

09:44.900 --> 09:51.500
vectors out of this description column and add that as an additional feature to your mom.

09:54.090 --> 10:02.940
Next comes the state, and if we look at the state, there are a lot of values in the state, so we

10:02.940 --> 10:09.110
will have to kind of get rid of this column because it does not have much information in it.

10:09.690 --> 10:14.970
So we will have to find out if there is a specific pattern which is associated regarding priorities

10:14.970 --> 10:15.730
for the state.

10:15.960 --> 10:18.390
If not, then we can radically get rid of this.

10:20.270 --> 10:25.790
Next comes the latitude and longitude, which we can directly get rid of, because that would again

10:25.790 --> 10:27.170
be similar to the state.

10:31.070 --> 10:36.830
So what we are doing here is we are dropping the idea you are in the region you are living in means

10:36.830 --> 10:43.390
you are description, latitude, longitude, country, region, and this is something which I have done.

10:43.580 --> 10:49.940
So I also told you about what all you can do other than what I am actually implementing here so you

10:49.940 --> 10:51.380
can do it accordingly.

10:51.390 --> 10:55.030
But this is just my implementation.

10:57.880 --> 11:05.290
Next, this is the final list of columns that we have, so we have praise your manufacturer, modern

11:05.680 --> 11:14.560
condition, cylinder fuel, odometer titles, distance mission, drive, size type being color and speed.

11:17.030 --> 11:20.540
Now, let's describe this and then we see the.

11:21.050 --> 11:29.060
So we have praise so you can see the minimum price is zero and then we have twenty five percent of the

11:29.060 --> 11:31.650
values which are price around four.

11:32.150 --> 11:37.480
Then we have eer of manufacturing.

11:38.750 --> 11:41.600
After that we have the odometer values.

11:41.870 --> 11:45.020
So first thing that we going be doing is handling outliers.

11:45.380 --> 11:53.000
Now, as you have already seen, that the prices have a lot of outlier present because there are a lot

11:53.000 --> 11:55.340
of old cars which are also present.

11:55.910 --> 12:04.040
So what we will be doing is we will see the find out the outlier of the will for the standard variable

12:04.040 --> 12:04.520
itself.

12:05.270 --> 12:07.700
And the target variable here is price.

12:07.710 --> 12:08.010
Right.

12:08.030 --> 12:17.420
So we will be finding out the outliers that and we will remove the model which so that those particular

12:17.420 --> 12:24.110
values could be accurate because these outliers present in the price itself, that is the target itself,

12:24.440 --> 12:31.460
will bring in a lot of difference in the values of the mean and these deviation which we have in the

12:31.460 --> 12:31.940
values.

12:32.930 --> 12:36.680
So the first thing which we will be doing is removing outliers from the licensing.

12:37.550 --> 12:42.490
So the difference between the 75 percent value and the maximum value is very large.

12:42.710 --> 12:47.490
So we will leave the 10 percent values at the end of the distributions.

12:47.810 --> 12:53.690
So what we are doing is we are finding out on day one what they do and then finding out the percentile

12:53.690 --> 12:54.350
values.

12:54.530 --> 13:03.260
And we will be dropping all the rules of the the list and these two values, which we have just found

13:03.260 --> 13:03.500
out.

13:06.760 --> 13:09.950
So now we will have a look and he ordered me to call you.

13:10.340 --> 13:14.480
So this is a distorted values of the order values.

13:15.850 --> 13:23.020
So here you can see the full value is zero and next comes to the eight thousand, which means that there

13:23.020 --> 13:29.340
is a lot of difference between these values and there are a lot of not on numbers as well.

13:36.470 --> 13:43.250
These are the different values which we have, so there are not a number of values and only one zero

13:43.250 --> 13:43.650
value.

13:43.910 --> 13:49.100
So what we will be doing is we will find out of whatever is having not a number.

13:49.430 --> 13:58.310
So here we have around 80000 values which are having a land value, and we will create the scatterplot

13:58.310 --> 13:59.360
for all the leaders.

13:59.380 --> 14:08.050
So here you can see there is one outlier value and all of the values are actually scattered here near

14:08.060 --> 14:08.540
zero.

14:14.380 --> 14:17.050
So this is the maximum value which we have.

14:19.510 --> 14:29.320
So what we are doing here is we will simply find out the values, we will drop this maximum value and

14:29.320 --> 14:33.030
we will drop the column, the particular rule which has the value.

14:34.450 --> 14:38.160
So we will basically be getting rid of this particular data point.

14:38.530 --> 14:44.080
I'm the first data point which have value zero percent in this.

14:47.530 --> 14:50.660
So now when us pulled us out of it.

14:50.890 --> 14:53.590
So this is how the distribution comes out to me.

14:54.760 --> 14:58.240
So regarding the values which have not.

14:58.280 --> 14:58.780
No.

14:59.140 --> 15:05.780
So for them, there are around above three hundred thousand.

15:05.800 --> 15:11.180
These values can be considered as outliers because most of the values are present here.

15:12.160 --> 15:16.220
These are the most of the values and these are the other values which are present here.

15:16.450 --> 15:25.360
So we are considering the values above three hundred thousand as the outlier values and then we again

15:25.370 --> 15:26.510
create the scatterplot.

15:26.530 --> 15:28.580
So here you can see the distribution again.

15:34.090 --> 15:41.710
So this is for more analysis, which has been done on the order with the value being done the year as

15:41.710 --> 15:41.920
well.

15:42.340 --> 15:49.630
So here we have the value of here and here we have the value of price.

15:51.030 --> 15:59.560
So with the distribution of the price, you can see that there is not much relevance of year and price

15:59.560 --> 16:06.380
in the lead up because it is kind of like of uniformly distributed price and your distribution.

16:06.640 --> 16:14.440
So if you see here, we have the year nineteen forty below, which the prices are actually not really

16:14.440 --> 16:18.430
much and there are a few, only a few datapoint also present.

16:18.640 --> 16:20.420
So this looks like an outlier.

16:20.440 --> 16:23.650
Otherwise all the data looks like uniformly distributed.

16:23.920 --> 16:31.450
So what we will be doing is we will be getting rid of these particular rules of the year of manufacturing

16:31.450 --> 16:34.540
is less than 19 for

16:38.140 --> 16:38.540
next.

16:38.810 --> 16:43.040
What we are doing is we will handle the other volumes which are present here.

16:43.060 --> 16:48.790
So I am simply finding out the percentage of null values in each claim.

16:49.510 --> 16:56.500
So here I'm just finding out the number of null values and I'm checking of which columns are present

16:56.500 --> 16:56.830
then.

16:57.190 --> 17:00.580
And I have simply found out the percentage.

17:00.610 --> 17:10.420
So here we have the column name the Josiah's Conditions Cylinder Bangalor Drive, a manufacturing model

17:10.420 --> 17:14.890
transmission's you legal status, then null values, the number of values which are present, which

17:14.890 --> 17:17.680
is the value and the percentage value.

17:18.070 --> 17:25.530
So here, if you see that this is column has it on 64 percent of values, which are actually not antico

17:25.540 --> 17:32.370
column condition has five percent values, which the lender has 33 percent values, which are Benkler

17:32.380 --> 17:33.700
has twenty four percent values.

17:33.700 --> 17:40.210
The General Andreev has twenty one percent type, has 18 percent values which are not apart from that,

17:40.210 --> 17:44.140
all other columns have very few number of values which are actually not.

17:46.270 --> 17:52.960
So these are the values column for condition, which is again having a huge number of value counters.

17:53.770 --> 17:55.870
So these are the different value columns.

17:55.870 --> 17:59.000
You can see excellent has a very huge number of presents.

17:59.040 --> 18:05.640
Then we have code, then we have like new, then we have and we have new and salvage.

18:07.570 --> 18:15.490
Now, the missing values in the condition can be found using odometer as mileage affects the condition

18:15.490 --> 18:16.120
of the car.

18:17.080 --> 18:24.430
So what we are trying to do here is finding out the mean value of the odometer reading group by their

18:24.430 --> 18:25.260
conditions.

18:26.830 --> 18:34.000
So here we have the excellent Otomi, the good old Boomin, the like New or Boomin Salvacion, all the

18:34.000 --> 18:35.970
mean and they're or doing so.

18:35.980 --> 18:38.830
These are the mean values which we have just found out.

18:39.100 --> 18:42.040
And these are the values of what, like new.

18:42.040 --> 18:45.120
We have eighty seven thousand for excellent.

18:45.130 --> 18:48.340
We have a little higher value for good.

18:48.340 --> 18:55.900
We have then something, then we have and then we have salvage.

18:56.620 --> 18:59.920
So these are the different mean values for the old.

19:01.960 --> 19:10.060
Now using these values, we can actually include the odometer reading, which will have more value in

19:10.060 --> 19:10.670
the condition.

19:11.410 --> 19:18.770
So what we do here is we are simply checking that you so wherever the air is greater than 2090.

19:19.390 --> 19:30.970
So there for the condition we are basically putting as new, then wherever the value is, make new or

19:30.970 --> 19:34.660
less than make new the mean value of the like new.

19:35.030 --> 19:45.490
We are simply putting as the like new that whatever the value is again greater than fair or don't mean,

19:45.850 --> 19:49.500
then we are putting forward that it is greater than all the mean.

19:49.510 --> 19:55.610
And this is basically a condition which you can actually analyze and find out what has been done here.

19:55.900 --> 19:59.530
So based on that, we have simply rated excellent good in salvage.

20:00.670 --> 20:06.440
Then what we are doing here is we are finding out the values again, the percentage values.

20:06.730 --> 20:11.220
So here again, you can see this is 60 for which we have just seen.

20:11.470 --> 20:15.120
So we are dropping all the values which have less than five percent value.

20:15.170 --> 20:17.650
So here we have this manufacturer.

20:17.920 --> 20:20.590
So below this all comes under the five percent.

20:21.220 --> 20:28.360
So we are removing the null value rules from the title status, human transmission model and manufacture.

20:30.520 --> 20:32.260
Next, we are also dropping the.

20:32.710 --> 20:39.100
With more than 30 personal views, but cylinder's can be an important feature, so we are not dropping

20:39.100 --> 20:40.010
the cylinder for you.

20:40.480 --> 20:46.890
OK, so we have simply dog this size clip from this next Izadi Malverde.

20:48.280 --> 20:50.620
So we are just finding the values again.

20:51.640 --> 20:54.020
So cylinder has 33 percent null value.

20:54.040 --> 21:00.090
We have been color drive by the null values now and on the column you don't really have any values.

21:01.450 --> 21:09.430
So we are finding out the paint color and we have simply filling them up using the full method.

21:09.850 --> 21:15.620
Similarly, we are finding out the values for drive and filling it using the Hollywood film method and

21:15.640 --> 21:17.460
same in place to type and.

21:19.990 --> 21:28.330
This leaves us with a cylinder having 11 values and drive, having two null values which we can actually

21:28.330 --> 21:30.720
get rid of because that is a very small number.

21:34.390 --> 21:42.250
Now, let's explore this later a little bit, and now we're importing sci fi and importing stats from

21:42.250 --> 21:44.590
sci fi and they're creating a lot of.

21:46.690 --> 21:55.660
If you see we have a price and we have here here, so you can see price and you don't really have much

21:56.080 --> 21:57.020
association.

21:57.040 --> 22:00.120
These are actually a uniform distribution present here.

22:03.040 --> 22:07.730
So we can find out these different relationships which are present here.

22:07.750 --> 22:10.590
So here we have the meter and all the window which comes out.

22:10.600 --> 22:15.790
Seeing here and here again would not have much relevance, price and pricing and we wouldn't have much

22:15.790 --> 22:17.140
relevance here.

22:17.140 --> 22:23.480
And prices are simply distributed by when we talk about all the meter in price.

22:23.710 --> 22:33.150
So as the price of basically as the odometer value increases, the price is also increasing.

22:33.940 --> 22:38.560
So as the volume, the value is increasing, the prices actually.

22:39.620 --> 22:40.250
Decreasing.

22:44.350 --> 22:50.080
So this is the relationship which we have and these are the details which we are live with now, for

22:50.080 --> 22:55.610
all the objects, we will simply be converting them into dummy columns.

22:55.960 --> 23:04.180
So here you can see these are the different state values, which we have a lot in number and different

23:04.180 --> 23:05.560
values associated as well.

23:06.040 --> 23:09.330
So now we are comparing the condition with the price.

23:09.340 --> 23:13.430
So these are different conditions and these are different prices.

23:13.450 --> 23:19.170
So as you can see, for knew the prices high and for fear the price is the least.

23:20.680 --> 23:26.690
So clearly the would condition new has highest price, as one would expect.

23:27.040 --> 23:32.410
And now we will create a categorical plot between the cylinder and the price.

23:32.710 --> 23:35.170
And this is the volume from which we are creating.

23:36.670 --> 23:39.340
So this is the wilin plot.

23:39.730 --> 23:44.740
So this shows the distribution of number of cylinders and the price.

23:47.200 --> 23:54.680
So here you can see this has 12 cylinders, so this is the price, I mean, at 10000.

23:55.360 --> 24:00.250
Then we have eight Lendell, which has the mean at a little greater than 10000.

24:02.890 --> 24:07.750
Next, we are comparing the fuel with price.

24:08.540 --> 24:10.130
So it's here.

24:10.160 --> 24:13.060
You can see if the fuel type is diesel.

24:13.780 --> 24:15.610
The price is a little higher.

24:15.620 --> 24:21.310
When the fuel type is gas, the prices are lower than for hybrid fuel.

24:21.340 --> 24:24.860
Again, the prices are a little lower.

24:26.320 --> 24:33.850
So this shows that the price range would be for a majority of each type of guy is based on the fuel

24:33.850 --> 24:34.270
itself.

24:34.930 --> 24:46.460
So for gas, it is between five to 17 K, for diesel it is from 12 to 20, K for hybrid.

24:46.480 --> 24:48.900
It is from seven to 15.

24:50.080 --> 24:54.830
And for hybrid, it is from seven to 15.

24:54.830 --> 24:57.850
K under it is 11 to 20.

24:59.560 --> 25:01.480
An electric blanket, Waitiki.

25:04.590 --> 25:11.550
So next of you are creating a categorical plot between the Dido's state design, the price, so for

25:11.550 --> 25:14.430
P.M., you can see the price is a little higher.

25:14.460 --> 25:21.510
Well, when we talk about the lean mean that I don't see this as having the highest price meet.

25:22.620 --> 25:27.030
So these are the different distributions for the prices based on the title SIECUS.

25:29.530 --> 25:35.260
Next, we have the distribution among the transmission in place, which clearly shows that the price

25:35.260 --> 25:39.030
for automatic transmission are almost similar.

25:39.040 --> 25:43.750
But when we talk about the other day of the mission, it has a very high price.

25:47.020 --> 25:58.090
Next, we have forward the wars and different types of the drive, so there is not much difference between

25:58.090 --> 25:58.870
both of these.

25:59.230 --> 26:02.450
But the third one, this particular one is a little different.

26:02.470 --> 26:03.810
These are a lot.

26:03.830 --> 26:11.490
See, so what we can see for that is we are distinguishing between the type of vehicle and the price.

26:11.920 --> 26:16.660
So here you can see different type of vehicle, have a lot of different people faces.

26:17.020 --> 26:21.880
The price of truck and pickup is somewhat similar.

26:22.180 --> 26:26.770
When we talk about sedan and the minivan, they are a lot similar.

26:27.010 --> 26:28.780
So these are different type of distribution, which.

26:30.910 --> 26:36.280
So the important observation can be obtained from the above the regarding the price brackets for each

26:36.280 --> 26:37.010
day of the week.

26:37.450 --> 26:44.280
So we can find out the price window for each week from this particular distribution.

26:46.360 --> 26:49.180
And here you see the distribution for colors.

26:49.570 --> 26:57.340
That is not much variation, but still we have a little price distribution, which we can get out of

26:57.340 --> 27:00.860
this based on the quartile values from these.

27:02.920 --> 27:07.690
Then we have a manufacturer which again gives a huge difference between these values.

27:07.960 --> 27:13.930
So we can see manufacturer is actually giving a lot of difference in the prices which we have.

27:14.680 --> 27:21.440
So next, what we are doing is we are creating the models or you can see the dummy variables are prepared.

27:21.490 --> 27:25.210
So we are simply converting the labels into encoded values.

27:25.870 --> 27:30.250
And next, we are simply having a string split of the data.

27:33.840 --> 27:38.450
And we are applying the forest on top of it.

27:39.300 --> 27:46.740
So when we apply the random forest, we are simply finding out the minoxidil then means when error and

27:46.740 --> 27:48.070
group mean square error.

27:48.360 --> 27:57.540
And what we get out of this is the mean absolute error of one six one zero, the mean square error of

27:57.540 --> 28:04.080
around six fifty eight thousand, then root mean square error of two five, stubbled six.

28:04.290 --> 28:10.680
And if you talk about the accuracy, it is actually eighty six percent, which is of decent accuracy

28:10.980 --> 28:14.850
based on the model and the simplicity of the model which we have taken here.

28:16.860 --> 28:23.910
So this is one implementation of this morning, another implementation, what you can do is you can

28:23.910 --> 28:32.250
simply take the description column and convert them into the idea vectors, as it will provide a lot

28:32.250 --> 28:38.830
more information about Neka, although the information might be present in the different columns.

28:38.850 --> 28:45.840
So that is something which you need to be careful of because then it can introduce a little bias in

28:45.840 --> 28:46.400
the data.

28:46.860 --> 28:50.010
So you can try that as.
