WEBVTT

00:01.660 --> 00:08.620
In this section, we will be learning mode by generation, so first of all, let's have a look at a

00:08.620 --> 00:12.110
normal piece of covid we would have created a model.

00:12.550 --> 00:15.820
So here we have a textual data.

00:16.300 --> 00:22.560
And this text data is the same as from collection data, which we have already used.

00:23.350 --> 00:28.420
Now, we already know that this data contains target value and the message.

00:29.880 --> 00:38.990
So how we would have used how we would have implemented it earlier is we would have imported the strain

00:39.000 --> 00:45.720
split, we would have imported the multinomial navys divided victimiser Stopford.

00:45.720 --> 00:48.090
What tokenized would need?

00:48.090 --> 00:49.090
Climatized.

00:49.410 --> 00:54.410
And then we would have created a word Natal Amitiza object called Lemmer.

00:56.170 --> 01:04.350
Next, we will get the sake of stop words from the language English, and this becomes de stopwork.

01:05.020 --> 01:11.200
Now we will define a function which is splitting dilemmas which will basically take the message, conveyed

01:11.200 --> 01:12.900
the message and do a lowercase.

01:13.240 --> 01:17.920
Then we will organize the word would the mess of the message.

01:18.190 --> 01:22.090
And after that, we will create a blank list of words.

01:22.420 --> 01:22.870
Stop.

01:23.880 --> 01:30.330
Now, what we will be doing is we will agree on all the words which are present in the message.

01:31.470 --> 01:39.180
And in case the world belongs to the Stopford, then we will not do anything, but if the world does

01:39.180 --> 01:46.500
not belong in the Stopford, then we will upend the world in the blank list, which we already created

01:46.500 --> 01:46.830
here.

01:47.980 --> 01:57.790
Now, after this particular run, what it will do is it will the raid on the LEMMER, which is the word

01:57.790 --> 02:05.380
legitimatize it, it will limites each and every word that is not belonging in the.

02:06.500 --> 02:07.880
Would stop.

02:10.820 --> 02:17.420
So it will memorize all the words which we will be getting after removing all the stock quotes from

02:17.420 --> 02:17.780
here.

02:18.590 --> 02:27.080
So basically what it has done is it can water the entire world entire message into a lowercase and then

02:27.470 --> 02:29.420
collected all the words.

02:30.740 --> 02:38.470
In this particular list, which did not belong to the Stopford, so if the words to belong to the stop,

02:38.750 --> 02:40.180
then it did not do anything.

02:40.370 --> 02:47.900
And if the word did not belong to the Stopford, then it will added the word into words and stop and

02:48.170 --> 02:56.200
we will get the words and stop, which has a list of all the words which do not have stopped.

02:57.350 --> 03:03.160
Now, out of this list it with celebrities, all of these words.

03:03.740 --> 03:10.100
So now we have retrieved the words, removing the stop words, and now we will limites these words and

03:10.100 --> 03:12.890
this will be returned by these fluking dilemma's.

03:14.020 --> 03:19.420
And then we have created a festering split out of the data which we had.

03:20.640 --> 03:23.260
And applied the idea of victimiser.

03:23.700 --> 03:31.140
Now this victimiser will analyze all the data, all the rules of data using split in two levels, and

03:31.140 --> 03:38.220
it will make sure that the minimum document frequency of any word is twenty and the maximum document

03:38.220 --> 03:40.300
frequency of any word is three thousand.

03:40.620 --> 03:45.490
That is a word to be a part of the video vector.

03:45.630 --> 03:54.270
It has to be at least 20 times a be a part of a document or a part of at least 20 documents.

03:54.750 --> 04:02.710
And it must be a part of three thousand document if it is belonging to less than 20 documents.

04:02.730 --> 04:09.630
That means that it is a very rare word and in case it is belonging to more than three thousand documents,

04:09.630 --> 04:12.540
it means that it is a very often acting.

04:14.790 --> 04:23.600
Now we will fit the IDF on the training data and also transform the training data on this data.

04:24.830 --> 04:32.630
And after that, we will apply the classifier or any classifier which we want, and after applying declassified,

04:32.630 --> 04:41.070
we will declassify and find the probability of the same and then we can use it to apply and find out

04:41.090 --> 04:44.020
the classes of the particular data.

04:45.290 --> 04:51.830
Now, this is what we used to do, but now because we are aware of pipelines, so we will be creating

04:51.830 --> 04:52.620
pipelines.

04:53.180 --> 04:58.750
So how would we implement this, using a pipeline to create a pipeline?

04:59.060 --> 05:02.870
First of all, we will import pipeline from Escalon Pipeline.

05:03.950 --> 05:06.800
This pipeline allows us to.

05:08.590 --> 05:12.870
Provide a list of the books.

05:13.920 --> 05:24.690
So basically, this list contains all different types of actions, so this list contains a couple of

05:24.690 --> 05:27.730
actions which we want the pipeline to perform.

05:28.290 --> 05:29.700
What are these actions?

05:30.000 --> 05:33.140
So the first action is the idea.

05:33.600 --> 05:35.180
This is the name of the action.

05:35.490 --> 05:39.760
And after that, we provide the declaration of the action.

05:40.080 --> 05:43.030
So here we declare what action this has to be.

05:43.470 --> 05:44.570
So what is the action?

05:44.850 --> 05:47.370
The action is the idea victimiser.

05:47.730 --> 05:51.980
What it has to do, it has to analyze using splitting the lemons.

05:51.990 --> 05:53.730
This is the minimum document frequency.

05:53.760 --> 05:55.440
This is the maximum document frequency.

05:55.890 --> 05:58.710
Then next action is classify it.

05:59.070 --> 05:59.800
What is it?

06:00.300 --> 06:01.970
It is the multinomial name.

06:01.980 --> 06:02.790
We classify it.

06:03.000 --> 06:12.270
So now we have provided that it has to analyze the data using Vidia Victimiser and then use this particular

06:12.270 --> 06:12.880
classifier.

06:13.260 --> 06:14.550
So this is our pipeline.

06:14.580 --> 06:17.050
This is a complete pipeline which we have created.

06:17.280 --> 06:19.340
Now, how will we use this pipeline?

06:19.350 --> 06:26.130
We will use this pipeline by simply using five dot fit and we give the training data.

06:28.120 --> 06:35.560
When we do fight the good fight, it will basically run this particular thing if I leave the laser on

06:35.560 --> 06:44.410
this data and then learn from it and apply it to declassify it on top of it so it will run the video

06:44.420 --> 06:48.310
victimiser on the data and then declassify it on it.

06:48.520 --> 06:54.190
And then it will fit the this data using this classifier.

06:54.460 --> 06:56.710
And then I would be ready.

06:57.100 --> 07:02.510
Now I Arpatepe actually knows what the model is, what the model parameters are.

07:03.250 --> 07:10.780
So next time when we will have to predict the values, what we will do is we will simply use five predict

07:10.780 --> 07:16.450
proba and we can simply get the probabilities of the predictions that we want to make.

07:17.350 --> 07:22.440
So just we have to declare the fight.

07:22.540 --> 07:24.950
We have to inform the public about the fight.

07:25.330 --> 07:33.850
Then we will have to foot the pipe once and then we can use it a hundred million times using five one

07:34.660 --> 07:35.110
proper.

07:37.070 --> 07:41.960
Now, this is about just one type of feature, it is just having a tax data.

07:42.260 --> 07:49.550
Now, let us say they have a particular problem where we have numerical data and categorical data now

07:49.550 --> 07:51.770
that data will have to be dealt with.

07:52.900 --> 07:53.450
Together.

07:53.470 --> 08:01.310
So in that case, we can vertically apply the idea of victimizer that needs a lot more detailed transformation.

08:01.380 --> 08:04.120
So we need to provide that also.

08:04.330 --> 08:05.630
So what we will do that.

08:06.130 --> 08:09.330
So let us say we will be creating a digital union.

08:09.610 --> 08:13.520
So this one will be we will be using existing base data.

08:13.810 --> 08:19.580
So this data contains reference, no children, age, violence, status, occupation, occupation.

08:19.890 --> 08:21.290
And I know our core values.

08:21.640 --> 08:23.770
So it has 32 columns.

08:25.720 --> 08:32.590
Now, out of this we want to predict that is we want to classify and we want to find out did a new grid.

08:32.590 --> 08:41.050
But now here, what we will have to do is as a simple problem, what we used to do is we would have

08:41.620 --> 08:42.370
four civil.

08:43.440 --> 08:47.320
Converted all the numeric column into numerical.

08:47.550 --> 08:49.950
So this is age one, this is family income.

08:49.980 --> 08:52.800
These all would have to be converted into numeric form.

08:53.350 --> 08:56.510
Next, we have this categorical cholo's.

08:56.520 --> 09:00.370
These categorical columns will again have to be converted into categories.

09:00.660 --> 09:03.000
So these are a few things which we would have done.

09:03.210 --> 09:07.280
But now we have to implement this using right now.

09:07.290 --> 09:08.560
How would we do that?

09:09.210 --> 09:10.290
So for that.

09:11.280 --> 09:13.390
We will get the details.

09:13.410 --> 09:15.240
So we did our unique.

09:16.800 --> 09:26.270
Then we got the data types, now we will get to more libraries that is based, estimated and transformative

09:26.760 --> 09:27.330
mixing.

09:28.310 --> 09:37.010
So what we do is we create one class, this class is for selecting the particular variable.

09:37.520 --> 09:41.410
It allows us to select the particular variable type.

09:41.660 --> 09:48.470
So what we are putting into it, it is taking the input barometer as the best estimate and the transformer

09:48.470 --> 09:50.030
mixing as the input value.

09:51.260 --> 09:59.030
Now it requires a little initialization, so as initialization, it needs the variable types, so which

09:59.030 --> 10:00.830
variable dipen you're looking for.

10:00.860 --> 10:05.840
So in that selected, we you have numeric value, ordinal value, categorical value.

10:05.850 --> 10:07.550
So any time could be present.

10:08.060 --> 10:14.750
So in these variables, we want to have the numeric type or the object.

10:15.020 --> 10:17.390
These are the only two types which we will be having.

10:18.080 --> 10:21.860
So we are giving either numeric or object.

10:22.520 --> 10:28.850
And out of these, we will have some ignored with what is a good variable ignore variable is the variable

10:28.850 --> 10:30.150
that you want to get rid of.

10:30.440 --> 10:36.770
So any variable which you would have wanted to drop initially, you can provide it as I ignore that.

10:36.950 --> 10:43.670
For example, reference number, I don't want to have a reference number in my dataset, so I can simply

10:43.670 --> 10:45.550
put it as I ignore that.

10:46.340 --> 10:51.730
Next, I will provide different details like what I want to do on instead what I want to return and

10:51.770 --> 10:52.400
transform.

10:52.610 --> 10:58.460
So in case of transform, I want to return the X dog to be self.

10:58.470 --> 11:01.580
Not that they don't drop cell signal.

11:01.790 --> 11:07.080
So what we want to do, we want to select the debate which detailed with the one district.

11:07.130 --> 11:09.790
We want to select the data five which we provided here.

11:10.220 --> 11:17.210
So whatever database we will be provided while running the pipe, we want to select those particular

11:17.210 --> 11:17.840
columns.

11:19.690 --> 11:26.530
And when we select all these columns before that one to drop these ignored bad columns from Axis one,

11:27.430 --> 11:28.620
this is what we want to do.

11:28.810 --> 11:30.370
So we have declared the same.

11:31.390 --> 11:38.950
Next is we want to get dummies, what does get them these pipelines and so this is get them is a class

11:39.280 --> 11:41.830
which will help us to create dummy variables.

11:42.520 --> 11:46.340
So what we need in this, we want again, what do we want?

11:46.360 --> 11:47.530
We want the frequency.

11:48.280 --> 11:54.050
That is the frequency of which we use to provide while converting a particular column into the column.

11:54.850 --> 11:57.550
Then we want the variables.

11:57.550 --> 11:59.080
Categorical dictionary.

12:00.380 --> 12:04.580
That is which dictionary of the variable got its.

12:05.650 --> 12:07.910
Now, next, what do we want to do?

12:08.170 --> 12:10.180
We want to fix what is the food?

12:10.480 --> 12:16.690
We want to get the columns and from the columns, we want to get the account of the different columns,

12:16.850 --> 12:19.240
values of the categorical variables.

12:19.660 --> 12:26.230
And based on the value count, we want to find out if the category has a value greater than the frequency

12:26.440 --> 12:28.590
of which we have provided here or not.

12:28.960 --> 12:34.180
If it is greater, then then we will put it in the variable dictionary.

12:34.180 --> 12:35.170
Otherwise we won't.

12:35.740 --> 12:36.720
So what is this?

12:36.730 --> 12:38.640
This is the variable category dictionary.

12:38.650 --> 12:45.820
So this will have the values of the columns which we actually want to convert into dummy variables.

12:46.040 --> 12:46.370
Right.

12:46.630 --> 12:48.130
This is the fitting.

12:48.130 --> 12:52.350
But now what is transforming power in transforming five?

12:52.360 --> 12:55.840
We have this variable guide dictionary keys.

12:56.020 --> 12:57.670
We have the key values.

12:57.970 --> 13:05.860
Now, based on this, we get the column names from this and from each column we get the categories and

13:06.250 --> 13:11.550
from this categories we basically combine the column and category and convert it into a dummy variable

13:11.800 --> 13:12.550
dummy column.

13:13.030 --> 13:15.550
So this is what we are doing in this game.

13:15.550 --> 13:16.510
Dummy Spyplane.

13:18.270 --> 13:25.440
Now, what we have to do, this function will help us to select the new video of this particular will

13:25.440 --> 13:30.270
help in converting a particular column into dummy columns.

13:30.480 --> 13:36.180
Now, what we want to do, we want to create a pipeline and create a union because we didn't know.

13:36.180 --> 13:40.710
What we have is we have all the numeric columns generated here and here.

13:40.710 --> 13:42.320
We have all the object type.

13:43.050 --> 13:48.480
Now, we want to combine both of these because these will be presented to different data frames as of

13:48.480 --> 13:48.690
now.

13:48.960 --> 13:53.140
So now we want to combine this so we will combine this using feature unión.

13:53.640 --> 14:01.460
So once imported this pipeline and feature union, we will now import the logistic regression and strain

14:01.620 --> 14:02.030
split.

14:02.880 --> 14:10.380
After that, we will simply get the data set accordingly that this extreme Vedrine Vytenis values.

14:14.060 --> 14:16.550
And now we will create the biplanes.

14:17.510 --> 14:23.460
So the first five, which we will be creating is the categorical pipeline.

14:24.680 --> 14:26.810
So what is this categorically pipeline?

14:27.170 --> 14:28.820
This pipeline contains?

14:30.530 --> 14:32.090
But what are we doing here?

14:32.690 --> 14:35.580
We are getting the word type selectors.

14:35.590 --> 14:37.310
So what is it they're doing?

14:37.580 --> 14:44.060
It is selecting the object and ignoring post code and posted from.

14:45.190 --> 14:54.710
Next, what we are doing, we are getting the dummies and for dummies we are fixing the biplane value

14:54.710 --> 14:57.360
of the frequency of value was one hundred.

14:57.700 --> 15:02.950
So based on frequency of value, we're creating the dummy variables.

15:03.850 --> 15:05.470
Now you can go back.

15:06.850 --> 15:08.350
To get the means by plane.

15:11.270 --> 15:19.640
So here in the get the spy plane, you can see you have this initialization with self and frequency

15:19.850 --> 15:25.640
of value and it has different values like that can fictionalise and all these values.

15:25.920 --> 15:32.370
So based on this, it will basically carry around this particular guidebook by plane, this guy by plane.

15:32.390 --> 15:38.850
Well, first of all, run this categorical variable, get the categorical variables, then push all

15:38.850 --> 15:46.370
of these categorical variables which it will have in this dummy spy plane into this particular function.

15:46.550 --> 15:49.850
And this function will generate the dummy variables.

15:50.780 --> 15:53.260
Then we will create the next pipeline.

15:53.300 --> 16:01.120
This next pipeline takes the features and creates it is a part of two tuples you can see.

16:01.460 --> 16:05.170
This is the first step and this is the second tuple.

16:05.420 --> 16:08.030
What does the first step do this for?

16:08.030 --> 16:13.850
Stapel is of feature unión, so it applies feature unión function.

16:15.080 --> 16:22.640
On what is it applying the Feagin union function, it is for running the pipelines, so entreated union,

16:22.640 --> 16:26.630
first of all, it will run the pipeline.

16:28.750 --> 16:35.730
While running the gas pipeline, it will again be categorical variables using five types selected,

16:35.740 --> 16:40.200
it will get the object variable and remove the post for the full stadium.

16:40.750 --> 16:44.230
Then it will get the NUM variable.

16:44.230 --> 16:46.470
It will run this particular double.

16:46.480 --> 16:53.540
This double has that pipe selected for in sixty four and float sixty four and it will remove the reference

16:53.540 --> 16:55.570
side as ignored variable.

16:55.960 --> 16:59.890
So now what we have generated, we have two values here.

17:00.460 --> 17:07.120
The false data frame consists of the output from this one and the second data frame contains output

17:07.120 --> 17:07.880
from this one.

17:08.410 --> 17:13.730
This particular contains the output from this particular cat fight.

17:14.020 --> 17:16.420
What does cat bipeds get bypassed?

17:16.420 --> 17:25.360
The object variables with converted into dummies with hundreds of cardboard and ignoring these two columns.

17:25.960 --> 17:35.890
Similarly numbered contains in sixty four unflawed sixty four type columns and ignoring the reference

17:35.890 --> 17:36.640
number column.

17:37.750 --> 17:46.690
Now, once these two have run, then after that featured union would be done, so for the inner part

17:46.690 --> 17:49.820
will be done after that, the outside fight will be done.

17:50.110 --> 17:57.550
So now future union will be done and the result of the future union will be passed into logistic regression

17:57.880 --> 18:00.200
and not just a recession will be present.

18:00.550 --> 18:05.350
Now, we even use this pipe to which is a cumulative pipeline.

18:05.680 --> 18:09.340
This pipeline actually contains all the things which we are doing.

18:09.790 --> 18:17.500
So we will fight this pipeline on the training and testing, the training data, and then we will predict

18:17.500 --> 18:20.920
probabilities using this particular pipeline only.

18:21.430 --> 18:23.680
So this is how we can use these pipelines.

18:23.950 --> 18:31.210
Now, let's say the had instead of having just categorical and numerical variables, we had some textual

18:31.210 --> 18:31.740
detail.

18:31.790 --> 18:38.320
So then what we could have done is we could have created another pipeline, just like the pipeline,

18:38.320 --> 18:41.500
which we have created for the factual data.

18:42.680 --> 18:48.980
That is five one, then we could have included this five one.

18:55.480 --> 19:01.780
As a new pipe here, so we could have created get by, we would have needed another pipe, one like

19:01.780 --> 19:08.890
the one which we created in the dictionary data, and by doing the feature union internally, we could

19:08.890 --> 19:10.680
have added another couple.

19:10.900 --> 19:13.900
I'm supposed to have the five one.

19:17.490 --> 19:19.660
Karma and the details of the fight.

19:22.290 --> 19:26.290
So then what will happen is it will run at five.

19:26.310 --> 19:29.070
It will run this particular by five selecter.

19:29.080 --> 19:30.630
Then it will run by one.

19:30.900 --> 19:35.640
I do the feature union and after that run the logistic regression.

19:35.970 --> 19:41.810
So this is one example which will cover all types of variables, all type of data.

19:41.820 --> 19:43.380
Which would you ever get?

19:43.890 --> 19:47.730
So this is what you can do for creating a pipeline.

19:48.330 --> 19:52.800
Now let us see that you want to run this pipeline.

19:52.800 --> 19:55.140
You want to see this particular pipeline.

19:55.380 --> 20:01.320
Now, they know what we have learned is that we will run this file again and again, again and again,

20:01.530 --> 20:03.140
and then use this file.

20:03.810 --> 20:07.070
But that is not the sort of walking option.

20:07.380 --> 20:13.860
What you will be doing is, let's say I make certain transformation in my data center after doing the

20:13.860 --> 20:16.030
transformation at any point of time.

20:16.350 --> 20:18.510
So you want to save your work then?

20:18.510 --> 20:23.310
What you can do is you can save the data frame in to see if we fight.

20:23.490 --> 20:29.130
I next time when you want to start your work again, you don't need to run the entire piece of food

20:29.130 --> 20:29.490
again.

20:29.490 --> 20:34.620
But you can simply import this ESV five, which you have exported over.

20:36.290 --> 20:43.760
Secondly, when you generate a new model or any pipeline, you can simply use this particular piece

20:43.760 --> 20:54.220
of food, which is job live to dump the fire, that is to save the pipeline or the model and to help

20:54.420 --> 20:54.930
Girlfight.

20:55.280 --> 20:57.470
A big downside is a round table fire.

20:57.710 --> 21:01.580
So what you will be doing is you will simply save it and do this fight.

21:01.760 --> 21:07.870
And whenever you will want to load a particular model, you can simply load the model by state.

21:09.050 --> 21:17.330
My model then opened them modified and then you doing we don't load for the model and once you load

21:17.330 --> 21:24.920
this, you can directly load this model into a pipe or in any variable and this variable will hold your

21:25.340 --> 21:26.150
model isn't.

21:27.360 --> 21:34.250
Now, once you have your own model, you can simply use don't predict Roba or don't predict to run the

21:34.290 --> 21:35.120
model itself.

21:36.600 --> 21:43.950
So this is how you will learn and you can run pipelines and this is the entire of pipeline, so you

21:43.950 --> 21:49.740
can create different pipelines using this particular method and use them in future.