WEBVTT

00:01.460 --> 00:08.210
In this session, we will begin with linear regression, linear regression is the very first and very

00:08.210 --> 00:13.310
basic model which we will be using for making predictions.

00:14.090 --> 00:21.920
Why this is used, how it is used and how we will actually create the model is what we will be launching

00:21.920 --> 00:23.240
in this particular session.

00:24.080 --> 00:31.130
Please note that we will be going through all this Hilevich we have done until now regarding the preparation

00:31.340 --> 00:38.700
again so that you will get a better view of how we will work on a real life problem.

00:39.500 --> 00:47.870
So even if you did not understand how the data is done and how it is actually implemented, we will

00:47.870 --> 00:49.240
be doing that now.

00:50.610 --> 00:51.570
So let us begin.

00:54.180 --> 01:01.230
So till now, what I expect you to have learned is that you would know the Biton basics, which we have

01:01.230 --> 01:06.410
discussed, you would know that how you can read the data.

01:07.630 --> 01:14.590
We have discussed the basics, which would help you to create some functions and methods which we will

01:14.590 --> 01:24.550
be using while preparing the model, then reading data will actually help us to import the dataset from

01:24.550 --> 01:26.140
various CSP points.

01:27.800 --> 01:34.700
After that, we will need to look at different properties of data for looking at different properties

01:34.700 --> 01:35.460
of data set.

01:35.720 --> 01:41.450
The summary functions like Full and describe will be really helpful.

01:42.260 --> 01:45.440
We will have to understand about different types of features.

01:45.690 --> 01:52.760
For example, the features could be numeric type, category type or audiotape, and another category

01:52.760 --> 01:54.010
being the text data.

01:54.620 --> 02:01.360
Now, out of all of these data types, we will have to convert these data types into a numeric form.

02:01.910 --> 02:10.580
Hence we will need to know how to convert categorical or ordinal data into the numeric Day-To-Day via

02:10.730 --> 02:18.350
one encoding that is Dommy creation or via converting the ordinal data into a numeric value.

02:18.680 --> 02:26.620
Or we will have to convert a particular data type that is legacy date to the month they and your format.

02:27.170 --> 02:34.850
Another thing which we will have to know is how we can convert the text data into account like that

02:34.850 --> 02:36.320
or the idea vector.

02:36.470 --> 02:43.700
So these are a few things, but we have actually discussed by learning about Python and and so we will

02:43.700 --> 02:45.890
know how we can convert these.

02:46.100 --> 02:52.520
But why is it needed and how we will be doing it on the building data is something that we will be learning

02:52.760 --> 03:00.170
through the projects, which we will be doing by working on each and every method and each and every

03:00.170 --> 03:01.790
every time, which we will be learning.

03:03.480 --> 03:07.060
Next is we will have to create certain subsets of data.

03:07.470 --> 03:14.940
Then we will have to work on the data things now, although the data which we will be working with would

03:14.940 --> 03:22.080
be kind of modified and we will not we have to do a lot of modifications, but it will give you an insight

03:22.260 --> 03:24.610
of how the actual data will look like.

03:25.680 --> 03:29.190
Then another thing is how we will analyze the data.

03:30.180 --> 03:36.390
We can analyze the data numerically and visually, which is, again, we we have learned uncovered in

03:36.390 --> 03:37.790
the data publishing module.

03:38.280 --> 03:45.570
So we will need all of these learnings to actually implement the machine learning algorithm.

03:47.050 --> 03:56.200
So in case these are not clear to you, you can go ahead and look at the videos again and try to get

03:56.410 --> 04:02.660
familiar with these topics so that you will be able to understand what we are doing more to the.

04:06.700 --> 04:08.920
So let us talk about linear regression.

04:10.190 --> 04:18.330
In our introduction to Machine Learning, we have discussed the three types of machine learning problems.

04:19.220 --> 04:26.480
The first one being supervised learning, the second one being unsupervised learning, and the third

04:26.480 --> 04:29.630
one the the reinforcement learning.

04:30.960 --> 04:36.930
The cutting modules, which we will be covering that initially in Galtos, which we will be learning

04:37.140 --> 04:39.890
will be a part of the supervised monitoring.

04:41.920 --> 04:45.430
So let us discuss about the supervised model first.

04:48.110 --> 04:57.320
The supervised model is the model which has the input data and output data both provided.

04:58.580 --> 05:01.550
So what we will be doing in a supervised program.

05:02.490 --> 05:05.260
So the problem will be something like that.

05:06.780 --> 05:13.770
So here we will have the idea age among salary, dependents, sex and children.

05:14.280 --> 05:16.940
These are different X values.

05:17.400 --> 05:21.180
These are different input values which we have been given.

05:22.530 --> 05:32.760
These input values are also known as features, attributes or input values or independent values, because

05:32.910 --> 05:40.950
we expect these values to be independent of each other and not have any relationship amongst each of

05:40.950 --> 05:41.160
the.

05:45.280 --> 05:53.470
We will have these the age amount, these columns, which would be the input, values or input features

05:53.470 --> 05:56.640
or attributes or independent columns.

05:57.490 --> 06:01.600
Now, apart from these columns, we will also given input.

06:01.990 --> 06:03.970
That is interest rate.

06:04.910 --> 06:10.430
Now, this is actually the output value or the target value, which we are expecting.

06:11.210 --> 06:20.030
So as an input in supervised learning model, we provide the input values and the output value both

06:20.390 --> 06:29.240
to the model so that the model can loan the relationship between the independent columns and the dependent

06:29.240 --> 06:38.270
column, and it can formulate the limit equation or any kind of equation or connection between different

06:38.660 --> 06:43.360
columns so that it can create the relationship between this.

06:44.210 --> 06:51.020
So this relationship which will be creating will be to find different patterns from this particular

06:51.020 --> 06:51.430
data.

06:53.070 --> 06:58.860
So based on this, we will be applying the supervised learning algorithm.

06:59.810 --> 07:05.950
Now, when we talk about linear regression, linear regression is again a supervised learning model,

07:05.960 --> 07:13.520
and it is the simplest supervised learning model and it is trying to fit the data to a straight line

07:13.520 --> 07:18.620
and create an equation so we can see how this actually works out in something.

07:19.130 --> 07:23.580
So let us, first of all, see what is unsupervised learning now.

07:23.870 --> 07:31.400
So in case of supervised learning, we used to have the target value also available by giving the training

07:31.400 --> 07:31.790
data.

07:33.040 --> 07:37.270
But in case of unsupervised learning, no target value is provided.

07:38.380 --> 07:43.770
In case of supervised learning, we tried to find out different patterns from this data.

07:44.410 --> 07:51.610
I'm the equation or create to establish a relationship between the data so that we can predict a particular

07:51.610 --> 07:51.970
value.

07:52.180 --> 07:57.430
So here I mean, that's supposed to predict the interest rate, which is a continuous one.

07:57.460 --> 08:00.030
We wanted to predict a particular value here.

08:00.430 --> 08:07.420
But in case of unsupervised learning, the focus is on grouping dissimilar entities by finding structures

08:07.420 --> 08:08.070
in the data.

08:08.320 --> 08:14.110
We are not concerned with finding out something and predicting something for the future.

08:14.330 --> 08:19.170
These is what go the data is with similar entities together.

08:21.150 --> 08:28.860
And in case of unsupervised learning, we have to find anomalies that we can use it to find something

08:28.860 --> 08:36.060
which is out of the bag, something which is not usual, something unusual, and another implementation

08:36.060 --> 08:38.820
of unsupervised learning is dimensionality reduction.

08:39.540 --> 08:42.890
But we will discuss in depth separately.

08:44.130 --> 08:48.210
So now let us look at different types of business problems.

08:49.190 --> 08:57.020
So there could be a business problem, such as the user might be looking for planning, there could

08:57.020 --> 09:06.620
be an organization who is looking for improving their sales or reducing the cost or increasing the quality.

09:06.920 --> 09:11.810
So they might be looking at some particular aspect of the business problem.

09:12.380 --> 09:18.200
Now, based on these, they might be looking for predicting a particular value.

09:19.550 --> 09:25.930
Now, when we are predicting a particular value, we might and I do predict a continuous rally also

09:26.180 --> 09:28.620
and the categorically value also.

09:30.840 --> 09:38.490
A convenience value would be of value, such as age of a person or height of a person or weight of a

09:38.490 --> 09:41.910
person or the temperature of a particular date.

09:42.330 --> 09:46.200
So these would be the prediction of continuous values.

09:47.090 --> 09:53.840
Categorical prediction could be something like if a particular loan should be approved or not, or if

09:53.840 --> 09:59.690
a particular animal is a cat or dog or if someone is happy or unhappy.

10:00.050 --> 10:02.970
So these kind of things are categorical difference.

10:03.470 --> 10:06.360
So these kind of predictions can be made.

10:06.590 --> 10:14.180
One is the continuous value and the one is a categorical value now for continuous values and categorical

10:14.180 --> 10:20.070
values when we are trying to predict these values either continues or categorical.

10:20.330 --> 10:22.880
These are called supervised learning problems.

10:24.600 --> 10:29.360
That is, we are trying to predict a particular value here.

10:31.170 --> 10:36.990
So these are called supervised learning problems and in case of supervised learning problems.

10:38.210 --> 10:41.270
We have to find out our relationship.

10:42.400 --> 10:50.410
We have to find out the relationship between the output value and the input value X.

10:52.580 --> 11:00.020
So we have to create a function of X which would allow us to predict the Y value.

11:02.180 --> 11:12.130
And why is the value which comes out after predicting from the X so Y value is the expected value y

11:12.140 --> 11:15.260
Y is the value which we have actually predicted.

11:15.950 --> 11:16.970
Now, what is y?

11:16.970 --> 11:18.160
What is infix?

11:20.260 --> 11:27.370
So here, when we are trying to predict the interest rate of a particular loan, then the interest rate

11:27.370 --> 11:36.340
is a continuous value and interest rate will be the value, which is the output value, the label or

11:36.340 --> 11:44.260
the target value, which is the output value, which is the interest rate, and what is our input value

11:44.530 --> 11:49.210
or feature or attribute or independent variable.

11:50.260 --> 11:53.020
These are the X values.

11:53.200 --> 11:54.700
What are the X values here?

11:55.090 --> 12:02.450
Idy each amount Sidonie dependent six children.

12:03.310 --> 12:11.860
So whatever values we are using to predict this particular value, these are called independent values

12:11.860 --> 12:14.140
or features or attributes.

12:14.950 --> 12:23.500
And the value which we are trying to predict is called target or output value on the dependent value.

12:28.400 --> 12:34.850
Now, let us discuss about different type of problems, so there is one problem where the bank is facing

12:35.210 --> 12:37.730
loss due to loan defaulters.

12:38.600 --> 12:48.830
So in this case, the predictors that is the features or attributes or the input, values or customer

12:48.830 --> 12:56.630
details, credit history, loan applications, these are different criteria based on which we want to

12:56.630 --> 13:02.190
find out if a particular person will default on the loan or not.

13:02.390 --> 13:08.930
That is, we are trying to find out a target or a label or a dependent variable.

13:08.960 --> 13:10.550
What is this dependent variable?

13:10.880 --> 13:15.700
We want to find out if a customer will default on loan or not.

13:15.830 --> 13:18.400
And the what will be the value of the target?

13:18.410 --> 13:21.960
The value of target will be yes or no.

13:22.990 --> 13:24.970
That is Zettl or one.

13:26.980 --> 13:34.930
So this is called a classification problem where we are trying to classify something in a few classes,

13:35.200 --> 13:40.330
so when we are trying to classify in two classes a this of classification problem.

13:41.610 --> 13:49.240
Now, the next day, the problem is the flight prices keep fluctuating based on the demands on the holidays.

13:49.500 --> 13:57.480
So here, what are the predictors which will actually help us find out, guy, the target value, the

13:57.480 --> 13:59.120
features or attributes.

13:59.400 --> 14:07.500
These are season nearest holidays, the month origin destination and what season is it?

14:07.680 --> 14:11.730
So these things will actually decide if the flight prices will be high or low.

14:12.600 --> 14:15.630
Now, the target value, that is the.

14:17.100 --> 14:24.480
Value, which we are trying to find out the output value, is this revised price of the flight, the

14:24.510 --> 14:26.060
new flight value.

14:26.310 --> 14:28.930
So this is a continuous value because it is a price.

14:29.170 --> 14:30.780
Price is a continuous value.

14:31.080 --> 14:33.190
So this is a regression from.

14:34.970 --> 14:41.240
So these are a few things that you need to keep in mind, although I will keep reminding you all of

14:41.240 --> 14:47.780
these things while we will be working on different problems so that it slowly, gradually feeds into

14:47.780 --> 14:48.690
your mind.

14:49.010 --> 14:52.310
But just keep these things in your mind that what are.

14:54.700 --> 15:02.560
What are classification problems, what are regression problems and how we are deciding on what are

15:02.560 --> 15:04.240
predictors and what are target?

15:04.720 --> 15:10.570
Now, when we are working on these problems, what happens is that when we are currently working on

15:10.570 --> 15:17.560
these problems, the values of the predictors, and it will be actually given to us, we will be given

15:17.560 --> 15:20.770
a proper dataset where we will be given.

15:21.930 --> 15:29.310
The predicted values, I'm the dad, good values, and we will just have to perform the transformation

15:29.670 --> 15:34.230
and after performing the transformation, we will bring on modern.

15:35.480 --> 15:42.920
But in a real life situation, you will actually have to decide upon different predictors, you will

15:42.920 --> 15:50.810
actually have to get the data from different sources and why you think data from these different sources,

15:51.050 --> 15:57.050
you will be understanding which data is actually important and which data is not important.

15:57.980 --> 16:04.520
So for that, actually, the data on summation and the feature selection, which we have done, will

16:04.520 --> 16:13.910
be useful because right now what predictors will be given to you will be a narrower list of all the

16:13.910 --> 16:14.810
features available.

16:15.820 --> 16:21.790
But actually, when the organization will be reaching out to a data scientist, they will be handing

16:21.790 --> 16:23.000
over a lot of data.

16:24.010 --> 16:27.930
There would be hundreds and thousands of columns of data.

16:29.560 --> 16:37.360
Then comes the situation when you will have to analyze each and every column of data and find out if

16:37.360 --> 16:39.660
the column is actually relevant or not.

16:40.630 --> 16:47.970
Most of the columns will be having less detail or they will be highly correlated with another column,

16:48.850 --> 16:53.230
so which will help you on reducing the number of columns initially.

16:54.040 --> 17:02.830
And after deciding on what columns are not correlated by moving columns, by using the practices which

17:02.830 --> 17:11.140
we have discussed by feature selection, like using coordination matrix or using the affect or or by

17:11.140 --> 17:17.800
using a Findus profiling and removing the columns using these additions provided by that, you will

17:17.800 --> 17:21.070
be able to reduce the number of columns initially.

17:22.280 --> 17:30.290
And after that, we will be discussing about different practices, which will allow you to remove more

17:30.290 --> 17:31.370
number of columns.

17:32.690 --> 17:38.660
So we will be learning those practices while we are learning different models so that you will also

17:38.660 --> 17:40.340
get an hands on practice on that.

17:45.760 --> 17:48.250
Now, next is.

17:50.450 --> 17:55.100
These are the words which I have been talking about very frequently.

17:57.140 --> 18:06.170
So what is Viva la, viva la is the output value, which is the label, the target or the dependent

18:06.170 --> 18:06.480
value?

18:07.070 --> 18:14.660
It is called independent value because we are expecting it to be dependent on the independent values

18:14.660 --> 18:15.980
or the X values.

18:17.210 --> 18:17.960
That is.

18:18.970 --> 18:28.240
We are expecting the interest rate to be dependent on the age amount salary, the number of dependents,

18:28.240 --> 18:31.990
the sex of the person and the number of children a person has.

18:33.160 --> 18:40.350
So that is why this is called a dependent variable and these all of value X, I'll call the independent

18:40.350 --> 18:42.260
video and I'll get back to the slide.

18:43.120 --> 18:44.890
Now, what are these X values?

18:44.890 --> 18:51.310
These X values are also called features attribute, input or independent.

18:54.270 --> 19:03.030
Next thing is, while we going to be training our model, we will be formulating a function of X so

19:03.030 --> 19:08.370
that the function of X will be equivalent to a value of vice.

19:09.270 --> 19:17.760
But because this function will not always be an ideal scenario and a perfect function, so there will

19:17.760 --> 19:22.060
always be a slight amount of error present.

19:23.340 --> 19:27.030
Now, this slight amount of error is of two types.

19:27.330 --> 19:32.580
One is the reducible error and another one is the irreducible error.

19:33.390 --> 19:42.270
The reducible error is the error, which can be reduced by improving the data quality, the feature

19:42.270 --> 19:46.530
selection, and by training the model properly.

19:48.490 --> 19:55.980
While the irreducible error can not be reduced, so we will be targeting on reducing the deducible of.

20:01.550 --> 20:02.150
Now.

20:03.520 --> 20:10.810
When we are talking about data science and machine learning, the target is all about business, the

20:10.810 --> 20:16.830
problem solving which we will be doing, so we will be identifying the problem initially.

20:17.050 --> 20:18.610
We will be identifying that.

20:18.610 --> 20:24.820
What is the problem if we need to find out the investigate or if we need to find out if someone will

20:24.820 --> 20:26.590
be defaulting on the loan or not?

20:26.770 --> 20:30.880
So what is the actual problem which we need to find a solution for?

20:32.160 --> 20:37.860
And once we have the problem in hand, then we will I think that is the problem.

20:38.760 --> 20:48.420
We will categorize the problem in the sense of if the problem is a regression problem or it is a classification

20:48.420 --> 20:56.340
problem or is it an unsupervised problem where we don't where we don't want to.

20:57.440 --> 20:59.990
Find out the actual value.

21:00.000 --> 21:07.910
Well, we don't want to predict something, but we actually want to only classify something or group

21:07.910 --> 21:13.040
something to be precise or cluster something based on the items of a present.

21:13.370 --> 21:21.350
So let us say I want to identify a good number of customers, the customers which will actually be buying

21:21.350 --> 21:22.000
my product.

21:23.300 --> 21:30.800
So in that case, I don't want to classify the customers, I want to classify, I just want to cluster

21:30.800 --> 21:33.170
the customers into different groups.

21:33.380 --> 21:40.010
I want the group, the customers, in such a way that I can know that this group of customers will be

21:41.120 --> 21:44.070
more interested in, let's say, coffee.

21:44.180 --> 21:49.870
And that is another group of people who will be more interested in be so including my marketing.

21:50.030 --> 21:56.660
I will be targeting the people who are interested in coffee for my coffee than I am for the B.

21:56.660 --> 21:59.410
I will be targeting them for my fee product.

22:00.240 --> 22:06.930
So here we will be applying the supervised unsupervised learning problem, like when I'm trying to find

22:06.930 --> 22:13.050
out the weather for tomorrow or the temperature for tomorrow, then I will be using the supervised learning

22:13.050 --> 22:13.350
from.

22:15.630 --> 22:21.000
So what if we made the mistake in predictions so.

22:22.750 --> 22:30.100
Here we have this value and we have those X values and we are formulating a function here.

22:31.970 --> 22:35.830
So there will always be some amount of enterprising.

22:37.680 --> 22:42.000
And the some of the evidence is actually the cost.

22:43.070 --> 22:48.010
The sum of all the errors is for the cost of the wanton.

22:49.280 --> 22:56.930
So if we make any mistakes, then we will have certain cost value, which we all will want to reduce

22:57.170 --> 23:02.180
and how we reduce the cost value, we reduce the cost value by using it in the same.

23:03.410 --> 23:10.670
So our main target is to formulate a function and while formulating the function, we will have certain

23:10.670 --> 23:15.940
added value which we will be trying to reduce at maximum as possible.

23:18.930 --> 23:25.460
So let's go ahead and learn about the aggression, so we will be discussing that in the next session.