WEBVTT

00:01.510 --> 00:03.160
What do data scientists to?

00:04.500 --> 00:13.260
As we have discussed, data science is the field of finding out solutions to a particular problem by

00:13.380 --> 00:14.850
analyzing the data.

00:15.870 --> 00:24.230
So in this context, what happens is that an organization is looking for solutions for its problems.

00:25.080 --> 00:30.370
These solutions have to be fact based and they have to be realtime in nature.

00:31.620 --> 00:37.920
These solutions could be something which is related to, let's say, an organization is facing problem

00:37.930 --> 00:42.000
with its sales or it is facing problems with its marketing.

00:42.510 --> 00:50.100
So the organization will reach out to a data scientist and then data scientist will take out and execute

00:50.100 --> 00:51.380
a particular process.

00:52.410 --> 01:01.920
So the tasks which would be assigned to a data scientist would be very first being the identification

01:01.920 --> 01:03.870
of data analytics problem.

01:05.070 --> 01:12.510
So although the organization has come to you for a problem seeing that there is some issue with the

01:12.510 --> 01:14.920
SEALs, we are not making enough profit.

01:15.060 --> 01:17.340
The marketing is not going right.

01:18.390 --> 01:26.970
We still need to analyze the data to find out if actually the issue is with the sales or the marketing

01:27.180 --> 01:29.310
or the pricing is not separate.

01:30.220 --> 01:32.500
So what is the actual issue?

01:32.620 --> 01:39.490
Is it because of some specific market fluctuation, which is constant for all of the products or all

01:39.490 --> 01:45.710
of the organizations, or is it something which is very specific to this particular organization?

01:46.210 --> 01:54.520
So all these things have to be analyzed and the data scientist will try to find out the actual underlying

01:54.520 --> 01:55.020
problem.

01:56.720 --> 02:04.790
So once the data scientist has the problem in hand, the question which he wants to solve in hand,

02:05.060 --> 02:10.440
then the next task is to determine the correct details and the variables.

02:11.600 --> 02:17.150
So let us see if the problem is with the sales and the sales, prices are not set.

02:17.150 --> 02:17.580
Right.

02:17.810 --> 02:23.090
Then there would be no use of taking the customer segments in consideration.

02:23.900 --> 02:32.060
Or let us see if the problem is in customer segmentation, then the prices or the sales details would

02:32.060 --> 02:33.350
not be much useful.

02:34.550 --> 02:42.710
So the data scientist has to collect all the information which is present and find out which information

02:42.710 --> 02:45.320
is actually relevant to the problem.

02:46.040 --> 02:53.330
So the data scientist collect the data and find out different variables, find out what all data is

02:53.330 --> 02:55.700
available for the solving of the problem.

02:57.300 --> 03:04.980
Once this large data is collected, this data is a matter of raw form, so this data set could be present

03:04.980 --> 03:07.500
in a structured form or an unstructured form.

03:07.680 --> 03:16.170
A structured form would be something like a database or a complete diabolo file format or an Excel file.

03:16.440 --> 03:24.960
While an unstructured data could be Audu with you or some images, all of these would be unstructured

03:24.960 --> 03:26.940
data or some text.

03:27.150 --> 03:29.010
All these would be unstructured data.

03:30.090 --> 03:36.930
So all of these data sets would be collected from different sources and then this data would have to

03:36.930 --> 03:45.390
be cleaned and validated to make sure that the data is accurate, complete and uniform in nature, in

03:45.390 --> 03:52.470
case some data is missing, then the data scientist would try to collect the missing data or would try

03:52.470 --> 03:56.540
to impute values as close to the missing data.

03:58.010 --> 04:05.270
If the data scientist is not able to collect those information, then they would probably drop that

04:05.270 --> 04:09.680
particular piece of information based on how much information they have.

04:11.470 --> 04:18.670
After that, the data scientist were try to find out what more for the best for this particular problem.

04:18.970 --> 04:23.430
So the data is stick to a particular model.

04:23.440 --> 04:26.140
An algorithm is applied on top of this data.

04:27.280 --> 04:37.210
And then the data is analyzed and different patterns and trends are found out of it, once these patterns

04:37.210 --> 04:43.060
and trends are found out, the interpretation of the data to discover solutions and opportunities.

04:44.140 --> 04:52.000
So once we have the patterns in hand, we will try to find out the different solutions, like let us

04:52.000 --> 04:58.000
say that in some particular segment of customers, which is really very interested in the product,

04:58.240 --> 05:06.850
then instead of doing marketing on all other customers, we can target that particular customer segment,

05:06.850 --> 05:13.930
which we are expecting to be highly interested and save on resources which could be used in some of

05:13.930 --> 05:14.060
the.

05:16.530 --> 05:22.950
Then communicating these findings to stakeholders using visualization and other means is the last of

05:23.220 --> 05:24.720
which a data scientist will do.

05:25.110 --> 05:31.650
So all these things would be done by our data scientist from analyzing the problem to finding out the

05:31.650 --> 05:38.070
data and then gleaning the data and validating the data and then applying different models to find out

05:38.070 --> 05:44.780
patterns and trends and then interpreting and analyzing that entire Biden brain, then trying to find

05:44.790 --> 05:51.060
out different solutions and finally communicating these solutions to the stakeholders is what a data

05:51.060 --> 05:51.960
scientist does.

06:01.480 --> 06:08.350
Now, here are a few implementations of machine learning like Netflix and Netflix.

06:08.380 --> 06:16.780
You would have seen the recommendation system where the Netflix suggests different movies or shows which

06:16.780 --> 06:18.130
you might be interested in.

06:18.370 --> 06:26.770
Similarly, Amazon, Google Maps, elixir fraud detection, then see the different object.

06:26.770 --> 06:34.480
Free speech recognition does the self driven cars and then of finding out the best picture and searching

06:34.480 --> 06:36.080
from that based on the picture.

06:36.280 --> 06:39.910
All these things are implementations of machine learning.

06:44.410 --> 06:46.850
Now, what is machine learning exactly?

06:47.590 --> 06:55.120
We have talked about the different implementations we have felt about what is does how he finds out

06:55.120 --> 06:57.690
the problems, how he find the solution.

06:58.030 --> 07:03.400
We have discussed about what is data science, why is the science required now?

07:03.400 --> 07:09.610
Although we have all these things in mind, we still need to answer the question that what is machine

07:09.610 --> 07:10.030
learning?

07:11.570 --> 07:20.810
So although we have the problem in hand, we need to apply certain algorithms which wouldn't be helping

07:20.810 --> 07:23.720
us to find the solution to the problem.

07:24.950 --> 07:31.250
Now these algorithms, one thing could have been done was someone could have written a complete full

07:31.250 --> 07:41.690
fledged code to have considered different cases, like I can have different Ifill's conditions to decide

07:41.840 --> 07:49.100
if someone should prepare the Dinev or not or if someone should order dinner or the dinner.

07:49.790 --> 07:56.080
So this thing would have been written by a simple Ifill's condition saying that, are you feeling hungry?

07:56.090 --> 08:02.630
If yes, then we will check if we have proper ingredients of the food at home or not.

08:02.840 --> 08:10.190
If we have food ingredients, then is the ingredient something which you would like to eat or not?

08:10.340 --> 08:16.130
If this is something that you would like to eat, then are you really feeling like cooking or not?

08:16.230 --> 08:19.880
And if you're feeling like cooking, then you would go for cooking.

08:19.880 --> 08:20.980
Otherwise you'll order.

08:20.990 --> 08:22.770
All right.

08:22.910 --> 08:24.440
So this could be written.

08:24.440 --> 08:26.500
Is that as an Ifill's condition also?

08:26.780 --> 08:28.970
But we don't really want to do that.

08:30.280 --> 08:40.270
We want to provide different scenarios of all the month before or an year before data with all the conditions

08:40.360 --> 08:44.710
and all the values of are you feeling tired, are you feeling hungry?

08:45.370 --> 08:51.250
Like the data would contain different columns with values, like if someone is feeling hungry or not.

08:52.290 --> 09:00.630
Is someone feeling energetic or not to cook or not, then if someone wants to eat outside or not, if

09:00.630 --> 09:03.160
someone has ingredients at home or not?

09:03.420 --> 09:10.590
All these questions and those are the values for all of these would be provided in its ability and based

09:10.590 --> 09:16.320
on these values, what decision that person made at each and every day would be given.

09:17.400 --> 09:25.080
And we would let the machine analyze this entire decision and find out different patterns, that something

09:25.080 --> 09:30.930
like if someone is feeling energetic and the ingredients are fine with the person and delayed that those

09:30.930 --> 09:33.580
ingredients, then they broke, they would go for cooking.

09:34.140 --> 09:40.710
This is one type in which we have found another pattern would be like, let's say someone is not feeling

09:40.710 --> 09:41.580
energetic enough.

09:41.580 --> 09:43.530
So they would probably order the food.

09:44.130 --> 09:45.900
So these are the things.

09:45.930 --> 09:51.240
These are the items which we don't want to give to the machine, but we want a machine to find out these

09:51.240 --> 09:52.330
patterns from the data.

09:54.040 --> 10:02.530
So machine learning is all about providing this data to the machine, I'm expecting the machine to launch

10:02.530 --> 10:06.390
from this data to make its decisions in future.

10:07.210 --> 10:14.020
So we will give it that one month data regarding if someone had the food or ordered online and based

10:14.020 --> 10:18.370
on my data machine, would learn from that data, find out patterns from the data.

10:18.580 --> 10:20.530
I mean, decisions and future.

10:22.000 --> 10:26.020
OK, so machine learning is an application of artificial intelligence.

10:27.620 --> 10:36.530
That provides systems, the ability to automatically loan and improve from experience without being

10:36.530 --> 10:43.070
explicitly programmed, so without being explicitly programmed, without writing an explicit offense,

10:43.100 --> 10:43.380
Lou.

10:43.610 --> 10:50.150
I just want that machine to look at the data and learn different patterns from it and find out if someone

10:50.150 --> 10:52.430
would be cooking the food or ordering only.

10:54.590 --> 11:01.520
Machine learning focuses on the development of computer programs that can access the data and use it

11:01.730 --> 11:03.650
to learn for themselves.

11:08.940 --> 11:18.180
Now, where is machine learning used, the heavily hive self-driving car, the online recommendation

11:18.180 --> 11:26.450
systems like Amazon and Netflix, knowing what the customer is saying about you on Twitter, Lizzi of

11:26.890 --> 11:34.020
sentiment analysis, we can apply sentiment analysis on textual data, which is present in Facebook

11:34.020 --> 11:41.490
or Twitter on the field or any other feedback forum, and find out what exactly our customers are feeling.

11:41.710 --> 11:43.830
We can find out the good reviews.

11:43.950 --> 11:46.410
We can find out about the bad reviews.

11:47.420 --> 11:53.360
From the Bible reviews, you can find out what is the actual problem which the people are facing and

11:53.360 --> 11:54.630
then try to solve them.

11:56.350 --> 12:00.670
Then fraud detection, all of these are machine learning implementations.

12:04.320 --> 12:08.200
Now, the next question is, what are different types of machine learning?

12:09.420 --> 12:12.600
There are three types of machine learning problems.

12:12.960 --> 12:15.250
One is supervised learning.

12:15.960 --> 12:21.030
Second is unsupervised learning and thought is reinforcement learning.

12:22.050 --> 12:26.930
Supervised learning is where we want to predict something.

12:27.600 --> 12:31.950
We want to give certain values to the to the machine.

12:31.960 --> 12:34.350
And we want to predict something for future.

12:34.620 --> 12:37.980
Let us say we can give some weather related data to the machine.

12:38.220 --> 12:44.520
We can provide the temperature, the season, the day of the month, all these details and the temperature

12:44.520 --> 12:45.660
of that particular day.

12:46.320 --> 12:49.610
And we can give, let's say, three past three years data.

12:50.490 --> 12:59.940
And then based on the temperature and all these details, we want to use the machine to learn this data

12:59.940 --> 13:02.450
and predict the temperature for tomorrow.

13:03.860 --> 13:12.440
Or we can give a machine several images of different animals and let it know that this one is OK and

13:12.440 --> 13:16.550
this one is a dog and which one is a little chimpanzee.

13:16.820 --> 13:25.750
And we want the machine to recognize or predict which in case any unseen images provide to it.

13:26.000 --> 13:30.600
And we wanted to predict that this image of a cat and this image is of a dog.

13:31.250 --> 13:37.490
So in case of supervised learning, we want to either predict some continuous value, which is called

13:37.490 --> 13:45.380
regression, like let's say we want to find out temperature, height, weight prices, all of these

13:45.380 --> 13:46.810
things are continuous values.

13:47.090 --> 13:50.570
So if we want to predict continuous values, these are called.

13:51.740 --> 13:58.940
Supervised learning problems, these are called regression problems, we are finding out cantinas values

13:58.940 --> 14:09.740
in regression problems and in case we are trying to find out any classes like different categories of

14:09.740 --> 14:10.270
values.

14:10.400 --> 14:11.630
So those are called.

14:13.840 --> 14:19.320
Classification problems when we are finding out if something is a gag order, OK?

14:21.700 --> 14:28.840
So in case of supervised learning the ropes, we clearly define the output value.

14:29.990 --> 14:38.900
We provide the input value also and the outcome of the Dellec we provided the data of previous years

14:38.900 --> 14:47.080
or previous months then is and what a claim would be used for there, how much humidity was, then?

14:47.090 --> 14:48.890
All those details would be provided.

14:49.100 --> 14:55.190
And along with that, the expected output would also be provided that the temperature that they were

14:55.190 --> 15:02.630
going to be provided so that it can find note, let us see kind of an equation between the directives,

15:02.630 --> 15:09.950
all the things like the humidity and the temperature and climate and all of those things and the relationship

15:09.950 --> 15:14.430
between the temperature and create an equation out of it.

15:14.930 --> 15:21.150
And similarly, create an equation out of the decisions which we were trying to make about the glasses.

15:21.350 --> 15:22.430
So all these things.

15:24.120 --> 15:32.790
Commander supervised learning via unsupervised learning is to simply finding out different patterns

15:32.790 --> 15:41.440
in the data and to categorize something or to basically group something or to segment something.

15:41.610 --> 15:46.340
So in case of unsupervised learning, we don't provide the output values.

15:47.280 --> 15:50.840
We only provide the input values.

15:50.910 --> 15:53.070
We only talk about us.

15:53.280 --> 15:55.430
We are talking about a particular camp.

15:55.710 --> 16:00.090
So we would say that we are having a certain.

16:01.290 --> 16:08.940
Animal, which has whiskers and it has a bee and it has years and it says male.

16:10.410 --> 16:21.900
And we have another animal, which is Bo, and it has long hair and the long mouth, so all these details

16:21.900 --> 16:27.840
we will have, we will give, but we will not build the machine that this one is a gag and this one

16:27.840 --> 16:28.400
is a dog.

16:29.220 --> 16:38.190
We will only provide the related features about these two glasses, but we will not name those and we

16:38.190 --> 16:41.730
will not expect the machine to also recognize these two.

16:41.940 --> 16:45.950
We will only expect it to segregate these into two different groups.

16:46.890 --> 16:51.930
So that is what unsupervised cloning does, is it does not predict any particular value.

16:52.080 --> 17:00.360
It just groups different, classes different and just segregate the items.

17:02.190 --> 17:09.980
Violent is of reinforcement learning, it is a reward based learning, which is basically like, let's

17:10.000 --> 17:14.440
see, I have a particular reward which is trying to learn to walk.

17:15.210 --> 17:21.810
So what I would tell you is that if you're going towards the wall, if you hit something, don't right

17:21.810 --> 17:23.830
or don't lift or go back.

17:24.960 --> 17:31.170
So each time and I would tell you that this is something wrong, which you have done, I would tell

17:31.310 --> 17:33.180
that it is a negative thing.

17:34.650 --> 17:41.790
And if something it is going towards the wall and it makes a left turn, I would say it is a positive

17:41.790 --> 17:42.090
thing.

17:42.270 --> 17:43.540
You do not hit the wall.

17:43.560 --> 17:45.080
It just it is a reward for you.

17:45.900 --> 17:53.110
So we will provide a reward system so that something so that the robot does not hit the wall again.

17:53.610 --> 17:58.130
So this is what reinforcement learning is, which is completely based on the reward system.

18:03.040 --> 18:11.470
Now, here is the more descriptive of the bill of supervised learning, so in supervised learning,

18:11.470 --> 18:18.520
the new classification and regression in case of regression, we are trying to find out, identify and

18:18.520 --> 18:25.740
predict the continent's values like prices or wheat or height of a person Vilem classification.

18:25.750 --> 18:31.590
We are just trying to classify items into different categories, see cats and dogs.

18:31.780 --> 18:41.320
So in case of supervised learning, we will provide data about different classes and we will also provide

18:41.320 --> 18:43.240
the labels that these are CAT.

18:44.570 --> 18:51.500
So that when different damages would be provided to the machine, it will be able to find out that this

18:51.500 --> 18:53.190
is gack and this is not.

18:56.300 --> 19:03.080
Right, in case of unsupervised learning, they do not provide any labels in case of supervised learning,

19:03.090 --> 19:09.130
we provided these labels get, but in case of unsupervised learning, we are not providing any labels.

19:09.140 --> 19:12.200
We are just providing different features about these animals.

19:12.410 --> 19:18.980
And we are expecting the machine to find out the features and group them in two different groups.

19:20.280 --> 19:27.600
So this has two types of problem, which is clustering, so it will segregate these plus these multiplication

19:27.600 --> 19:35.040
signs and so goes into different different clusters, an anomaly detection where it is trying to find

19:35.040 --> 19:37.950
out some abnormal behavior out of the normal.

19:42.910 --> 19:49.810
Now, what does a machine learning engineer do, the task of the machine learning engineer is to explore

19:49.810 --> 19:52.470
the data to find out actionable insight.

19:53.140 --> 20:01.570
They do model the data using different machine learning algorithms to predict the outcome and then to

20:01.570 --> 20:03.820
report the outcome to the Monch.

20:07.380 --> 20:14.340
This is the entire site where the fetch the data, then we clean the data.

20:15.700 --> 20:20.190
After that, we prepared the data and the brain, a modern.

20:21.380 --> 20:27.890
After training the model to evaluate the model, if the model is not good enough, we again bring the

20:27.890 --> 20:28.310
model.

20:29.500 --> 20:36.310
And keep doing that again and again until we are completely satisfied with Darmody and then we deploy

20:36.310 --> 20:39.610
the model to the production and then monitor the.

20:41.350 --> 20:43.660
Output's, which had given from the model.

20:48.280 --> 20:51.320
We will be discussing about this entire process.

20:51.340 --> 20:54.960
We will be discussing about different machine learning algorithms in future.

20:55.180 --> 21:00.290
So it is fine if you don't really understand much because this is just the beginning.

21:00.640 --> 21:07.150
So if you even understand some very basics about what I have told you, that should be enough, because

21:07.150 --> 21:09.730
we will discuss everything in detail in future.

21:12.380 --> 21:17.900
These are different big chains we're investing in machine learning like Google, Amazon, Invidia,

21:19.010 --> 21:23.420
will, Intel, Microsoft, Salesforce, IBM, Facebook.

21:27.320 --> 21:33.620
And this is the format of the traditional models and the machine learning model.

21:34.690 --> 21:43.950
So in his traditional model, we provided the data and we wrote and handcrafted model, which is a complete

21:43.950 --> 21:46.150
defense, complete model.

21:46.180 --> 21:52.270
We will write by hand and then we will feed it to the computer and get the results.

21:53.300 --> 22:00.290
V, in case of machine learning, there are two phases, one is the learning phase where the machine

22:00.500 --> 22:04.910
will learn from the sample data and from the expected result.

22:06.090 --> 22:13.130
And after that, after the cloning, we will create a modern this cloning phase will result in a modern

22:13.440 --> 22:21.630
and then this model is fed with the new data and then which is used to make different predictions.

22:22.350 --> 22:29.460
So first of all, we will feed the input and output data to the model, to the machine, and we will

22:29.460 --> 22:32.150
train on what we will create a new model.

22:32.190 --> 22:39.000
We will try to learn different patterns from this particular detail, which we get and create a model

22:39.000 --> 22:39.660
out of it.

22:39.900 --> 22:45.840
And then we will use this model to make predictions on the unseen data.

22:46.830 --> 22:48.870
This is what machine learning is.

22:53.240 --> 23:00.050
When we talk about machine learning, when we talk about the supervised learning, to be precise, we

23:00.050 --> 23:02.030
have two types of values.

23:03.310 --> 23:12.040
One is the input value, these input values are different X values, different values, and these are

23:12.040 --> 23:15.590
called features, input or independent values.

23:16.270 --> 23:21.720
These are called independent values because these are not related with each other.

23:22.660 --> 23:27.960
Like I say, we are trying to predict if someone will default on loan or not.

23:29.600 --> 23:38.570
So for this particular problem, we will need the information on a person's complete salary, on the

23:38.570 --> 23:43.440
number of dependents, on the person, if the person has any other loan or not.

23:43.820 --> 23:49.400
Then the age of the person, the education of the person and all the details about the person.

23:50.480 --> 23:53.680
These details are independent of each other.

23:55.820 --> 23:59.030
So these values are called ex.

24:01.210 --> 24:07.870
All these values which I just quoted are known as Xe, these are the values from which we want to find

24:07.870 --> 24:08.930
out certain facts.

24:10.290 --> 24:15.550
And the value which we want to find out is if someone would default on a loan.

24:15.810 --> 24:17.100
Yes or no.

24:17.130 --> 24:19.870
So the answer would be either yes or no.

24:20.280 --> 24:22.680
This, yes or no is called wife.

24:24.030 --> 24:28.760
The label that or the target value, all the all the dependent value.

24:30.170 --> 24:39.710
Because this yes or no is dependent on different X values, on the values of the amount of loan that

24:39.710 --> 24:46.580
person has taken, the salary of the person or the age of the person, the number of dependent on the

24:46.580 --> 24:47.090
person.

24:47.240 --> 24:53.780
All these things, these yes or no, would be dependent on all these things which I just listed.

24:54.320 --> 25:02.560
So we are trying to find out a function of these X values which would be equivalent to these very values.

25:03.320 --> 25:08.270
So why is a function of X plus some added value?

25:08.930 --> 25:11.290
This added value would always be present.

25:11.660 --> 25:13.520
It is not completely reducible.

25:13.850 --> 25:16.860
So there are two types of errors which are present.

25:16.880 --> 25:23.850
One is irreducible error, which can be reduced by improving the model quality, by printing the model

25:23.870 --> 25:26.570
properly, by providing good quality of data.

25:26.840 --> 25:34.340
This error could be reduced and there is another error, which is president, which is called irreducible

25:34.340 --> 25:36.290
error, which would always remain.

25:39.090 --> 25:45.600
The traditional modern modeling methods used to have biased roots while the machine learning models

25:45.750 --> 25:50.220
find a gender relationship that reveals the error by finding the unknown.

25:50.640 --> 25:56.860
So the machine learning models will try to find out this gender relationship.

25:57.060 --> 26:05.670
This function of X, such that it will be able to reduce the error to the maximum value, just possibly.

26:09.980 --> 26:15.990
So this is the process again, so we will provide some input data from this input data.

26:16.010 --> 26:21.680
We will find out different features, different columns which are actually important, which will help

26:21.680 --> 26:25.890
us in finding out if someone will be defaulting on the loan or not.

26:26.330 --> 26:28.600
And then we will be building the model.

26:28.610 --> 26:30.100
We will be training the model.

26:30.290 --> 26:34.550
And once we have the model in time, then we will make predictions.

26:36.370 --> 26:46.210
It is always important to have better features, a good amount of good quality of data in hand in comparison

26:46.210 --> 26:53.110
to the better algorithm, because the better the data that you have, the better the machine learning

26:53.110 --> 26:54.190
model would perform.

26:55.000 --> 27:02.710
If the data quality is not good, if the information which we are trying to find the patterns in is

27:02.710 --> 27:10.390
not correct, then no matter how good the machine learning algorithm is, it will not be able to reduce

27:10.390 --> 27:11.410
the errors much.

27:12.340 --> 27:20.260
So always walk on future engineering, always work on finding out the features, which is very, very,

27:20.260 --> 27:21.300
very important.