WEBVTT

00:01.200 --> 00:04.170
In this session, we will discuss about statistics.

00:05.160 --> 00:09.030
Machine learning is deeply based on statistics.

00:12.500 --> 00:23.390
Statistics is the science and the art of learning from statistics is the collection, analysis and interpretation

00:23.390 --> 00:24.140
of data.

00:24.380 --> 00:29.780
It is the effective communication and presentation of results, relying on the data.

00:30.860 --> 00:39.920
So as part of statistics, we will try to understand how statistics is used to understand the data and

00:39.920 --> 00:42.860
how it could be used to support machine learning.

00:45.650 --> 00:46.100
Here.

00:46.430 --> 00:49.730
You can see that we have a complete population.

00:50.540 --> 00:54.530
Population is the entire dataset which we have.

00:55.280 --> 00:58.760
Let us say we are talking about the entire country.

00:59.860 --> 01:05.770
So the population of the country would be the complete consensus of the country.

01:06.040 --> 01:10.450
All the people present in the country would be a part of the population.

01:11.260 --> 01:18.760
But when we are conducting a particular study, we will not be conducting the survey on the entire population,

01:18.970 --> 01:24.310
but rather be taking a small sample from the population.

01:25.420 --> 01:34.000
But while we are taking samples from the population, the sample would have to be a good representation

01:34.270 --> 01:35.620
of the population.

01:36.130 --> 01:45.220
So the sample has to be selected randomly from the population so that it can represent the viewpoint

01:45.220 --> 01:46.780
of the population correctly.

01:48.790 --> 01:55.330
We use the samples and analyze the samples to derive inferences.

01:55.540 --> 01:55.860
The.

01:57.120 --> 01:58.680
Regarding the population.

01:59.880 --> 02:00.900
This is what?

02:01.920 --> 02:06.150
Population and sample has a relationship between.

02:08.900 --> 02:09.170
Now.

02:09.170 --> 02:11.600
What is population and what is something?

02:12.440 --> 02:16.670
A population includes all the elements from a set of data.

02:17.360 --> 02:23.930
While a sample consists of one or more observations drawn from the population.

02:26.460 --> 02:33.360
A measurable characteristic of a population such as the mean standard deviation is called a barometer.

02:34.050 --> 02:40.560
So all the measurable characteristics of a population are called the barometer, while the measurable

02:40.560 --> 02:44.040
characteristic of the sample are called statistics.

02:45.490 --> 02:49.180
The reports are a true representation of the opinion.

02:49.420 --> 02:56.740
So if they are trying to find out the mean of the population, then the mean of the population is a

02:56.740 --> 02:58.900
true representation of the opinion.

02:59.560 --> 03:08.320
While if we find out the meaning of the sample, then it will have a margin of error and a confidence

03:08.320 --> 03:08.920
interval.

03:09.280 --> 03:13.060
The value will not be exactly same as the population.

03:13.330 --> 03:22.360
There is a possibility of error and there is a confidence interval or interval in which the value of

03:22.360 --> 03:25.420
the mean of a sample can range.

03:26.350 --> 03:32.590
So the mean of the sample will be present in the range between the mean of the population.

03:33.430 --> 03:39.010
So we cannot say that the mean of the sample will be exactly same as that of the population.

03:39.250 --> 03:47.260
It will be a little, but this is why we are applying different methods on the samples so that we can

03:47.260 --> 03:50.560
find out the values of the population accurately.

03:50.830 --> 03:53.350
And this entire thing is called statistics.

03:53.530 --> 03:58.180
The entire study of this characteristic is called statistics.

04:00.490 --> 04:06.730
So why we are taking the samples, we have to apply several sampling techniques.

04:07.870 --> 04:11.320
So there are two types of sampling techniques which are available.

04:11.890 --> 04:17.770
One is the probability sampling, and another one is the non probability sampling.

04:18.970 --> 04:27.250
Probability sampling involves the random selection, allowing you to make statistical inferences about

04:27.250 --> 04:28.210
the whole group.

04:28.930 --> 04:38.170
So in case of probability sampling, the sampling is done by random selection of elements from the population.

04:40.100 --> 04:40.580
Why?

04:40.790 --> 04:50.420
In case of non probability sampling the non random selection based on convenience or any of the criteria

04:50.540 --> 04:54.200
allowing you to easily collect the initial data is used.

04:55.190 --> 05:03.860
So non probability sampling allows a convenient selection of values by probability.

05:03.860 --> 05:08.540
Sampling is used for random selection.

05:08.720 --> 05:15.260
So whenever we are making a random selection it is called probability sampling and whenever the selection

05:15.260 --> 05:20.150
is based on convenience, it is called non probability sampling.

05:22.660 --> 05:26.410
Now there are different types of probability sampling.

05:27.520 --> 05:35.770
One is simple random sampling, where we take random values from the entire set.

05:36.010 --> 05:39.610
We pick out random elements from the entire set.

05:40.180 --> 05:42.640
These are called simple random samples.

05:43.690 --> 05:52.810
Systematic sampling is where we systematically pick a particular sequence of values from the entire

05:52.810 --> 05:53.260
data.

05:53.440 --> 06:00.130
So here, if you see we have values from 1 to 9, so we are picking out all the multiples of three.

06:01.270 --> 06:05.200
To get the samples from this dataset.

06:06.810 --> 06:15.330
The stratified sampling has the values without a clear cluster has formed.

06:15.660 --> 06:23.490
Different clusters have been formed from this data, so cluster of old and young men have been baking

06:23.490 --> 06:24.210
separately.

06:24.600 --> 06:31.740
Another cluster of women have been taken separately and one cluster of children has been taken.

06:32.040 --> 06:40.830
And then from these clusters people are now trying to meet, this is what started this fight, nine

06:40.830 --> 06:48.330
of them sampling this and cluster sampling is we create clusters out of the entire data.

06:48.330 --> 06:56.220
We create different clusters from the data and we just pick out one cluster from this entire data.

06:56.220 --> 07:04.680
We pick out random clusters from this entire clusters of data, and we use them for samples.