WEBVTT

00:01.880 --> 00:08.690
Now, the next measure in the descriptive statistics is the measures of variability.

00:09.950 --> 00:13.130
Which is also point the measure of spreading.

00:14.410 --> 00:20.680
Measure of spread sends us about how five the values are spread out.

00:22.040 --> 00:26.810
So here, if you see we have these values ranging from 1 to 100.

00:27.350 --> 00:36.230
So here to show that how these values are spreading, how these values are going from 1 to 100, if

00:36.230 --> 00:44.120
the values are more present toward between 1 to 12, I'm there's very few values between 12, 100.

00:44.360 --> 00:49.970
So this kind of data is shown by the measure of variability.

00:50.810 --> 00:54.550
This shows that how much the values are reading in between.

00:56.460 --> 00:56.880
So.

00:57.990 --> 01:06.300
We have 9 million more, which can provide information about the central points of the data, but they

01:06.300 --> 01:08.930
do not tell us about how the detail varies.

01:10.150 --> 01:18.250
So we use range in the four day range variance and standard deviation as the measures of.

01:19.900 --> 01:24.340
While 9 million and more are the measures of center attendance.

01:28.250 --> 01:29.840
Now let us discuss about.

01:31.090 --> 01:33.310
The quartiles and in the dating.

01:34.520 --> 01:35.870
So let us see.

01:35.870 --> 01:37.430
We have these values.

01:38.210 --> 01:41.780
We have data ranging from 3 to 19.

01:43.160 --> 01:48.960
So these values are ranging from 3 to 19 and then at a total of 15 values which represent.

01:50.100 --> 01:56.100
So quartiles are basically the data divided into four quarters.

01:57.240 --> 02:03.780
So the detail from 3 to 7 is the first quarter, but with just 25% of the values.

02:04.960 --> 02:09.760
Then 8 to 11 is the next 25% of the values.

02:10.860 --> 02:18.150
Then 12 to 15 is the next 25% of the value, and then 17 onwards is the next.

02:19.860 --> 02:28.770
25% of the values, the values from 3 to 11 mean that this is the 50% of the value.

02:29.010 --> 02:34.800
While 2217 mean that this contains 75% of the values.

02:35.340 --> 02:40.160
So each quarter versus 25% of the entire data.

02:41.280 --> 02:42.210
Which is shown here.

02:42.780 --> 02:45.810
So each quarter will have 25%, be they net.

02:46.910 --> 02:53.150
On the line, which is dividing this data is called coordinates.

02:54.050 --> 02:56.720
So this one becomes the Q one.

02:57.110 --> 03:01.430
This becomes Q, which is also the median of these values.

03:02.030 --> 03:07.790
And the third division between the 25, the 75 and 25 of the.

03:07.790 --> 03:10.370
This becomes the third quarter.

03:10.610 --> 03:11.570
But just three.

03:13.340 --> 03:20.540
Now interquartile range is the difference between Q3 and Q1.

03:21.910 --> 03:28.300
So the in the model range for this particular data would be 17 minus seven, which is day.

03:30.520 --> 03:34.870
Now to find out if a value is an outlier or not.

03:35.290 --> 03:37.630
We have the simple formula present.

03:38.050 --> 03:48.850
So the outlier is calculated by Q one, which is seven -1.5 times the in the quartile range.

03:49.930 --> 03:53.160
And so it will be something there below.

03:53.160 --> 03:53.650
Q What?

03:55.320 --> 04:05.120
I'm the next outlier, which is the above range would be Q three plus 1.5 times the in the quartile

04:05.130 --> 04:05.520
range.

04:05.760 --> 04:10.590
So 17 plus 115 would be 30.

04:10.920 --> 04:19.920
So any value above 32 will be an outlier on any value below minus eight would be an outlier.

04:21.270 --> 04:23.940
So this is what Cornell done in the ages.

04:24.690 --> 04:27.150
Now, here, let us see this data.

04:27.360 --> 04:28.850
And here we have a box.

04:28.860 --> 04:29.550
BLOCK three.

04:31.010 --> 04:33.500
This kind of diagram is called a box plot.

04:34.220 --> 04:36.890
So here, this again shows the same thing.

04:38.070 --> 04:39.690
So this law was good.

04:40.170 --> 04:43.470
Usually shows the law most value.

04:45.900 --> 04:53.970
And this second line, which is the beginning of the box, shows the Q one line.

04:55.040 --> 04:59.990
Which is which means that this much area contains 25% of the data.

05:01.440 --> 05:07.170
The next portion from the Q one line to the Q2 line.

05:08.110 --> 05:14.020
This is the line for the Q2 and this is called the median also.

05:14.230 --> 05:22.270
So this much area contains that 50% of the data and the area above this line is the next 50% of the

05:22.270 --> 05:27.520
data on this middle line or the end of the box line is the.

05:28.560 --> 05:31.370
Range from 50% to 75% of.

05:33.000 --> 05:34.350
I'm the line above.

05:34.650 --> 05:40.550
This end of the line is the values above 75% of the data.

05:41.470 --> 05:45.760
The above, the score shows the maximum value.

05:46.810 --> 05:49.990
In peace, there is any outlier present.

05:50.290 --> 05:57.550
Then this upper was this good and the lower good will not show the minimum and maximum value, but in

05:57.550 --> 05:59.740
dawn they will sure be.

06:01.270 --> 06:02.340
Outlier line.

06:03.130 --> 06:05.590
So this line would then go on.

06:05.590 --> 06:06.060
Endo.

06:06.580 --> 06:14.770
Q One -1.5 base IQ and any value above this line, this particular value which is depicted by this stuff,

06:15.040 --> 06:16.750
is the outlier value.

06:18.160 --> 06:21.160
So here the lower cost quartile value.

06:21.160 --> 06:27.100
Q one is seven, this is seven, which is Q1 and Q2 is 8.5.

06:28.100 --> 06:29.420
Which is also the median.

06:30.460 --> 06:33.140
Then Q3 is nine.

06:35.120 --> 06:40.850
Then the lower whisker is £1.5 in the port range, which is at fourth.

06:42.570 --> 06:49.530
Because that is data presented below what we are showing it as the outlier.

06:50.610 --> 06:57.750
Now the upper school would be calculated by people as 1.5 frames in the model range, which is nine

06:57.750 --> 06:58.320
plus three.

06:58.420 --> 07:01.530
Well, so this is the maximum one.

07:02.750 --> 07:08.900
If there is no datapoint that will then the highest point, less than 12 would be considered as the

07:09.140 --> 07:09.910
upper whisker.

07:10.760 --> 07:14.810
This means the 1.52 out of a school can be uneven in.

07:16.200 --> 07:22.680
So this whiskey would either be the maximum value or 1.5 things that up.

07:23.900 --> 07:27.050
And this one would be either the minimum value or.

07:27.500 --> 07:29.860
Q one -1.5 things like you are.

07:32.190 --> 07:39.810
Now that we have learned about the different measures of central tendency, let us go ahead and have

07:39.810 --> 07:45.240
a look at how we can get to know about the spread of data.

07:46.200 --> 07:47.610
Or you can also see this.

07:49.110 --> 07:58.950
Now the measures of central tendency let us know about what are the central values or the data which

07:58.950 --> 07:59.340
we have.

08:00.150 --> 08:04.650
But that does not give us an insight of how the day district.

08:05.550 --> 08:06.730
So that is what we will get

08:10.680 --> 08:11.430
for that.

08:11.970 --> 08:15.240
Let us first have a look at what are the values.

08:16.080 --> 08:21.300
The first one is variants, and the second one is standard deviation.

08:22.260 --> 08:33.580
Standard deviation is calculated by one upon N multiplied by v.

08:34.980 --> 08:39.240
Thus the square of difference of the mean.

08:39.240 --> 08:39.570
And

08:42.690 --> 08:48.330
these are some together and divided by in to find the same condition.

08:48.960 --> 09:00.750
Now we have considered mean as a value for calculating the standard deviation, and standard deviation

09:00.750 --> 09:04.170
is nothing but a square root of readings.

09:05.220 --> 09:08.460
So if you look at the formula, it is just the same.

09:10.470 --> 09:19.710
It is mean minus that for a minus that particular number or you can see the number minus the mean.

09:19.710 --> 09:23.220
You can see it anyway because it is being squared.

09:23.640 --> 09:24.990
So it doesn't make a difference.

09:25.770 --> 09:29.760
And further, your summing all the values up and dividing by n.

09:31.260 --> 09:39.690
Similarly, you're doing the same thing but taking an underscore and the root of square root of the

09:39.930 --> 09:40.830
medians here

09:44.490 --> 09:46.200
if made too complicated for now.

09:46.440 --> 09:49.150
But as we go forward, you'll understand.

09:49.440 --> 09:52.910
So let us try to understand why we have chosen this formula.

09:52.920 --> 09:54.900
How did this formula get formulated?

09:55.590 --> 09:56.730
So let's understand that.

09:59.700 --> 10:01.680
So this is just an example.

10:01.860 --> 10:04.770
So let's say we are considering these numbers.

10:05.640 --> 10:11.760
These are the set of numbers which we have, and we are trying to find out how this particular data

10:11.760 --> 10:12.480
is spread out.

10:14.190 --> 10:19.590
So when we are looking at the spread of this particular data, I can clearly see that the numbers are

10:19.590 --> 10:21.240
ranging from 1 to 16.

10:21.900 --> 10:28.250
There are some numbers which are one and two, which is the lower side of the numbers.

10:28.260 --> 10:34.080
Then I have numbers like seven, 12, 15, 16.

10:36.540 --> 10:38.610
Now these are few numbers.

10:38.730 --> 10:43.700
So you can see that there is a slight difference between the median.

10:43.710 --> 10:50.100
So if I calculate the mean of this particular list of numbers, the mean comes out to be 8.44.

10:50.790 --> 10:54.870
So you can see these are a few numbers which are actually very near to the mean.

10:55.200 --> 11:01.490
But apart from that, all of the numbers are distributed in the dataset.

11:01.500 --> 11:05.550
So there are very low numbers also and very high numbers are also present.

11:06.000 --> 11:13.530
But because these numbers are spread out, that is why the mean is coming out as 8.4.

11:14.790 --> 11:20.100
Now, what should we use to find out how the did they spread?

11:21.300 --> 11:23.610
So let us look this up.

11:23.610 --> 11:30.360
In Excel, women do the calculations in Xs so that you will get an infusion of what we are doing and

11:30.360 --> 11:31.380
why we are doing it.

11:34.140 --> 11:36.540
Now, these were the numbers which we were looking at.

11:36.960 --> 11:39.770
So let us first of all, find out the meaning of these numbers.

11:39.840 --> 11:41.280
The mean will come out when

11:51.360 --> 11:53.400
I'm taking an average of these numbers.

11:54.780 --> 11:56.520
So this is the mean of these numbers.

11:57.240 --> 12:00.630
Now we need to find out the state of these numbers.

12:00.630 --> 12:07.260
So I need to compare these numbers with a particular number to actually find out how the data is being

12:07.380 --> 12:07.920
changed.

12:08.760 --> 12:16.520
I need to have a number with which I can compare these so that I can get an idea of what is this thing?

12:17.010 --> 12:26.100
So what I'm doing is I'll be comparing these for the first number, the last number, and the mean of

12:26.100 --> 12:26.580
the numbers.

12:27.000 --> 12:29.790
And please note that these numbers are sorted in meters.

12:29.790 --> 12:31.280
So the first number and last.

12:31.610 --> 12:36.860
Are actually having a logical meaning that proposed in the last them what is actually the meaning and

12:36.860 --> 12:43.640
max of the list of numbers which we have to first find out that deviation from the first number.

12:43.640 --> 12:50.870
So the deviation from the first number will be this minus this.

12:52.750 --> 12:56.140
Now when we are finding the deviation we want to keep.

12:59.920 --> 13:01.480
This number is constant

13:04.840 --> 13:08.140
and I keep finding out the values.

13:08.470 --> 13:20.020
So here you can see this is basically one minus two, one minus four, one minus seven, one minus seven,

13:20.410 --> 13:25.720
one -12, one -12, one -15, one -16.

13:25.960 --> 13:27.820
These are the values which you're getting here.

13:30.250 --> 13:32.830
Now, I will again take the average of these numbers.

13:33.580 --> 13:35.860
Let me just copy the formulas here.

13:36.460 --> 13:38.430
And here I give the average of these numbers.

13:38.440 --> 13:41.110
I copy this formula here as well,

13:44.080 --> 13:47.170
so that once we have the values, we get the results accordingly.

13:48.130 --> 13:52.300
Now, this is the deviation from the first number.

13:53.170 --> 13:56.830
Now, similarly, I will calculate the deviation from the last number.

13:58.030 --> 14:06.730
So deviation will be what will be this number minus each in every number in the list.

14:07.720 --> 14:10.360
So for that, I will have fixed this number again.

14:10.360 --> 14:11.500
So I put a dollar here.

14:13.360 --> 14:19.390
Now, here, I'm also giving you a slide inside about how we can use formulas in Excel.

14:19.390 --> 14:23.140
So you can take that as a additional point here.

14:23.440 --> 14:28.060
You will get to know how you can perform these in using an Excel.

14:28.060 --> 14:35.500
Also, we will be learning that how we can do this in by hand, how we can do this from using Python.

14:35.920 --> 14:37.150
So in the demo.

14:37.900 --> 14:47.410
So here we have again 16 minus one, 16 minus two, 16 minus four and so on and 16 -16 and last.

14:47.920 --> 14:50.320
So these are the values which we have here and here.

14:50.320 --> 14:56.460
You can see that the deviation from the force number is -7.44.

14:56.470 --> 14:58.290
Here we have 7.55.

14:58.600 --> 15:05.590
If you take more of these numbers or you ignore the same and you can see that these values are somewhat

15:06.250 --> 15:09.220
similar, these values are near to each other as of now.

15:10.060 --> 15:15.730
So it doesn't really make a difference if we are calculating this based on the first number, last number

15:15.730 --> 15:16.180
and so on.

15:16.420 --> 15:18.910
So let's do it on the basis of the mean.

15:18.910 --> 15:31.300
So here I will find of the mean gap as constant minus again all the numbers.

15:31.870 --> 15:35.620
So let me just keep doing it.

15:37.210 --> 15:47.360
So here you can see if I subtract these numbers from the mean, then these numbers again have not changed

15:47.380 --> 15:47.830
much.

15:48.400 --> 15:54.610
So here is with respect to the mean, the deviation from the mean is coming out to be zero.

15:55.000 --> 15:59.590
Now, again, if I take more of these numbers, then it might change a little bit.

16:02.560 --> 16:05.380
Now let us do the same for the mean.

16:05.380 --> 16:10.630
Now, if you see we have followed the values from mean now, then we are calculating the deviation from

16:10.630 --> 16:10.990
mean.

16:11.410 --> 16:19.330
It is coming out to be zero, which is quite understandable because we are finding out the deviation

16:19.330 --> 16:20.740
from the sample values.

16:20.740 --> 16:27.820
So it will be balanced out when we are calculating in a positive direction and in the negative direction

16:27.820 --> 16:28.150
is one.

16:28.930 --> 16:36.040
So what we can do is instead of picking these values, we can take the square of these values so that

16:36.610 --> 16:40.270
these the impact of the sign is not considered.

16:40.600 --> 16:48.270
So we take a square of these values, so it will be free to put this into this set.

16:49.490 --> 16:54.010
So these are the values, and we will do the same for all the numbers.

16:55.000 --> 17:03.070
Now, because we have taken the square of these deviation from the means, what happens is the value

17:03.070 --> 17:04.040
is increased.

17:04.040 --> 17:04.600
Don't skip.

17:04.960 --> 17:13.090
Now we want to bring it back to the normal scale so that we can compare the deviation that the values

17:13.090 --> 17:14.260
which we have here.

17:14.980 --> 17:15.910
So let's do that.

17:15.970 --> 17:17.560
So we'll do a square root

17:20.380 --> 17:21.820
of this particular number.

17:23.290 --> 17:27.650
So it comes out to be 5.2305.

17:28.360 --> 17:37.570
Now, if you will calculate the square root, basically we will conclude the variance and standard deviation

17:37.870 --> 17:39.450
directly for these number.

17:39.460 --> 17:43.300
It will come out to be somewhat similar to this, but it is doing that as well.

17:43.420 --> 17:46.690
So let me find the values for you.

17:50.290 --> 17:54.910
So this is variance and I'm taking variance for these numbers.

17:56.260 --> 17:58.030
Similarly, I'm calculating the.

17:58.390 --> 17:59.710
Standard deviation

18:03.100 --> 18:04.660
for these numbers again.

18:07.180 --> 18:10.300
So it is coming out with the exact same number.

18:11.020 --> 18:17.650
Now, I have done scene calculation for two different data sets as well.

18:18.220 --> 18:24.520
So let's have a look at those data sets, everything answering the same how we do the calculation is

18:24.520 --> 18:25.630
also remaining the same.

18:25.960 --> 18:27.070
So let us look at that.

18:28.240 --> 18:34.570
So here you can see that we have we will consider the values which are having similar scaling.

18:34.960 --> 18:38.950
So here you can see the values are distributed evenly.

18:38.980 --> 18:41.760
We can see that the values are not changing.

18:42.060 --> 18:47.200
My instead of having one, two, four, seven, 12, 15, 16, which is a nice spread out.

18:47.800 --> 18:50.620
There is no outlier present in the data as well.

18:50.920 --> 18:57.260
So we are getting the average to be 8.4 for the we do from the first number as -7.4.

18:57.410 --> 18:59.710
Then the last number is 7.5.

19:00.280 --> 19:02.110
This is actually zero.

19:02.260 --> 19:07.270
And when we are taking the square root of the square deviation of the mean, it is coming out to be

19:07.270 --> 19:08.620
5.230.

19:09.250 --> 19:15.070
Now, I have changed these numbers a bit and here I have all these numbers.

19:15.280 --> 19:23.050
Apart from that, I have included this 59, which is an outlier towards the upper limit of the values.

19:23.050 --> 19:26.470
So basically I have increased the maximum value a lot.

19:27.550 --> 19:31.870
So what happens now is the time to give you is the average.

19:31.930 --> 19:36.850
The mean value turns out to be 13.22 instead of 8.44.

19:37.450 --> 19:40.780
Because we have changed just one single value.

19:41.050 --> 19:43.450
It has modified the mean so much.

19:44.710 --> 19:49.390
Instead, if you would have loved that median, the median value would not have changed.

19:50.320 --> 19:51.850
The mean value has changed.

19:52.990 --> 19:59.080
Now, similarly, when we look at the deviation from the force number, it comes out to be -12.

19:59.350 --> 20:04.780
The deviation from the last number comes out to be 45 point something, while the deviation from the

20:04.780 --> 20:10.380
mean comes out, the hero and squared deviation from the mean comes out to be 16.79.

20:10.930 --> 20:15.130
This shows that there is a change in the values.

20:16.300 --> 20:23.590
Now, if you look further here, the deviation from the force number and the last number remains somewhat

20:23.590 --> 20:24.100
seen.

20:24.580 --> 20:31.240
But when we introduce an outlier, the deviation from the force number and last number have a huge difference.

20:32.020 --> 20:36.310
Same thing happens here when I introduce an outlier towards the minimum value.

20:36.640 --> 20:40.780
The deviation from the force number and last number have a huge difference.

20:41.740 --> 20:45.910
While the deviation from the mean remains zero everywhere.

20:48.100 --> 20:50.770
In all the values, the original, the mean remains the same.

20:51.490 --> 20:57.520
But because there is an impact from the positive and negative values, that is something which needs

20:57.520 --> 20:59.110
to be taken care of.

20:59.110 --> 21:02.290
Otherwise we will not get a good information out of it.

21:02.980 --> 21:06.390
You can clearly see that the mean is changing a lot here.

21:06.550 --> 21:11.110
It was 8.4 for we introduced an outlier towards the maximum value.

21:11.110 --> 21:12.460
It changed to 13.

21:12.880 --> 21:16.270
We introduced an outlier towards the negative side of it.

21:16.300 --> 21:18.220
It changed to 2.3.

21:18.220 --> 21:28.240
So mean is highly influenced by a single location and so are the deviation from the first number and

21:28.240 --> 21:29.170
the last numbers.

21:31.620 --> 21:36.500
While the squared deviation from the mean if you see it is 20 here.

21:37.090 --> 21:40.020
Here it is 16 and here it is five.

21:40.470 --> 21:45.900
So these values are giving us a good idea of how the details are spread.

21:47.250 --> 21:54.630
So that is why we use the standard deviation to see how the the how the values are spread.

21:56.580 --> 22:02.490
The calculation which we have done here is exactly the same, which is to be done for the calculation

22:02.490 --> 22:04.440
of standard deviation and variance.

22:04.830 --> 22:10.320
The value which we have calculated by doing the square root is the standard deviation.

22:10.560 --> 22:16.740
And if we would have kept it as it is, that is, we would have found no deviation from this from the

22:16.740 --> 22:17.100
mean.

22:17.430 --> 22:24.570
That is the mean value minus each and every number.

22:24.870 --> 22:28.680
I'm thinking the sum of that deviation from the mean.

22:29.040 --> 22:33.180
If we were taking this division of that mean after doing the sum.

22:33.510 --> 22:41.820
So we are basically taking the difference, doing squaring, squaring all these mean deviations and

22:41.820 --> 22:49.950
then taking an average of that, basically dividing by the total number of values which we have.

22:50.250 --> 22:51.990
And that is the variance.

22:52.200 --> 22:55.830
And then when we take a square root of it, it is the standard deviation.

22:55.840 --> 22:57.720
That is exactly what needs to be done.

22:57.750 --> 22:58.920
That is the bottom line.

22:59.100 --> 23:05.530
This is the best way to remember this because this actually tells you why we are doing this.

23:07.080 --> 23:09.060
So this is the formula for obedience.

23:09.750 --> 23:16.350
And here you can clearly see that in probability theory that this experience is the average of the squared

23:16.350 --> 23:19.770
differences from the mean informally.

23:19.890 --> 23:25.740
It measures how what a set of numbers are spread out from their average values that you have seen that

23:25.740 --> 23:27.690
this is changing a lot.

23:28.350 --> 23:32.940
We are checking from the based on the basis of the plus number, the last number.

23:32.940 --> 23:37.260
But you mean gives us the most reliable values.

23:37.470 --> 23:42.300
That is why we use this Formula One variance and standard deviation.

23:43.680 --> 23:46.440
Now, if you look, there are two formulas here.

23:46.950 --> 23:52.170
First is for the population and second one is for the sum.

23:52.590 --> 23:58.380
Similarly, we have here one for the population and one for the sample.

23:58.620 --> 24:06.150
The only difference between the formula for its population and some food is that the for the values

24:06.150 --> 24:11.640
are divided by n minus one instead of in when we are calculating for the sample variance.

24:13.950 --> 24:19.440
Now why this is done, you will get to know in the next section where we are taking this business connection.

24:24.570 --> 24:33.840
Now why have we divided the standard deviation by in minus one instead of dividing it by in which is

24:33.840 --> 24:35.310
done in case of population?

24:35.640 --> 24:38.550
So this is done because of vessel selection.

24:38.790 --> 24:45.600
So vessels correction refers to the PN minus one found in several formulas, including the sample readings

24:45.600 --> 24:47.760
and sample standard deviation formula.

24:48.090 --> 24:56.970
This correction is made to correct for the fact that these sample statistics tend to underestimate the

24:56.970 --> 24:59.430
actual pattern we found in the population.

24:59.700 --> 25:07.050
Now, because the standard deviation and the variance which we are calculating for the sample is kind

25:07.050 --> 25:14.220
of considering, not considering the entire picture and might be a little lower than the actual variance

25:14.220 --> 25:17.670
or actual standard deviation of the population.

25:17.880 --> 25:27.810
So what we do is we reduce the in value in the sample so that when we reduce the denominator, the entire

25:27.810 --> 25:32.670
value of the variance and the standard deviation kind of increases a little.

25:32.940 --> 25:33.540
So it.

25:34.780 --> 25:40.900
Balances out and makes it closer to the population, standard deviation and the population readings.

25:42.210 --> 25:45.090
So this is why we use the vessels.
