WEBVTT

00:01.570 --> 00:07.930
High in this session, we will discuss about the next wave of Chi Square test, which is Chi Square

00:07.930 --> 00:10.150
test for independence.

00:11.570 --> 00:20.030
Now, in the case of the guys who did this for goodness of fit, we saw that a particular value, a

00:20.030 --> 00:26.750
set of value, which is categorical in nature, they tried to find out the proportion of these values

00:26.960 --> 00:36.660
and try to deduce things about the population distribution using the square based of goodness of the.

00:37.830 --> 00:45.630
Now, what do we do in case of this for independence, so did this for independence can be used and

00:45.630 --> 00:47.630
interpreted in two different ways?

00:48.420 --> 00:55.170
The first way is testing hypotheses about the relationship between two withi within a population.

00:56.480 --> 01:02.700
That is initially we were talking about one single video, but that is one categorical variable and

01:02.730 --> 01:03.650
that category.

01:03.680 --> 01:05.820
The video had multiple values in it.

01:05.990 --> 01:12.390
Let us see the eye color being brown, green, blue and other.

01:12.800 --> 01:14.990
So this was about us in the video.

01:15.620 --> 01:20.740
But here in the square of independence, we are talking about the videos.

01:21.880 --> 01:28.660
So you see the difference between the goodness, if it was about just one particular frequency distribution

01:28.660 --> 01:34.690
people, about one categorical variable, which we were talking about, but here we have two withI.

01:36.160 --> 01:42.400
And the other reading this about differences between proportion four to one world population, so one

01:42.400 --> 01:48.030
is about the relationship between the variables or the differences between the water, though.

01:48.910 --> 01:54.830
Now, if you try to analyze these two statements, they are kind of similar.

01:54.850 --> 02:02.860
Only one is trying to find out if two variables have some relationship or not, and another one is trying

02:02.860 --> 02:07.660
to find out if they do have some independence or differences or not.

02:08.500 --> 02:09.970
So it's one and the same thing.

02:09.970 --> 02:16.420
If they are able to find out about two population variables are having some similarity or differences

02:16.420 --> 02:16.790
or not.

02:16.810 --> 02:21.070
That is that can be considered just one null hypothesis only.

02:22.100 --> 02:25.920
So we don't need to be the sole solution here.

02:26.240 --> 02:28.850
We need just one good problem to solve.

02:30.270 --> 02:37.230
So the only defense would you need to take care of here is that the goodness effect is regarding one

02:37.230 --> 02:38.520
categorical variable.

02:39.210 --> 02:49.560
While the test for independence is regarding two or more populations, two variables present in different

02:49.560 --> 02:50.110
populations.

02:50.130 --> 02:54.930
So either they could be to population and we want to find out if these two populations are different

02:54.930 --> 02:55.310
or not.

02:55.830 --> 02:59.220
Or we could have two variables, insane population.

02:59.370 --> 03:05.140
And we want to find out if there is any relationship between these two of these two peoples.

03:05.690 --> 03:10.200
This could be something like a number of hours someone is studying.

03:11.460 --> 03:18.420
With respect to if they're scoring good or not, so here we have the same population, but we want to

03:18.420 --> 03:25.620
find out if the number of hours a person study has some relationship with the of the marks with the

03:25.620 --> 03:26.970
score they are getting on.

03:28.380 --> 03:29.910
So let us get for the.

03:32.080 --> 03:38.170
Now, although the two versions of the test of independence appear to be different, they are equivalent

03:38.170 --> 03:40.170
and they are interchangeable in nature.

03:41.710 --> 03:50.530
The first version of the test emphasizes the relationship between squit and of coordination because

03:50.740 --> 03:54.090
both procedures examined the relationship between two people.

03:54.130 --> 03:57.810
So here we are having squared all joint coordination models.

03:57.880 --> 04:03.340
It kind of emphasizes between guys when information if we're able to relate it to each other or not,

04:04.180 --> 04:10.860
while it determines whether there is an association between categorical variables, also known as a

04:10.860 --> 04:17.190
squared test of association or a test of independence, is also known as good as Twatt Association.

04:17.440 --> 04:21.220
So we are checking if the city was associated or they are independent.

04:23.300 --> 04:26.780
I'm just trying to find out the relation between two videos.

04:28.160 --> 04:37.010
So a very small square statistic means that the absorbed data puts your expected data extremely.

04:37.280 --> 04:40.430
In other words, there is a relationship between two evils.

04:40.850 --> 04:43.060
If the guy is square, the value is very small.

04:43.070 --> 04:46.160
It will mean that there is a relationship between the variables.

04:46.400 --> 04:51.910
If the square value is very large, then it will mean that the data does not fit very well.

04:51.920 --> 04:54.620
That is, it does it does not have any relationship.

04:54.620 --> 05:00.350
That is those two with equals, which we are trying to compare and we are trying to find out an association

05:00.350 --> 05:01.940
for are not related.

05:02.840 --> 05:04.570
And that independent of each other.

05:06.130 --> 05:14.470
Right, so the subscripts see here is the degree of freedom and the formula remains the same, if you

05:14.470 --> 05:20.700
see the formula is just exactly what we had for the formula, for the goodness of it.

05:21.550 --> 05:22.650
Just have a look.

05:22.650 --> 05:22.840
You.

05:25.010 --> 05:32.370
See, of the value minus expected value will square divided by expected what is the formula here?

05:33.080 --> 05:37.350
The families observe minus expected equals squared divided by expected.

05:37.850 --> 05:40.100
So the formula is exactly the same.

05:40.130 --> 05:41.120
The problem is.

05:42.980 --> 05:44.700
Let us go ahead now.

05:44.900 --> 05:49.660
This guy squared value, which we have this whole value, which we have is value for.

05:50.540 --> 05:56.210
This is an equation political and this is the these are the different goals which are generated from

05:56.210 --> 05:56.900
this equation.

05:57.590 --> 06:06.590
Now, what do these different colors signify, these different goals for different different values

06:06.590 --> 06:07.070
of the.

06:10.700 --> 06:11.760
Degree of freedom.

06:12.230 --> 06:17.600
So here we have different degrees of freedom and you can see four different degrees of freedom.

06:17.630 --> 06:19.490
We have the free golf screed.

06:21.670 --> 06:30.850
And as the degree of freedom increases, it it it it somehow represents a normal distribution, so as

06:30.850 --> 06:37.180
the degree of freedom will increase, the people will move towards being a normal distribution.

06:39.960 --> 06:41.860
So let us have an example here.

06:42.240 --> 06:49.270
So here we have a public opinion poll which is surveyed on a simple random sample of thousand voters

06:49.330 --> 06:52.070
support a number of people, which we have here is a thousand.

06:52.380 --> 06:56.300
And people of people who responded would classify it by gender.

06:56.310 --> 06:58.770
That is, either they were male or female.

06:59.490 --> 07:05.260
Now, this is our first video, which is a categorical variable that a person is a male or a female.

07:05.910 --> 07:07.350
What is the next category?

07:07.650 --> 07:09.030
The next is the.

07:10.850 --> 07:16.640
Voting preference of a person that is a person is Republican, Democrat or independent.

07:17.690 --> 07:23.630
So this is, again, a categorical variable, that is if a person is a public Democrat or independents

07:23.840 --> 07:26.810
and another one is the gender of the bush.

07:27.410 --> 07:31.860
So these are two variables which we have in Square Woodenness office.

07:32.240 --> 07:37.020
We had one right now for this.

07:37.070 --> 07:38.570
This is the baby, which we have.

07:38.580 --> 07:40.340
This is the values that we have.

07:40.350 --> 07:45.530
That is there are four hundred million six hundred females which were taken in the sample.

07:46.540 --> 07:54.050
And out of these four, 450 were Republican, 450 Democrat and hundred were independent.

07:54.460 --> 07:56.260
Now this is the data which we have.

07:57.950 --> 08:02.520
Now we want to find out if there is a particular gender gap.

08:03.140 --> 08:08.130
Do men's voting preferences differ significantly from the women of preferences?

08:08.330 --> 08:15.080
So we are trying to compare if the voting preferences of male and female are different or is there any

08:15.080 --> 08:16.740
relationship between both of these?

08:17.480 --> 08:26.360
Now, we could have seen that if the number of male and female would have been similar, if the number

08:26.360 --> 08:32.060
of male and female would have been five hundred and five hundred, then we could have seen if the values

08:32.060 --> 08:33.400
were significantly different.

08:33.770 --> 08:39.620
But here, the problem is that because it is a random sample, the number of male is willison, the

08:39.620 --> 08:40.490
number of female.

08:41.590 --> 08:47.380
So we cannot really determine anything from this data, which we have, so that is why we will be doing

08:47.380 --> 08:48.810
the squarest.

08:49.400 --> 08:50.610
Now, how do we do that?

08:52.690 --> 08:58.960
So we will be doing that by performing of kriesel steps.

08:59.590 --> 09:02.670
OK, now what is the null hypothesis here?

09:03.430 --> 09:08.970
The null hypothesis is that the gender and voting preferences are independent.

09:09.310 --> 09:15.970
That is, no, there is no relationship between the gender and voting preferences.

09:17.110 --> 09:24.490
That is me do not really vote for Republican or female, don't vote for independence.

09:24.760 --> 09:25.630
So these are.

09:26.880 --> 09:32.210
Different viewpoints, which could have been there, but there is no such dependency, right?

09:32.580 --> 09:34.050
They are completely independent.

09:35.160 --> 09:40.380
While the alternate hypothesis is that gender and voting preferences are not independent, that is,

09:40.410 --> 09:46.230
gender is actually impacted by the voting preferences of voting, preferences are actually impacted

09:46.230 --> 09:46.970
by the gender.

09:47.370 --> 09:54.990
If they have a particular gender voting more, then there is a chance of one particular party winning.

09:56.190 --> 09:59.330
Right, but we want to find out if that is true or not.

10:00.840 --> 10:03.900
So we have created this NALAN alternate hypothesis.

10:05.020 --> 10:12.820
Now, what are degrees of freedom now here, the degree of freedom will be the product of a number of

10:13.000 --> 10:17.830
degrees of freedom, rule of law and degrees of freedom of the column.

10:18.010 --> 10:18.660
What is it?

10:20.200 --> 10:21.700
Here we have two values.

10:22.930 --> 10:25.510
So here, the degree of freedom will be one.

10:26.560 --> 10:33.490
Here we have three values, so here the degree of freedom will be to the total degree of freedom will

10:33.490 --> 10:34.450
be to.

10:36.200 --> 10:39.230
Right, so how do we find out the degree of freedom?

10:39.260 --> 10:42.420
So the first thing which we will be finding out will be degrees of freedom.

10:42.800 --> 10:48.500
Now, here is the number of levels for one categorical variable, and C is the number of level for the

10:48.500 --> 10:50.110
other categorical variable.

10:50.840 --> 10:52.910
This is how we find out the degrees of freedom.

10:53.420 --> 10:56.100
Next value is the expected frequency.

10:56.100 --> 10:58.580
Is that is what is the expected frequency value?

10:59.760 --> 11:00.950
How do we find that out?

11:01.260 --> 11:07.380
The expected frequency count are computed separately for each level of one categorical variable at each

11:07.380 --> 11:08.820
level of the other, categorical.

11:09.510 --> 11:15.110
So we will be finding out the expected value for this, for this, for this.

11:15.120 --> 11:18.870
So for each and every data point, which we have here, we will be finding the.

11:19.840 --> 11:21.220
Expected value for these.

11:22.720 --> 11:24.650
Now, how do we find that out?

11:25.000 --> 11:31.630
It will be the number, the expected number of rule into expect the total number of those, the total

11:31.630 --> 11:34.310
number of values in the column divided by the total number.

11:35.170 --> 11:35.430
Right.

11:35.850 --> 11:42.290
So and that is what is the total number of sample observations I can level out of the variability and

11:42.310 --> 11:50.020
in total number of sample observations at levels of variability and is double the number that the sample

11:50.020 --> 11:50.530
size.

11:52.980 --> 11:53.460
OK.

11:54.690 --> 11:56.370
So this will be nothing but.

11:57.240 --> 12:00.850
For this, what will be the value for me versus representative?

12:00.870 --> 12:05.340
The value will be for 50 in the four hundred divided by thousand.

12:05.700 --> 12:06.970
What will be for this one?

12:07.170 --> 12:12.210
This will again be 450 Endou four hundred divided by a thousand.

12:14.880 --> 12:20.310
Right now, what is or C or C is the opposite of the frequency.

12:20.550 --> 12:21.510
So what will we observe?

12:21.510 --> 12:21.930
Frequency?

12:22.200 --> 12:24.220
Four million devices in the public.

12:24.300 --> 12:26.860
It is two hundred fifty million republic.

12:26.880 --> 12:29.280
It is 250 for female and Democrat.

12:29.280 --> 12:33.810
It is three hundred for male and Democratic is 150 for male and independent.

12:33.810 --> 12:36.740
It is 50 for female and independent of is 50.

12:36.870 --> 12:39.120
So this is how we find ourselves of values.

12:41.950 --> 12:49.270
OK, now we are doing the calculations, so the calculated value will be degree of freedom comes out

12:49.270 --> 12:54.670
to be to how the number of rulers minus one and the number of columns minus one.

12:56.500 --> 13:06.700
That is the minus one one and three, minus one, two, two, plus one is equal to two, then these

13:06.700 --> 13:10.350
are the different values for expected values.

13:10.930 --> 13:17.230
So you can go back to the values here and compare how these values are calculated.

13:17.920 --> 13:25.030
And similarly, we have the observed values already present and based on the expected value and the

13:25.030 --> 13:28.000
observed value, we can find out guys statistics.

13:28.570 --> 13:30.430
So what will be the guys for statistics?

13:30.430 --> 13:38.140
Guys with statistics will be of value minus the expected value squared divided by the expected value.

13:38.470 --> 13:45.430
And then we take a sum of all of these observations, all of these guys good values, what each and

13:45.430 --> 13:46.990
every little column value.

13:48.860 --> 13:54.020
So it comes out to be sixteen point two, you can pause this video.

13:54.930 --> 14:01.220
And verify the values, do the calculations on your own.

14:03.270 --> 14:04.110
And then to.

14:05.540 --> 14:09.600
So these are the values, so you can calculate these values accordingly.

14:09.950 --> 14:15.830
Now here with the degree of freedom is two and square value is sixteen point two.

14:17.280 --> 14:26.640
So for this, if we go to the guys table, then four degrees of freedom do sixteen point two will come

14:26.640 --> 14:27.030
where?

14:30.090 --> 14:35.550
This is the value, this is the degree of freedom do and the value is sixteen point something.

14:36.420 --> 14:41.040
Now you can see that the value sixteen point something is Vitullo.

14:45.040 --> 14:47.530
Here also, you can see four degrees of freedom to.

14:48.670 --> 14:58.150
Sixteen point two comes way above zero point zero one, which means that these we have to reject the

14:58.150 --> 14:59.170
null hypothesis.

14:59.590 --> 15:01.630
So what was the null hypothesis?

15:03.310 --> 15:09.730
The null hypothesis was that the GENDEREN voting preferences are independent, but because we have rejected

15:09.730 --> 15:15.670
the null hypothesis so we can conclude that there is a relationship between gender and voting preferences.

15:16.630 --> 15:23.410
So this means that if a particular agenda votes more than the other agenda, then there is a chance

15:23.410 --> 15:26.090
for a particular party to win, right?

15:26.290 --> 15:30.960
So this is what we have in the square, sort of independent.

15:31.780 --> 15:38.050
So next we will be discussing about a one which is another set of test, which we will be discussing

15:38.050 --> 15:39.340
in the next session.

15:39.820 --> 15:40.230
Thank you.