WEBVTT

00:04.200 --> 00:09.000
To implement data quality tests, we need to have a tool that can perform these tests.

00:09.040 --> 00:15.280
One tool that has gained more popularity over the last few years is soda, which is a data quality framework.

00:15.320 --> 00:21.200
We will use Soda Core, which is the open source Python based library, and using Soda Core we can define

00:21.240 --> 00:23.520
the data quality checks in YAML format.

00:23.560 --> 00:27.000
Now to have soda available we will need to install it using Pip.

00:27.040 --> 00:32.640
We can do this by going back to the requirements.txt file that we introduced in the beginning of this

00:32.640 --> 00:37.640
course, and adding two additional packages that will be part of our updated Docker image.

00:37.680 --> 00:39.200
These are the following two.

00:39.400 --> 00:43.440
The first one is for data quality, and this is the one that we're currently discussing.

00:43.560 --> 00:50.040
And the other one is Pytest, which we will go more into once we do the functional testing part of this

00:50.040 --> 00:50.680
course.

00:50.720 --> 00:59.530
Notice also how for Soda Core, I had to specify Postgres since our data is inside the Postgres database,

00:59.690 --> 01:02.810
we need to use the so-called process connector.

01:02.850 --> 01:08.770
If we were using another database like redshift for example, we would need to use the appropriate connector.

01:08.810 --> 01:14.730
Now that we have updated and saved the requirements.txt to update the Docker image, there are some

01:14.770 --> 01:16.970
additional steps which we need to take.

01:16.970 --> 01:21.410
So first we need to build the Docker image with updated tag.

01:21.450 --> 01:28.050
Remember that the first version tag was 1.0.0, and we can remember this by going to the Docker Hub

01:28.050 --> 01:30.730
repository and seeing 1.0.0.

01:31.690 --> 01:38.370
So let's build the YouTube API image using the 1.0.1 tag.

01:46.690 --> 01:50.370
Next we will need to push the image to Docker Hub.

01:50.650 --> 01:51.890
So first we log in.

01:56.410 --> 01:58.220
We are prompted for a for passwords.

01:58.660 --> 02:03.820
Make sure you have the saved somewhere and it is easily accessible.

02:05.620 --> 02:12.260
Once you get login succeeded, we can now push the updated image with the new tag.

02:19.180 --> 02:23.780
So from here, let's go briefly on Docker Hub and see if the tag was updated.

02:26.860 --> 02:32.860
As you can see, we just pushed the new image with the new tag less than a minute ago.

02:34.580 --> 02:41.580
As a final step, run docker compose pull to use the latest image that we just pushed and rebuild and

02:41.580 --> 02:47.940
restart the containers using the dash dash force dash recreate option of Docker compose.

02:47.940 --> 02:49.340
So this will look as follows.

02:49.780 --> 03:02.590
First, run the docker compose pull and then run docker compose up minus d dash dash force dash recreate.

03:14.030 --> 03:17.110
Let's run docker PS to make sure that we are using the new tag.

03:18.190 --> 03:22.270
And as you can see, we are using the new 1.0.1 tag.

03:23.030 --> 03:29.470
If for some reason the image tag is not updated, try closing vs code and rerun the docker compose up

03:29.470 --> 03:32.750
command and the docker PS command as well.

03:33.150 --> 03:37.790
Now to check whether we have soda code installed, we go inside one of the containers.

03:38.190 --> 03:40.950
Let's say we go inside the airflow worker.

03:41.110 --> 03:43.310
So we use the docker exec command.

03:44.790 --> 03:53.230
And now that we're inside the container, we type the pip show and the package in our case soda core

03:53.550 --> 03:54.350
Postgres.

03:54.350 --> 03:54.440
Progress.

03:57.680 --> 04:03.000
And as you can see, the package was found with this particular version that we have.

04:03.200 --> 04:03.560
Great.

04:03.560 --> 04:07.480
So at this point you confirm that we have Postgres installed.

04:07.520 --> 04:09.720
Now that we have it installed we can also run.

04:09.720 --> 04:11.200
So dash dash help.

04:12.880 --> 04:14.760
Here we get the commands that we can run.

04:14.800 --> 04:18.480
The next thing we should do is to test the connection to our database.

04:18.960 --> 04:22.360
As you can see, one of the options that we have here is to test connection.

04:22.720 --> 04:29.320
This tests the connection that soda has with Postgres as per the soda docs for Postgres connector.

04:29.440 --> 04:35.480
To test the connection, we need to create a YAML file called configuration.yaml, and the YAML file

04:35.480 --> 04:41.440
will contain all of these parameters for the connection to the Postgres database.

04:41.560 --> 04:47.240
Now, as we have said in one of the previous sections, we will store this YAML file under the include

04:47.280 --> 04:50.360
directory as per airflow best practices.

04:50.360 --> 04:53.690
And we will also create a subfolder which we will would call shoulder.

04:53.690 --> 04:54.810
So let's do that right now.

04:54.810 --> 04:56.610
First we create a sub folder soda.

04:57.010 --> 05:00.570
And in here we will have the configuration.yaml.

05:03.850 --> 05:06.930
The information for the configuration where you have in the env.

05:07.250 --> 05:09.850
So we don't need to stay hard coding these values.

05:09.850 --> 05:15.290
And the schema will be dynamic variable depending on whether we are working with the staging or the

05:15.290 --> 05:16.010
core layer.

05:17.010 --> 05:23.530
So if we have to look at what parameters we have in the documentation and change them according to our

05:23.530 --> 05:25.610
parameters, we will have the following.

05:25.650 --> 05:31.570
Just to go over this briefly, the one thing that we will define is the data source name, which I have

05:31.570 --> 05:34.130
named as PG underscore data source.

05:34.170 --> 05:37.450
The type of connection is with the Postgres connector.

05:37.450 --> 05:39.370
So we will use the Postgres.

05:39.370 --> 05:42.650
And here these we will all get from the env.

05:42.690 --> 05:47.570
As we said the schema is a dynamic variable which can either be core or staging.

05:48.010 --> 05:51.700
With the configuration set up we can now test the that's the connection.

05:51.700 --> 05:56.340
As you can see, the test connection command is given here.

05:56.340 --> 06:00.900
If we had to adjust for our scope, the command would look as follows.

06:01.140 --> 06:05.100
So here we're defining the soda test connection command.

06:05.140 --> 06:12.380
As we see in the soda shell we need to define using the dash D the data source which we said we would

06:12.380 --> 06:15.020
define as PG underscore data source.

06:15.020 --> 06:18.620
And here we define the part of the configuration file.

06:18.980 --> 06:23.140
Also we use dash capital V for a verbose output.

06:23.140 --> 06:27.740
So let's run this command and we should get a valid response.

06:29.340 --> 06:30.460
So I didn't save this.

06:30.460 --> 06:32.420
So let me first save the configuration.

06:33.180 --> 06:34.380
And let's run this again.

06:35.580 --> 06:36.580
And there we have it.

06:37.100 --> 06:40.460
The connection PG underscore data source is valid.

06:41.180 --> 06:41.820
So that's great.

06:41.820 --> 06:44.380
We have now verified the connection is working fine.

06:44.700 --> 06:45.300
At this point.

06:45.300 --> 06:50.590
We can now define the tests that we would like to apply in a separate YAML file which should be called

06:51.190 --> 06:51.630
YAML.

06:52.590 --> 06:53.910
So let me create it right now.

06:55.990 --> 07:00.870
In terms of checks for that come to mind for both staging and core layers are the following.

07:00.910 --> 07:04.790
We first need to check if there are null values for the video ID column.

07:04.910 --> 07:07.990
There shouldn't be any because this is a primary key.

07:08.190 --> 07:12.870
We should also check if there are duplicates, which again should not be the case for video ID column.

07:12.910 --> 07:19.070
We should also ensure that for each video, the views count is less than or equal to the number of likes,

07:19.070 --> 07:23.790
and likewise the views count is less than or equal to the number of comments.

07:23.790 --> 07:29.310
This is because, as you can imagine, the number of views has to be greater than or equal to the number

07:29.310 --> 07:34.510
of likes and comments as to like and comment a video, you would first have to actually see the video

07:34.750 --> 07:36.310
going back to the soda docs.

07:36.910 --> 07:39.670
In order to run these data quality checks.

07:40.150 --> 07:42.910
We have examples here of how we can do this.

07:43.310 --> 07:46.990
So we should always follow the standard of using checks for.

07:47.470 --> 07:49.680
And then here you can specify the table name.

07:49.680 --> 07:52.320
In our case this will be YouTube underscore API.

07:52.720 --> 07:57.560
So let's go back to VS code and start running our checks.

08:00.200 --> 08:05.320
To check for nulls soda has an inbuilt check which is called missing count.

08:05.360 --> 08:12.360
What you need to do is to define the missing count check and also specify the column that we will be

08:12.360 --> 08:13.920
checking here.

08:13.920 --> 08:17.520
We also specify the counts that we expect.

08:17.520 --> 08:20.880
In our case we expect zero null values.

08:21.040 --> 08:23.800
We can also do the same thing for the duplicates.

08:23.840 --> 08:32.160
Again, the check is inbuilt in soda and we can define the duplicate underscore count with the column

08:32.160 --> 08:32.800
in question.

08:33.680 --> 08:39.200
And now for the last two checks we have to build a custom query which we can as follows.

08:39.200 --> 08:42.120
So first we define the result variable of the query.

08:42.120 --> 08:46.730
And we set it to a value which we want which in our case will be would be zero.

08:46.930 --> 08:49.970
I will soon explain where the zero is coming from.

08:50.370 --> 08:52.730
The variable name will be as follows.

08:53.250 --> 08:57.690
So the likes count greater than video views is zero.

08:57.930 --> 09:04.410
This is I guess, self-explanatory, but what I want to show here is that the likes count.

09:04.450 --> 09:08.610
There are no instances where this is greater than the video views.

09:08.650 --> 09:14.770
We can now define the name parameter, and the name is basically a description of what the query will

09:14.770 --> 09:15.130
do.

09:15.130 --> 09:17.010
So we can write something as follows.

09:17.050 --> 09:24.810
Check for likes count column greater than video views.

09:25.010 --> 09:28.770
From here, what is left is to write the actual SQL query for the code.

09:28.810 --> 09:32.130
To do this, we copy the variable.

09:32.130 --> 09:38.610
Again, we write the keyword query, and from here we set up the SQL query.

09:39.810 --> 09:41.450
The query will look as such.

09:45.300 --> 09:52.500
So again, I will reiterate what this query is doing is that we are checking for the count of videos

09:52.500 --> 09:53.580
that has the likes.

09:53.580 --> 09:55.940
Count is greater than the video views.

09:56.100 --> 09:59.580
Again this should be zero which is what we defined here.

09:59.860 --> 10:05.860
So we expect that this count is zero and no rows will be returned for this check.

10:06.100 --> 10:11.020
Now the same thing can be done for the comments count.

10:11.460 --> 10:17.020
So I will simply copy the check I have here, but this time for the comments.

10:22.180 --> 10:29.020
Now that we have both the Configuration.yaml and the HTML, we can run a pseudo scan command to see

10:29.020 --> 10:30.500
if our tests pass.

10:30.540 --> 10:36.740
The command will look as follows and I will explain what it does.

10:37.140 --> 10:39.860
So again we're using the soda scan.

10:40.180 --> 10:43.230
This is done for the checks the data quality checks.

10:43.270 --> 10:48.910
We are defining the data source, the path of the configuration.

10:49.350 --> 10:51.190
The dash V.

10:51.590 --> 10:56.150
The small dash V specifies the schema, which in our case we're going to run.

10:56.150 --> 10:58.150
This checks for the core layer.

10:58.310 --> 11:01.950
And here we're also defining the location of the checks.

11:02.910 --> 11:03.830
Let's run this.

11:06.030 --> 11:08.350
So it seems I have a couple of errors here.

11:08.830 --> 11:12.910
The first thing that I'm noticing is that I misspelled missing count.

11:12.950 --> 11:16.190
This should be with another I duplicate count.

11:16.350 --> 11:18.190
Okay, let's try this again.

11:18.390 --> 11:24.550
So from what we can see here, there is some error with these last two checks.

11:25.470 --> 11:30.830
It could be because of the indentation that is caused with the colon.

11:30.910 --> 11:33.150
So let me just do this.

11:33.790 --> 11:35.390
Let's save and let's run again.

11:37.110 --> 11:38.390
Finally success.

11:38.390 --> 11:41.640
As you can see, four on four checks have passed.

11:42.600 --> 11:45.520
You also get this command that says it all is good.

11:45.520 --> 11:48.400
No failures, warnings or errors this time.

11:49.000 --> 11:51.680
And let's say we want to test our logic.

11:51.680 --> 11:53.960
And let's say we will set this to one.

11:54.120 --> 12:00.440
In this case, we will get a failure because we cannot have any videos with the comments count greater

12:00.440 --> 12:01.520
than the video views.

12:01.520 --> 12:04.280
So let's run this again and see what the output is.

12:04.480 --> 12:10.280
We might also want to include a minus v, a capital V for a verbose output.

12:10.840 --> 12:15.440
And as you can see, the test that we manipulated failed.

12:15.440 --> 12:19.800
And as you can see, the actual value should have been zero as it was before.

12:20.240 --> 12:21.880
So let's change this back.

12:22.560 --> 12:25.800
And we are done with the tests using soda.

12:26.200 --> 12:30.840
At this point, we know that we have a valid connection and that the data quality tests are working

12:30.840 --> 12:31.760
as expected.

12:32.320 --> 12:37.800
The final step is to integrate all of this in airflow, which is what we will do in the next lecture.