WEBVTT

00:04.160 --> 00:10.240
One further change that we would need to do before we start the CI CD section is to revisit again the

00:10.240 --> 00:11.240
DAGs we built.

00:11.280 --> 00:16.440
Currently, we have a problem with the logic of our DAGs, as we don't have a guarantee that they will

00:16.440 --> 00:18.520
run in the order that we expect them to.

00:18.720 --> 00:21.240
Sense scheduling is solely time based.

00:21.360 --> 00:27.720
If the first Dag is set to run at 2 p.m., fails or is delayed, which is this one here, the second

00:27.720 --> 00:32.000
Dag is set at 3 p.m., will still run and it will fail as well.

00:32.200 --> 00:35.960
Same story for the third Dag that currently runs at 4 p.m..

00:36.560 --> 00:43.060
This is also inefficient since if the Dag only takes five minutes to run, we have to wait another 55

00:43.060 --> 00:49.360
minutes before the second Dag is run and another hour before the last Dag is also run.

00:49.680 --> 00:56.080
A simple change that we can do is to join all the tasks in our three DAGs into one consolidated Dag.

00:56.280 --> 01:02.350
This is a completely fair approach in terms of Dag architecture and would avoid the issue we currently

01:02.350 --> 01:02.790
face.

01:03.310 --> 01:09.830
However, I will leave the architecture as it is right now with three DAGs to go over the topic of Dag

01:09.830 --> 01:14.350
dependencies, which is a challenge you will for sure face at the workplace.

01:14.470 --> 01:21.830
So in our case, what we can do is to keep only the first Dag with Dag id produce underscore JSON to

01:21.870 --> 01:25.910
still have a schedule trigger at 2 p.m., while the other two Dag IDs.

01:26.110 --> 01:33.950
This being the update db and data quality DAGs will be triggered once the subsequent Dag finishes successfully.

01:34.190 --> 01:37.790
And this can be done using the airflow trigger.

01:37.830 --> 01:39.190
Dag run operator.

01:39.550 --> 01:46.390
The trigger Dag an operator takes a minimum two arguments the Dag id to trigger, which is set using

01:46.390 --> 01:52.110
this parameter here, and the task id which is inherited from the base operator.

01:52.830 --> 01:59.830
So with the trigger diagram operator we have a chain of dependency between DAGs, which by default triggers

01:59.830 --> 02:01.140
the subsequent Dag.

02:01.540 --> 02:05.580
If all the tasks in the previous Dag have completed successfully.

02:05.860 --> 02:13.220
If a task fails, then the trigger Dag operator won't run because the dependency chain is broken.

02:13.940 --> 02:16.580
Therefore, the Dag will not be triggered.

02:16.900 --> 02:23.860
This is exactly what we want since, for example, if the produce underscore JSON dag fails, we wouldn't

02:23.860 --> 02:31.140
want the update underscore db to run as there will be no JSON file to update the data in the data warehouse.

02:31.380 --> 02:37.380
In terms of changes, the domain of script, I have done the changes already and we can see the differences

02:37.380 --> 02:38.540
by using git.

02:38.940 --> 02:44.340
The very first change we need to do is to import the trigger Dag run operator.

02:44.820 --> 02:47.980
This is done with this command over here.

02:48.660 --> 02:55.740
Going down to the first Dag definition, we will keep the schedule for the first Dag at 2 p.m. instead

02:55.740 --> 02:58.100
of reusing the same Dag variable.

02:58.140 --> 03:04.050
We can now distinguish between the three dogs and the right deck underscore produce for the first Dag.

03:04.290 --> 03:07.930
Dag underscore updates and Dag underscore quality.

03:07.970 --> 03:11.530
All of these replacing the command Dag variable.

03:12.570 --> 03:14.290
Going back up to the first Dag.

03:14.570 --> 03:20.490
The biggest change will happen here where we introduce the first task using the trigger diagram.

03:20.530 --> 03:21.130
Operator.

03:21.490 --> 03:28.290
In this first Dag we are adding this additional task trigger underscore update underscore DP in the

03:28.290 --> 03:29.770
task dependencies chain.

03:29.810 --> 03:36.170
Its purpose will be to trigger the next Dag which is the update underscore db dag.

03:36.330 --> 03:42.090
This we are defining by the trigger underscore underscore id parameter here.

03:42.090 --> 03:50.010
And again to reiterate, we also remove the schedule for the last two DAGs by setting the schedule equal

03:50.050 --> 03:53.770
to none for both Dag number two and number three.

03:55.210 --> 04:03.410
Repeating this step for Dag two, we define a trigger underscore Underscore quality task which is this

04:03.410 --> 04:09.410
one over here, which is set to run after the update underscore core task.

04:09.570 --> 04:14.370
And like that we also have a dependency between Hdac2 and Hdac3.

04:14.730 --> 04:21.850
If we were to go back to the airflow UI, once we do a refresh, we can see that the schedules are now

04:21.850 --> 04:24.210
conforming to the changes that we just did.

04:24.410 --> 04:30.370
Now, if we had to manually trigger the first Dag in the sequence, we would see the others being triggered

04:30.370 --> 04:31.290
automatically.

04:31.290 --> 04:32.810
So let's do that right now.

04:33.450 --> 04:35.730
Let's trigger the first Dag in the sequence.

04:37.930 --> 04:40.770
Here we are seeing that it is currently running.

04:43.250 --> 04:46.770
Let's give it some time to actually run and let's go back here.

04:46.770 --> 04:53.610
So we will see that the update underscore DB should be triggered once the first Dag finishes.

04:53.890 --> 05:02.080
As you can see the producer Jason has successfully finished and updated be was run same of course for

05:02.080 --> 05:02.800
data quality.

05:02.800 --> 05:08.920
You can see that once this was run, the second Dag data quality was also executed.

05:10.640 --> 05:12.640
So we have to go update underscore db.

05:12.760 --> 05:15.600
We will see that this just triggered right now.

05:17.760 --> 05:19.800
And even data underscore quality.

05:23.000 --> 05:30.440
Now that we did this change to the DAGs script, we also can't forget to update the unit test script

05:30.440 --> 05:30.720
here.

05:30.720 --> 05:35.040
We no longer have four tasks but now we have five.

05:35.320 --> 05:38.640
And same goes for update underscore db.

05:39.000 --> 05:40.880
We now have three tasks.

05:41.560 --> 05:46.480
The data quality Dag was untouched in terms of number of tasks.

05:46.840 --> 05:48.320
So that will stay true.

05:48.520 --> 05:54.160
Now that this change has been applied, we can commit and push to GitHub so we can finally start the

05:54.160 --> 05:55.320
ci CD section.

05:55.320 --> 05:56.320
I will see you then.
