WEBVTT

00:04.360 --> 00:09.440
Now that we have covered the theory aspect of airflow, we can now go over docker compose file.

00:09.480 --> 00:15.640
Docker compose is a YAML file that allows you to define and run multiple Docker containers with a single

00:15.640 --> 00:16.160
command.

00:16.480 --> 00:21.000
Airflow provides a docker compose file that you can customize for your needs.

00:21.040 --> 00:29.480
Simply go to running Airflow Docker in the How to guides, scroll down to fetching Docker-compose.yml.

00:30.160 --> 00:34.760
You can press here and this will be the docker compose file we will be using.

00:34.920 --> 00:40.280
I will leave a link to this page in the appendix of the section going back to VSCode.

00:41.480 --> 00:46.120
In our case, I have already created a docker compose file, which is the one to see here.

00:46.800 --> 00:52.760
I copied the contents of the base of our compose and made some changes, which we will now go over.

00:53.080 --> 00:58.040
As you can appreciate, this docker compose file is quite large, and although we will go through the

00:58.040 --> 01:03.550
more important parts of it, I would still recommend that you go through it on your own time as this

01:03.550 --> 01:04.830
will further your learning.

01:04.910 --> 01:10.190
So to recap, in this lecture, I will only mention the most important aspects of the docker compose

01:10.550 --> 01:14.110
or the parts where we have changed from the default file provided.

01:14.150 --> 01:19.950
So going back up, we can start off by going through the airflow command part.

01:20.110 --> 01:25.790
This airflow command relates to common variables that will be available to all airflow containers.

01:26.190 --> 01:30.830
These airflow containers are the web server scheduler and the trigger.

01:31.190 --> 01:37.590
And by default, don't include other airflow containers like the ones for Reddis or Postgres.

01:37.790 --> 01:42.870
The first variable that we have here is the image, which is expecting a Docker image.

01:43.750 --> 01:47.590
We have already built and pushed a Docker image in one of the previous lectures.

01:47.870 --> 01:49.510
So here we will reference it.

01:49.550 --> 01:54.870
One thing that you notice immediately is that we are not referencing the image directly, but are you

01:54.910 --> 01:57.830
using the dollar sign curly bracket syntax?

01:57.990 --> 02:02.470
Using the syntax, we can reference the environment variables from the env.

02:02.590 --> 02:05.340
The advantages of this we have already covered.

02:05.340 --> 02:11.300
But to reiterate, this makes the code more flexible and reusable, as you can use the same variables

02:11.300 --> 02:18.900
in different environments like testing dev or prod and using EMV, you don't show sensitive variables

02:18.900 --> 02:19.820
in your code.

02:19.860 --> 02:21.500
Now we need to pause for a second.

02:21.540 --> 02:29.540
Here we are referencing this image tag variable that is in the docker compose, but is not in the EMV

02:29.700 --> 02:31.060
environment variable file.

02:31.340 --> 02:33.100
So let's correct this right now.

02:33.140 --> 02:41.740
Let's set the image tag equal to the image tag version of 1.0.0, which we had defined in an earlier

02:41.740 --> 02:42.300
lecture.

02:42.660 --> 02:47.420
Let's save this and let's go back to the docker compose file in order to use it as EMV.

02:47.580 --> 02:52.820
We need to reference the EMV file under the env underscore file parameter.

02:52.860 --> 02:59.020
This tells Docker compose to use the env, which is found in the same directory as docker compose.

02:59.780 --> 03:04.740
If for some reason you don't have them in the same directory, you would need to change the relative

03:04.810 --> 03:06.210
path of the DMV.

03:06.330 --> 03:07.770
Moving on to the airflow.

03:07.770 --> 03:10.450
Common environment variables for executor.

03:10.490 --> 03:16.410
We have already mentioned this, but we are confirming here that we will be using this executor.

03:16.930 --> 03:22.010
Now we have a number of connection URIs which we have already gone over in the lecture.

03:22.330 --> 03:24.290
And here is where we apply the theory.

03:24.690 --> 03:28.770
This URI for example is for the metadata database.

03:29.290 --> 03:35.090
And this URI is for the backend which stores results executed by the workers.

03:35.210 --> 03:41.970
In both instances we are referencing database connection parameters that we have a number of these.

03:41.970 --> 03:48.290
You can see them here for example the metadata database username password and so on.

03:48.290 --> 03:52.650
And the same thing for the backend user name, backend password and so on.

03:52.810 --> 03:57.610
Here we also reference the fairness key from the DMV and going a bit down.

03:57.650 --> 04:02.290
We also have some true or false settings which you can go over in your own time.

04:02.730 --> 04:06.890
Scrolling further down, we get to the airflow connections and variable settings.

04:07.610 --> 04:12.290
These variables don't come by default in the docker compose, so you need to add them as shown here.

04:12.650 --> 04:18.490
The first one being the airflow connection to the database that will store the YouTube data API for

04:18.490 --> 04:25.450
connections can be written as URIs using the syntax airflow Con, which is the first part here, and

04:25.450 --> 04:27.970
then the connection ID creating the URI.

04:28.170 --> 04:35.090
Since we will use Postgres, PostgreSQL will be the connection type and the rest of the URI will reference

04:35.090 --> 04:37.810
the connection parameters for the database.

04:38.730 --> 04:43.970
The other variables that we will need to introduce will be used in the Dag scripts, and are defined

04:43.970 --> 04:48.050
using the airflow file underscore variable names syntax.

04:48.490 --> 04:49.530
So here we have an example.

04:49.530 --> 04:55.850
For the API key we have the airflow where and then we are referencing the variable name which in this

04:55.850 --> 04:57.130
case is the API key.

04:57.970 --> 04:59.490
Same goes for the channel handle.

05:00.250 --> 05:02.890
Now these final five variables that you see here.

05:02.930 --> 05:08.160
We don't need to define them at this point in time, but they will be needed for when we start working

05:08.160 --> 05:10.600
on the integration and data quality tests.

05:11.080 --> 05:15.200
So I will simply define them here for when they will be needed at a later stage.

05:16.040 --> 05:17.040
Moving on.

05:17.080 --> 05:19.880
We now go to the volumes parameter.

05:20.520 --> 05:25.680
Before we explain what we are doing here, it would be better that we are all aligned on what volumes

05:25.680 --> 05:27.120
are and why we use them.

05:27.120 --> 05:32.000
So volumes are a powerful tool in Docker and they are primarily used for data persistence.

05:32.080 --> 05:36.160
When a Docker container is stopped or removed, the data is stored.

05:36.440 --> 05:39.120
We will see this in action when we go to the Postgres service.

05:39.960 --> 05:43.680
Volumes, however, are also used for mounting, which is what we see here.

05:44.160 --> 05:48.680
These directories went over them in the theory lecture and what files they will contain.

05:49.040 --> 05:53.680
Here we are specifying that we want the contents in our local directories, which is on the left hand

05:53.720 --> 05:59.240
side, to be mapped to inside Docker containers, which is on the right hand side.

05:59.520 --> 06:03.360
So let's take for example the content that is under the data folder.

06:03.760 --> 06:05.680
We are referencing this line 97.

06:05.680 --> 06:12.310
Here we are saying it's on the left hand side our local data directory, which contains the ELT data

06:12.510 --> 06:13.670
from the YouTube API.

06:14.710 --> 06:21.630
We want it to go inside the Docker containers on the part that you see here on the right hand side.

06:21.830 --> 06:26.950
This is what mounting is in the context of Docker, and you can imagine how powerful this feature is

06:27.230 --> 06:31.110
that covers most of the changes we made in the airflow coming part of the compose.

06:31.310 --> 06:36.590
So now we can move on to the actual services which translate to the containers that will be spun up.

06:36.630 --> 06:42.070
If we go further down, you will see the services that we will introduce.

06:42.350 --> 06:44.350
And the first one is Postgres.

06:44.590 --> 06:46.550
Here we are specifying a number of things.

06:47.270 --> 06:50.510
We are saying the container name should be called Postgres.

06:50.670 --> 06:56.430
If not, it will assign a default name that will be composed of the directory you are in, which in

06:56.430 --> 07:01.990
my case is YouTube, ELT and the service name, which is also Postgres.

07:02.630 --> 07:10.820
We are using the Postgres version 13 and we specify the file under the env underscore file parameter.

07:11.180 --> 07:17.060
This will do it since the Postgres container is not a part of the airflow command containers, so we

07:17.060 --> 07:18.620
need to explicitly define it here.

07:19.340 --> 07:25.460
For environment variables, we also specify the root user and password for Postgres connection.

07:25.460 --> 07:28.100
Now for ports here we have two ports.

07:28.420 --> 07:30.300
Both are 5432.

07:30.580 --> 07:36.220
The 5422 on the left hand side is the port on your laptop, while five four, three two on the right

07:36.220 --> 07:38.820
hand side is the port inside the Docker container.

07:38.980 --> 07:43.180
So here we are mapping the laptop port to the Docker port.

07:43.300 --> 07:46.620
Both will be 5432 for Postgres, which is the default.

07:46.660 --> 07:47.940
Now for volumes.

07:47.940 --> 07:53.140
We have already described what they are and we already saw how they work for mounting local directories

07:53.140 --> 07:55.140
to directories inside Docker container.

07:55.180 --> 07:58.820
What we haven't seen is how volumes are used for data persistence.

07:58.980 --> 08:04.100
And in this example we will get to see both the persistence and also mounting.

08:04.500 --> 08:10.820
The first volume we have defined here ensures data persistence through a named volume, meaning that

08:10.820 --> 08:14.610
a volume of the volume, which is this one.

08:14.610 --> 08:20.770
Here we'll be creating on our laptops that will contain all the data related to the Postgres databases

08:20.770 --> 08:26.690
we have, be it the metadata, data related to the celery executions or the data from the YouTube API.

08:27.050 --> 08:31.530
And this data will be stored in the var directory in the Postgres container.

08:32.690 --> 08:38.370
By defining this named volume, we therefore ensure that the Postgres data will be saved even if the

08:38.370 --> 08:40.010
Docker containers are dropped.

08:40.290 --> 08:45.490
The second volume, which you see over here is referencing an initialization shell script.

08:45.730 --> 08:50.570
This will be used to create the three users and associated three Postgres databases.

08:50.770 --> 08:59.090
So what this volume is actually doing, it is mounting this shell script over here from our laptop inside

08:59.290 --> 09:03.330
the Docker container on this part on the right hand side.

09:03.330 --> 09:09.130
So let me pull up the shell script which is found here, and the script I have already created.

09:09.130 --> 09:13.960
And we won't spend too much time on it, but we can go over it briefly so we are aligned on what it

09:13.960 --> 09:14.320
does.

09:14.440 --> 09:18.160
So the first line of the script is known as the shebang.

09:18.200 --> 09:23.240
And what we are effectively saying is that we will use the bash shell to run the script.

09:23.360 --> 09:30.640
The set minus flag exits on any error, and the set minus u treats unset variables as errors.

09:30.840 --> 09:36.760
Next we are defining a function creates user and database that takes three arguments the database name,

09:37.160 --> 09:39.240
the username, and the password.

09:39.280 --> 09:45.960
Then it uses SQL to create a user and database, granting all privileges to the user.

09:46.040 --> 09:51.000
As a side note, psql is the command line tool used to interact with Postgres databases.

09:51.240 --> 09:57.320
Here we are also seeing the function is called three times for each individual database that we have.

09:57.360 --> 10:00.200
And ultimately we are sending a success message.

10:00.320 --> 10:06.960
These messages you can see in the echo commands as well inside the function and outside the function.

10:07.000 --> 10:12.040
A final note on this initialization script is that we could have created the three databases using a

10:12.040 --> 10:13.120
different approach.

10:13.160 --> 10:19.070
We could have initialized three separate containers with each database having a dedicated container.

10:19.350 --> 10:22.190
So we had to go back to the docker compose.

10:22.470 --> 10:30.350
We would have created three Postgres containers and each container would have one instance of the database

10:30.350 --> 10:31.190
respectively.

10:31.430 --> 10:37.070
This would result in very good isolation as all the databases are separated in different containers.

10:37.470 --> 10:43.190
However, it is resource intensive and for small lows like in our case, there is no need for it.

10:43.230 --> 10:47.950
Before we move on to the health check, I also wanted to mention that the named volume that you see

10:47.990 --> 10:54.030
here would also be referenced in the end part of the docker compose, which you see here.

10:54.350 --> 11:01.430
Now scrolling back up, the final part of the service is the health check and the restart policy for

11:01.430 --> 11:01.990
the health check.

11:01.990 --> 11:06.270
We are monitoring the health of the container to ensure it's running correctly.

11:06.590 --> 11:09.670
In this case, we are testing the connection to the metadata database.

11:10.670 --> 11:11.990
This is what you're seeing here.

11:12.350 --> 11:15.940
And for the restart policy which is this part over here?

11:16.460 --> 11:21.260
We are saying that if the container stops, it will always be restarted automatically for the other

11:21.260 --> 11:22.940
services in the docker compose.

11:22.980 --> 11:25.060
We can keep most of the default settings.

11:25.500 --> 11:30.900
Just be aware that we have read this, which is the message broker used by celery executor.

11:31.100 --> 11:34.660
And this function is to forward messages from the scheduler to the worker.

11:35.340 --> 11:42.340
We also have the web server, which is the UI that we will be able to access using this URL on port

11:42.380 --> 11:43.180
8080.

11:44.940 --> 11:50.900
Going down we also have the scheduler, which as we know is the orchestrator that triggers tasks to

11:50.940 --> 11:53.180
run based on our schedule.

11:54.660 --> 11:57.060
And the worker that executes these tasks.

11:57.340 --> 12:00.060
As a side note, here we are defining one worker.

12:00.420 --> 12:05.140
But you could have multiple workers depending on how many tasks you need to run concurrently.

12:05.260 --> 12:10.220
In terms of Docker compose, you can increase the number of workers by defining another container.

12:10.300 --> 12:15.860
So for example, if you wanted to have two workers, you could have one worker service named airflow

12:15.900 --> 12:20.220
worker one won with these properties and the other would be airflow.

12:20.260 --> 12:23.380
Worker two with the same or slightly different properties.

12:24.540 --> 12:25.900
Moving on to the trigger air.

12:26.060 --> 12:31.100
As we have said in the previous lecture, we don't have the variable tasks, so we can switch it off

12:31.100 --> 12:32.260
for our use case.

12:32.300 --> 12:34.180
And as you can see, it is commented out.

12:35.300 --> 12:42.060
Now the airflow init container only runs at the start, and it is an initialization container which

12:42.100 --> 12:47.980
ensures the environment is properly initialized and set up for the airflow containers.

12:48.540 --> 12:51.180
If we go a bit down, we will see it in the environment.

12:51.220 --> 12:54.060
Here we are specifying two environment variables.

12:54.660 --> 12:59.860
These are the username and the password that we will use to access the airflow UI.

13:00.380 --> 13:05.620
The airflow CLI container designed to provide a command line interface for interacting with airflow,

13:05.660 --> 13:09.820
which will be useful for running airflow commands directly from within the Docker environment.

13:11.060 --> 13:18.810
Now the flower service references the flower, which is a monitoring tool for celery that gives insights

13:18.810 --> 13:21.770
into task execution and resource management.

13:22.490 --> 13:27.210
Feel free to explore this monitoring tool on your own time, but it will not be covered in this course,

13:27.490 --> 13:29.770
so we will switch it off in post.

13:29.930 --> 13:34.210
So like that, we have covered the docker compose that we will be using at a high level.

13:34.250 --> 13:39.410
A piece of advice I can give you relating to named volumes is to be sure that before you set up the

13:39.410 --> 13:45.290
containers, the variables and the credentials defined in the env are set and final.

13:45.490 --> 13:48.650
Let's go to an example of why this is important.

13:48.810 --> 13:53.890
Let's say you spin up Docker containers with a set of Postgres credentials using the docker compose

13:53.930 --> 13:54.690
up command.

13:56.690 --> 14:03.410
This will create, in our case the named volume and docker compose, which is this one Postgres db volume.

14:03.610 --> 14:06.490
As a side note, we will be doing this exercise in the next section.

14:06.730 --> 14:09.490
For now, I just want you to understand the scenario.

14:09.890 --> 14:14.330
After we have spun up the containers we then want to stop Docker containers.

14:14.330 --> 14:16.010
So we do docker compose down.

14:16.210 --> 14:23.360
After doing that, for some reason we change the password for the metadata database, and then we tried

14:23.360 --> 14:25.200
to spin up Docker containers again.

14:25.600 --> 14:31.520
This will result in an error since the Postgres volume has the previous credentials stored.

14:32.200 --> 14:36.560
The new windows credentials will conflict and the containers will not be spun up.

14:36.720 --> 14:41.560
If you find yourself in this scenario for this course, I would recommend that you revert back to the

14:41.560 --> 14:48.560
previous password or if you have the credentials and face this issue, delete the volume and spin up

14:48.560 --> 14:50.160
Docker containers again.

14:50.480 --> 14:56.320
Deleting the volume implies you will lose all the data stored in all the databases to delete volumes.

14:56.320 --> 14:59.840
You can do this by either going on Docker Desktop application.

14:59.880 --> 15:04.280
In the volume section, find the volume in question and click delete.

15:04.280 --> 15:05.680
So let's do that right now.

15:05.680 --> 15:08.040
So this is the Docker desktop.

15:08.240 --> 15:15.440
And to look at your volumes you can go in the volume section here I have another volume.

15:15.600 --> 15:17.120
This is from another project.

15:17.360 --> 15:18.320
Just ignore it.

15:18.960 --> 15:21.630
But in our case once we have actually spun up.

15:21.670 --> 15:25.030
Okay, we will get the volume that you see here.

15:25.550 --> 15:29.710
And it is a simple task of clicking here and pressing delete.

15:31.150 --> 15:33.870
Now you can also do this deletion through the CLI.

15:34.670 --> 15:37.750
First you run the commands to list docker volumes.

15:39.110 --> 15:41.390
The command is docker volume list.

15:43.070 --> 15:46.150
You will see the volume that I just showed you.

15:46.150 --> 15:53.510
And then to actually delete the volume you would do docker volume rm and the volume name.

15:53.830 --> 15:55.630
So in our case it would be.

15:57.590 --> 16:02.910
And as I mentioned, you will only do this in the scenario where you're facing issues spinning up the

16:03.110 --> 16:07.670
containers because you did a change in the credentials in the dot env.

16:09.230 --> 16:11.630
So that's all that you need to know for docker compose.

16:11.830 --> 16:17.430
I will see you in the next lecture, where we go over the spinning up of Docker containers and exploration

16:17.430 --> 16:18.790
of the containers themselves.
