WEBVTT

00:04.920 --> 00:09.720
Now that you have everything set up and ready to go, I will explain the project outline in terms of

00:09.720 --> 00:10.560
milestones.

00:12.200 --> 00:14.240
The first milestone is getting the data.

00:15.240 --> 00:20.080
In this course, we will be getting the data from YouTube and in order to extract this data, we will

00:20.080 --> 00:21.640
be using the YouTube API.

00:22.520 --> 00:26.760
The data will be using is from a popular channel which is that of Mr. Beast.

00:27.240 --> 00:29.240
I think most of you have heard of him.

00:29.400 --> 00:30.920
If not, no issue at all.

00:31.360 --> 00:33.680
Who he is is relevant to what data engineers.

00:33.840 --> 00:40.080
We are just interested in the data which we will clean and test to make sure it is of good quality and

00:40.080 --> 00:45.600
loads to a storage system from where our colleagues downstream of us, like data analysts or data scientists,

00:45.840 --> 00:47.600
can use for their analysis.

00:47.640 --> 00:52.760
In our project, we will be using Python to build the scripts to interact with the YouTube API.

00:53.360 --> 00:59.640
The second milestone relates to the loading and transformations part transformations, meaning we will

00:59.640 --> 01:06.860
apply changes to the data to either make it more readable, or extract just a portion of the information

01:06.860 --> 01:08.620
from the raw extracted data.

01:09.620 --> 01:15.700
Loading implies saving the data to a storage system, which is many times called the data warehouse.

01:16.660 --> 01:22.780
There are other variations of storage system names like Data Lake or more recently, Data Lake House.

01:23.100 --> 01:28.540
Don't be concerned about these different names as they all serve the same purpose.

01:28.580 --> 01:32.420
Just know that in our case, we will be creating a data warehouse.

01:32.460 --> 01:39.420
The first and second milestone combined are what is known as the ETL or more accurately in our case,

01:39.460 --> 01:39.980
ELT.

01:40.420 --> 01:45.780
The choice between ELT or ETL depends on the order of loading and transformations.

01:46.540 --> 01:52.580
In our case, we will first load the data to the data warehouse and from there apply transformations.

01:52.780 --> 01:56.020
So in summary we will be performing ELT.

01:56.420 --> 02:01.100
Traditionally, the ETL extract, transform and load process was used.

02:01.100 --> 02:06.700
So the data once extracted would be transformed and then loaded into the data warehouse.

02:06.740 --> 02:11.060
However, in the past years there has been a shift from ETL to ELT.

02:11.300 --> 02:16.970
I won't get too much in detail about this, but just know that this change is driven by advancements

02:16.970 --> 02:17.730
in technology.

02:17.770 --> 02:24.250
Modern data storage solutions can handle large volumes of raw data efficiently, making ELT more scalable

02:24.570 --> 02:26.170
and flexible than ETL.

02:26.290 --> 02:31.890
In our case, we will be using Postgres as our data warehouse, and we will use Python to apply the

02:31.890 --> 02:32.970
transformations.

02:32.970 --> 02:38.450
Once we have transformed data loaded into our warehouse, we can move on to our third milestone, which

02:38.450 --> 02:41.490
is to perform what are called data quality checks.

02:41.530 --> 02:47.930
Data quality tests are essential to ensure that the data given to the downstream users, like data analysts

02:47.930 --> 02:50.690
and data scientists, is accurate and reliable.

02:50.730 --> 02:56.970
The most common data quality issues I can think of are missing data duplicates and incorrect formats.

02:57.170 --> 03:02.930
As data engineers, it is imperative that you do these checks so the downstream users are confident

03:02.930 --> 03:09.010
that they are receiving reliable and accurate data, which will be used for analysis and decision making

03:09.010 --> 03:09.930
of the business.

03:09.930 --> 03:13.090
In our case, we will be using a tool called soda.

03:13.370 --> 03:17.170
The fourth milestone is then the functional and end to end testing.

03:17.630 --> 03:23.110
Testing is a software development best practice, and is a skill set that will set you apart from other

03:23.110 --> 03:23.950
data engineers.

03:23.990 --> 03:29.270
For now, just know that we will go over unit tests, integration tests and end to end tests.

03:29.270 --> 03:34.910
The details of these tests will be explained in the specific section and for testing we will be using

03:34.910 --> 03:36.110
a combination of tools.

03:36.150 --> 03:42.190
We will use two frameworks called Pytest and Unit test for unit and integration testing and Airflow's

03:42.230 --> 03:44.870
testing features for the end to end tests.

03:44.910 --> 03:51.190
The fifth and final milestone is implementation of continuous integration and continuous deployment

03:51.190 --> 03:52.790
or in short CI CD.

03:52.950 --> 03:58.310
For ci CD we will touch on many topics like deployment using Docker and automated testing.

03:58.350 --> 04:04.670
For now, just know that ci CD is important because it helps keep your data pipelines running smoothly

04:04.670 --> 04:06.750
when you apply changes to your code base.

04:06.790 --> 04:09.830
We will discuss this in more detail when we get to this section.

04:09.830 --> 04:13.430
And in terms of tools for CI, CD, we will use GitHub actions.

04:13.430 --> 04:17.630
So once you conclude this final fifth milestone we will be ready.

04:17.950 --> 04:20.870
Okay, that's all for now and I'll see you in the next lecture.