WEBVTT

00:04.200 --> 00:09.120
Now that we have all the functions to extract the data using the API, we need to think about how we

00:09.120 --> 00:11.280
will load data into the data warehouse.

00:11.920 --> 00:15.680
In truth, there are various approaches depending on your data architecture design.

00:16.160 --> 00:22.720
You could gather all the raw data stored as CSV or JSON or in any other format, and upload these to

00:22.760 --> 00:24.600
a data lake bucket in a string.

00:25.200 --> 00:30.520
You could also decide not to save your data in a raw format, and push directly the data to your data

00:30.520 --> 00:35.760
warehouse, which is a common theme in streaming where you need immediate access to the data.

00:35.760 --> 00:39.640
In our case, considering the volume of data which is a few hundred rows.

00:40.120 --> 00:43.520
And to keep things simple, we will use the first approach.

00:43.720 --> 00:48.720
But instead of storing in the cloud or some data lake, which is what you would do in real life, we

00:48.720 --> 00:51.880
will simply store to a part on our local machine.

00:51.920 --> 00:57.840
To do so, we just need to create a function that takes the data we just extracted and save it as a

00:57.840 --> 01:01.000
JSON at a specific file path on our machine.

01:01.320 --> 01:02.880
So let's define a function.

01:02.880 --> 01:04.840
Let's call it save to JSON.

01:04.840 --> 01:11.370
So the save to JSON, which takes as input the output data from the previous function.

01:11.690 --> 01:13.850
So extracted data.

01:13.890 --> 01:17.570
Now we need to have a file path where we will save the JSON files.

01:18.210 --> 01:20.970
This can be in our current directory under a folder.

01:21.170 --> 01:22.770
And let's call it data.

01:23.170 --> 01:26.250
So let's create a new folder and name it data.

01:26.410 --> 01:32.010
Since we will run the ELT pipeline once a day, we can also specify in the JSON file name the day the

01:32.010 --> 01:33.090
file was generated.

01:33.570 --> 01:38.570
For this, we first need to import datetime module at the top of the script.

01:38.810 --> 01:44.330
So this would look as follows from DateTime import date.

01:44.370 --> 01:45.570
Now going back down.

01:48.090 --> 01:49.930
We define the file paths.

01:50.690 --> 01:54.810
So file path equals start an f string.

01:55.690 --> 02:00.730
We're taking the current directory using the data folder we just created.

02:01.450 --> 02:05.250
And the file name will have the structure of YouTube data.

02:07.330 --> 02:13.790
Taking into account the day that the script is run and it will be of type JSON.

02:14.110 --> 02:17.710
We can now use what is known as a context manager to write the file.

02:17.750 --> 02:23.750
Context managers are an efficient way of opening, writing and closing files as it takes care of resource

02:23.750 --> 02:24.430
management.

02:25.150 --> 02:29.950
The most popular context manager, which is what we'll be using, is the with statement.

02:30.150 --> 02:36.310
So we define the with open which needs the path where we will store the file and the action we want

02:36.310 --> 02:36.670
to do.

02:36.790 --> 02:38.750
In our case we want to do a write.

02:38.750 --> 02:40.870
So we would set the mode to W.

02:41.150 --> 02:44.190
If we wanted to read the file we would set the mode to I.

02:44.310 --> 02:48.750
So let's write what we just said in the with statement.

02:49.550 --> 02:55.230
With open we set the file path as the first input.

02:55.710 --> 02:57.030
We said we want to write.

02:57.390 --> 03:08.430
So we use W and we also will specify the encoding which will set it to encoding equals UTF dash eight,

03:08.630 --> 03:11.910
which simply ensures that the file can handle special characters.

03:15.480 --> 03:21.920
Then, since we need to save data as JSON, we used JSON dump which takes as arguments the Python data

03:22.400 --> 03:25.240
represented by the variable extracted data.

03:26.480 --> 03:28.960
We start tracking this down so JSON to dump.

03:29.200 --> 03:31.160
We said we will take the extracted data.

03:32.040 --> 03:36.800
Also, we will take the file objects to which the JSON data will be written.

03:37.440 --> 03:42.440
This is represented by what we just defined the JSON underscore file.

03:42.480 --> 03:48.880
We can also define the data indent, which is usually set to four spaces as it makes the output pretty

03:48.880 --> 03:50.920
printed and therefore easier to read.

03:51.040 --> 03:53.080
Let's add the indent to four.

03:53.640 --> 04:02.240
And lastly, we also include the argument ensure asking equals false which allows JSON to include non-ascii

04:02.280 --> 04:03.040
characters.

04:03.080 --> 04:06.400
What we have now in this video underscore stats.

04:06.440 --> 04:10.000
Python script is the whole extract part of our pipeline.

04:10.760 --> 04:16.440
If we were to run it, we would end up with the JSON file for today's date under the data directory.

04:16.600 --> 04:20.900
Before we get the file, let's do one final change at the bottom of the scripts.

04:20.900 --> 04:24.020
And the double underscore name equals double underscore main parts.

04:24.420 --> 04:28.980
So the change is to simply call the function we just created.

04:31.140 --> 04:35.900
And we do that here calling the video data variable.

04:36.020 --> 04:37.220
So let's press run.

04:37.220 --> 04:42.940
And we should ultimately end up with the final JSON containing the video variable values for this day

04:42.980 --> 04:44.140
at the specific time.

04:44.980 --> 04:49.540
This might take a while since we're looping through all the videos, but we should ultimately see the

04:49.540 --> 04:50.460
JSON file.

04:50.500 --> 04:52.540
And there's this data directory over here.

04:53.420 --> 04:56.140
And as you can see, git picked up a change.

04:56.900 --> 05:03.660
And if we were to open the JSON file for today's date, we will see the contents of all the videos that

05:03.660 --> 05:06.260
are present on Mr. B's YouTube channel.

05:06.260 --> 05:07.580
So that's it for this section.

05:07.580 --> 05:12.580
We finally have the main scripts that will extract the data from YouTube using the YouTube API.

05:13.140 --> 05:18.660
And in the next lectures we will go over how Docker will play an important role in our architecture.

05:18.980 --> 05:19.980
I will see you then.
