WEBVTT

00:04.200 --> 00:09.560
So far we have covered database connections, data inserts, updates and deletes.

00:09.600 --> 00:15.280
Now we can discuss data transformations as we know by now, to populate the core layer table, there

00:15.280 --> 00:19.440
needs to be some transformations applied to the staging layer table.

00:19.440 --> 00:25.800
In terms of differences between the raw and refined layers, it boils down to the video duration column.

00:25.840 --> 00:32.440
The YouTube API gives the duration of a video in the ISO 8601 format, as per the documentation, which

00:32.440 --> 00:34.840
I will link in the appendix of this section.

00:35.600 --> 00:40.200
From the documentation, you can see that in its most generic format.

00:40.520 --> 00:43.160
And again here we are talking about the duration.

00:43.800 --> 00:50.800
The duration column itself is represented in this format where P indicates the start of the duration

00:50.800 --> 00:51.800
specification.

00:52.040 --> 00:53.800
D stands for days.

00:54.120 --> 01:03.830
T implies that there is an hour value represented by H, M Mm represents minutes and S represents seconds.

01:04.430 --> 01:10.590
So a transformation function needs to be defined to translate this format into a format which we understand

01:10.590 --> 01:12.550
such as hours minutes and seconds.

01:12.830 --> 01:19.470
Now another column in the table is the categorization of a video, which means if it's a normal YouTube

01:19.470 --> 01:21.670
video or a YouTube shorts video.

01:21.910 --> 01:27.750
Unfortunately, the YouTube API is of recording of this lecture does not have a parameter that specifies

01:27.750 --> 01:28.710
the video type.

01:28.750 --> 01:34.750
And to make matters a bit more complicated, I've seen discussions where they will increase the YouTube

01:34.750 --> 01:42.190
shorts maximum time from 1 minute to 3 minutes to simplify this and avoid any future changes impacting

01:42.190 --> 01:42.990
our logic.

01:43.030 --> 01:49.750
We will create another column called Video Type and any video that is one minute or shorter.

01:49.910 --> 01:51.870
We will give the value of shorts.

01:51.990 --> 01:56.270
If the video is over one minute then we classify as normal.

01:56.310 --> 02:02.660
So let's first start by converting the ISO 8601 format into a more recognizable format.

02:02.700 --> 02:07.100
Let's go back to VSCode and create a data transformation script.

02:11.740 --> 02:18.700
And we can define the first function as parse duration, which will take only one argument.

02:18.740 --> 02:24.300
This is the duration string that is in the Iso8601 string format.

02:24.860 --> 02:31.140
So what we will do in this function is break apart this string by splitting it in such a way that we

02:31.140 --> 02:32.780
extract the values that we want.

02:32.780 --> 02:38.820
So let's first remove the strings which are not needed, which are the p and the T strings.

02:38.820 --> 02:42.620
We will do this by using the replace method for Python strings.

02:42.860 --> 02:46.460
So we can write duration underscore string.

02:47.860 --> 02:53.580
But replace can replace p with nothing.

02:54.220 --> 03:00.280
We can concatenate to also replace the tea with nothing.

03:00.760 --> 03:05.800
So here we have an example of method chaining where we have two replace methods.

03:05.840 --> 03:10.240
Return one after the other, separated only by this full stop.

03:10.480 --> 03:16.960
Side note I want to add to this is that the string will always be there and needs to be removed, but

03:17.000 --> 03:19.360
the string may or may not be there.

03:19.400 --> 03:24.800
However, the replace method will still work even if the T is not present.

03:24.880 --> 03:29.680
So let's take for example the value that I have here.

03:29.720 --> 03:33.280
So far we have removed the p and the t.

03:33.520 --> 03:35.760
So what remains is this value.

03:35.960 --> 03:41.880
Now we need to build some generic logic that will take into consideration the components and the respective

03:41.880 --> 03:42.560
values.

03:42.560 --> 03:53.000
So in this example what we have left is one here for one day, two for two hours, 30 for 30 minutes

03:53.000 --> 03:55.400
and 45 for 45 seconds.

03:55.920 --> 04:03.470
So now let's define the components to look for using a list and initialize their values to zero using

04:03.510 --> 04:04.750
a values dictionary.

04:04.790 --> 04:06.790
We can build this as follows.

04:16.230 --> 04:23.350
A note on the components list is that it's important that the order of the components is kept like that,

04:23.350 --> 04:28.990
since we will be splitting the duration string in each loop using the previous results at each split.

04:29.030 --> 04:35.190
After having these two variables defined, we can now write a for loop, so we will loop through each

04:35.190 --> 04:42.550
component in the components list, and if the component is in the duration string, we split and extract

04:42.550 --> 04:45.030
the integer value of the component in question.

04:45.070 --> 04:46.550
The function will look like this.

04:46.630 --> 04:58.860
So for component in components, if the component is present in the duration string, In the value duration

04:58.860 --> 05:04.060
string, and we split it at that particular component.

05:05.260 --> 05:12.660
The values between which we just defined will also be overwritten to the specific value that we have

05:12.660 --> 05:13.940
in the duration string.

05:13.980 --> 05:18.460
And like that, we have the values dictionary populated with the correct values.

05:18.700 --> 05:24.300
Now before we can continue, we need to import the Timedelta class from the DateTime module.

05:24.460 --> 05:26.540
So let's do that at the top of the script.

05:28.220 --> 05:35.340
And now we just need to define a Timedelta object based on the values of dictionary values as such.

05:35.700 --> 05:39.540
So duration equals timedelta.

05:42.380 --> 05:44.700
Base equals the value.

05:44.940 --> 05:49.620
Here we are taking each respective value from the values dictionary.

05:49.980 --> 05:51.260
So for this is ready.

05:51.380 --> 05:54.780
And now we have to do also for hours minutes and seconds.

06:03.010 --> 06:10.050
The time delta function we used here will give a time delta object, which in Python represents a duration

06:10.050 --> 06:12.610
or the difference between two dates or times.

06:12.610 --> 06:15.490
And finally we simply return time Delta object.

06:17.490 --> 06:22.930
Now this function which we have just created will use it in the main transformation function that we

06:22.930 --> 06:26.130
can define as transform underscore data.

06:27.930 --> 06:33.730
In this function we'll use the pass duration function we just built to take the current duration column

06:33.770 --> 06:40.730
on the stage layer table, which is the ISO 8601 format, to a more readable time format in the core

06:40.730 --> 06:46.530
layer, and also take care of the second transformation, which will define a new column that we will

06:46.530 --> 06:52.930
call Mediatype that will be used as a categorization column and will have values of either normal or

06:53.130 --> 06:53.850
as shorts.

06:53.850 --> 07:00.750
So let's continue building this transform data function by adding row as an argument again.

07:00.790 --> 07:06.950
Here, row represents one row in the staging table, which we will loop through in the main script that

07:06.950 --> 07:08.790
will be covered in a coming lecture.

07:08.830 --> 07:17.150
We first define a variable duration time delta that will have the time delta duration using the pass

07:17.150 --> 07:18.910
duration function we just built.

07:23.630 --> 07:29.550
We can then update the duration field in the row directly as a time object by writing.

07:34.030 --> 07:37.590
To use the DateTime here, we need to import it as well.

07:37.630 --> 07:39.310
Same as we did for Delta.

07:45.950 --> 07:53.430
So to understand what is happening here, let's first understand what DateTime dot mean is DateTime

07:53.430 --> 07:59.460
Daytime min is the earliest possible daytime, which if I iterate it down, would be this value.

08:00.940 --> 08:07.460
Next, we are adding the time delta variable, which we got from the previous line, and we use the

08:07.500 --> 08:12.100
time method to extract the time component of the daytime.

08:12.140 --> 08:19.620
We can now define the video type column based on the duration converted from time delta to seconds,

08:19.620 --> 08:27.300
using the total seconds method, where we apply the logic of less than or equal to one minute that categorizes

08:27.340 --> 08:30.500
as shorts, else it will be categorized as normal.

08:30.780 --> 08:32.020
So this will look like.

08:40.220 --> 08:42.140
And finally we just return the row.

08:44.340 --> 08:47.460
So that's it for the transformation script.

08:47.460 --> 08:49.460
So I will see you in the next lecture.
