WEBVTT

00:04.320 --> 00:09.840
Continuing from the last lecture, we can now create another script that handles common database actions

00:09.840 --> 00:12.360
like inserts, updates and deletes.

00:12.560 --> 00:17.440
So for this, let's create a new script and call it data underscore modification.

00:18.400 --> 00:21.360
Again we do it under the Data warehouse folder.

00:24.400 --> 00:30.920
I will first include any libraries and variables that we need for the script, which relates mostly

00:30.920 --> 00:36.320
to logging and the global table name for both staging and core schemas.

00:36.680 --> 00:41.040
Now let's put our focus on the insert functionality.

00:41.080 --> 00:50.440
Let's do this by first creating a function called insert rows that will insert the API data, which

00:50.440 --> 00:54.320
we load it from the JSON file into the staging and core tables.

00:54.680 --> 01:01.120
This function will take the cursor and connection objects, the schema to differentiate between staging

01:01.120 --> 01:07.070
and core, And a variable rho, which is a dictionary containing the data to be inserted.

01:11.270 --> 01:14.750
So now we can define a try and accept clause.

01:15.150 --> 01:19.310
And in the try we can specify an if else statement.

01:19.430 --> 01:25.430
In the if block we will define the insert related to the staging schema, while the else block will

01:25.430 --> 01:27.950
contain operations related to the schema.

01:30.910 --> 01:36.110
We also define a video ID variable that will be used for logging purposes.

01:38.190 --> 01:42.550
Now that we have done this, we can specify the SQL that will be executed.

01:42.750 --> 01:46.350
This will be done using the insert into SQL command.

01:54.710 --> 02:03.270
So obviously since we are in the if block for the staging schema, this value will be staging and the

02:03.370 --> 02:06.330
table will be YouTube underscore API.

02:07.170 --> 02:13.130
Now the insert into command takes the names of the columns in the staging schema, which are the following

02:13.170 --> 02:13.730
seven.

02:14.250 --> 02:20.690
Now that we have defined the columns where we will insert the data, we have to actually define the

02:20.730 --> 02:22.730
data itself that will be inserted.

02:22.730 --> 02:26.810
And this we can do using the values statement.

02:31.050 --> 02:37.010
When specifying values, it is a good practice to use named placeholders to remove any doubt of what

02:37.010 --> 02:39.570
data is being inserted into which columns.

02:39.570 --> 02:43.330
So the basic syntax will be something like values.

02:44.090 --> 02:49.170
Then we define the percentage and the placeholder name.

02:50.610 --> 02:53.370
We also define the s at the end.

02:53.810 --> 03:00.970
So as we are all on the same page, what we are inserting here under the names placeholder is the actual

03:00.970 --> 03:04.450
name that we have inside the JSON data.

03:04.680 --> 03:11.320
So if we go on one of these examples, the name will be this video underscore ID.

03:11.360 --> 03:14.000
It will be actually the key name that we have here.

03:14.000 --> 03:18.720
So what we can do is we can go back to the data modification script.

03:18.720 --> 03:22.600
And we can change this to video underscore ID.

03:22.840 --> 03:26.280
We can do the same for the other keys as follows.

03:26.560 --> 03:32.320
So all of these key names are what we see here in the actual JSON data file.

03:32.360 --> 03:37.480
Finally, we specify the row variable as the second argument of the execute command.

03:37.520 --> 03:40.880
As every insert as we will do, will contain a specific row.

03:41.560 --> 03:47.680
We will later develop Python code to loop through each row of videos, but for now, let's continue

03:47.680 --> 03:49.000
this insert function.

03:49.280 --> 03:52.600
So far we have completed the if block for the staging schema.

03:52.800 --> 03:57.720
Now we can put our focus on the block that will be for the core schema.

03:58.280 --> 03:59.360
So now we can write.

03:59.400 --> 04:03.560
Else we can define the video ID for logging purposes.

04:03.880 --> 04:11.430
But notice how in this case, the name is video ID, which is the column name as we defined it in the

04:11.430 --> 04:12.950
table of the staging layer.

04:13.110 --> 04:14.430
So these have to match.

04:15.510 --> 04:18.030
Now we repeat what we did in the if block.

04:18.030 --> 04:21.310
But using the placeholder names as we defined in staging layer.

04:21.510 --> 04:26.550
We do this because we are no longer reading from the JSON like we did for staging for the core layer.

04:26.590 --> 04:30.510
We read data from the previous layer, which is the staging layer.

04:31.070 --> 04:35.990
So the placeholder names must match the names of the columns on the staging table.

04:36.030 --> 04:43.830
Now I have the command already written, so I will just paste it here and you can notice how the command

04:43.870 --> 04:44.870
is very similar.

04:44.990 --> 04:52.710
We just have the values part that is different as this would represent the columns in the core table

04:52.710 --> 04:53.270
schema.

04:53.550 --> 04:55.870
Finally, we can commit our changes.

04:55.910 --> 04:58.550
Since we are changing the contents of the database.

04:58.830 --> 05:00.830
And also we can log our insert.

05:08.890 --> 05:10.130
If there is an exception.

05:10.130 --> 05:11.650
We also look this as another.

05:26.450 --> 05:27.010
Like this.

05:27.010 --> 05:28.770
We have built the insert function.

05:29.050 --> 05:35.130
However, we need to take into account that most of the variables we defined can be updated, especially

05:35.130 --> 05:41.850
variables like the video views, the likes count and the comments count as they are values that will

05:41.850 --> 05:46.090
grow day by day as more people watch and interact with the YouTube videos.

05:46.210 --> 05:52.450
So now we need to define a function which we can name update rows that will have parameters similar

05:52.450 --> 05:53.770
to the insert function.

06:00.890 --> 06:07.450
Again, for the staging layer, we base the row dictionary keys on the JSON file, while the core layer

06:07.450 --> 06:10.400
is based on the column specified in the staging layer.

06:10.440 --> 06:12.440
Knowing this, we can write the following.

06:13.640 --> 06:21.480
So here you can see that for the staging schema, the variable names are based on the JSON keys that

06:21.480 --> 06:23.120
we have JSON file here.

06:23.120 --> 06:28.680
And the variables here are based on the columns of the staging table.

06:28.720 --> 06:33.720
We again need to define an SQL command to execute using the cursor object.

06:33.720 --> 06:39.280
And in this case we use the update SQL command where we define the table to update.

06:39.320 --> 06:40.640
We can do this as follows.

06:47.280 --> 06:54.280
Now we use set by specifying the columns where changes can occur based on the row dictionary value.

06:54.560 --> 07:00.920
For the columns, we will consider the video title, which could potentially change, and the other

07:00.920 --> 07:06.080
three variables which will definitely change, which are the video views, likes count and the comments

07:06.080 --> 07:06.520
count.

07:07.120 --> 07:13.910
And also very important we need to specify a Where clause depending on the video ID and upload date,

07:13.950 --> 07:15.230
which won't change.

07:29.350 --> 07:35.190
So to summarize, what we are saying here is that for a particular video which has a specific video

07:35.190 --> 07:42.910
ID and upload date, the next time we run this ELT process, update the values of the video title,

07:42.950 --> 07:49.750
video views, likes and comments count based on the new values which we get from the latest JSON data

07:49.790 --> 07:50.190
file.

07:50.350 --> 07:55.310
This update function will need to be run for both staging and core layers.

07:55.350 --> 08:01.950
Finally, we commit log the update and for some reason we have an error in the try block for any of

08:01.950 --> 08:02.790
the video updates.

08:02.830 --> 08:05.310
We raise an exception like this.

08:05.310 --> 08:07.910
The update function is ready.

08:07.950 --> 08:13.100
Now there could be instances where the channel we are extracting data from the video.

08:13.140 --> 08:17.700
We need to cater for this scenario as it might lead to incorrect data in the tables.

08:18.060 --> 08:24.420
Let's define a delete rows function that's different from the insert and update functions.

08:24.660 --> 08:30.060
Takes a different arguments as default arguments, which is the ids to delete.

08:30.060 --> 08:36.140
So def delete rows here on schema.

08:36.140 --> 08:45.860
And as we said, the IDs to delete this last variable will contain the video ID or IDs.

08:45.860 --> 08:48.100
If there is more than one video to delete.

08:48.140 --> 08:54.340
In order to be read properly in the SQL that we will define soon, we first need to convert this variable

08:54.340 --> 09:01.380
to a proper SQL format string, and we can do this using the Python string join method as follows.

09:01.540 --> 09:07.700
So first we define the try block IDs to delete will be equal to.

09:17.040 --> 09:19.640
Note we have changed the format of the string.

09:20.000 --> 09:26.680
We can move on to the SQL delete statement that takes the table in question, and using the Where clause,

09:26.760 --> 09:34.000
deletes the row or rows depending if the IDs to delete variable has one or more IDs.

09:34.560 --> 09:37.240
So now we can write execute.

09:46.360 --> 09:52.720
Finally, as we did in the insert and update functions, we commit log deletion.

09:52.720 --> 09:58.760
And if for some reason we have an error in the try block for any of the video deletions, we raise it

09:58.760 --> 09:59.720
as an exception.

10:01.480 --> 10:05.160
And that's all we have to do for the data modification scripts.

10:05.720 --> 10:09.680
In the next lecture we will discuss data transformations.