WEBVTT

00:04.320 --> 00:10.080
Now to have a connection set up, we can create the schemas which will represent the raw and refined

00:10.080 --> 00:13.160
layers and the associated tables for each schema.

00:13.200 --> 00:16.520
First thing we should do is define the naming conventions.

00:16.560 --> 00:21.080
The raw layer will be represented by the staging schema, and the refined layer will be represented

00:21.080 --> 00:22.120
by the core schema.

00:22.240 --> 00:24.520
This was already in the presentation.

00:24.560 --> 00:31.080
Each schema in our case will have one table, which we will name YouTube underscore API for both the

00:31.080 --> 00:32.560
raw and refined layers.

00:32.760 --> 00:36.880
So let's first create the global table name at the top of our script.

00:40.160 --> 00:45.560
Now we can build the function create schema which has the schema variable as a parameter.

00:46.800 --> 00:48.920
Function create schema.

00:49.600 --> 00:50.720
The schema parameter.

00:51.560 --> 00:56.040
First thing we do is we define a character and connection variables.

00:59.520 --> 01:06.640
And here we can use the get underscore underscore character function that we just created, and now

01:06.680 --> 01:09.400
we specify what SQL we want to run.

01:09.600 --> 01:14.000
In our case, we want to create a schema so we can define the following SQL.

01:15.760 --> 01:21.800
We use f strings with the SQL being create schema if not exists.

01:22.480 --> 01:24.040
And then the schema name.

01:25.920 --> 01:31.760
Notice how we use the if not exists, which would only create the schema if it doesn't exist already.

01:32.320 --> 01:37.520
And to actually run the SQL we need to use the cursor object.

01:37.680 --> 01:43.200
So execute schema underscore SQL.

01:43.640 --> 01:47.760
Since we are making changes to the database we need to commit these changes.

01:47.880 --> 01:50.360
And to do this we use the convert method.

01:51.440 --> 01:53.160
So con dot commit.

01:54.520 --> 01:58.840
Finally it's important that we close the connection and cursor objects.

01:59.120 --> 02:04.880
So we can call the close command and query function that we also defined previously.

02:08.080 --> 02:11.590
So like this we have defined the create schema function.

02:12.030 --> 02:15.270
So now we need to create another function for the tables.

02:15.270 --> 02:19.310
As we said already we will have the table names with the same for both schemas.

02:19.710 --> 02:24.870
However, since we will later do transformations between each layer, each table will have different

02:24.870 --> 02:25.590
columns.

02:25.630 --> 02:29.190
Let's define a create table function.

02:29.190 --> 02:30.950
So create table.

02:33.550 --> 02:39.470
Which will have the SQL code to create the tables depending on whether it's the raw layer or refined

02:39.470 --> 02:40.310
layer table.

02:40.390 --> 02:43.390
So let's define schema as an argument here again.

02:45.510 --> 02:51.190
Same as before we use the get quantum cursor function.

02:51.670 --> 02:56.750
And now we can define an if else statement depending on the schema.

02:56.870 --> 02:59.030
And here we define the column names.

02:59.310 --> 03:03.590
So if the schema is equal to staging.

03:06.230 --> 03:10.750
The SQL to create the table will look as follows.

03:15.070 --> 03:18.910
Notice how now we are defining a create table, not create schema.

03:19.110 --> 03:23.110
We are also defining the if not exists SQL code.

03:24.430 --> 03:28.150
And here we can define the schema that table.

03:31.230 --> 03:37.990
The actual column names will be the video ID that will be of type varchar which will have a length of

03:38.110 --> 03:40.270
11, and it will be a primary key.

03:40.350 --> 03:46.270
Then there's also the video title which will be of data type text, the upload date that will be of

03:46.310 --> 03:49.030
data type, timestamp duration will be varchar.

03:49.030 --> 03:53.390
And finally the video views like counts and comment counts.

03:53.790 --> 03:58.670
I already have the SQL for the columns, so I will just paste it here.

03:58.790 --> 04:01.910
Notice how depending on the column we have different data types.

04:01.910 --> 04:04.510
So the video ID is of type varchar.

04:04.750 --> 04:09.310
But column like video views will be of type integer.

04:09.430 --> 04:10.590
Same for a date.

04:10.630 --> 04:13.870
Since it's a date it will be of type timestamp.

04:15.990 --> 04:19.660
So like this we have the if part of the if statement.

04:19.660 --> 04:26.860
And now we need to define the else part, which will basically be for the core schema.

04:27.060 --> 04:30.380
Again, I have the SQL ready, so I will just paste it here.

04:32.140 --> 04:35.900
We notice how the columns are almost the same in terms of differences.

04:35.900 --> 04:43.020
It only boils down to the duration and the video type columns, which will be explained in more detail

04:43.020 --> 04:44.100
in a later lecture.

04:44.460 --> 04:45.860
Export table creation.

04:45.900 --> 04:49.580
We can run the commands same as for the schema creation.

04:49.740 --> 04:55.780
So we can go back here and we can copy these commands.

04:57.820 --> 05:05.060
Obviously we need to change from schema underscore SQL to the actual table SQL variable that we have

05:05.100 --> 05:06.060
in this function.

05:06.900 --> 05:11.300
So like that we have defined also the create table function.

05:12.140 --> 05:18.580
One last function that we should create is to get all the video IDs in either the staging or the correlation

05:18.580 --> 05:19.260
tables.

05:19.460 --> 05:25.100
This will be helpful when we come to loop through the rows of data inside the tables so we can first

05:25.140 --> 05:35.540
define get underscore video underscore IDs function that will take the cursor and the schema as arguments.

05:35.980 --> 05:41.660
We can then perform an execute statement to get all the video IDs in the table in question, by using

05:41.660 --> 05:44.460
the fetch all methods of the cursor object.

05:45.020 --> 05:47.420
So here execute.

05:55.780 --> 05:59.380
And as we said, we will get the IDs using the fetch all method.

06:01.420 --> 06:06.340
In our case this will give a list of dictionaries where the key is always this variable.

06:06.340 --> 06:11.220
Here the video id and the value will be a video id value.

06:11.220 --> 06:15.860
So an example of what IDs would look like is something like this.

06:16.220 --> 06:20.100
We have the video ID key that is constant.

06:20.340 --> 06:23.220
And then the actual value of the id.

06:23.260 --> 06:27.930
Notice how we are defining a dictionary in each case.

06:28.330 --> 06:32.490
Now, in our case, we just want the actual value of each row.

06:32.930 --> 06:39.410
So what we can do is to write a list comprehension and return the new list as follows.

06:40.770 --> 06:44.050
So we define video underscore write these variable.

06:49.930 --> 06:57.810
What we're doing here is we are going inside the video ID key and extracting the value for each row.

06:58.010 --> 07:00.890
And then finally we can return the video IDs.

07:02.450 --> 07:03.450
This we can delete.

07:03.690 --> 07:11.330
And an example of how this will look like when we come to return is this would have a list of what the

07:11.330 --> 07:13.130
values for the video IDs.

07:13.330 --> 07:19.770
So now we have the functions ready to create the table and schemas, and also a list of all video IDs

07:19.770 --> 07:20.530
in the table.

07:20.810 --> 07:27.330
In the next lecture we will look into getting the data from the JSON into the actual database tables.
