WEBVTT

00:00.280 --> 00:04.840
So before we ingest the documentation, we first need to download it.

00:04.960 --> 00:09.040
And in order to do so we'll use the map in Tivoli Extract API.

00:09.400 --> 00:16.720
Tivoli map is going to discover and map out the link chain documentation and the website and the URLs

00:16.720 --> 00:20.160
that we want to scrape and extract information from.

00:20.560 --> 00:27.640
The extract is going to extract the data of that pages, which is going to hold all the information

00:27.640 --> 00:29.400
about the link chain documentation.

00:29.880 --> 00:30.240
All right.

00:30.240 --> 00:33.080
Let me go and open this notebook in Google Colab.

00:33.080 --> 00:35.600
So I'm going to use the Google Colab runtime.

00:36.160 --> 00:42.960
And I want now to go and make sure that I have the runtime environment selected in Python three.

00:43.640 --> 00:44.040
Cool.

00:44.520 --> 00:47.840
So we first want to start by installing the dependencies.

00:48.280 --> 00:52.240
So I'm going to click this tab over here which is going to install link chain.

00:53.120 --> 00:57.440
And this is the link chain integration which is highly maintained.

00:58.080 --> 01:04.180
The reason why I am mentioning this is that first of all, it's not obvious that our chain integration

01:04.180 --> 01:05.340
is highly maintained.

01:05.580 --> 01:12.580
And second, if you're using a third party with integration and that third party link chain integration

01:12.580 --> 01:18.740
is not highly maintained, then what may happen is that the third party API might change and it would

01:18.780 --> 01:20.740
break the link chain integration.

01:20.980 --> 01:26.820
So this is very, very important when using link chain third parties, especially when you go to production.

01:27.380 --> 01:34.300
Anyways I'm going to install link chain and I'm going to install certify and certify is simply to make

01:34.340 --> 01:41.300
API calls with a valid certificate which can be verified, which is defensive programming to make us

01:41.300 --> 01:43.380
send lots of requests to the API.

01:44.500 --> 01:44.820
Right.

01:44.860 --> 01:46.340
We're then going to install Rich.

01:46.340 --> 01:48.980
So this is to display some cool text for logging.

01:49.420 --> 01:51.620
And actually we're not going to use pandas.

01:51.660 --> 01:54.620
He got in here because I wrote a bunch of this code.

01:56.340 --> 01:56.940
All right.

01:56.940 --> 02:00.100
So now I want to run the import cell.

02:00.140 --> 02:02.740
So I want to import here a bunch of stuff.

02:02.860 --> 02:08.510
I want to import async IO because we're going to make concurrent requests to extract.

02:08.710 --> 02:13.710
We want to use the OS package in case we want to set up some environment variables.

02:14.070 --> 02:22.230
We want to import SSL because we want to provide a valid SSL context some typing, and we want to import

02:22.230 --> 02:24.230
certify for the certificate.

02:24.390 --> 02:32.630
And here from chain underscore we want to import the extract object in the map object.

02:33.390 --> 02:37.350
And after that we're importing some stuff from a witch.

02:37.350 --> 02:41.190
So this is simply to display a nicely logs.

02:41.630 --> 02:44.710
And then we want to configure the SSL context.

02:44.830 --> 02:50.630
And I don't want to elaborate on this snippet, but at the end of the day it's simply setting up some

02:50.630 --> 02:51.950
environment variables.

02:51.950 --> 02:57.750
So every time we'll be making a request to a third party, which is going to be tavileh, then we're

02:57.750 --> 03:03.870
going to set valid SSL context, and we're not going to get blocked because we don't have a valid SSL

03:03.910 --> 03:04.830
certificate.

03:04.830 --> 03:05.230
All right.

03:05.290 --> 03:08.210
So now we want to go and get the API key.

03:09.930 --> 03:14.450
And if you don't have an API key you can go and click this plus button here.

03:14.450 --> 03:19.930
And just name your API key, create it and copy its value.

03:20.330 --> 03:23.530
So let me copy the value of my API key here.

03:24.410 --> 03:29.090
And you should go and take it and put it right over here.

03:29.370 --> 03:32.450
And then run the cell to set up the environment variable.

03:32.890 --> 03:36.530
And by the way, if you're worrying about my API key then don't.

03:36.570 --> 03:41.650
I'm not exposing it because it's going to be revoked as I'm finishing to film this video.

03:42.490 --> 03:42.850
All right.

03:42.850 --> 03:46.850
So you go and run this cell and initialize your environment variable.

03:47.730 --> 03:48.330
So let's go.

03:48.330 --> 03:54.690
Now initialize an object of map which is going to interact with the map API.

03:55.250 --> 03:58.810
And this API is going to receive an input a URL.

03:58.810 --> 04:07.030
And it's going to traverse the website URL like a graph and to explore all the paths to intelligently

04:07.030 --> 04:13.550
discover and generate a comprehensive sitemap, which is a list of URLs.

04:13.710 --> 04:17.830
So this is the output we're going to receive once we map website.

04:18.070 --> 04:24.590
And we're going to use every element of that list, which is a URL to scrape and extract information

04:24.590 --> 04:25.150
from.

04:25.150 --> 04:28.550
So this is going to be fed into our Rag pipeline.

04:28.950 --> 04:32.750
And I'm going to cover some of the arguments that this API receives.

04:32.750 --> 04:39.070
So it's going to receive Maxdepth which defines how far from the base URL the crawler can explore.

04:39.550 --> 04:47.070
We're going to give it also max breadth, which is the max number of links to follow per level of the

04:47.070 --> 04:47.550
tree.

04:47.590 --> 04:48.710
So per page.

04:48.710 --> 04:51.910
And we're going to give it the limit of 500.

04:51.910 --> 04:55.590
So this means that we don't want to get more than 500 URLs.

04:58.630 --> 05:00.550
Let me go and run this blog.

05:02.390 --> 05:14.410
And let's now continue to invoke this API In the base URL we want to give it is python.com/docs/introduction.

05:14.410 --> 05:16.610
So this is going to be our starting point.

05:17.130 --> 05:24.210
And then we are going to invoke this API and to call it by using the invoke method.

05:24.410 --> 05:28.410
And we can invoke because we imported from link chain.

05:29.210 --> 05:35.970
So this map object is actually a link chain tool like we saw earlier in the videos that can be used

05:35.970 --> 05:37.010
by an agent.

05:37.450 --> 05:43.130
And the way we invoke tools is by using the invoke method or invoke.

05:43.650 --> 05:51.010
And here what actually is happening to really wrapped their SDK with a link chain tool.

05:51.050 --> 05:55.130
So this is what we're actually getting by using the link chain to really package.

05:55.170 --> 06:00.010
Of course we can use it by itself, but I wanted to show you the link chain integration.

06:00.530 --> 06:05.450
So here we're going to invoke this map object.

06:05.450 --> 06:08.170
And we're going to get back the sitemap here.

06:08.890 --> 06:09.410
All right.

06:09.460 --> 06:13.300
So after that we're going to iterate over the URLs that we get.

06:13.340 --> 06:15.500
And we're going to print them nicely.

06:15.700 --> 06:18.740
So let me play the cell and let me show you what I get.

06:18.780 --> 06:24.700
And it's important to note that the results may change depending when are you playing this video.

06:24.700 --> 06:27.860
Because link chain might and probably will add.

06:27.980 --> 06:31.900
Let's change some of their documentation by the time you're seeing this video.

06:32.180 --> 06:37.380
So the important thing to note is that you're going to get with this the up to date sitemap of link

06:37.380 --> 06:38.460
chain documentation.

06:41.500 --> 06:46.300
And you can see now we're displaying the first 50 URLs.

06:46.340 --> 06:49.300
Let me go and click on one random URL.

06:51.300 --> 06:52.860
And let's see what do we get.

06:56.700 --> 07:00.180
And we can see it's a valid documentation page.

07:01.260 --> 07:03.620
All right let's go back to the notebook.

07:03.660 --> 07:07.460
And now we want to select a random page.

07:07.460 --> 07:09.260
And we want to scrape it.

07:09.380 --> 07:16.720
So for that we will Who will initialize the extract object, and I'm not initializing it with any arguments.

07:16.720 --> 07:19.200
And this is going to do the scraping for us.

07:19.200 --> 07:24.760
And it's going to output us a markdown file of the page content.

07:25.120 --> 07:26.840
Let me run this cell.

07:27.440 --> 07:34.960
And now let's go and let's go and extract the content and scrape the content of a certain page.

07:35.280 --> 07:37.920
And let's walk through this code here.

07:38.120 --> 07:40.440
So I want to select some sample URLs.

07:40.440 --> 07:47.840
And here I selected a list that is currently containing one URL, which is the 21st element of this

07:47.840 --> 07:48.280
list.

07:48.480 --> 07:51.400
And you can put here as many as elements that you want.

07:51.920 --> 07:55.760
Then we're going to use the extract a invoke.

07:55.760 --> 08:01.960
We want to use the asynchronous function and we want to go and await it.

08:02.080 --> 08:07.600
And the reason why we're doing it because later when we'll be using it in our ingestion file, we'll

08:07.640 --> 08:09.800
want to extract everything concurrently.

08:10.200 --> 08:15.620
And we're going to give it the input, which is going to be a dictionary with the key of URLs.

08:15.900 --> 08:22.140
And the value here is going to be the list of URLs we want to scrape and extract.

08:22.660 --> 08:30.020
And after we await this coroutine here, we're going to get back a dictionary of results.

08:30.020 --> 08:32.300
And we want to get the results.

08:32.500 --> 08:33.500
Um key here.

08:33.620 --> 08:36.460
So this is where is going to send us the output here.

08:36.900 --> 08:37.380
Cool.

08:37.580 --> 08:46.580
So we get back here a list of documents where each document is going to contain a URL, which is the

08:46.580 --> 08:52.140
original URL that we scraped, and the raw content which is the content of that page.

08:52.420 --> 08:54.500
And then we want to display it nicely.

08:54.780 --> 08:56.620
So this is what's happening in the cell.

08:56.660 --> 08:58.220
So let's go and run it.

09:00.180 --> 09:04.460
And here we can see the scraped content displayed nicely.

09:12.540 --> 09:13.620
Alrighty.

09:13.880 --> 09:21.520
So now I want to show you how to batch, process and extract concurrently many pages.

09:21.800 --> 09:29.360
And this is a must when we want to handle scale, because we want to do everything as efficient as we

09:29.360 --> 09:29.880
can.

09:30.120 --> 09:36.960
And if we have lots of documents to scrape and extract, then we don't want to do it sequentially because

09:36.960 --> 09:38.440
it's going to take forever.

09:38.760 --> 09:47.480
So what we're going to do now is we're going to use the extract API, and we're going to do it in batches,

09:47.480 --> 09:49.480
which are going to run concurrently.

09:50.040 --> 09:53.560
So I have here this chunk URL function.

09:53.680 --> 09:57.440
And this function is going to receive the original list of URLs.

09:57.720 --> 10:01.320
And it's going to receive the chunk size which the default is three.

10:01.640 --> 10:07.640
And it's going to return us a list that contains containing list of URLs.

10:07.920 --> 10:12.680
So we're going to split that big list of URLs.

10:12.720 --> 10:17.330
We're going to split it into a bunch of sublists which are going to be your batches.

10:17.970 --> 10:20.770
Here I have a coroutine, an async function.

10:20.810 --> 10:24.730
Extract batch, which is going to receive a list of URLs.

10:24.730 --> 10:26.450
So this is going to be one batch.

10:26.650 --> 10:28.610
It's going to receive the batch numbers.

10:28.610 --> 10:29.850
So this is for logging.

10:29.850 --> 10:32.530
So we can see what exactly is being processed here.

10:32.930 --> 10:36.570
So we're going to use the extract invoke like we did before.

10:36.730 --> 10:39.770
So this time it's going to work on our batch of URLs.

10:39.770 --> 10:46.530
It's going to return a the result for this and the way we're going to process all the URLs here and

10:46.530 --> 10:49.090
here we're selecting nine URLs to process.

10:49.290 --> 10:51.210
We're going to split it into batches.

10:51.210 --> 10:53.650
So we're going to get three batches of three.

10:53.970 --> 11:00.450
And we're then going to create tasks which are coroutines which are going to run concurrently.

11:00.450 --> 11:04.090
And this is what we're going to do with the async IO gather here.

11:04.090 --> 11:10.290
And we're going to send here the list of tasks where each task is a coroutine which is going to process

11:10.330 --> 11:11.650
a different batch here.

11:12.010 --> 11:20.830
And once this finishes executing we'll get the batch results, and it's going to wait for all the batches

11:20.870 --> 11:21.750
to finish.

11:22.110 --> 11:27.430
And once we do that, we simply go and extract everything into one list.

11:27.430 --> 11:29.670
And that's what we're going to log.

11:29.670 --> 11:31.670
And we want to return eventually.

11:32.070 --> 11:39.390
So this logic over here is the preparation of taking the LinkedIn documentation and downloading it or

11:39.390 --> 11:40.750
scraping it concurrently.

11:41.230 --> 11:45.590
So let me go and run this code block and show you the output.

11:45.870 --> 11:53.190
And we can see that we started executing all the tasks which are coroutines in a simultaneously.

11:53.190 --> 11:57.830
So you can see here the prints are very organized batch one, batch two then batch three.

11:58.150 --> 12:00.430
But the finish rate is different.

12:00.470 --> 12:03.430
We first finish batch two, then we finish batch one.

12:03.470 --> 12:05.350
Then we finish batch three right.

12:05.390 --> 12:08.870
So this is just to show you how everything is running asynchronously.

12:09.030 --> 12:14.590
And if we were to have more batches like we're going to have in the ingestion, which we're going to

12:14.590 --> 12:18.270
do next, we can even see it even more explicitly.
