WEBVTT

00:00.240 --> 00:00.880
Hey there.

00:00.920 --> 00:01.800
Eden here.

00:01.800 --> 00:03.840
And this video is optional.

00:04.400 --> 00:12.000
I wanted to introduce you another powerful way to scrape and extract content using the toolkit.

00:12.320 --> 00:19.240
And like I showed you in the previous video, the main recommendation for most cases is to use the Velcro

00:19.240 --> 00:20.720
because it's super simple.

00:20.720 --> 00:27.280
You just give it a URL and it automatically maps the entire site, scrapes everything in the sitemap,

00:27.280 --> 00:32.400
and even lets you filter exactly what you need using natural language instructions.

00:32.400 --> 00:36.080
So, for example, we can point it to a documentation site and tell it.

00:36.080 --> 00:41.640
Just give me everything about agents and it will do all the filtering and scraping for us.

00:42.160 --> 00:46.160
However, sometimes we want to have even more control.

00:46.400 --> 00:53.000
Maybe we want to customize every step of the process, or we want to go really deep into certain parts

00:53.000 --> 00:53.920
of the website.

00:53.920 --> 01:01.030
So in this video, I'll show you how to use map and to really extract for our use case of extracting

01:01.030 --> 01:02.590
the link chain documentation.

01:02.590 --> 01:07.670
If we want a bit more flexibility and control over the entire scraping process.

01:08.230 --> 01:15.110
In this video, I'll be covering some batch processing strategies, and I'll be going a bit deeper into

01:15.110 --> 01:18.550
scraping with third parties and rate limiting handling.

01:18.790 --> 01:19.830
So feel free.

01:19.830 --> 01:21.670
This is a bit of an advanced video.

01:21.670 --> 01:23.950
And this is again optional.

01:24.630 --> 01:32.830
So we're first going to map the sitemap of the documentation and get all of the URLs that the documentation

01:32.830 --> 01:33.270
has.

01:33.550 --> 01:39.590
And then for each documentation URL we're going to scrape it with the video extract.

01:39.710 --> 01:44.270
Now we're going to run the extraction and scraping concurrently.

01:44.270 --> 01:46.990
So the solution we built here does scale.

01:47.310 --> 01:52.550
And the code I'll be showing here is going to be very similar to the code I showed you in the previous

01:52.550 --> 01:56.910
videos where we covered the map interval extract.

01:56.910 --> 02:01.430
So just a few logistics because it's going to be a pretty long video.

02:01.710 --> 02:07.550
I highly suggest you first watch it and only then try to do it yourself.

02:07.870 --> 02:10.750
And of course, you don't need to write everything from zero.

02:10.790 --> 02:13.910
You can simply copy the snippets that I'll be using.

02:13.910 --> 02:15.910
You can copy it from the repository.

02:16.190 --> 02:19.870
So this is how I recommend you watching this video.

02:21.470 --> 02:21.910
All right.

02:21.910 --> 02:25.510
Let's start by writing some logs before we begin.

02:25.710 --> 02:28.470
So let's log by writing log header.

02:28.670 --> 02:33.070
And we're going to call it with the string documentation ingestion pipeline.

02:33.070 --> 02:35.230
So this is going to be a nice header.

02:35.670 --> 02:38.790
Then we want to log info to map.

02:38.790 --> 02:41.550
Starting to map documentation structure from.

02:41.550 --> 02:43.670
And this is the URL of the LinkedIn URL.

02:43.910 --> 02:51.350
And let me write it in purple by the way feel free to stop and to write it yourself, or to copy paste

02:51.350 --> 02:57.990
the code from the reference code that I'm going to provide in the videos resources with all the code

02:58.030 --> 03:00.100
to this a couple of videos.

03:00.500 --> 03:00.980
All right.

03:00.980 --> 03:05.580
So now we want to invoke the map object.

03:05.700 --> 03:13.300
And I remind you they invoke here is actually using the map tool which is a wrapper around the API to

03:13.340 --> 03:15.500
turn it into a LinkedIn tool.

03:15.980 --> 03:16.540
All right.

03:16.580 --> 03:22.220
So the result of this function execution we're going to get in the variable sitemap.

03:22.460 --> 03:28.740
And the actual list of URLs is going to be in this variable which is going to be a dictionary.

03:28.740 --> 03:32.740
And it's going to be in the key of results like I showed you in the previous video.

03:33.100 --> 03:40.540
So when we log the results of how many URLs we found, we're going to access the results key here.

03:41.060 --> 03:41.540
Cool.

03:41.700 --> 03:48.580
So let me put a breakpoint here and let me run this in debug mode so we can see the objects and the

03:48.580 --> 03:49.180
results.

03:50.380 --> 03:52.660
So now we're going to run it.

03:53.580 --> 04:00.010
And we can see now the header log documentation ingestion pipeline and really map starting.

04:02.410 --> 04:04.450
Let's wait a couple of seconds.

04:04.610 --> 04:07.370
And now we can see the sitemap variable.

04:07.650 --> 04:08.770
Let's open it up.

04:08.890 --> 04:11.130
It has here a results key.

04:11.370 --> 04:15.250
And here we can see a bunch of URLs of the documentation here.

04:16.930 --> 04:21.050
So we can see we can even go even down a bit.

04:21.570 --> 04:25.770
We can see we get here 100 and we can see the rest below.

04:25.810 --> 04:28.410
Let me just go and finish this execution here.

04:28.570 --> 04:33.610
And we can see here the last log that we managed to scrape 500 URLs.

04:36.170 --> 04:44.050
So we want for each URL that we saw, we want to run and execute a request to, to really extract to

04:44.090 --> 04:47.250
get that content of the documentation.

04:47.490 --> 04:52.610
The way we're going to do this is using some batch processing techniques.

04:52.850 --> 04:56.570
So the API also supports batch processing.

04:56.800 --> 05:03.200
So the extract API can receive also a list of URLs in every API call.

05:03.320 --> 05:06.720
So we don't need to make one API call per URL.

05:06.760 --> 05:11.880
We can make one API call for server URLs that we want to extract.

05:12.120 --> 05:20.400
So what we want to do here is after we discover the link chain documentation in all the URLs, we want

05:20.400 --> 05:21.880
to start batching them.

05:22.080 --> 05:26.640
And we want to get from that large list of lots of URLs.

05:26.640 --> 05:33.520
We want to create batches of URLs and each batch is going to be one extract request.

05:33.880 --> 05:37.480
The batches were going to fire up concurrently.

05:37.480 --> 05:39.920
So they're going to execute simultaneously.

05:40.120 --> 05:45.920
So we get here two levels of parallelism that was hard to pronounce.

05:46.440 --> 05:51.440
So the API supports parallel processing in the API layer.

05:51.600 --> 05:53.320
So we don't need to worry about that.

05:53.320 --> 05:59.840
We just need to adhere and not send too many URLs according to their documentation, and that is going

05:59.840 --> 06:02.800
to be controlled by the batch size that we're going to run.

06:03.080 --> 06:11.000
And the second layer of parallel processing is us going to fire up those requests asynchronously.

06:11.720 --> 06:18.360
They are the classic example of I o bound operations waiting for an API calls to complete.

06:18.560 --> 06:22.120
So this technique is going to get us the documentation in no time.

06:22.280 --> 06:28.360
And trust me before doing it, when I downloaded it manually and I didn't do it concurrently, then

06:28.360 --> 06:30.960
it took me a bunch of hours to do it.

06:31.200 --> 06:31.720
All right.

06:31.720 --> 06:39.040
Let me now paste in this function chunk URLs, which is going to receive the list of URLs and a chunk

06:39.040 --> 06:39.480
size.

06:39.720 --> 06:46.680
And it's going to return a list where each element of that list is going to be a list of URLs.

06:46.920 --> 06:51.600
So those lists are going to be the batches of URLs that we're going to be using.

06:51.800 --> 06:54.960
We covered it in the previous video in the notebook.

06:54.960 --> 06:57.390
So I'm not going to elaborate on that.

06:57.430 --> 06:59.070
It's pretty basic Python here.

06:59.350 --> 07:03.150
So now we have the ability to get the batches here.

07:03.710 --> 07:07.030
And let's go now to our main function.

07:07.390 --> 07:13.030
And we want to execute and run the chunk function with the sitemap.

07:13.470 --> 07:16.670
And we want to give it the chunk size of 20.

07:16.990 --> 07:19.550
And we can play around with the chunk size.

07:19.590 --> 07:25.710
Now I remind you don't make it too big because if you make it too big then the API won't receive it

07:25.710 --> 07:27.910
because you'll be sending too much URLs.

07:28.710 --> 07:33.070
And after we're done chunking, we can log it that we chunked it up.

07:33.550 --> 07:39.630
So let me run this and let me show you the logs and the intermediate results that we're getting now.

07:40.030 --> 07:42.870
So let me just put a breakpoint over here.

07:43.550 --> 07:45.270
Let me run this in debug.

07:50.070 --> 07:57.900
And right now we're going to first map the documentation and then we're going to patch it.

07:57.940 --> 07:59.300
We're going to get the patches.

07:59.700 --> 08:01.460
So let's wait and see.

08:01.940 --> 08:02.540
Cool.

08:02.540 --> 08:04.220
So let's check out the patches.

08:04.540 --> 08:11.700
And we can see here a we have here a list where every element in this list is a list itself.

08:12.420 --> 08:19.340
So every list here is going to contain 20 elements which each element is going to be a URL.

08:20.180 --> 08:24.700
So maybe except the last one which is going to be the remainder of the URLs.

08:25.820 --> 08:32.020
All right so let me continue this and we get here 25 batches.

08:32.740 --> 08:33.580
Alrighty.

08:33.620 --> 08:41.060
So now I want to create a function which is called extract batch which is going to receive a batch which

08:41.060 --> 08:42.380
is a list of URLs.

08:42.700 --> 08:45.580
And it's simply going to make one API call for it.

08:46.060 --> 08:50.300
And it's going to add some padding of logging and exception handling.

08:50.700 --> 08:52.180
So let's go and do that.

08:54.140 --> 08:54.700
All right.

08:54.700 --> 08:58.260
So we have this coroutine named Extract batch.

08:58.500 --> 09:02.100
It receives a list of URLs and the batch number.

09:02.100 --> 09:04.940
I forgot to tell you that it also receives the batch number.

09:04.940 --> 09:09.660
And we need this batch number for observability which we'll see in the logs.

09:09.940 --> 09:11.900
We first start by logging that.

09:11.940 --> 09:17.380
We're starting to process the batch number and the number of URLs that it has.

09:17.940 --> 09:23.580
We're then going to await to extract with the a invoke method.

09:23.740 --> 09:31.580
So this is going to run concurrently the extract a functionality to call the extract API.

09:32.060 --> 09:36.100
And the input to this tool is going to be a dictionary.

09:36.100 --> 09:40.580
And this dictionary is going to be um, the format that I really expect.

09:40.660 --> 09:44.340
So they expect to get in their API, the URLs field.

09:44.340 --> 09:49.420
And that value is going to be the list of URLs, which is going to be our batch here.

09:49.740 --> 09:51.620
So we're then going to run it.

09:51.940 --> 09:55.610
We're going to await it because this is a non-blocking operation.

09:55.610 --> 09:56.770
This is io bound.

09:57.090 --> 10:03.090
And after that, if there are no exceptions, we're going to log that we succeeded, and we're going

10:03.090 --> 10:08.770
to log the number of URLs that we managed to extract and to return those URLs.

10:08.930 --> 10:09.410
Cool.

10:09.450 --> 10:17.250
So now that we have this helper function that simply wraps around the extract tool, we want to run

10:17.250 --> 10:18.770
those batches concurrently.

10:19.210 --> 10:24.890
So I'm going to implement a new function which I'm going to call it async extract.

10:25.290 --> 10:30.610
And this coroutine is going to concurrently extract all the URL.

10:30.650 --> 10:39.610
And it's going to do this by creating coroutines for each batch and executing all those things concurrently.

10:39.970 --> 10:43.890
So this is what this a coroutine itself is going to be doing.

10:44.210 --> 10:47.050
So you can see we started by first logging.

10:47.450 --> 10:50.410
And now let's create the coroutines.

10:50.690 --> 10:54.760
So they'll be stored in a variable which is called tasks.

10:54.960 --> 10:57.640
And tasks is going to hold coroutines.

10:57.960 --> 11:02.000
So we want to enumerate over the URL batches.

11:02.200 --> 11:07.040
So to assign for each URL batch we want to assign it with a number.

11:07.040 --> 11:08.360
So we can keep track of.

11:08.800 --> 11:14.760
And we want to create a coroutine that is going to use to really extract.

11:14.760 --> 11:21.360
So you can see over here in the syntax extract batch which is going to receive the batch itself alongside

11:21.360 --> 11:23.960
with the batch number here.

11:24.280 --> 11:31.840
Now if you're not familiar with async execution then this right over here is not really executing yet.

11:31.960 --> 11:35.760
The coroutines because we did not await those expressions here.

11:36.080 --> 11:41.560
So once we have all of them ready so we'll have here a list of coroutines.

11:41.880 --> 11:44.160
Then we are going to gather.

11:44.160 --> 11:48.360
So this is going to await every coroutine in this list over here.

11:48.360 --> 11:51.920
And all those coroutines are going to execute asynchronously.

11:52.200 --> 11:55.640
And we're going to wait over here until all of them finish.

11:55.800 --> 11:59.800
And once all of them finish, we'll have the entire documentation downloaded.

11:59.800 --> 12:03.800
And it's going to be stored in the results variable over here.

12:04.520 --> 12:06.320
And I'll create two variables.

12:06.360 --> 12:12.560
All pages which will collect all successful extracted documents and the failed batches.

12:12.560 --> 12:16.320
So a counter for tracking how many batches have failed.

12:16.680 --> 12:25.520
And we're going to iterate over each element of the results list which is going to be either a dictionary

12:25.520 --> 12:28.720
containing the extracted content in data format.

12:28.720 --> 12:35.560
So it's going to be a dictionary holding the URL key and the raw content, or an exception and an error

12:35.560 --> 12:38.640
object that indicates that the batch has failed.

12:38.720 --> 12:44.280
So if it's going to be an exception, we want to log an error that the batch has failed and to increase

12:44.280 --> 12:45.040
the counter.

12:45.560 --> 12:53.950
And if it's not an instance of exception, then it means that we have a valid result, and for each

12:53.950 --> 13:01.150
element in that batch result, then we're going to have the URL, which is the source of where it came

13:01.150 --> 13:01.510
from.

13:01.510 --> 13:03.390
And we're going to have the raw content.

13:03.590 --> 13:07.430
And from that we want to create a link chain document.

13:07.630 --> 13:10.910
So the link chain document is going to have the page content.

13:10.910 --> 13:13.350
And it's going to have in its metadata field.

13:13.350 --> 13:15.390
We want to give it the source key.

13:15.430 --> 13:17.510
We want to give it the original URL.

13:17.750 --> 13:23.230
So that's how we're going to keep track of which content came from which URL here.

13:23.750 --> 13:25.750
So this is pretty much it.

13:26.110 --> 13:31.510
Now I know this is a bit of confusing right now because you don't see the objects yourselves.

13:31.550 --> 13:33.430
We're going to soon debug this and see.

13:33.670 --> 13:39.790
But right over here in the right side I created an example of what the output might look like.

13:39.790 --> 13:45.910
So feel free to pause for a moment and check out the structure of the results here.

13:46.070 --> 13:48.110
So nothing fancy here is happening.

13:48.110 --> 13:51.340
Simply some data manipulation and data extraction here.

13:52.100 --> 13:52.380
Cool.

13:52.420 --> 13:58.820
So after we're done with the extraction, we want to log the success of the function and log if there

13:58.820 --> 14:00.340
are any failures in our batches.

14:00.540 --> 14:07.020
And we want to return the all pages variable, which is going to collect a list of length chain documents

14:07.020 --> 14:09.900
containing the length chain documentation content.

14:10.420 --> 14:11.140
Alrighty.

14:11.180 --> 14:16.020
So now let's go in our main function and let's add this extraction here.

14:16.220 --> 14:19.220
So I'm going to await async extract.

14:19.220 --> 14:21.660
I'm going to give it the URL batches.

14:21.700 --> 14:27.780
And I'm going to save everything in the all docs variable here which is supposed to be the flattened

14:27.780 --> 14:28.220
list.

14:28.740 --> 14:32.660
I've already put here a breakpoint and run this in debug.

14:32.980 --> 14:35.060
So right now you're seeing the execution.

14:35.060 --> 14:38.620
And it stopped right over here in line 129.

14:38.900 --> 14:44.100
And you can see in the logs, now that we have 25 batches that we're going to execute.

14:44.420 --> 14:50.010
So let me go and continue this and let me show you what we see in the log here.

14:50.010 --> 14:52.450
So it's going to run now the async extract.

14:54.050 --> 14:55.610
Let me go and play this.

14:56.490 --> 14:56.930
All right.

14:56.930 --> 15:01.130
So now we can see we fired up all the batches execution.

15:01.290 --> 15:04.690
And we can see that we're starting to stream the results.

15:04.690 --> 15:07.850
And notice that there isn't any order for the result.

15:07.850 --> 15:09.690
It's first come first go.

15:09.690 --> 15:12.210
So whatever finishes we get that result.

15:12.250 --> 15:13.770
We can see it in the log here.

15:15.130 --> 15:20.050
And let me just paste in the snippets from the next video.

15:20.090 --> 15:27.050
You can ignore it for now, and I simply want to paste it in so we can put a breakpoint and examine

15:27.050 --> 15:28.050
the documents.

15:28.050 --> 15:36.490
So ignore this right now let me just go and rerun everything and let me change the breakpoints to put

15:36.490 --> 15:37.090
it here.

15:37.490 --> 15:40.090
And now let's rerun everything.

15:40.290 --> 15:42.290
So I want to rerun everything in debug.

15:42.290 --> 15:46.090
And I want to show you the intermediate values here.

15:50.370 --> 15:53.250
So right now we are executing the batches.

15:53.850 --> 15:55.770
We're starting to get the results.

16:05.410 --> 16:06.770
And we finished.

16:06.770 --> 16:14.130
And if we check out the all docs here we can see that here we're getting the link chain documents.

16:14.530 --> 16:23.050
So here we have the link chain documents with a the source which is the original URL and the data itself.

16:25.010 --> 16:27.090
So here we see a page not found.

16:27.090 --> 16:31.730
So this means that we might have gotten a wrong URL.

16:31.730 --> 16:34.010
Or maybe this URL does not exist anymore.

16:34.530 --> 16:36.810
And let me check out another document.

16:36.810 --> 16:42.650
And this is an example of the link chain expression language documentation here.

16:43.010 --> 16:44.930
So looks good.

16:44.970 --> 16:47.570
And right now we're ready to continue.