WEBVTT

00:00.440 --> 00:07.680
Let's now use the crawl to crawl the link chain documentation and pull out all the documentation.

00:08.200 --> 00:16.160
So web crawling refers to the automated process of browsing website by following hyperlinks, clicking

00:16.160 --> 00:21.920
them, going from one page to the next, and uncovering more and more related content.

00:22.400 --> 00:29.080
And for agents and autonomous agents, crawling is a key capability, especially when we're trying to

00:29.080 --> 00:34.640
access deeper layers of the web that aren't easily reachable through standard search.

00:34.880 --> 00:40.040
Alrighty, let's go to the code and let me show you how easy it is to fetch the documentation.

00:41.040 --> 00:43.880
And this is our boilerplate code right now.

00:44.040 --> 00:48.440
If name equals main then we want to use async IO run.

00:48.440 --> 00:52.440
And then we want to run the main coroutine we have here.

00:52.760 --> 00:55.720
And let's start with adding some logs here.

00:56.000 --> 01:00.320
So I want to log that the recommendation ingestion has started.

01:00.720 --> 01:03.440
And then that we're using to really crawl.

01:03.480 --> 01:04.880
To start crawling.

01:04.920 --> 01:14.360
The documentation in python.com and I remind you log header and log info are in the logger.py file.

01:15.280 --> 01:15.760
Cool.

01:15.800 --> 01:20.160
Let's go and run it just to see the logs are printing and we're all set.

01:23.600 --> 01:25.080
So code is executing.

01:25.080 --> 01:29.760
We're initializing all the clients and let's go and see the logs.

01:35.000 --> 01:37.520
So let's now do the actual crawling.

01:37.720 --> 01:41.400
And to do that we're going to use the crawl object.

01:41.400 --> 01:43.760
And I remind you this is a link chain tool.

01:43.760 --> 01:46.320
Now because we're using link chain Tavileh.

01:46.560 --> 01:48.440
And we want to invoke that.

01:48.440 --> 01:52.760
And we want to provide the URL to be the Python comm.

01:53.160 --> 02:00.680
We provide it with the maxdepth, which is going to define how far from the base URL the crawler can

02:00.680 --> 02:01.280
explore.

02:01.640 --> 02:03.240
And the default is one.

02:03.240 --> 02:08.010
And this has to be an integer and the maximum at the moment is five.

02:08.730 --> 02:09.170
Cool.

02:09.210 --> 02:15.810
So I want to elaborate a bit on maxdepth because this is an important argument when it comes to crawling.

02:16.370 --> 02:20.410
So this is from the best practices for crawling.

02:20.770 --> 02:24.330
And it discusses here the excessive debt problem.

02:24.530 --> 02:31.370
So it's obvious if we're going to have a higher number in Maxdepth then runtime is going to be longer.

02:31.410 --> 02:33.970
This is a no brainer in worst case.

02:33.970 --> 02:39.090
In certain website topologies it can even be exponentially slower.

02:39.770 --> 02:47.290
We usually want to start with depth equals to one or to two, and only then after we review the results

02:47.290 --> 02:49.850
to then increase it if we need more depth.

02:50.130 --> 02:57.010
And this is the smart approach because it's going to give us first faster iterations because runtime

02:57.010 --> 03:00.170
is going to be shorter with lower max depth.

03:00.890 --> 03:07.010
And second, it's also going to be cheaper because we are consuming less of the resources.

03:07.290 --> 03:10.170
So this is the correct approach to start with small.

03:10.170 --> 03:12.050
And we're actually going to do this in the video.

03:12.250 --> 03:18.090
The number five I got after a bunch of iterations and figured out that this should be the number we

03:18.090 --> 03:21.970
want to crawl to get the most documents from the LinkedIn documentation.

03:22.370 --> 03:31.370
And we're going to give extract depth equals to advanced and advanced extraction retrieves more data

03:31.410 --> 03:38.330
including tables and embedded content with higher success rate, but it may increase the latency.

03:39.250 --> 03:47.490
Now, when we invoke the crawler, we're invoking a website traversal tool that explores hundreds of

03:47.490 --> 03:54.090
paths in parallel with built in extraction and intelligent discovery, which we're going to see in action.

03:54.290 --> 03:59.610
So just for quick iterations, let me change Maxdepth to be one right now.

03:59.850 --> 04:04.890
So we're actually going to crawl list links here in the LinkedIn documentation.

04:05.130 --> 04:11.220
And let me run this in debug here, and let me show you the result that we get when invoking this tool.

04:14.540 --> 04:17.820
So we can see now we are starting the crawling.

04:18.060 --> 04:19.860
And we get immediately a result.

04:19.860 --> 04:22.260
And this ran under one second here.

04:22.540 --> 04:24.020
And let me show you the result.

04:24.020 --> 04:26.860
Here we have the base URL where we started.

04:27.060 --> 04:29.620
And we have this results key over here.

04:29.820 --> 04:34.540
And it has a list of all the pages that we crawled.

04:34.900 --> 04:41.740
Each crawl page has the structure of a URL, which is the URL of the page and the raw content.

04:41.740 --> 04:43.940
So let me show you, for example, the page here.

04:44.220 --> 04:46.660
This is the page from the documentation.

04:46.660 --> 04:49.820
And you can see the raw content which is scraped for us.

04:50.180 --> 04:55.940
And I'm super excited about this because crawling is a pain.

04:56.180 --> 04:57.540
It's a tedious task.

04:57.540 --> 05:05.740
It has a lot of room for bugs, tons of things can go wrong and crawling, rate limiting, bot protection,

05:06.020 --> 05:11.180
getting dynamically rendered pages and so many things can go wrong here.

05:11.180 --> 05:17.780
And actually we tried to do this manually in earlier versions of this course and it was really a pain.

05:17.820 --> 05:23.660
So many students had issues and it ran differently on different machines, and sometimes students didn't

05:23.660 --> 05:24.980
get the correct pages.

05:25.180 --> 05:26.460
And it was a real pain.

05:26.580 --> 05:30.020
And that's why I'm super excited of using crawl here.

05:30.020 --> 05:36.860
And just as a side note, as a software engineer, if crawling is not going to be my main business logic,

05:37.180 --> 05:42.860
then I would like to offload it into a third party that knows how to do it much better than me.

05:43.020 --> 05:44.540
So that's in general my approach.

05:44.540 --> 05:49.020
I don't want to waste any time spending, debugging and trying to do this.

05:49.340 --> 05:49.740
All right.

05:49.780 --> 05:52.940
Notice here that we got here 18 results.

05:52.980 --> 05:56.140
Let's go and change Maxdepth to B2.

05:56.300 --> 05:59.380
And let's check out the results that we get.

06:00.580 --> 06:02.660
And let me go and rerun this.

06:09.500 --> 06:11.060
And we got a result.

06:11.700 --> 06:13.260
And we can see.

06:13.260 --> 06:14.870
Let's go and check the results.

06:14.910 --> 06:15.510
Key here.

06:15.830 --> 06:20.830
And we can see that we have here now 75 pages that were scraped.

06:21.230 --> 06:21.870
Pretty cool.

06:22.150 --> 06:25.870
Now let's go and change it to the maximum which is going to be five.

06:25.910 --> 06:29.710
And this is the maximum depth that ability offers at the moment.

06:34.670 --> 06:41.590
Now we can expect that this will be longer because we are going to scrape and crawl much more pages.

06:42.350 --> 06:44.150
And let me fast forward it a bit.

06:52.630 --> 06:54.390
And we got here a result.

06:54.430 --> 06:56.670
Now let's check out what we have here.

06:56.670 --> 07:02.830
It took 26 seconds and we got here 251 results.

07:03.710 --> 07:04.310
Alrighty.

07:04.350 --> 07:06.630
Let me show you something very cool here.

07:06.630 --> 07:09.950
And this is an argument which is called instructions.

07:10.350 --> 07:15.750
And here I'm going to provide natural language which is going to be used by the crawler.

07:15.750 --> 07:21.950
And this is going to specify during the mapping process when the crawler is going to map out the pages,

07:21.950 --> 07:26.830
it's going to tell the crawler which page to scrape and which page not to scrape.

07:26.950 --> 07:34.070
And that's a filtering mechanism that is going to help us get much more accurate and much more precise

07:34.070 --> 07:36.550
results in case we're looking for a specific field.

07:36.830 --> 07:40.710
So for example, here I want to search for content on AI agents.

07:41.430 --> 07:43.630
And let me now run this in debug.

07:44.510 --> 07:48.110
And let me now show you what results we get here.

07:53.150 --> 07:55.230
So let's go now and examine it.

07:55.390 --> 07:58.310
We can see it took 30s I fast forward it.

07:58.430 --> 08:03.390
And we can see now that we have 23 pages that were scraped.

08:03.750 --> 08:04.870
Now, let me show you something cool.

08:04.870 --> 08:08.070
Let me show you the URLs of the pages we scraped.

08:10.750 --> 08:17.430
And just from the slugs of the URL, we can see that it's documentation on AI agents.

08:17.830 --> 08:23.030
We can see that we have information about AI agents over here, and we don't have something which is

08:23.030 --> 08:24.230
not AI agents.

08:24.750 --> 08:30.510
Now, I know we are in a rag section here, but just a quick note that this capability is actually very

08:30.510 --> 08:33.990
useful when we're implementing AI agents.

08:34.670 --> 08:38.510
And one important note on the argument of instructions.

08:38.870 --> 08:43.510
So it really matters which instructions we give tavileh.

08:43.710 --> 08:48.150
If our instructions are going to be bad then we're going to get bad results.

08:48.390 --> 08:56.670
And we need to understand that when Tavileh uses this argument, it actually uses this to help map out

08:56.710 --> 08:58.910
the URLs to crawl or to not crawl.

08:59.190 --> 09:04.550
So it's really a function that helps filter out URLs whether to crawl and extract them.

09:04.750 --> 09:06.790
And that's what we should have in mind here.

09:06.790 --> 09:08.270
That should be the argument.

09:08.390 --> 09:11.510
So we shouldn't put here questions or something like that.

09:11.510 --> 09:17.830
We should put here instruction that is going to help tavileh to decide whether to crawl that page or

09:17.830 --> 09:18.990
not to crawl that page.

09:19.630 --> 09:25.560
And if you're interested in more nuances and how to get the best out of crawl, then I highly recommend

09:25.560 --> 09:31.840
you checking these best practices for crawl page here that I've posted, which has a lot of cool tips

09:31.840 --> 09:34.120
and a lot of cool ways to use this API.

09:34.280 --> 09:38.520
And by the way, remember when we talked earlier about the Maxdepth argument?

09:38.840 --> 09:45.920
So because we're going to use instructions, then we could make the Maxdepth higher because instructions

09:45.920 --> 09:49.200
are going to help the crawler to skip irrelevant pages.

09:49.520 --> 09:50.120
All right.

09:50.120 --> 09:51.360
So let me minimize it.

09:51.360 --> 09:52.960
And let's go back to the code.

09:53.040 --> 09:58.480
And now I want to create from each result I want to create a link chain document.

09:58.800 --> 10:03.760
So we're going to iterate through all the results in the results key here.

10:04.720 --> 10:09.600
And we want to create a document object a link chain document.

10:10.240 --> 10:14.880
And it's going to have the page content which is going to be the result.

10:14.880 --> 10:21.160
But it's going to have raw content fields like we saw earlier in debug and in the metadata.

10:21.160 --> 10:25.640
I want here to put a dictionary with the key of source.

10:26.040 --> 10:30.760
And here I'm going to put the result in the URL key here.

10:31.120 --> 10:38.520
And we're going to use this metadata when we retrieve the information to know exactly where did we get

10:38.520 --> 10:39.920
the source from.

10:39.920 --> 10:41.760
So this is very important.

10:41.800 --> 10:44.880
That helps with user explainability.

10:44.920 --> 10:50.160
To explain the user why our AI rag application answered the way it answered.

10:50.160 --> 10:53.680
And it really helps to create trust in the system.

10:53.680 --> 10:56.560
So this is going to take the information from Tavileh.

10:56.560 --> 11:00.800
And each result we were going to convert and cast into a linked chain document.

11:01.000 --> 11:03.120
So we can later split.

11:03.640 --> 11:05.360
And let me run this in debug.

11:05.360 --> 11:11.920
And by the way, notice I changed the Maxdepth to be one simply for faster iterations because we were

11:11.920 --> 11:14.600
going to be scraping and crawling less pages.

11:14.600 --> 11:19.160
So it's going to be easier in the credits as well, and it's going to improve latency.

11:19.480 --> 11:22.400
So let me show you here the old docs here.

11:22.400 --> 11:31.650
And we can see now we have link chain documents if the page content and with the metadata of our URL.

11:32.970 --> 11:33.610
All right.

11:33.610 --> 11:37.730
So let me tell you what's going to happen in the next couple of videos.

11:37.970 --> 11:46.250
The video after this video I'm going to show you how to crawl the LinkedIn documentation using map and

11:46.250 --> 11:52.690
to really extract, which is taking this process and breaking it down and making it a bit more granular

11:52.690 --> 11:54.530
if we want a bit more control.

11:54.690 --> 11:56.170
This video is optional.

11:56.170 --> 11:58.050
You don't have to watch it.

11:58.050 --> 12:03.330
This is simply to show you if you want more control, what you can do, and it also helps you appreciate

12:03.330 --> 12:05.050
this crawl endpoint.

12:05.610 --> 12:06.170
All right.

12:06.170 --> 12:10.290
So just to recap of what we have so far in the pipeline.

12:10.290 --> 12:13.170
We took the documentation we used to crawl.

12:13.170 --> 12:17.290
And we loaded the LinkedIn documentation into LinkedIn documents.

12:17.290 --> 12:20.130
And now is the text splitting part.

12:20.170 --> 12:22.010
So we want to take each document.

12:22.010 --> 12:26.890
And we want to chunk it up into smaller pieces so we can index them in the vector store.