WEBVTT

00:00.080 --> 00:04.640
In this section, we're going to take a closer look on robots.txt.

00:04.670 --> 00:12.950
Why sites have a robots.txt, what they are good for, and why we should try and follow them to avoid

00:12.950 --> 00:14.780
getting banned on a site.

00:15.380 --> 00:24.320
So we have a couple of examples of robots.txt text here from GitHub, Google, Craigslist, Facebook,

00:24.320 --> 00:25.280
LinkedIn.

00:25.400 --> 00:34.550
And basically the robots.txt file is saying to web crawlers what they can do and what they cannot do.

00:34.760 --> 00:37.780
So here we have Facebook robots.txt.

00:37.790 --> 00:45.440
Well, first of all, they have a they have a notice about they have prohibited crawling Facebook unless

00:45.440 --> 00:47.180
you have written permission.

00:48.050 --> 00:49.160
So there's that.

00:49.160 --> 00:54.010
And then they have written out specific rules for the different bots.

00:54.020 --> 01:02.090
So there's one for the able bot and then there's one for Baidu bottle Spider, and then we have one

01:02.090 --> 01:05.510
for Bing, and then there's one for Google as well.

01:06.860 --> 01:12.980
And then there's various rules in here, for example, disallow, which means that the Google Bot is

01:12.980 --> 01:23.930
not allowed to scrape any directories or files or pages under the folder and or scrape anything with

01:23.930 --> 01:27.740
album dot PHP or checkpoint and so on.

01:28.220 --> 01:29.930
So that's one example.

01:29.930 --> 01:32.840
Then we also have Craigslist in here.

01:32.840 --> 01:39.950
They have a star under the user agent, which means that any user agent is not allowed to go to these

01:39.950 --> 01:40.970
directories.

01:41.970 --> 01:50.040
And so you can't scrape anything that ends on your chest that goes like the chest here or flag and so

01:50.040 --> 01:50.550
on.

01:50.880 --> 01:54.330
Let's see if we have some other interesting rules here.

01:54.510 --> 01:57.120
Here is the Google robots.txt.

01:57.120 --> 02:04.410
And in here we can see for all user agents, they say they disallowed the slash search, but they actually

02:04.410 --> 02:08.010
allow search slash about and search static.

02:08.550 --> 02:18.450
Another thing you also might see inside a robots.txt is how the interval you can have between scraping

02:18.450 --> 02:18.960
a site.

02:18.960 --> 02:23.010
So let me see if I can find any examples of that in here.

02:23.190 --> 02:25.920
I don't think I could find any.

02:26.850 --> 02:34.200
Um, but I think there is a example in the parser we're going to use inside NodeJS here.

02:34.440 --> 02:37.110
There's one here that says crawl delay one.

02:37.110 --> 02:43.740
So that means that you have a delay of one second for each site that you're scraping inside of a specific

02:43.740 --> 02:44.430
domain.

02:44.430 --> 02:52.650
So you can't just hit up a lot of pages on a certain domain in very little time or else they are not

02:52.650 --> 02:54.480
going to not going to like that.

02:54.480 --> 02:56.880
And maybe they will ban you.

02:58.240 --> 03:02.770
We noticed that they had specific rules for different user agent.

03:02.800 --> 03:07.390
They define that up here and say what user agent can do what.

03:07.420 --> 03:11.890
But usually they have the same rules for all the different user agents.

03:13.690 --> 03:18.880
So the user agent string is just a string that we can define inside of our crawler.

03:19.210 --> 03:26.560
And last thing is that it is of course voluntary from your side in the crawler, if you actually want

03:26.560 --> 03:34.780
to respect robots.txt, it's not a technical requirement when you build your web crawler that you have

03:34.780 --> 03:36.820
to respect robots.txt.

03:36.820 --> 03:46.030
But if you don't, then maybe this side is going to block your IP or ban you from crawling the site

03:46.030 --> 03:46.870
at all.

03:48.010 --> 03:54.850
So if you want to establish, I can I guess you can call it a long term relationship of crawling a certain

03:54.850 --> 03:59.620
site, then you should probably follow the robots.txt.

04:02.240 --> 04:11.180
So now in the next section, we're going to look at how we can programmatically parse and follow robots.txt

04:11.180 --> 04:19.040
rules so we can test out if our crawler is allowed to scrape a certain site and then we can let our

04:19.040 --> 04:25.220
crawler pass and get the HTML if it is allowed to pass a certain page.