WEBVTT

00:00.350 --> 00:09.110
Okay, so in this section we're going to build out a parser robots parser from scratch on a real site,

00:09.140 --> 00:14.750
not just the usage example that we had from the GitHub, but we're going to build it up using either

00:14.750 --> 00:21.830
something like Facebook, getting the robots or text from Facebook or Craigslist or another site.

00:22.310 --> 00:24.710
I'll find out what we're going to use in the end.

00:25.070 --> 00:28.670
So let's first import the robot parser.

00:28.850 --> 00:33.920
So we say const require robots parser.

00:34.730 --> 00:44.180
And in here, let's say that we have to also have a some some requests going on because we need to we

00:44.180 --> 00:47.840
need to use request to download the robots.txt file.

00:47.840 --> 00:53.330
So we need to edit the package of request promise.

00:53.330 --> 00:55.040
So we say yarn request.

00:55.250 --> 00:57.110
Request promise.

01:01.560 --> 01:03.900
And let's see.

01:04.050 --> 01:06.510
So we say request require.

01:06.990 --> 01:09.480
Request promise.

01:11.210 --> 01:13.310
And now what else do we need?

01:13.340 --> 01:18.560
Then we need to find a URL to download the robots.txt file from.

01:18.800 --> 01:21.830
So for this example.

01:22.670 --> 01:28.580
I decided to use a site that's maybe not too popular for some people.

01:29.290 --> 01:32.030
Um, it's called Textfiles.com.

01:32.030 --> 01:32.770
So.

01:32.780 --> 01:38.120
Okay, this is just a brief intro about the the site, even though it doesn't matter what side it is

01:38.150 --> 01:46.250
we're scraping, but it's just a site I had a lot of fun with in my younger days looking at old files

01:46.250 --> 01:51.470
from the old BBS in the 1980 to 1995.

01:51.500 --> 02:01.460
It's pretty funny files to read about that people were writing back in the old Usenet BBS days.

02:01.460 --> 02:07.100
So yeah, as you can see, it's just lots of text files that you have at the end.

02:08.450 --> 02:18.020
But anyway, it's not so much about the site, it is more about the robots.txt file and the text file

02:18.020 --> 02:22.160
that Textfiles.com does not have a robots.txt file.

02:22.160 --> 02:25.220
Actually you can see they don't have one.

02:25.430 --> 02:31.250
But the mirror, I believe the mirror on the German mirror here.

02:31.640 --> 02:37.160
Text files Dot Müller or however you say it dot net.

02:37.400 --> 02:44.390
So if I go on to robots.txt here, I get lots of rules that I can use.

02:44.810 --> 02:53.390
I also had other pages in mind such as Wikipedia, but I'm just worried that some of you might make

02:53.390 --> 02:56.720
mistakes and in the end you might get banned.

02:56.990 --> 02:59.060
Actually, um.

03:00.630 --> 03:02.830
So that's not that cool.

03:02.850 --> 03:08.880
Also, some of you guys have been blocked on Craigslist or Craigslist.

03:08.880 --> 03:17.220
So that's why in the end, I decided just to go with text files.com or this one.

03:17.220 --> 03:24.030
The mirror text files merely or however you say it, I'm going to put a link for it so you guys can

03:24.030 --> 03:25.380
just use it.

03:26.230 --> 03:26.880
And.

03:26.890 --> 03:27.370
Okay.

03:27.370 --> 03:32.530
Anyway, enough about the story of the site that I chose and why.

03:32.830 --> 03:36.370
Let's put in the URL for the robots.txt file here.

03:37.260 --> 03:39.750
So here we have the robot's URL.

03:42.620 --> 03:50.460
And now we need to make a request and get the robots.txt file so we can say async function.

03:50.480 --> 03:57.260
Let's call it get robots.txt and we pass in a robots URL.

03:59.660 --> 04:05.210
And then inside here, we make a request where we just get the robots.txt file.

04:05.390 --> 04:12.700
Remember, that request is just getting a file, like something similar to curl.

04:12.710 --> 04:20.570
If you know curl already, you can run it inside the console to just download files or similar to Postman

04:20.570 --> 04:22.850
which is making get request.

04:22.850 --> 04:30.950
That is also just getting the file without parsing a JavaScript or doing anything advanced like a browser

04:30.950 --> 04:31.670
would do.

04:32.090 --> 04:41.420
But we're also going to look at an example where I get the robots.txt from puppeteer or the Chrome browser

04:41.420 --> 04:43.010
in an automated way.

04:43.730 --> 04:50.930
Um, so okay, I did a little sidetrack here because I just wanted to explain you guys what, what request

04:50.930 --> 04:52.010
is all about.

04:53.610 --> 04:55.650
So we say request.

04:57.380 --> 05:02.210
Dot get and paste in the robot's URL.

05:03.420 --> 05:09.900
And now we basically have the robots.txt file here because request is just getting a file.

05:10.350 --> 05:13.230
And for the HTML we just get the HTML.

05:13.260 --> 05:17.280
But now since we just have a text file, it's just getting the text file.

05:18.120 --> 05:21.960
And well, then we can say, Well, then we can just.

05:22.780 --> 05:36.010
Return robot text here and we can also just put it inside the robots parser so we can say const parser

05:36.400 --> 05:37.660
or robots.

05:39.270 --> 05:42.420
Const robots and we can say robot parser.

05:43.030 --> 05:45.530
And then we can call it like this.

05:45.550 --> 05:53.650
And in here you put in the URL so that it knows what directory you have this robots.txt file lying in,

05:53.650 --> 06:01.090
which is what I mentioned before, how the rules are different depending on what directory it is inside.

06:01.860 --> 06:04.350
And then we have the contents itself.

06:04.350 --> 06:10.560
So that's why we just put in this robot text which request is getting for us.

06:10.560 --> 06:13.710
So that's the text file directly from the site.

06:14.880 --> 06:15.600
Okay.

06:16.390 --> 06:22.470
Now let's try and see if we can actually make this work and test out some rules.

06:22.480 --> 06:26.200
So let's first take a look at the robots.txt file here.

06:26.500 --> 06:28.990
It doesn't have that many rules.

06:28.990 --> 06:33.310
It does have a crawl delay of 10s.

06:33.310 --> 06:34.630
So keep that in mind.

06:34.630 --> 06:41.170
It actually wants you to wait for 10s before you make another request on this site.

06:41.380 --> 06:46.300
And it has various user agents that it has disallowed.

06:49.550 --> 06:50.960
And there's one here.

06:51.000 --> 06:55.250
It also specifically says it wants a crawl delay of 10s.

06:57.040 --> 07:06.190
Okay, so let's try and do a console log here of robots where we can say is allowed.

07:07.030 --> 07:10.930
And we could check out if we are allowed to.

07:10.930 --> 07:15.130
Let's let's try and find some URLs here.

07:16.960 --> 07:19.960
If we are allowed to.

07:20.490 --> 07:21.080
Let's see.

07:21.100 --> 07:23.170
Read my favorite 100.

07:23.830 --> 07:26.170
Let's see if we are allowed to get that one.

07:27.530 --> 07:34.910
Looks like the HTML went kind of weird there, but let's see if we are allowed to scrape this side here.

07:35.600 --> 07:36.650
As.

07:37.910 --> 07:41.030
Let's call it the Stefan part.

07:41.660 --> 07:42.890
Or you can call it.

07:43.470 --> 07:44.760
Your own name.

07:45.180 --> 07:45.900
But.

07:46.410 --> 07:55.020
And then let's try and see if we are allowed when we are not supposed to, when we have a one of these

07:55.020 --> 07:57.360
user agents, which they don't like.

07:57.360 --> 08:03.510
When you have a disallow with just a slash, it basically means don't scrape anything.

08:03.510 --> 08:05.490
We don't want you to scrape anything.

08:06.300 --> 08:10.380
Um, and well, if they're lucky, maybe the bot respects it.

08:12.520 --> 08:13.420
Now, let's see.

08:13.420 --> 08:20.370
So now we test out this one in in the bottom should be false because that's Roger Bart.

08:20.380 --> 08:25.150
And they say, Roger Bart here should not be allowed to scrape anything.

08:25.570 --> 08:35.200
And also Stefan Bart should be allowed because it's just the directory 100 and they fortunately don't

08:35.200 --> 08:38.050
have any rules against Stefan Bart in here.

08:38.500 --> 08:40.660
So let's test that out.

08:40.690 --> 08:46.720
So we say make sure to save it and make sure to call this function as well.

08:47.610 --> 08:53.970
So we call it here and then paste in the URL that we have up here.

08:53.970 --> 08:55.080
Robot's URL.

08:57.880 --> 08:59.200
Okay, so that's it.

08:59.230 --> 09:02.410
Now let's try and run the code and see if it works.

09:05.720 --> 09:08.410
Okay, so everything looks good here.

09:08.420 --> 09:11.870
We got a true first where it.

09:11.870 --> 09:18.950
I also guessed it would be true because this user agent is allowed and then a false because this user

09:18.950 --> 09:21.140
agent is not allowed to scrape at all.

09:21.620 --> 09:23.150
So that's um.

09:23.420 --> 09:32.810
That's really how you get a robots text file out from request and then how you paste it in, pass it

09:32.810 --> 09:37.760
in to the robot parser and check if you the rules that is allowed.

09:39.240 --> 09:45.060
We can also try and check if how the crawl delay should be get crawl delay.

09:47.030 --> 09:54.470
And it should be ten because in the rules here for any user agent, it says scroll delay ten.

09:54.950 --> 10:01.100
So remember that if you're trying to scrape this site, they want you to have a delay of 10s for every

10:01.100 --> 10:01.790
request.

10:07.850 --> 10:08.840
Let's try and see it again.

10:08.840 --> 10:10.550
So it says 10s.

10:11.480 --> 10:19.020
Okay, so now in the next section, I think we are going to actually try and crawl the site and we'll

10:19.070 --> 10:27.110
basically have a web crawler which is going to just check for all the links on this site and go to all

10:27.110 --> 10:30.200
of the files and basically get them.

10:30.200 --> 10:33.770
I think let's try and see if we can do that next section.