WEBVTT

00:00.110 --> 00:07.880
Now I'm just going to copy the example that they have inside of the GitHub repository for the the package.

00:07.880 --> 00:12.560
So just so we can get a short intro to how this parser works.

00:13.190 --> 00:20.540
So I'm just going to paste that in here and then we can talk a bit about how this all works so well.

00:20.540 --> 00:27.890
First, we have the import of the package up here with require, and then we have where we call the

00:27.890 --> 00:34.310
robots parser here where we put in the URL, where we get the parser from.

00:36.410 --> 00:42.410
And then what we paste in the contents of this robots.txt file.

00:42.410 --> 00:46.990
And they made it into an array where they then join it with the new line.

00:47.000 --> 00:53.030
But you can also just pass in the array from the text file and I'm going to show you ways that we can

00:53.030 --> 00:57.350
get this text file automatically from the pages that we are crawling.

00:58.640 --> 01:02.170
And don't worry, we are going to build this up from scratch as well.

01:02.180 --> 01:03.230
All of this.

01:03.830 --> 01:07.910
But I'm just going to go shortly through how this works.

01:08.860 --> 01:13.330
And so here we have the rules from the robots.txt file.

01:14.220 --> 01:21.990
And then down here we can then call this robot parser and check out a product programmatically.

01:21.990 --> 01:27.900
If we are allowed to pass or I mean scrape certain pages on the domain.

01:28.590 --> 01:38.040
So for example, this will return false because test dot HTML is inside of this allow here.

01:38.370 --> 01:45.210
So let's try and see if we can say console log around this statement we have here.

01:48.970 --> 01:52.120
And say node robot parser.

01:53.840 --> 02:01.340
And then it's going to come up with an error about robots, get preferred hosts get preferred host is

02:01.340 --> 02:02.670
not a function.

02:02.690 --> 02:09.350
And I'm not sure why they have this function in the example when it doesn't work.

02:09.920 --> 02:17.120
But since it's not a very used function inside of robots.txt anyway, then I'm just going to remove

02:17.120 --> 02:17.540
it.

02:19.340 --> 02:21.710
Problem solved anyway.

02:21.740 --> 02:23.660
Now let's try and run it again.

02:24.570 --> 02:26.820
And we can see it returns false here.

02:27.180 --> 02:30.870
And okay, so one more thing.

02:30.870 --> 02:39.510
So we had the allow and disallow on various directories and it also has a star on the agent, on the

02:39.510 --> 02:42.150
user agent here and in here.

02:42.150 --> 02:47.910
You can also pass in as a second argument the user agent that you are using.

02:48.150 --> 02:51.840
In this case, they're just making it the name, same spot.

02:51.870 --> 02:54.750
You can call it whatever you want to your crawler.

02:55.530 --> 03:01.710
Um, but it's also to test out if there's rules for specific agent or user agent.

03:03.570 --> 03:05.010
Now let's see.

03:05.010 --> 03:10.530
There is another property here, which is sitemap.

03:11.250 --> 03:17.100
Sitemap is sort of like a directory of all the pages you can have on a site.

03:17.780 --> 03:20.490
And we have one example with Google.

03:20.520 --> 03:25.200
They actually have a sitemap here in the bottom of their robots.txt.

03:25.710 --> 03:30.000
So if I go ahead and go into that XML file.

03:31.000 --> 03:35.320
They have this various directory of all the pages that they have.

03:35.500 --> 03:39.190
If we go into maybe this one here.

03:39.790 --> 03:47.710
And going to another page here, it's basically a list of different URLs they have that your web crawler

03:47.710 --> 03:50.680
can visit and start crawling or scraping.

03:52.140 --> 03:59.010
And there's also other ways to crawl a website just simply by going to a website and looking for links

03:59.010 --> 04:02.040
on the website and then visiting those links.

04:02.040 --> 04:04.170
We're also going to look at that method.

04:06.760 --> 04:11.140
So that's a short, short intro to robots parser.

04:11.710 --> 04:19.450
First is the URL to the robots.txt file, but it doesn't actually download the robots.txt file.

04:19.450 --> 04:27.070
It just uses this URL here when it is checking for what directories we can allow.

04:27.580 --> 04:29.740
Because if you had your.

04:29.770 --> 04:35.980
Well, if you had your robots.txt file somewhere else, maybe you could have it inside this directory.

04:35.980 --> 04:41.980
Then the way the rules come out is going to be a little different because you have your robots.txt in

04:41.980 --> 04:43.510
a different directory.

04:43.780 --> 04:49.110
So it's not because it's going to automatically download the file that you pass in the URL.

04:49.120 --> 04:52.390
It's to make sure how to pass the rules.

04:52.570 --> 04:54.550
You can make out a little test here.

04:54.550 --> 04:59.290
So we write out the deer here and then let's make a new disallow rule here.

04:59.290 --> 05:00.970
So I write disallow.

05:02.210 --> 05:05.330
And we can say hello dot HTML.

05:06.190 --> 05:10.810
So I just added a new disallow rule here for Hello.html.

05:11.260 --> 05:19.850
And then down here, let's try and check out if we can get Hello.html from the base directory.

05:19.870 --> 05:23.020
So it's not inside the directory here.

05:23.230 --> 05:25.750
But let's try and see what it says here.

05:28.320 --> 05:30.180
And now it says true.

05:30.180 --> 05:40.320
So that means it is allowed to get a low dot HTML from the base directory here, even though it is saying

05:40.320 --> 05:46.590
that it's not allowed from the robots.txt and under the directory here.

05:47.440 --> 05:56.230
But if I put if I say let's say we say dear slash hello.html and then let's run it again.

05:56.350 --> 05:58.180
Now it says it's false.

05:58.180 --> 06:02.200
So now it's not allowed to crawl or scrape this site.

06:02.680 --> 06:12.190
But if I put a slash in front of Hello.html, it means that any page that is ending on Hello.html is

06:12.190 --> 06:13.090
not allowed.

06:15.300 --> 06:17.970
So there's various rules here for that.

06:17.970 --> 06:26.070
And so now it does it's not allowed on both the base tier and inside the sub directory here.

06:26.820 --> 06:33.210
So there's a lot of rules here to keep an eye out for and that's why we have this robots.txt parser

06:33.210 --> 06:34.380
to do it for us.

06:34.800 --> 06:40.800
It's just something I just wanted to point out to you to keep in mind while you are building this.

06:42.000 --> 06:46.830
Now let's go on to where we start out building our own example from scratch.

06:47.600 --> 06:52.760
Where we are going to pass an actual robots.txt file from a real site.