WEBVTT

00:00.500 --> 00:01.550
Hello, everyone.

00:01.550 --> 00:07.970
In this video I'm going to talk about how to scrape a page that is using pagination.

00:08.710 --> 00:09.330
If that's.

00:09.330 --> 00:11.030
If that's how you pronounce it.

00:11.040 --> 00:12.450
Pagination.

00:12.600 --> 00:15.150
Anyway, what is pagination?

00:15.900 --> 00:24.440
I hope I'm pronouncing the right way throughout this video, which when I'm going to say it anyway,

00:24.450 --> 00:25.940
what is pagination?

00:25.950 --> 00:35.520
So pagination is when a page is dividing itself up to several pages instead of having one long page.

00:35.730 --> 00:41.400
Sometimes they do that where you have infinite scrolling, like in Facebook, Instagram, and so on.

00:42.230 --> 00:47.940
And but here instead, it's divided up into several small pages.

00:47.960 --> 00:56.570
So, for example, here you can see in Craigslist, we can click on next and we go to another page or

00:56.570 --> 01:00.890
click previous and we go to the previous page.

01:02.730 --> 01:05.160
Same thing with eBay, for example.

01:05.160 --> 01:07.800
We can go and search on something.

01:09.410 --> 01:12.110
Like an orange pie and.

01:13.560 --> 01:16.200
Down below in the very bottom.

01:18.540 --> 01:21.480
At the very, very, very bottom.

01:21.630 --> 01:24.460
We have small pages you can go to.

01:24.480 --> 01:26.670
So that is pagination.

01:27.750 --> 01:33.420
And a lot of you guys have been asking me, how do I scrape a page with pagination?

01:33.420 --> 01:36.030
Well, actually, it's not that hard.

01:36.030 --> 01:39.570
It is pretty simple once you know how it works.

01:40.020 --> 01:49.700
So notice how when I click next here, there's an extra like query parameter coming onto it that says

01:49.860 --> 01:50.880
120.

01:51.420 --> 01:54.720
When I click next again, it says 240.

01:55.920 --> 02:02.790
Then I click next again, it says 360, so it's just adding 120 to the page.

02:02.790 --> 02:06.000
And if I set it to zero here, guess what?

02:06.000 --> 02:07.950
I get to the start page again.

02:09.210 --> 02:12.510
And it's the similar inside of eBay.

02:12.680 --> 02:21.660
Here we we I clicked for the next page and we see I get to here where it says page number two and here

02:21.660 --> 02:24.950
it is, setting 200 items to be shown.

02:24.960 --> 02:27.660
So if I set page number three here.

02:29.090 --> 02:30.740
I get to the third page.

02:31.070 --> 02:36.680
So that is how most of the websites pagination is working.

02:36.680 --> 02:43.410
There's a URL and you can change a number in up here to get to another page.

02:43.430 --> 02:50.300
Also, we have a section inside of the Airbnb scraper section where we actually learn how to scrape

02:50.330 --> 02:57.230
Airbnb, which is also using pagination on a certain section of their site we get to.

02:57.890 --> 03:05.210
But here's a really simple example of how to scrape the Craigslist page, because that's a really simple

03:05.210 --> 03:07.520
site doesn't use any JavaScript.

03:07.640 --> 03:12.680
So we're only going to learn and focus on how to scrape a site with pagination.

03:14.910 --> 03:15.870
Now, let's see.

03:15.900 --> 03:21.180
So we get this URL here with the zero at the end.

03:22.450 --> 03:26.380
And I'm going to make a new project here.

03:29.240 --> 03:29.800
Dah dah dah.

03:30.030 --> 03:32.060
I'll call it pagination.

03:33.490 --> 03:38.080
And I go inside pagination and I initialize NPM.

03:39.610 --> 03:41.110
NPM init dash dash.

03:41.110 --> 03:41.350
Yes.

03:41.350 --> 03:42.940
So we can add some packages.

03:42.940 --> 03:47.980
We are going to need request and request promise.

03:48.670 --> 03:49.330
And then.

03:49.330 --> 03:49.900
Cheerio.

03:53.350 --> 04:01.990
And once that is done, we are going to open up the project inside our editor.

04:02.720 --> 04:11.120
So here I have the project running and now I'm going to put my browser to the right of the window or

04:11.120 --> 04:12.140
the screen.

04:13.350 --> 04:15.590
So here I have my project.

04:15.600 --> 04:23.730
I just started up with npm init adding the request request promise and share your packages.

04:24.850 --> 04:26.470
So let's make a file.

04:26.470 --> 04:29.080
Let's call it index, as always.

04:29.840 --> 04:36.800
And let's import the modules or packages we just imported before.

04:36.800 --> 04:39.650
So request say require.

04:39.650 --> 04:42.110
Request promise.

04:43.210 --> 04:44.170
And.

04:44.170 --> 04:45.130
Cheerio.

04:46.360 --> 04:47.710
You say require.

04:48.220 --> 04:49.240
Cheerio.

04:50.940 --> 04:55.170
And I'm just going to close the tab here so you can see more of my screen.

04:56.880 --> 05:02.130
Now, the first thing we need to do is we'll make a scrape function.

05:03.170 --> 05:05.630
A async function scrape.

05:08.410 --> 05:10.870
Because we like to use async await.

05:10.900 --> 05:14.230
It's a very nice clean syntax to use.

05:14.740 --> 05:18.550
And then I call it down here in the global scope.

05:19.060 --> 05:23.680
So we have this URL, this one.

05:23.830 --> 05:31.030
And you can see what we're going to do is we're going to make a for loop and we're going to go through

05:31.030 --> 05:33.730
the first page, which is this one at zero.

05:33.760 --> 05:36.910
Then the next page is 120.

05:37.270 --> 05:39.310
The next page again is 240.

05:39.310 --> 05:48.250
So we make a for loop that starts at zero and adds 120 each time until we get to let's see how many

05:48.250 --> 05:49.420
pages there is.

05:49.450 --> 05:50.650
360.

05:51.010 --> 05:51.910
So.

05:53.090 --> 05:54.320
Let's get to it.

05:54.860 --> 06:01.640
I just cut the URL there, but let's see how it looks with the with the for loop.

06:01.850 --> 06:03.260
So for.

06:04.310 --> 06:05.960
And let's use let instead.

06:05.990 --> 06:07.820
That's modern way to do it.

06:08.660 --> 06:18.560
We start at zero and we continue until index is equal to or mean index is.

06:19.750 --> 06:25.600
More or equal to 360, which is the end page inside of the Craigslist.

06:25.600 --> 06:32.980
And for each time we run this loop, we are going to add 120 to the index.

06:35.030 --> 06:37.100
Now, let's see if I can.

06:38.110 --> 06:39.820
Make it so you can see all the text.

06:40.420 --> 06:47.110
So and then we still have the URL here, so we need the HTML from request.

06:47.110 --> 06:50.020
So we say request get.

06:50.500 --> 06:51.910
And paste into your URL.

06:52.420 --> 06:57.680
Now, obviously I'm going to make the text a little smaller so we can see.

06:58.240 --> 07:05.500
And we need to put in a variable here, which is the number, the index.

07:07.230 --> 07:11.460
So I'm just going to put a plus index here.

07:13.190 --> 07:17.090
So first time we run this for loop, it's going to be zero.

07:17.690 --> 07:22.400
And the next time we run it, it's 120 and so on and so on.

07:22.640 --> 07:24.650
Two 4360.

07:25.550 --> 07:26.570
And.

07:27.550 --> 07:34.240
So that is, remember, we need to have a weight in front of the requests.get.

07:35.290 --> 07:37.300
Because it is an asynchronous call.

07:38.650 --> 07:44.920
And yeah, so we make the Cheerio.

07:45.980 --> 07:48.260
And selector.

07:48.260 --> 07:51.380
We can select inside here from the HTML.

07:52.120 --> 07:56.380
And that is also a weight that's also an asynchronous function.

07:56.800 --> 08:01.750
And now we can select items inside of the HTML.

08:03.180 --> 08:11.730
This is just like the the original Craigslist example we did before, and we're just focusing on how

08:11.730 --> 08:13.590
to do pagination right now.

08:14.160 --> 08:18.450
So let me just see really fast how we can get the title.

08:20.190 --> 08:24.180
So the title we can get by saying.

08:25.620 --> 08:26.370
Um.

08:26.370 --> 08:27.960
Result title.

08:27.990 --> 08:28.920
Get the result.

08:28.950 --> 08:30.240
Title Class.

08:31.360 --> 08:32.200
So let's see.

08:32.200 --> 08:38.110
So for each of the result title classes we go.

08:39.100 --> 08:42.820
We have an index and an element.

08:46.340 --> 08:49.220
With each of these, we can do a console log.

08:50.570 --> 08:52.490
Where we get the element text.

08:56.880 --> 08:58.140
Oops, I forgot.

09:03.180 --> 09:05.220
I forgot to have a.

09:07.690 --> 09:08.980
Parent phases here.

09:09.970 --> 09:10.750
Let me see.

09:10.960 --> 09:11.830
There we go.

09:11.830 --> 09:16.620
So now we get all the titles of all the whatever we are looking for.

09:16.630 --> 09:20.080
I think this is volunteers on Craigslist.

09:20.470 --> 09:25.120
And so that's how we get the titles.

09:25.150 --> 09:26.950
We can just copy this.

09:29.910 --> 09:31.560
And paste it in here.

09:32.640 --> 09:42.360
So result title for each have index element and we console.log the title of the result title class.

09:43.390 --> 09:50.680
And let's also just do a console.log here at page number.

09:53.370 --> 09:54.480
Index.

09:57.870 --> 10:00.870
Okay, so let's see if it works.

10:05.040 --> 10:07.620
There we go at page number one zero.

10:07.650 --> 10:09.660
Page number 120.

10:10.680 --> 10:13.080
Uh, page 240.

10:13.770 --> 10:15.330
Page 360.

10:15.600 --> 10:19.110
And let's see, that's the end of the pages.

10:19.110 --> 10:20.010
So.

10:21.420 --> 10:25.290
If we go to the end of these volunteer jobs.

10:26.970 --> 10:34.140
We should have this art post here of someone looking for someone to clean their home, which we also

10:34.140 --> 10:37.800
see here at the end at page number 360.

10:40.240 --> 10:45.040
So that's how you do pagination, how you scrape a site with pagination.

10:45.070 --> 10:51.010
Of course, Craigslist is a really simple site because it's a static website, but I didn't want to

10:51.010 --> 10:53.050
make this into a complex example.

10:53.940 --> 10:58.330
Basically all sites that have pagination works this way.

10:58.350 --> 11:06.690
There's a certain URL you get and you have to watch how the URL looks as you click on the pages in the

11:06.690 --> 11:11.670
site and you're going to figure out there's some sort of pattern to it.

11:11.700 --> 11:12.990
Some pages only.

11:12.990 --> 11:15.630
Just do a one, two, three, four.

11:16.170 --> 11:24.240
Other pages do something like Craigslist where they have one where they start at zero one, 2240, and

11:24.240 --> 11:24.870
so on.

11:27.030 --> 11:30.210
Or like in the eBay example we saw before.

11:33.830 --> 11:37.700
That's a that's also a classic pagination.

11:38.410 --> 11:46.720
Where we have the page number here at three, and then we can just set it to one to go on page one.

11:46.720 --> 11:54.550
So basically you just make a for loop, usually a for loop and set some numbers and decide how much

11:54.550 --> 11:56.200
you want to increment it by.

11:56.200 --> 11:59.470
And then you just scrape the page for every for loop.

12:00.470 --> 12:03.620
That's how you scrape a page with pagination.

12:03.620 --> 12:10.190
So I hope you could use this for something and let me know if you have any other questions.
