WEBVTT

00:00.990 --> 00:01.450
Okay.

00:01.470 --> 00:07.350
One in this section, I'm going to show you how we build out a loop that puppeteer can go through to

00:07.350 --> 00:11.270
visit all of the different URLs that we get from this first page.

00:11.280 --> 00:14.670
All of the URL properties that we get in our array.

00:14.790 --> 00:21.210
So how do we build out a loop that puppeteer can go through and visit each page so we can scrape out

00:21.210 --> 00:24.930
the job description text for each page.

00:24.930 --> 00:30.480
So first of all, let's rename this function here to be something called scrape.

00:31.290 --> 00:32.430
Listings.

00:33.060 --> 00:39.390
And then instead of having this scrape listings, we are going to have a new main function.

00:39.390 --> 00:41.400
So async function main.

00:42.120 --> 00:50.400
And then in here, let's cut these two browser and page instantiations into the main function instead.

00:51.080 --> 00:52.190
Just like that.

00:52.580 --> 01:01.190
And then from here, we are going to call this and say const listings, await scrape listings.

01:01.190 --> 01:07.280
And in the scrape listings, we're then going to pass in the page variable so the listings function

01:07.280 --> 01:08.120
can use it.

01:09.620 --> 01:12.710
And then also make sure to have it inside here.

01:14.400 --> 01:17.400
So now we created a new main function.

01:17.400 --> 01:19.830
Make sure to also call that down here.

01:20.610 --> 01:23.340
We renamed this function to be scraped listings.

01:23.340 --> 01:28.290
Instead, we pass in the page variable so we can use it inside of this function.

01:28.840 --> 01:35.080
And we instantiate the property browser out here in Main so that we can use it in other functions that

01:35.080 --> 01:35.820
we build out.

01:35.830 --> 01:38.410
For example, this loop where we're going to make now.

01:40.380 --> 01:47.280
So then we end up with this listings, which are listings array, which we also need to remember to

01:47.280 --> 01:48.660
return from this function.

01:48.660 --> 01:50.490
So we say return.

01:51.420 --> 01:52.730
It's called results here.

01:52.730 --> 01:55.880
Let's rename it to be listings.

01:57.010 --> 02:00.100
So it returns the listings array from this function.

02:00.340 --> 02:04.060
Let's call the console log and put it down here.

02:04.180 --> 02:08.680
So we make sure that we are getting the right data from this function call.

02:09.900 --> 02:16.260
So then we have restructured our code a little bit so that we can keep on using the puppeteer browser

02:16.260 --> 02:17.850
in other functions.

02:19.780 --> 02:25.720
Now we need to build out our new function for our loop to go through all of these different URLs.

02:26.530 --> 02:31.420
So let's call it const listings with job descriptions.

02:33.990 --> 02:38.160
And we can say await scrape job descriptions.

02:39.090 --> 02:44.580
So this function needs to take in the current listings URL or current listings.

02:44.610 --> 02:47.010
Arrays of URLs we made before.

02:47.010 --> 02:54.300
So we pass in the listings and then let's also pass in the page because we need to navigate around inside

02:54.300 --> 02:55.170
properties here.

02:57.560 --> 03:02.270
Okay, so with that, let's write out our new function with the loop.

03:02.970 --> 03:10.110
So async function, scrape job descriptions and we have the listings and the page.

03:11.210 --> 03:19.670
Now in here, we need to have this loop for each of the URLs so we can say for var, we start at zero

03:19.670 --> 03:26.570
on the array zero index and we go until we get to the end of this listings array.

03:27.920 --> 03:34.400
So listings, dot length, and then we just go through each of them one by one.

03:36.020 --> 03:41.090
And then we can simply do something similar to what we did just for one page up here.

03:41.330 --> 03:43.280
We can say await.

03:44.180 --> 03:52.370
Page dot go to listings on the I index and take the URL property.

03:52.790 --> 03:59.060
So then we go through each of these listings, job description URLs and visit the page inside properties

03:59.060 --> 03:59.510
here.

04:00.780 --> 04:07.320
And then we can extract the HTML content by saying something like we did up here where we say const

04:07.530 --> 04:08.190
HTML.

04:10.560 --> 04:15.810
I wait await page content.

04:17.180 --> 04:20.360
And just like that, we get the HTML from the page.

04:21.610 --> 04:29.140
Now, the reason some of you might have noticed that I'm using this older kind of for loop instead of

04:29.140 --> 04:32.200
a newer kind of is six for each loop.

04:32.200 --> 04:34.300
So instead of something like this.

04:35.270 --> 04:37.640
For each listing.

04:38.840 --> 04:42.800
And then have the code in here instead of a regular old for loop.

04:42.980 --> 04:50.150
The reason why is because for each sort of dos thing does things concurrently in parallel.

04:50.920 --> 04:53.420
And that doesn't work very well with property.

04:53.440 --> 04:57.490
You need to do things one by one, one page at a time.

04:57.490 --> 04:59.380
It can only visit one page at a time.

04:59.380 --> 05:02.800
You can only sort of do things in serial with property.

05:02.800 --> 05:06.340
It doesn't really handle concurrent requests you make to it.

05:06.790 --> 05:12.700
So basically what I mean is that if you try to do a for each loop with puppets here, it's not going

05:12.700 --> 05:14.250
to work out so well for you.

05:14.260 --> 05:17.200
So that's why I'm using this kind of loop here.

05:17.620 --> 05:20.080
Just so you keep that in mind.

05:20.080 --> 05:23.710
If you're wondering why I'm using this old kind of for loop.

05:24.280 --> 05:31.540
Anyway, now let's try and see if Puppeteer is actually visiting this these sites.

05:31.540 --> 05:35.400
And then in the next section, we're going to actually extract the data.

05:35.410 --> 05:41.590
So let's first try and run it inside Node Index.js and see if it visits all the sites.

05:44.850 --> 05:45.600
So there we go.

05:45.630 --> 05:53.490
I can already see that it is visiting all the different job descriptions and I'm just going to stop

05:53.490 --> 06:00.360
my loop now and make sure when you're running this that you don't run it for a long while because it

06:00.360 --> 06:05.430
is making quite a lot of requests at the same time.

06:05.430 --> 06:11.970
So we need to have some sort of way to limit the rate of requests that we're making so we don't make

06:11.970 --> 06:14.370
too many page visits.

06:14.370 --> 06:21.210
And then Craigslist might ban you from their page because you're making too many requests or putting

06:21.210 --> 06:23.040
too much load on their page.

06:24.260 --> 06:30.560
Now it is kind of unlikely that you're going to do this with puppeteer because it is sort of limited

06:30.560 --> 06:33.410
to how many requests it can make per second.

06:33.410 --> 06:36.380
It is not that many after all.

06:36.440 --> 06:42.920
But in the next section, I'm still going to show you how we can limit the rate of request and make

06:42.920 --> 06:47.900
sure we don't get blocked when we are scraping lots of pages on our website.

06:48.440 --> 06:50.570
So I'll see you in the next section.
