WEBVTT

00:06.830 --> 00:15.290
Okay, so apologies that I went a bit ahead in the last lecture and started writing out the call to

00:15.290 --> 00:16.340
this function.

00:16.520 --> 00:23.990
And but at this we got to see how the extract items function is running and the results it's returning.

00:23.990 --> 00:27.290
So it's returning this node list of all the boxes, right?

00:27.830 --> 00:34.790
And now it's time for us to write the function here to the function for scrape infinite scroll items.

00:35.210 --> 00:44.030
So async function scrape, infinite scroll items and we pass in the page the function extract items,

00:44.630 --> 00:54.440
the target item count, and a delay which we set by default to 1000 milliseconds.

00:54.440 --> 00:56.600
So that's one second, right?

01:01.100 --> 01:02.180
And.

01:07.510 --> 01:16.840
Now let's try and define an array of the items first and then we say a try catch clause.

01:17.590 --> 01:19.900
So catch error.

01:22.010 --> 01:23.230
Console error.

01:23.240 --> 01:23.810
Error.

01:26.130 --> 01:27.240
And.

01:28.530 --> 01:30.270
Then what?

01:30.300 --> 01:34.620
Let me explain first what the basic algorithm here is.

01:35.010 --> 01:45.640
Um, we check how many boxes is in the Dom, and then if it's below 100, then we keep scrolling.

01:45.660 --> 01:47.760
So here we see.

01:47.760 --> 01:51.300
Well, now there's, I think 20 boxes.

01:51.300 --> 01:52.170
So we check.

01:52.170 --> 01:53.550
Okay, that's below 100.

01:53.550 --> 01:54.930
So we keep scrolling.

01:55.110 --> 01:56.790
Then we scroll down more.

01:56.820 --> 01:58.710
Now there's 30 boxes.

01:58.720 --> 02:00.120
Well, that's below 100.

02:00.120 --> 02:04.350
So we keep scrolling and then we get 40 and so on, so on.

02:04.350 --> 02:05.610
You get the idea.

02:05.790 --> 02:11.310
That's the target item count, which we set to 100 here and pass in.

02:12.730 --> 02:15.620
And so let's see, let's see.

02:15.630 --> 02:24.420
We define the items array here and then we do a while items length.

02:24.780 --> 02:30.010
So items is all of the boxes we're getting from this page here.

02:30.280 --> 02:42.250
So while the boxes count is less than item, uh, sorry, target item count, which is set to 100.

02:43.740 --> 02:48.480
Then we need to scroll down and yeah, scroll down on the page some more.

02:49.080 --> 02:53.760
So we see we scrape the items again on the page.

02:53.760 --> 02:56.730
So we say, wait, page dot, evaluate.

02:58.220 --> 03:05.180
Pass in the extract items function the one that gets all the boxes and.

03:06.190 --> 03:07.660
Then we see that.

03:07.750 --> 03:08.340
Let's see.

03:08.350 --> 03:11.710
So then we need to scroll down on the page.

03:11.710 --> 03:15.550
And how do you do that inside of Puppeteer?

03:15.820 --> 03:20.440
Well, Puppeteer doesn't really have a function to scroll down on pages.

03:20.440 --> 03:26.920
What they have instead is what they're pointing you to do instead is to use something called window

03:28.600 --> 03:31.060
scroll to.

03:31.870 --> 03:36.130
And it takes in an argument and a Y argument.

03:36.130 --> 03:42.100
So if you say oh ten, it's going to scroll by ten.

03:43.660 --> 03:51.370
Um, and then we have something called document body scroll height.

03:51.910 --> 03:58.360
So that's the height of the body element, which is basically the whole page.

03:59.140 --> 04:02.080
And if we refresh the page here.

04:04.430 --> 04:06.110
And write document.

04:06.980 --> 04:11.570
Body and scroll hide.

04:12.890 --> 04:17.090
We can see that it is 1495 now.

04:20.120 --> 04:27.830
But if I scroll down a bit, it's going to extend the page and put more boxes on.

04:28.100 --> 04:32.320
So now the height is 2775.

04:32.330 --> 04:34.960
So it's it's longer now and so on.

04:34.970 --> 04:38.900
If you keep scrolling down, we see it gets longer and longer.

04:39.230 --> 04:47.300
So now how do you actually scroll down using JavaScript in the in the page?

04:47.840 --> 04:53.540
The way to do it is we right window scroll to then we have an argument.

04:53.540 --> 04:59.900
We just set it to zero and we pass in the document body scroll height.

05:01.360 --> 05:08.590
So that's just going to scroll to the bottom of the page and that means we are going to get more boxes

05:08.590 --> 05:10.150
loaded into the Dom.

05:10.450 --> 05:16.960
So if I press that, it goes to the bottom of the page and more boxes gets loaded.

05:18.500 --> 05:22.880
You can see here in the elements, we have 40 boxes.

05:22.880 --> 05:28.340
Now, if I pass in this function again, it loads more boxes.

05:28.370 --> 05:29.420
See the scroll bar?

05:29.420 --> 05:37.370
It went up here and in the elements, we now have 50 boxes.

05:38.090 --> 05:44.810
So that's basically what we're going to insert and do inside of the puppeteer commands we have over

05:44.810 --> 05:45.470
here.

05:45.890 --> 05:52.160
So we type in, await page, evaluate and say.

05:54.280 --> 05:55.210
Window.

05:55.570 --> 05:58.450
Scroll to zero.

05:58.480 --> 06:02.110
Document Body Scroll Height.

06:05.890 --> 06:09.010
And then for good measures.

06:09.010 --> 06:13.570
It's a good idea to wait and see if we actually executed this function.

06:13.570 --> 06:25.330
So we say, wait, mean wait for function, and then we pass in um, some function to check if our scroll

06:25.330 --> 06:28.720
height is bigger than our previous height we had.

06:28.720 --> 06:31.840
So we say some backticks here.

06:34.220 --> 06:38.270
Notice that I'm using Backticks here so I can enter in a variable.

06:38.270 --> 06:41.750
So document body scroll height.

06:42.830 --> 06:46.190
Check if it's higher than my previous height.

06:47.880 --> 06:49.530
Previous height.

06:51.000 --> 06:51.870
And.

06:51.870 --> 06:55.830
Well, if that's the case, then we're good to go.

06:57.430 --> 07:01.300
And of course we need to say previous height in a variable.

07:01.300 --> 07:03.670
So let's go ahead and do that.

07:06.020 --> 07:10.250
So let's define previous height up here.

07:13.700 --> 07:16.370
And we say previous height.

07:18.180 --> 07:27.000
Is equal to await page evaluate document body scroll height.

07:31.080 --> 07:33.720
Now, let me zoom out a bit.

07:35.540 --> 07:44.060
And so then the final thing we can have is we can wait for a delay to finish so we can say await page,

07:44.060 --> 07:47.020
wait for scroll delay.

07:47.030 --> 07:52.070
So that's just going to wait for the X number of milliseconds we set up here.

07:53.180 --> 07:58.790
And well, if you're doing this on a live site like Facebook or Instagram, maybe you want to put this

07:58.790 --> 08:02.270
delay to some random range that you set.

08:02.270 --> 08:08.420
So it's not always one second, because if you just set it to one second, always, they might be able

08:08.420 --> 08:13.790
to detect that and hence you get banned anyway.

08:13.790 --> 08:18.170
In the end, after all this, we return the items.

08:20.800 --> 08:26.500
And I also made a mistake here by defining items up here.

08:26.890 --> 08:31.360
I only need to define it outside of the try catch clause up here.

08:31.840 --> 08:33.910
And, um.

08:33.910 --> 08:36.910
Yeah, let's see if it works.

08:48.260 --> 08:57.500
So Node Index.js Now it's going to open the browser and start doing what we're setting it to do.

09:11.740 --> 09:12.760
So let's see.

09:12.790 --> 09:13.720
Let's see.

09:20.200 --> 09:22.590
It looks like it is stuck right now.

09:22.600 --> 09:25.060
Let's see what could be wrong here.

09:27.460 --> 09:28.240
Um.

09:28.240 --> 09:29.200
Yeah, I need.

09:29.200 --> 09:30.340
I missed.

09:30.340 --> 09:31.600
I put in.

09:31.630 --> 09:34.450
I missed a parenthesis.

09:34.450 --> 09:35.080
Here.

09:36.250 --> 09:37.870
Let's try again.

09:51.850 --> 09:54.910
Scroll delay is not defined.

09:55.090 --> 10:01.480
Oh yeah, I need to call it scroll delay up here as well.

10:01.510 --> 10:02.800
Sorry for that.

10:14.720 --> 10:17.840
Okay, So it looks like it is working like it's supposed to.

10:17.840 --> 10:25.990
Now we can see that the items are being increased by one second every time until it gets up to 100,

10:26.000 --> 10:27.140
it should stop.

10:27.140 --> 10:31.850
And then we should have a printout of all of the items.

10:32.510 --> 10:39.590
So just like that, ladies and gentlemen, that's how we can keep on scrolling on a page and scrape

10:39.590 --> 10:40.970
all of the elements.

10:41.540 --> 10:44.330
So keep in mind the errors I made.

10:44.330 --> 10:51.170
I was missing a parenthesis here and I needed to call this scroll delay instead of delay.

10:51.950 --> 11:01.010
And with that being said, that is how you build a scraper for a infinite scroll type of application.
