WEBVTT

00:02.150 --> 00:10.340
Okay, So today someone asked me, how can you scrape Facebook and and.

00:10.340 --> 00:11.360
Well, Facebook.

00:11.360 --> 00:19.010
After the Cambridge Analytica scandal, they had to revise their API API a lot and a lot of endpoints

00:19.010 --> 00:20.840
were being closed off.

00:21.020 --> 00:29.180
And so I found an article about a guy who's writing about he tried to scrape the desktop version of

00:29.180 --> 00:33.230
Facebook and that ended up with him being banned.

00:33.260 --> 00:40.460
He tried to scrape it using a puppeteer automation and it ended up with him being banned.

00:40.880 --> 00:47.510
So then he tried to disable JavaScript because Facebook obviously has some kind of JavaScript on their

00:47.510 --> 00:49.970
page with which is detecting.

00:49.970 --> 00:56.360
If you're just acting like a human or if you're just scrolling through and getting all of the data.

00:56.750 --> 01:03.900
And so then he tried to disable JavaScript and that ended up with, well, the page Facebook is not

01:03.900 --> 01:04.680
working.

01:04.890 --> 01:13.680
Then he got redirected to mobile dot Facebook.com, which is a more simple design of the Facebook desktop

01:13.680 --> 01:14.400
page.

01:15.710 --> 01:21.800
And well, this is where I kind of disagree with him because he says here it doesn't use any.

01:22.460 --> 01:25.880
However, when you scroll down to the bottom, it updates the page.

01:25.880 --> 01:29.870
So there is some kind of JavaScript going on here.

01:29.870 --> 01:32.060
And also if we.

01:34.740 --> 01:41.910
If you reload the page with the network tab being open, we can see a lot of JavaScript here being loaded.

01:42.060 --> 01:49.080
And if I try to use the extension I have over here, which is disabling JavaScript.

01:49.560 --> 01:51.150
If I try to do that.

01:51.180 --> 01:56.250
Well, the page is not going to load because it needs to have JavaScript enabled.

01:56.250 --> 02:03.990
So I don't know, maybe he had an earlier version that didn't use JavaScript or something, but apparently

02:03.990 --> 02:07.650
they use JavaScript even on mobile dot facebook.com now.

02:07.800 --> 02:16.170
However, however, it is still I think the detection method inside mobile desktop inside mobile version

02:16.170 --> 02:20.870
is probably less effective than their desktop version.

02:20.880 --> 02:28.140
So if you want to scrape Facebook, I would still suggest that you use the mobile variation of it.

02:29.460 --> 02:29.940
Okay.

02:29.940 --> 02:36.460
So however, we are not going to be scraping Facebook in this example because first of all, it would

02:36.460 --> 02:42.850
require you to make a throwaway account for Facebook, which is a kind of a hassle for people.

02:42.850 --> 02:52.270
And second of all, I think it's better that we use this example we have here.

02:52.270 --> 02:59.650
So I found a guy who made a demo page for scrolling or for scraping pages with an infinite scroll.

02:59.650 --> 03:06.400
So infinite scroll means that you scroll down to the bottom of the page and new elements get loaded

03:06.400 --> 03:07.600
automatically.

03:07.990 --> 03:10.750
So you can see in the HTML here.

03:11.840 --> 03:14.000
We have 20 boxes here.

03:14.770 --> 03:20.260
Now, if I scroll down to the bottom, the JavaScript will just add ten more elements.

03:20.260 --> 03:24.790
So that's just some JavaScript inside of the HTML, but it's emulating quite well.

03:24.790 --> 03:29.110
What Facebook and Instagram and other pages like this is doing.

03:29.440 --> 03:36.040
So we're going to go through this example here and try to build a scraper that is scraping this, which

03:36.040 --> 03:40.240
is scrolling down and scraping 100 boxes at a time.

03:40.780 --> 03:48.400
And I'm going to link to the article also both articles, this one where he is showing a how to build

03:48.400 --> 03:49.060
the scraper.

03:49.060 --> 03:54.220
I'm going to build the same scraper just with a little variation, not much, but a little.

03:54.520 --> 04:01.450
And then a I'm also going to link to this article which I am disagreeing with a bit, but you can read

04:01.450 --> 04:03.610
up on it anyway.

04:03.700 --> 04:09.640
Now in the next lecture we are going to start building our scraper of this page and go through it step

04:09.640 --> 04:10.330
by step.

04:10.330 --> 04:12.130
So see you in the next lecture.
