WEBVTT

00:00.680 --> 00:01.610
Hey, everyone.

00:01.610 --> 00:06.680
Today I had a question from a student who is Sean Smith?

00:06.860 --> 00:12.620
He's asking, how can you scrape a table that has multiple pages and uses the same URL?

00:12.740 --> 00:20.390
So a website where the URL doesn't change the page doesn't refresh as you see when you change the page

00:20.390 --> 00:21.710
or change the.

00:21.740 --> 00:24.010
Yeah, change the page inside the table.

00:24.020 --> 00:26.390
How can you scrape a site like that?

00:26.540 --> 00:34.700
Well, Sean, when you say that it doesn't have when it has the same URL and it doesn't refresh, then

00:34.700 --> 00:43.910
that sounds like it is using something called async Json or async JavaScript or as it's called, which

00:43.910 --> 00:50.690
means that we are usually using an API behind in the back end to get the data from.

00:50.900 --> 00:58.340
So the JavaScript is fetching from a backend using usually a rest backend and that is a good thing for

00:58.340 --> 01:06.240
us as web scrapers because that means that we can just get the data in a really computer readable format

01:06.240 --> 01:07.680
in Json format.

01:08.430 --> 01:10.920
So this is the site that Sean mentioned.

01:10.920 --> 01:16.080
And one of the things that I mentioned in the beginning of the course is that you should always check

01:16.080 --> 01:20.400
if there is a back end in the site that you're trying to scrape.

01:20.430 --> 01:24.990
Check if there is a back end first an API.

01:25.020 --> 01:27.840
That's because that's going to make your job really easy.

01:27.840 --> 01:34.140
When you try to get data, it's going to be so easy for you if you can find a rest API to get your data,

01:34.290 --> 01:38.430
it's going to save you from actually building out a web scraper in the first place.

01:38.820 --> 01:42.390
The second one, the second step if it doesn't have an API.

01:42.420 --> 01:45.570
Second step is to use request NodeJS.

01:45.840 --> 01:49.080
The third step if request NodeJS doesn't work.

01:49.110 --> 01:57.360
If the site requires JavaScript but doesn't have a rest API, then you use puppeteer, which is a fully

01:57.360 --> 01:58.770
automated browser.

01:59.470 --> 02:03.150
So rest API NodeJS request pop.

02:03.160 --> 02:05.470
It's here in the in that order.

02:05.470 --> 02:12.280
So based on how difficult it is to build and how prone it is to crashing properties, a lot more prone

02:12.280 --> 02:19.750
to crashing than NodeJS requests and uses more resources and and calling an API is so simple and doesn't

02:20.080 --> 02:20.650
it?

02:21.250 --> 02:22.660
It's so simple.

02:22.660 --> 02:27.670
Anyway, let's go ahead and see how we find the rest API for this site.

02:27.670 --> 02:32.170
It's really simple, so we just go and press F12.

02:32.260 --> 02:39.280
So open up the developer tab in Chrome and I have the network tab selected here so I can see all the

02:39.280 --> 02:41.890
network requests the site is making.

02:42.190 --> 02:47.200
So then we go down here to all the stats on this table we have here.

02:47.290 --> 02:50.170
There's an arrow here to go on to the next page.

02:50.170 --> 02:56.530
So I click on that and now we see it made a request to the rest API.

02:56.560 --> 03:02.050
So they say here football API and what do you see?

03:02.080 --> 03:03.580
What do we see here?

03:03.580 --> 03:05.950
Well, we have lots of data.

03:06.250 --> 03:08.530
We have all the data that we need.

03:08.530 --> 03:11.620
Actually, it's even more than this inside this table.

03:11.620 --> 03:17.290
Sometimes that's an added bonus of finding a rest API is that you get even more data than you can actually

03:17.290 --> 03:18.880
see on the site.

03:19.180 --> 03:22.570
So anyway, so we can see here we have.

03:23.960 --> 03:26.570
Or the wrist or the Json data.

03:26.700 --> 03:28.460
I have this online Json viewer.

03:28.460 --> 03:33.620
I put the data in here, then I can see it in a nice tree if I want to.

03:33.650 --> 03:34.100
We can.

03:34.100 --> 03:35.660
We can look around here.

03:36.390 --> 03:45.390
I can also paste it inside Visual Studio code, save it as a Json file so we can save football Json.

03:46.670 --> 03:52.700
And now when I save it, it's going to format it all automatically because I am using the prettier plugin

03:52.700 --> 04:00.500
inside Visual Studio code and then I can see all the data we have here inside from the Json.

04:00.950 --> 04:02.710
There's even the age.

04:02.720 --> 04:06.070
I think that's the footballers age.

04:06.080 --> 04:11.630
Actually, we have the first last name even formatted as a first and last name.

04:11.630 --> 04:15.110
How much more can you ask for if you're doing web scraping?

04:15.110 --> 04:20.120
I mean, if you had to do web scraping, you had to manipulate the strings and all that stuff.

04:20.120 --> 04:23.390
And yeah, there's just so much here.

04:23.720 --> 04:26.600
Name goals to do, goals, value.

04:26.750 --> 04:30.320
And there's the rank 17, which is actually.

04:31.210 --> 04:34.330
What we only see on this page right here.

04:35.600 --> 04:42.110
Here's this rank 17 and there's the stat, which was also mentioned inside the Json.

04:42.110 --> 04:43.910
So there you go.

04:43.910 --> 04:48.170
That's how we get lots of data from a website when they are using.

04:48.710 --> 04:55.880
If they have an API that you can easily access, sometimes it's not as easy as this and sometimes in

04:55.880 --> 05:02.120
these cases, if the site does require JavaScript and you can't find an API no matter what, then you

05:02.120 --> 05:04.850
have to resort to proper puppeteer.

05:04.940 --> 05:10.910
But if you can get away with not using JavaScript, getting the data, use NodeJS request.

05:11.000 --> 05:16.730
Anyway, that's it for now and I hope you got something out of this video and you can use it for getting

05:16.730 --> 05:18.440
your data from websites.
