WEBVTT

00:00.350 --> 00:01.580
Hello, everyone.

00:01.580 --> 00:09.230
In this section, I want to talk about what you should always check before you start to even begin making

00:09.230 --> 00:10.370
a web scraper.

00:10.940 --> 00:19.880
Now, I actually want to avoid riding a web scraper, and that's kind of odd, since now I have a course

00:19.880 --> 00:23.690
about web scrapers and they're fun enough to write.

00:23.690 --> 00:32.720
But inherently a web scraper is sort of a hack, Which means what I mean by that is that a web scraper

00:32.720 --> 00:41.660
is taking a website and sort of parsing it into data that you can use, like machine readable data,

00:41.690 --> 00:46.820
like titles, images, not something for humans to consume.

00:46.820 --> 00:47.420
Right?

00:48.580 --> 00:54.490
So inherently we are taking things from a format that's not meant for computers and trying to pass it

00:54.490 --> 00:55.750
into a computer.

00:56.410 --> 01:02.800
What I'm getting to is that before you even start to make a web scraper for a site, you should always

01:02.800 --> 01:07.450
check if the site has an API that you can use instead.

01:08.050 --> 01:13.750
Always, always check if the site has an API that you can use instead.

01:14.200 --> 01:22.810
That goes either by a totally public API that the site is advertising that you can use, or if you do

01:22.810 --> 01:32.470
a little like reverse engineering on the site's source, then you get to a API that you can use.

01:32.680 --> 01:40.480
We do an example of this using the Nordstrom website where we sort of reverse engineer how to use the

01:40.480 --> 01:46.690
API to get results of their items instead of scraping the site instead.

01:46.990 --> 01:55.520
And I can tell you it is you're going to save so much time and trouble if you can use an API instead

01:55.520 --> 02:03.740
of having to scrape lots of HTML or even launch a automated web browser such as puppeteer.

02:04.990 --> 02:07.720
The order of things that you should try to do.

02:07.720 --> 02:11.980
It is always to try get an API first.

02:12.930 --> 02:19.620
If you can't get an API from the site, if it doesn't have an API, not even if you look inside of the

02:19.620 --> 02:25.650
source of the website, then try and go for using request only.

02:26.280 --> 02:32.460
And the reason why you want to go with request request next after an API.

02:32.490 --> 02:40.920
If you can't get an API, go for request because request is using a lot less resources than puppeteer

02:40.920 --> 02:49.830
or something like nightmare chairs, selenium and so on and solutions like automated browsers, like

02:49.830 --> 02:53.790
Puppeteer, selenium, Nightmare, JS and so on.

02:53.850 --> 02:56.940
Use a lot of CPU, a lot of memory.

02:56.970 --> 02:58.800
They're prone to crashing.

02:58.800 --> 03:02.860
It's a lot more complex than using request.

03:02.880 --> 03:04.320
Request is just.

03:04.910 --> 03:12.470
Taking a URL and getting the HTML back without parsing any JavaScript and stuff like that.

03:13.490 --> 03:14.480
However.

03:15.640 --> 03:17.200
If you have a site.

03:17.880 --> 03:19.920
That doesn't have a public API.

03:19.950 --> 03:26.130
You can access right away or it's not easy for you to reverse engineer how to use the API.

03:27.000 --> 03:35.570
And the site requires JavaScript to render, which means you can't use request because request needs.

03:35.880 --> 03:43.680
A request doesn't have a JavaScript rendering enabled, it's just fetching the data on the URL.

03:44.610 --> 03:53.610
Then the last option is to use an automated browser such as puppeteer or selenium nightmare JS and so

03:53.610 --> 03:54.120
on.

03:54.450 --> 03:56.640
That's the very last option.

03:56.640 --> 04:05.160
And unfortunately I see a lot of my students go right to the very last that I would go to, which is

04:05.160 --> 04:06.300
something like Puppeteer.

04:06.300 --> 04:14.430
And they ask me, Hey Stefan, how do I make Puppeteer go to a form and type in something and search

04:14.430 --> 04:15.150
something for me?

04:15.150 --> 04:15.960
And I'm like.

04:16.820 --> 04:17.270
Um.

04:17.570 --> 04:20.120
What do you want to scrape, first of all?

04:20.330 --> 04:24.140
Well, this side here and I go and look at the side and I see.

04:24.140 --> 04:27.410
Well, that doesn't require JavaScript to be rendered.

04:27.410 --> 04:29.270
You could use request with that.

04:29.450 --> 04:38.390
And request takes a lot less memory CPU than perpetrators and it's a lot less prone to crashing.

04:38.390 --> 04:45.170
And it's so much simpler to use than having to use something like puppeteer.

04:46.250 --> 04:52.310
That being said, there's always a place in time for it, but it's the very last option that I would

04:52.310 --> 04:52.910
use.

04:54.070 --> 04:55.630
And so that's it.

04:55.630 --> 05:03.160
So if you want to scrape something like Yelp, then don't go and write a puppeteer scraper for it,

05:03.160 --> 05:07.330
because Yelp, for example, has a public API that you can use.

05:08.920 --> 05:09.550
Same thing.

05:09.550 --> 05:15.850
We have a case study on Nordstrom, for example, where we reverse engineer, find out they have an

05:15.850 --> 05:18.820
API we can use instead of building a scraper.

05:19.800 --> 05:26.060
So, yeah, even though I have a web scraping course here, you're watching a web scraping course.

05:26.070 --> 05:32.310
I also teach you how to avoid actually writing a web scraper, which is perhaps the best solution.

05:32.520 --> 05:39.420
Because once you have an API, you get nice and clean data that you can use right away in a nice Json

05:39.420 --> 05:44.400
format, and that's just going to make your life so much easier.

05:44.700 --> 05:51.390
That being said, there's always going to be a time and place to use something like request parse HTML

05:51.390 --> 05:55.050
or something like puppeteer and parse the HTML.

05:55.050 --> 05:57.420
And of course I'll also show you how to do that.

05:57.660 --> 06:03.060
But keep in mind API first, then request.

06:03.690 --> 06:08.340
Then puppeteer or another automated puzzle solution.

06:09.980 --> 06:12.070
So that's it for this lesson.

06:12.080 --> 06:14.210
I hope you got something out of it.

06:14.210 --> 06:17.210
Remember the order of things to do it.

06:17.330 --> 06:19.700
And you'll be.

06:19.700 --> 06:20.240
You'll.

06:20.240 --> 06:20.930
You'll be good.

06:20.930 --> 06:24.260
You'll be on your way to becoming a web scraping master.

06:24.260 --> 06:26.420
So see you in the next section.
