WEBVTT

00:02.030 --> 00:09.980
Today I got a question about how do you handle a bad network connectivity when you try to run the scraper?

00:10.010 --> 00:17.390
If you're living in a place with a bad Internet for some reason or if the server is a bad server that

00:17.390 --> 00:22.730
sometimes is dropping the connection, then this is what you want to do.

00:22.760 --> 00:30.380
You want to get something called request retry, which is going to act exactly the same as we use request

00:30.380 --> 00:31.970
or request promise.

00:31.970 --> 00:38.320
Except for that it's going to retry the connection every time if the connection is dropped.

00:38.330 --> 00:44.330
So you want to get something called request retry.

00:46.690 --> 00:56.980
So we'll just say yarn add request retry or NPM install request retry, add this package and then try

00:56.980 --> 01:01.000
and just make a request like this.

01:03.490 --> 01:06.940
Just like we used before and call the request retry.

01:07.420 --> 01:10.630
Then I will comment out the one we used before.

01:12.160 --> 01:13.150
Like so.

01:14.560 --> 01:23.290
And now, since request retry by default gives you this object response object which has different properties.

01:23.320 --> 01:30.490
One of them being the body, which is the HTML string, which is the one we actually get here.

01:32.380 --> 01:39.700
Um, it's it means that if you want to use the code like it is now, we would have to write dot body

01:39.730 --> 01:42.760
to make sure you load the HTML string.

01:43.090 --> 01:52.630
But since we, I think it's stupid to edit all of the places we do that we can use a default option

01:52.960 --> 02:08.770
to put in our defaults option for full response to false and that makes it return only the HTML string.

02:09.280 --> 02:14.680
Now let me see where the request retry.

02:14.710 --> 02:17.290
Documentation here if I'm correct.

02:17.410 --> 02:19.960
So full response is by default.

02:19.990 --> 02:20.410
True.

02:20.440 --> 02:23.950
We set it to false and it means we only get the HTML string.

02:24.520 --> 02:30.280
I'm going to link to the GitHub so you can read up on the documentation for this package.

02:30.310 --> 02:31.150
Of course.

02:32.680 --> 02:37.360
Um, but let's just try and see how this goes.

02:37.750 --> 02:42.760
So now we have request as request retry.

02:43.090 --> 02:51.490
And now I have tested this out with something called clumsy, which is going to drop some packages.

02:51.760 --> 03:00.940
And what I would see if I ran it with the normal request that we use before then, sometimes the connection

03:00.940 --> 03:08.900
is going to drop and it will not get the data at all versus using request retry, which means it's going

03:08.900 --> 03:10.820
to retry the connection again.

03:11.030 --> 03:17.990
By default, it's going to retry five times and it has a delay of about five seconds.

03:17.990 --> 03:24.050
So it's going to wait five seconds and then try again or five times.

03:24.050 --> 03:26.570
But if you want to make it, try more.

03:26.600 --> 03:29.210
You can also set max attempts to something else.

03:29.210 --> 03:31.490
You can set it to 110.

03:31.520 --> 03:32.750
I don't know one.

03:33.350 --> 03:42.290
You can also set the delay, retry delay to something longer or shorter, depending on what you find

03:42.320 --> 03:43.610
is the best option.

03:45.350 --> 03:56.450
Now, you could also make this loop into a, um, sort of a single threaded loop, meaning that, uh,

03:56.450 --> 03:59.700
you would only run one request at a time.

03:59.720 --> 04:05.950
You wouldn't run multiple requests at once, depending on how your connection is.

04:06.280 --> 04:10.630
I think that should be your LAStrillionESORT if you want to make that work.

04:11.110 --> 04:19.540
It's going to be really slow if you do it one job at a time instead of doing it, um, multiple jobs.

04:19.540 --> 04:28.750
But if you want to see how you could do it, I have added, uh, the code which is inside of the GitHub

04:28.750 --> 04:33.730
for the Craigslist scraper inside of the branch handling network errors.

04:34.600 --> 04:44.950
In here you can see inside of the index.js file, you can see a method or a function I call scrape description,

04:44.950 --> 04:49.120
not concurrent, which uses a for loop instead.

04:49.270 --> 04:51.730
And the rest of the code is the same.

04:51.730 --> 04:58.480
But since it's a for loop with a await inside, it's going to run just one request at a time.

04:58.480 --> 05:02.140
But like I said before, it's going to be super slow.

05:02.140 --> 05:08.660
So I recommend that only as the last option if request retry doesn't work out for you.

05:09.980 --> 05:15.710
Okay, so that was how to handle bad network connectivity in five minutes.

05:15.710 --> 05:22.460
I hope this helped you a bit and you can get further on with the wonderful world of web scraping.
