WEBVTT

00:01.310 --> 00:02.180
Okay, everyone.

00:02.180 --> 00:09.740
So I think it makes sense that I draw out a little bit about when I want to talk about the disadvantages

00:09.740 --> 00:16.400
and advantages to either the timed method and the on demand method.

00:16.490 --> 00:17.990
So here we go.

00:17.990 --> 00:23.300
We have the so let's say we have the timed method here.

00:30.070 --> 00:32.110
So the time effort, right.

00:32.470 --> 00:39.640
So that method means that let's say we have 1000 users.

00:40.300 --> 00:43.540
So we have 1000 users.

00:47.220 --> 00:51.120
Lots of happy guys here, 1000 guys.

00:51.540 --> 00:53.960
And they want your data.

00:53.970 --> 00:57.900
They want your data that you have on your scraping website.

00:57.900 --> 01:00.330
So my scraping.

01:03.800 --> 01:04.670
Site.

01:08.460 --> 01:15.840
And obviously when you have 1000 users, you're going to get 1000 requests, right?

01:15.870 --> 01:23.640
So each of these lines is a request from a user going to your site requesting this scraped data.

01:24.300 --> 01:29.940
And now, excuse me, this is actually the on demand method we're using.

01:29.980 --> 01:31.050
We're showing here.

01:31.050 --> 01:32.910
So that's the on demand method.

01:34.140 --> 01:42.870
So there's 1000 users going on to your scraping site or your your site exposing an API or just the data

01:42.870 --> 01:44.370
from the site you're scraping.

01:44.640 --> 01:50.970
So that means if you have a scraper behind this API or website, you have a scraper here.

01:54.380 --> 02:01.370
And suddenly this scraper is also going to get 1000 requests.

02:01.370 --> 02:09.500
And at the same time, if you have 1000 users at the same time and this scraper is being run on demand

02:09.500 --> 02:12.080
every time there's a request here.

02:12.290 --> 02:13.970
Well, yeah.

02:13.970 --> 02:20.150
So you're going to have 1000 requests here and this scraper is going to make 1000 requests to whatever

02:20.150 --> 02:28.790
site you're scraping, for example, IMDb, if that's what you're doing, and suddenly your scraper

02:28.790 --> 02:32.720
is going to make 1000 requests as well to this site.

02:33.410 --> 02:41.930
And yeah, obviously that's going to result in a ban just like that.

02:42.470 --> 02:48.680
And so the scraper is going to get banned because you have 1000 requests here.

02:49.570 --> 02:50.190
Right.

02:53.170 --> 02:59.680
And obviously your users are happy because or at least for a while, because they're getting the newest

02:59.680 --> 03:02.080
data from the scraper.

03:02.320 --> 03:09.490
But, um, not for a long while because your scraper here is going to get banned because it's making

03:09.490 --> 03:11.980
1000 requests to this site.

03:11.980 --> 03:16.480
So IMDb or whoever you're scraping is going to go like, Whoa, what is this?

03:16.510 --> 03:17.590
What is going on here?

03:17.590 --> 03:21.760
This guy is making 1000 requests to our site.

03:21.880 --> 03:24.340
Uh, he's going to get banned now.

03:25.540 --> 03:27.740
So that's the on demand method.

03:27.760 --> 03:34.390
This one here where you just scrape as you get a request and obviously you can see that doesn't go well

03:34.390 --> 03:37.270
when you're scaling and when you have lots of users.

03:37.630 --> 03:44.350
But if you want the newest data that's on the site, then okay, it's fine, but you can't scale it

03:44.350 --> 03:46.870
up to having 1000 users.

03:47.380 --> 03:51.040
So that's where the timeout method is a lot better.

03:51.130 --> 03:52.480
So let's see again.

03:52.480 --> 03:54.490
So we have 1000 users, right?

03:59.130 --> 04:00.930
Happy guys again here.

04:02.820 --> 04:07.230
And they make a request to your scraping site.

04:13.740 --> 04:16.290
My scraping site here.

04:21.300 --> 04:22.920
So see lots of requests.

04:22.950 --> 04:23.280
Dada.

04:23.280 --> 04:23.550
Dada.

04:23.550 --> 04:23.820
Dada.

04:23.820 --> 04:24.330
Dada.

04:24.840 --> 04:25.290
Right.

04:25.290 --> 04:28.560
So 1000 requests here from your to your side.

04:29.370 --> 04:30.990
And, um.

04:31.140 --> 04:32.160
Yeah, so.

04:32.160 --> 04:37.020
But then your side is just going to go and look at a database.

04:37.620 --> 04:38.730
So a.

04:39.640 --> 04:41.440
Some kind of database.

04:43.890 --> 04:54.150
And this could be either be this could be MongoDB, this could be a MySQL SQL database, any kind any

04:54.150 --> 04:55.250
kind of database.

04:55.260 --> 04:57.180
You're just returning the.

04:58.070 --> 05:06.410
DV database values basically better your users, but this database is instead being filled from the

05:06.410 --> 05:06.860
scraper.

05:06.860 --> 05:09.590
You're running in a separate location over here.

05:09.830 --> 05:12.950
So you just have a scraper that you run here.

05:14.920 --> 05:16.810
At an interval of

05:19.060 --> 05:19.900
run.

05:21.200 --> 05:26.660
For example, run every hour in one hour.

05:29.570 --> 05:30.500
And.

05:31.580 --> 05:36.230
This site then goes and looks on, For example, the site we're scraping like IMDb.

05:41.540 --> 05:50.600
So suddenly your site is only making one request every hour to IMDb and IMDb is not going to care about

05:50.600 --> 05:51.350
that at all.

05:51.350 --> 05:53.150
They wouldn't care about that.

05:53.300 --> 05:59.000
And then once your scraper has won every hour, it saves the data to the database.

05:59.240 --> 06:03.950
And yeah, so that's how suddenly you can.

06:04.820 --> 06:10.880
Have lots of users and scale up to your scraping sites, whatever you're running.

06:11.390 --> 06:16.910
But obviously the data that we are getting here is one hour old at most.

06:16.910 --> 06:20.660
So your users is not guaranteed to see the newest data.

06:22.010 --> 06:29.810
If you want to use this on demand scraping method instead, you would have to have some kind of limit

06:29.810 --> 06:40.850
here on this layer here and maybe say, Hey, you can only run so and so many scrapers at a time and

06:40.850 --> 06:44.190
obviously it's not the best method to scale anyway.

06:44.210 --> 06:49.220
So for a lot of use cases, this is fine, I think.

06:49.400 --> 06:49.820
Um.

06:50.840 --> 06:52.310
For personal use cases.

06:52.310 --> 06:57.710
Maybe you would prefer just to have the newest data from the website.

06:57.980 --> 07:02.210
So keeping keep that in mind and.

07:03.300 --> 07:08.400
Try to work out what's the best use case you can for your project you're trying to do.

07:08.850 --> 07:10.440
One thing to keep in mind.

07:10.470 --> 07:18.450
Maybe you're a clever now or you think you're clever and think, Okay, well, uh, hey, why don't

07:18.450 --> 07:21.640
we just run the scraper on the website?

07:21.660 --> 07:29.730
I mean, on the client side instead of having it on our API and.

07:30.120 --> 07:32.420
Yeah, that's not going to go so well.

07:32.430 --> 07:41.190
I had one student asking me about this, and basically all new browsers have something called coarse,

07:41.220 --> 07:49.860
which is something to prevent cross-origin, uh, requests from other sites.

07:49.860 --> 07:56.460
So if they go to your scrape, your site, uh, my scraping site, dot com or whatever.

07:57.640 --> 08:01.720
And the client is suddenly making a request to IMDb.

08:01.960 --> 08:05.200
The Chrome browser is not going to allow that.

08:05.440 --> 08:07.960
So client side scraping.

08:08.870 --> 08:11.000
Is not really possible.

08:21.580 --> 08:29.480
It's not really possible because of the course security issue that all new browsers have.

08:29.500 --> 08:33.970
So I would not say that this is a feasible solution.

08:34.210 --> 08:41.380
Um, unless you're, of course your users is willing to download, for example, an extension to Chrome

08:41.380 --> 08:47.920
that disables the core security, uh, which I don't think is going to happen.

08:49.710 --> 09:00.990
Um, one way to perhaps, um, to perhaps avoid this issue.

09:01.500 --> 09:04.800
You think then could be maybe to make a proxy.

09:04.800 --> 09:13.020
So the client thinks that it's making a request to your API, but your API is sending the request further

09:13.020 --> 09:21.210
on to IMDb, but you're going to run into the same issue again, which is that you have um, a ton of

09:21.210 --> 09:24.900
requests from your API onto to IMDb.

09:25.350 --> 09:30.150
You just took out the scraping there and put it on the client side instead.

09:30.540 --> 09:36.420
So, uh, client side scraping and I don't recommend it.

09:36.780 --> 09:41.550
Please don't do it unless you're doing it for personal projects and something like that.

09:42.480 --> 09:47.940
Stick it in the API level instead and try to use the time method.

09:48.300 --> 09:53.910
I think that's the, the best, uh, sort of solution for most of the cases.

09:54.090 --> 09:59.910
Anyway, I hope that you got something out of this and in the next sections we are going to look at

09:59.910 --> 10:07.480
creating an actual API, saving the data to a MongoDB database from our scraper and so on.

10:07.500 --> 10:09.300
So see you in the next sections.
