WEBVTT

00:00.430 --> 00:07.600
So today I had a question from a student in the Web scraping course, John, who is asking me.

00:07.600 --> 00:15.280
He's trying to scrape a website, but when he tries to get the URL of an image using jQuery or end request,

00:15.310 --> 00:21.460
he gets back this base64 URL, which doesn't really make any sense.

00:21.460 --> 00:29.920
It's not the full image that's encoded in base64, it just seems like it's some random base64 he gets

00:29.920 --> 00:36.100
back compared to when he looks inside of the actual site using Chrome.

00:36.790 --> 00:43.000
So if we go inside Chrome developer tools and let's say that we want to extract the images from this

00:43.000 --> 00:50.710
new site, the abc.net.edu, and we try to look on the images here that we want to get.

00:51.630 --> 00:57.540
Now we see here we have the image which is inside this featured card.

00:58.410 --> 01:04.380
Um, and there is a source here with the image URL.

01:04.410 --> 01:06.240
Let's try and click on that.

01:08.140 --> 01:09.070
Then get selected.

01:09.070 --> 01:09.360
Okay.

01:09.370 --> 01:13.510
I selected it now and what that shows the image.

01:13.510 --> 01:19.960
So that's what John from the course the student is trying to achieve, he's trying to get all of the

01:19.960 --> 01:24.310
images we have on this new site now.

01:24.310 --> 01:30.370
But the problem here now is that John is writing when he tries to do it inside of request, he gets

01:30.370 --> 01:35.020
back these base 64 URLs instead, which he can't use for anything.

01:35.020 --> 01:37.690
You can't even decode this base.

01:37.690 --> 01:38.260
64.

01:38.260 --> 01:40.570
It doesn't make sense, basically.

01:41.080 --> 01:42.850
So what is happening here?

01:42.850 --> 01:47.290
Well, sometimes a website is acting a little differently.

01:47.920 --> 01:55.270
When you don't have JavaScript enabled or if you're if the user agent is different from the one you

01:55.270 --> 01:56.560
have inside Chrome.

01:57.970 --> 02:00.460
So there's two things you can do.

02:00.880 --> 02:03.850
Well, there's actually a more than two things you can do.

02:04.820 --> 02:12.430
Um, first thing you can try to do is to change or set the user agent header inside of request.

02:12.580 --> 02:20.260
So you add this option object, and then you add the headers property, and then you set the user agent.

02:20.980 --> 02:23.790
Now, what should you set as the user agent string?

02:23.800 --> 02:28.450
Well, if you Google something like what is my user agent?

02:28.480 --> 02:35.650
Then it just comes up right here from Google and you can just copy that string and then paste it inside

02:35.650 --> 02:35.920
here.

02:35.920 --> 02:38.830
And then you should have the same user agent.

02:39.040 --> 02:42.130
Now, in this case, it didn't make a difference.

02:42.130 --> 02:45.010
Sometimes it works, but in this case it didn't.

02:45.190 --> 02:53.530
So what you can do then is to inspect what data or what are you actually getting back from the this

02:53.530 --> 02:55.900
new site or this site you're trying to scrape.

02:56.410 --> 02:58.900
So let's go ahead and try to do that now.

02:59.140 --> 03:08.530
So to do that, I'm going to save this HTML we get with request the result variable here and then save

03:08.530 --> 03:11.620
it into a file using the FS module.

03:11.950 --> 03:20.960
So let's go in here and say const FS require FS, and that's just a file writing module that we can

03:20.960 --> 03:22.970
all use inside of NodeJS.

03:23.420 --> 03:29.930
So here we can say something like FS write file and I just use write file sync because it's simpler

03:29.930 --> 03:30.650
to write.

03:30.740 --> 03:34.880
And then I can say something like dot index HTML.

03:36.290 --> 03:41.870
And then we pass in the data, which is going to be the result from the request we made.

03:41.870 --> 03:46.030
So that's just HTML that we fetch using request.

03:46.040 --> 03:51.020
And then afterwards we pass it into chario so that we can select elements.

03:51.230 --> 03:57.410
But anyway, now let's try and run the file again and see what kind of HTML we are actually getting

03:57.410 --> 04:04.700
back and see how is it different compared to the ones we see inside of our Chrome browser when we look

04:04.730 --> 04:06.740
on the live internet version?

04:07.760 --> 04:12.350
So let's try and run the file or let's run the node index.js file.

04:14.130 --> 04:20.340
And then out here, I can see now the index.js or the index.html file has been created.

04:20.340 --> 04:23.610
Let's try and run it again just so you can see how it's been created.

04:24.180 --> 04:30.540
I run the file, the index.js file again, and now we can see the HTML file is being created from the

04:30.630 --> 04:31.920
FS module.

04:32.340 --> 04:36.570
So let's try and inspect this HTML a little bit.

04:36.930 --> 04:40.500
You can either inspect it inside vs code.

04:41.160 --> 04:48.390
Um, but since it doesn't really, uh, tidy up the HTML, it's like raw HTML we are seeing here.

04:48.390 --> 04:51.270
It's not being formatted very good.

04:51.270 --> 04:58.320
So we can try also to reveal it in our file Explorer and let's try and open it up with our browser.

04:58.320 --> 05:04.980
In this case, I'm going to use Chrome and this is how the page looks like when I load it up inside

05:04.980 --> 05:05.760
of Chrome.

05:06.910 --> 05:10.450
Now, this is where the images are supposed to be.

05:10.450 --> 05:13.840
And I guess something is missing here.

05:13.840 --> 05:16.500
It could be JavaScript, it could be something else.

05:16.510 --> 05:20.590
Or maybe it's just not fetching the images for some reason.

05:21.790 --> 05:22.630
Um, yeah.

05:22.630 --> 05:28.960
So the it seems like some things might be blocked because of course, or maybe some JavaScript we're

05:28.960 --> 05:33.310
supposed to fetch the, uh, the images, whatever.

05:33.310 --> 05:41.980
But we can try and still inspect where this image was by using the select uh, element to inspect it.

05:41.980 --> 05:49.000
This cursor we have up here and we can try and click where the image was supposed to be.

05:49.000 --> 05:53.770
So I click here on this div here and try to go inside another step.

05:53.770 --> 05:57.520
And I can see already here there is an image.

05:58.400 --> 06:05.210
So here's the image and let's try and see how it compares to the original site so I can try and see

06:05.360 --> 06:12.860
When I click here, there is again this same image, but it it does have a difference now.

06:13.220 --> 06:20.630
So if I try to see them side by side here, a crowd of people wave their hands while attending an open

06:20.630 --> 06:24.140
air church rally, the same one as we have here.

06:24.880 --> 06:35.470
We can see that the source here has a URL, a good usable URL for the image, but inside the HTML fetch

06:35.470 --> 06:43.750
with request it says this data image base64 encoding I talked about before, which we can't use.

06:45.060 --> 06:53.190
But if we look closer here, we can see that theta dash source property actually has the proper URL.

06:53.400 --> 06:56.400
Now, I don't know why this happens.

06:56.640 --> 06:59.280
Um, we can also see some no script going on here.

06:59.280 --> 07:03.840
So they tried to make it for if you don't have JavaScript enabled.

07:04.500 --> 07:04.800
Um.

07:06.110 --> 07:07.760
But something went wrong here.

07:07.760 --> 07:09.650
Maybe because it's a local file.

07:09.680 --> 07:13.370
We can also try disabling JavaScript on this page.

07:14.750 --> 07:18.350
So let's try and disable JavaScript on the.

07:19.770 --> 07:24.900
On the page and see how the the the HTML looks like if you do that.

07:24.900 --> 07:29.880
So I have this quick JavaScript switcher if I click on that.

07:31.340 --> 07:32.690
And disable it.

07:35.540 --> 07:36.770
Let's try and see.

07:38.200 --> 07:41.800
So now let's disable I'm going to pin it here so I can see that it's disabled.

07:41.800 --> 07:43.240
It's red now.

07:43.270 --> 07:44.380
Now it's green.

07:44.410 --> 07:46.210
JavaScript is enabled.

07:46.210 --> 07:47.020
Now it's red.

07:47.050 --> 07:48.430
It's disabled now.

07:48.730 --> 07:55.030
And you can see, well, the page still looks the same, but JavaScript has been disabled and this is

07:55.030 --> 08:02.200
the same way as request will get the page because JavaScript is not enabled inside of NodeJS request.

08:02.530 --> 08:11.890
So if we take a look at this page or this image now we can still see the source is still the same and

08:11.980 --> 08:18.130
so it still looks the same as our regular JavaScript enabled version of Chrome.

08:18.220 --> 08:26.260
But for some reason or another, the NodeJS actual fetched HTML still looks different.

08:26.290 --> 08:26.880
Why?

08:26.890 --> 08:27.820
I don't know.

08:28.510 --> 08:38.180
But we can see here that the data source property actually has the URL of the image so we can still

08:38.180 --> 08:43.310
use this fetch HTML instead of using puppeteer.

08:44.590 --> 08:49.330
If all else fails, if you can't get this image.

08:50.120 --> 08:57.470
For some reason through Node.js request, you can fall back to puppeteer, but I would strongly recommend

08:57.470 --> 09:03.230
against using Puppeteer as much as possible because it does take more resource and is more prone to

09:03.230 --> 09:05.000
crashing and so on and so on.

09:05.150 --> 09:11.210
Of course it still works fine if you have to use it at some point, but try to use NodeJS request as

09:11.210 --> 09:12.210
much as possible.

09:12.230 --> 09:18.410
Now all that blabbering, let's try and see if we can actually make this work and fetch some images.

09:18.590 --> 09:23.750
So let's try and go for the data source property instead in here.

09:24.260 --> 09:33.470
So inside of our NodeJS index, instead of going for source property which worked inside Chrome browser,

09:33.470 --> 09:38.030
we're going to use the data dash source attribute instead.

09:39.490 --> 09:49.090
Let's run Node Index.js and here we can see all of the different actual URLs of the images that we want.

09:52.780 --> 09:53.320
Here we go.

09:53.320 --> 09:54.220
So we can see.

09:54.220 --> 09:58.330
Now we can get the images just like we wanted to.
