WEBVTT

00:01.130 --> 00:09.590
Okay, so on to the juicy part, which is actually scraping the data we want from this these description

00:09.590 --> 00:10.580
pages.

00:11.570 --> 00:13.340
Let's see how we can do that.

00:13.340 --> 00:18.470
So we actually need to do pretty much the same thing we do up in the index page.

00:18.470 --> 00:22.100
We're just going to get some different selectors instead.

00:22.460 --> 00:27.380
So I'm just going to be lazy and copy this.

00:28.930 --> 00:31.480
From here down to here.

00:32.620 --> 00:37.840
And then we have this nice jQuery selector we can use for the page.

00:38.290 --> 00:46.870
Let's try and run it and just to open the browser so we can look in the HTML and find our selector to

00:46.870 --> 00:47.890
get the data.

00:53.260 --> 00:56.650
So we got a description page up here.

00:56.860 --> 00:59.740
Oh, I closed the window.

00:59.750 --> 01:01.660
I was not supposed to do that.

01:09.510 --> 01:11.880
So we got a page here.

01:12.030 --> 01:17.130
I'm just going to open it in a new tab so that my, uh.

01:19.250 --> 01:21.440
It's not it's not going to load a new page.

01:21.440 --> 01:30.140
Instead, while I'm looking for a selector, I'm going to open the good old developer tools here and

01:30.740 --> 01:32.030
select an element.

01:32.030 --> 01:39.440
And we're just going to go straight for the money guys and select the currency and the value we got

01:39.440 --> 01:40.070
here.

01:41.090 --> 01:48.950
And so I'm just going to right click this one, say copy, copy selector, then we're going to paste

01:48.950 --> 01:54.800
it inside of the console and we can see the selector here.

01:54.800 --> 02:02.120
So there is no real defining classes or IDs I can use to select Exactly.

02:02.120 --> 02:07.340
This element is just random classes you see here, random names.

02:07.550 --> 02:17.030
But I think the most efficient way to do this is to get rid of all the random class names and we can

02:17.030 --> 02:25.710
still use this long list of parent child selectors and go through the Dom tree to actually get our value.

02:25.980 --> 02:30.360
I think that's a more, um, simple way to do it.

02:30.360 --> 02:37.710
Rather than taking all of the HTML and using something like regular expressions to look for, uh, something

02:37.710 --> 02:39.780
that looks like this text here.

02:39.960 --> 02:47.910
I think the, the more simple way to do it is to just go and look for this tree and find exactly the

02:47.910 --> 02:50.040
child element you want to get.

02:50.280 --> 02:52.110
So let me show you how we do that.

02:52.140 --> 02:58.770
We delete all of these weird names you see here with the dot before it.

02:58.770 --> 03:01.200
So this one, delete that.

03:01.700 --> 03:07.820
And you also want to delete this one because they're randomly generated for each of the descriptions.

03:07.820 --> 03:11.450
So we don't want to select by class names.

03:11.450 --> 03:14.000
We just want to select by elements instead.

03:14.360 --> 03:19.610
So make sure to delete all of these class names with the dot in front.

03:19.610 --> 03:20.660
This one.

03:20.810 --> 03:21.830
This one.

03:23.380 --> 03:24.580
And this one.

03:24.580 --> 03:29.440
And you should be ending up with something that looks similar to this.

03:30.970 --> 03:31.960
Selector here.

03:31.960 --> 03:38.530
Just a lot of this one with the child selector here, some and some spans in the end.

03:39.100 --> 03:41.050
And we copy that.

03:43.560 --> 03:49.830
And let's just see if our jQuery I mean, our CSS selector is working.

03:49.830 --> 03:54.720
So we just do this and we can see it's working fine.

03:54.720 --> 04:01.170
I'm getting the price in here, so I'm just going to well, copy this.

04:01.950 --> 04:12.780
From the console and paste it into our script description page URL function we have here and let's call

04:12.780 --> 04:15.930
it price per night.

04:19.270 --> 04:29.170
And then let's do a good old fashioned console log and let's see how that's turning out.

04:29.170 --> 04:33.220
So now we should be getting prices for each of these rooms, right?

04:33.250 --> 04:34.510
Price per night.

04:37.760 --> 04:38.570
Let's see.

04:43.650 --> 04:45.880
Well, now it's showing dollars over here.

04:45.880 --> 04:47.440
I don't know why.

04:47.710 --> 04:48.670
That's weird.

04:49.420 --> 04:50.800
That's really weird.

04:51.400 --> 04:55.570
Anyway, you can see we got the first price of this one.

04:55.570 --> 04:57.340
We got this one also.

04:57.760 --> 04:58.960
The other one was blank.

04:58.990 --> 05:02.050
You notice that here we get the price.

05:03.960 --> 05:05.670
And here we got the price.

05:05.670 --> 05:07.380
There was one that was blank.

05:08.160 --> 05:10.260
And here we got the price.

05:11.640 --> 05:12.780
And here's a plank.

05:12.810 --> 05:13.560
Another plank.

05:13.560 --> 05:14.870
But it was 350.

05:14.880 --> 05:15.900
You saw that shortly.

05:15.900 --> 05:16.890
Here's another plank.

05:16.890 --> 05:17.520
Right.

05:17.640 --> 05:19.710
So why are we getting planks here?

05:19.710 --> 05:25.560
Well, the reason we're getting planks is because the page has not been fully loaded yet.

05:26.970 --> 05:27.780
Now.

05:28.740 --> 05:29.280
Excuse me.

05:29.280 --> 05:30.180
Sorry for that.

05:30.450 --> 05:32.940
And now Puppeteer has something.

05:32.940 --> 05:36.000
I'm just going to go away from this.

05:36.600 --> 05:40.950
Puppeteer has something that is called Wait for Selector.

05:41.370 --> 05:45.810
So you could use something like Await Puppeteer.

05:46.690 --> 05:48.930
Uh, sorry, await page.

05:48.930 --> 05:58.530
Wait for selector and you pass in the, uh, the CSS selector like this inside.

05:59.490 --> 06:08.340
But unfortunately, we can't do that because the CSS, uh, the element is actually being created,

06:08.340 --> 06:14.190
but it's not being filled out until later by the asynchronous JavaScript.

06:15.090 --> 06:17.580
So what can we do instead?

06:17.610 --> 06:28.440
Well, we can do a trick here where we say an option we pass into the page dot, go to.

06:28.440 --> 06:40.740
We pass in an option saying wait until that the network has less than two connections for more than,

06:40.740 --> 06:42.990
uh, for at least half a second.

06:44.230 --> 06:50.200
Let me just copy that straight from the puppeteer documentation, how it's it looks like.

06:50.500 --> 06:57.010
So the wait until Network Idle two is going to consider navigation to be finished when there are no

06:57.010 --> 07:00.760
more than two network connections for at least half a second.

07:01.150 --> 07:04.390
You can also use a network Idle Zero.

07:04.390 --> 07:10.900
So when there is no more than zero network connections for at least 500 milliseconds.

07:10.900 --> 07:18.820
But this one is a little faster, but it's going to work fine for what we are getting.

07:22.620 --> 07:26.670
So that's going to make it only consider the page to be loaded once.

07:26.670 --> 07:31.320
We're not fetching lots of data in using JavaScript.

07:32.960 --> 07:35.300
Inside of chromium, of course.

07:35.990 --> 07:42.770
Now, let's see if we get lots of blank prices now or if we are actually getting some good values.

07:43.220 --> 07:43.820
So.

07:43.820 --> 07:47.180
720 And.

07:48.080 --> 07:49.400
900.

07:55.810 --> 07:59.500
So it looks more promising now, Right.

08:00.820 --> 08:08.640
So it's going to be a little slower, but it's, of course, a lot better data you get out of the the

08:08.650 --> 08:09.490
result.

08:10.900 --> 08:18.100
Okay, everyone, now in the next lecture, we are going to get the rest of the values we have inside

08:18.100 --> 08:19.210
of the rooms.
