WEBVTT

00:00.200 --> 00:01.310
Hood element.

00:02.660 --> 00:05.630
The text and that should be it.

00:05.660 --> 00:10.940
Now we can just add it as a property onto our result object.

00:11.060 --> 00:14.270
And then let's try and run it inside Node.

00:17.950 --> 00:19.440
And let's see.

00:19.450 --> 00:25.180
So some of the job listings don't have a neighborhood onto them, but that's okay.

00:25.180 --> 00:27.940
We are just having an empty string instead.

00:28.330 --> 00:32.650
But if I go up, I can see already that some of the data is filled out.

00:32.650 --> 00:36.520
We have Los Altos here, San Jose South and so on.

00:37.030 --> 00:40.210
Now, I hope you got to the same point here as I did.

00:40.210 --> 00:42.070
I didn't try to do the exercise.

00:42.070 --> 00:48.130
But there's one more thing I'm going to add on to that I haven't talked about I said I would do, which

00:48.130 --> 00:50.410
is to clean up our data a little bit.

00:51.610 --> 00:55.810
And this is very typical something you have to do when you're doing web scraping.

00:55.810 --> 00:58.690
Sometimes you have to denoise the data.

00:58.690 --> 01:00.370
You're scraping a little bit.

01:00.550 --> 01:06.670
In this case, I mean that there is a space in front of the neighborhood and we have these parent pieces

01:06.670 --> 01:07.240
here.

01:07.570 --> 01:11.320
So how are we going to remove this noise from our data?

01:11.320 --> 01:17.240
I'm going to show you now, so I'm going to copy this string here as an example and go into the console

01:17.240 --> 01:19.960
and show you how we can denoise it.

01:20.560 --> 01:23.320
So let's make an example here.

01:24.040 --> 01:26.560
And it's going to be this one with Los Altos.

01:26.830 --> 01:34.570
And to remove the space in the front and the back, we can use the trim function inside of JavaScript.

01:35.500 --> 01:38.440
Oh, it's called example, not examples.

01:39.100 --> 01:44.430
So by using trim, we remove the space in front here.

01:44.440 --> 01:47.230
So we just get the string without spaces.

01:48.520 --> 01:53.020
But you also need to remove the parent pieces we have here at the beginning and the end.

01:53.060 --> 01:54.430
Now, how do we do that?

01:54.460 --> 01:57.910
Well, we can use something called replace instead.

01:58.090 --> 02:03.670
So I'm going to call another function chain the functions, and say replace.

02:04.420 --> 02:09.310
And then I will put in the character it should find and the one it should replace it with.

02:09.340 --> 02:12.700
So it's going to replace it with an empty string.

02:13.760 --> 02:18.470
Now, if I run it again, I can see that the front parenthesis has been removed.

02:18.920 --> 02:25.010
Now we also need to get the back parenthesis, the one here, so I can just call the replace function

02:25.010 --> 02:25.670
again.

02:28.170 --> 02:29.100
Just like that.

02:29.100 --> 02:35.700
And now I have the much more clean data without the spaces or the parent pieces.

02:37.280 --> 02:39.440
So I'm just going to copy this.

02:40.300 --> 02:41.540
Let's copy that.

02:41.560 --> 02:42.550
I will just.

02:43.430 --> 02:49.370
Take it from here and then we'll just put it next after the text here.

02:50.730 --> 02:51.960
And there we go.

02:51.960 --> 02:54.300
So now I have this chain functions here.

02:54.300 --> 02:59.130
I have the first getting the text of the hood element, then the trim and then the replace.

02:59.370 --> 03:04.110
So now I should have much cleaner data of the website instead.

03:05.550 --> 03:08.580
And now let's try and run it and see how it looks.

03:13.850 --> 03:17.810
And we can see still some of them do not have a neighborhood.

03:17.810 --> 03:24.410
But up here where we can see some of them have Los Altos without any spaces or parenthesis.

03:24.590 --> 03:33.740
So nice clean data is what we usually want to get instead of leaving it up to other guys in the later

03:33.740 --> 03:35.480
layers or int.

03:35.600 --> 03:37.940
We are doing something with our data maybe.

03:38.780 --> 03:40.820
Anyway, that's it for now.

03:40.820 --> 03:47.780
And in the next section we are going to look at how we are going to go through all of these different

03:47.780 --> 03:56.690
job URLs and go into their job description and get the job description and perhaps also the compensation

03:56.690 --> 03:57.800
that are filling out.

03:58.040 --> 04:00.710
So I'll see you in the next section.
