WEBVTT

00:00.260 --> 00:05.780
So now I think it's a good idea for me to show you where do we actually run this scraper that we're

00:05.780 --> 00:11.630
going to build because we're not going to run our scraper inside of Chrome developer tools.

00:11.630 --> 00:16.880
We're going to run it on a server side script inside of NodeJS.

00:17.680 --> 00:24.730
So that we can basically deploy to any server anywhere or run it at any time without having a browser

00:24.730 --> 00:32.440
open and manually pasting or typing our commands inside of console tab inside of chrome developer tools.

00:33.630 --> 00:38.370
So in order to get started on that, you need to open up Visual Studio code.

00:38.520 --> 00:43.860
Hopefully you already have it installed and make sure you also have NodeJS installed.

00:44.600 --> 00:47.570
So in here, go ahead and make a new directory.

00:47.570 --> 00:50.390
We can call it scraping intro.

00:52.570 --> 00:56.110
And then go inside the scraping intro directory.

00:57.310 --> 01:01.490
And in here you need to run npm init test dash.

01:01.510 --> 01:01.860
Yes.

01:01.870 --> 01:05.770
So you can get npm packages and use them from node.

01:09.160 --> 01:12.910
And then you need to do you need to add two packages.

01:12.910 --> 01:19.150
You can either use NPM install or you can use yarn add, which I prefer to use.

01:19.510 --> 01:26.710
Then you can say yarn, add request and request promise and then cheerio.

01:29.780 --> 01:34.070
So let me first tell you about these two packages before we move on.

01:34.790 --> 01:43.010
The request package and request promise is a package two packages that we use to simply download a website.

01:43.100 --> 01:47.350
It just downloads a website similar to something like Curl.

01:47.360 --> 01:54.060
If you know the Curl command for just downloading things or something like we get from Linux.

01:54.080 --> 01:57.380
It's just a simple downloader for things.

01:57.380 --> 02:00.740
It doesn't pass the site like our browser does.

02:00.770 --> 02:02.570
It just downloads the site.

02:03.800 --> 02:11.120
And cheerio is something we use to be able to use jQuery selectors inside of NodeJS.

02:12.510 --> 02:17.940
So I think it's a good idea just to show you in practice how all of this works.

02:17.970 --> 02:24.780
So go ahead and open the new folder you just created, Scraping Intro inside of Visual Studio Code.

02:33.560 --> 02:40.700
Now that you open the folder, go ahead and create a new file and let's call it Index.js.

02:41.530 --> 02:48.040
And in here we need to import the two modules that we imported using yarn add before.

02:48.100 --> 02:50.380
So we write const.

02:51.120 --> 02:52.290
Request.

02:52.440 --> 02:53.640
Require.

02:54.290 --> 02:56.040
Request promise.

02:58.110 --> 03:01.800
And then say const cheerio require.

03:02.830 --> 03:03.760
Cheerio.

03:05.260 --> 03:11.830
And then we are going to have a main function that we then execute when we run the script so we can

03:11.830 --> 03:14.020
call the async function main.

03:15.580 --> 03:20.020
And then we call Maine down here below.

03:22.460 --> 03:28.370
And now inside of Maine, we do a concert where we get the HTML of the site.

03:28.490 --> 03:33.770
So we say await request, get.

03:34.070 --> 03:39.410
And now make sure that you use the URL I provided earlier in the course.

03:44.330 --> 03:48.350
So this is making a get request for this URL.

03:48.380 --> 03:51.230
Now there's different Http methods you can use.

03:51.230 --> 03:56.300
You can use something like post or put up update and so on.

03:56.330 --> 04:02.120
The ones we use most of the time just to get a website is dot get.

04:02.270 --> 04:10.640
So just like you would visit a website, you just do a request dot get to get a site and then we can

04:10.640 --> 04:15.020
well, we can also save the HTML so we can see the HTML.

04:15.020 --> 04:24.620
So we can say const FS to save a file, require FS, and then we can say FS, right.

04:24.650 --> 04:27.050
File sync and say.

04:29.770 --> 04:33.530
Test dot HTML and put in the HTML.

04:33.550 --> 04:38.320
Then you can look at what we actually get from this site if we run this script.

04:39.270 --> 04:42.120
Now let's run it inside of the terminal.

04:42.120 --> 04:44.310
So I'm going to go into new terminal.

04:45.170 --> 04:49.100
And then I will run Node Index.js.

04:50.920 --> 04:56.860
And now you can see it has downloaded the HTML and it has it right in here.

04:57.160 --> 04:58.090
So that is the.

04:58.450 --> 05:02.680
That is the HTML that I have on my site over here.

05:02.710 --> 05:07.840
So basically, it's just getting the page, downloading the HTML.

05:08.170 --> 05:15.370
But how do we also get the data or the text that we have inside here, just like we did with our console

05:15.370 --> 05:16.270
log here?

05:16.750 --> 05:20.020
Well, that is where Cherry-o comes in handy.

05:20.170 --> 05:22.750
So we can write something like.

05:23.450 --> 05:25.550
Const Dolla sign.

05:25.800 --> 05:26.930
Cerio.

05:27.110 --> 05:27.710
Wait.

05:27.760 --> 05:28.820
Cheerio.

05:29.040 --> 05:32.090
Load and then pass in the HTML.

05:32.450 --> 05:38.420
And now we can select all of the different elements on this side, just like we did inside of a chrome

05:38.420 --> 05:39.230
browser.

05:39.350 --> 05:41.270
So we can say H1.

05:42.350 --> 05:45.470
Here with the dollar sign dot text.

05:46.430 --> 05:48.860
And we can also save it to a variable.

05:48.860 --> 05:58.490
So the text I have here and then we say console log the text, and then we should see the text of this

05:58.490 --> 05:59.720
H1 element.

05:59.750 --> 06:03.740
If I do a node index.js And there we go.

06:03.740 --> 06:07.910
Now we see the text of this element that we scraped from the side.

06:08.150 --> 06:13.880
And then of course, there's endless opportunities to what you want to do with this data.

06:14.030 --> 06:20.000
Sometimes people want to save it to a database or MongoDB database or something like that.

06:21.070 --> 06:24.800
Or well, they just want to save it to a CSV file.

06:24.820 --> 06:33.160
There's lots of things that you can do with this data and use it for and but that is basically from

06:33.160 --> 06:36.550
top to bottom of how we build a web scraper.

06:36.580 --> 06:42.640
Of course, things gets more complex along the way, but I'm going to take you through different projects

06:42.640 --> 06:44.950
where you can see how we do things.

06:47.600 --> 06:53.900
So the reason why we're doing it inside of NodeJS in the end, even though we actually did it inside

06:53.900 --> 07:00.590
the Chrome developer browser first, is because, well, we are going to run this on a server every

07:00.590 --> 07:04.010
once in a while in a timed interval, maybe once an hour.

07:04.370 --> 07:11.120
If not, then you have to go back to your computer every once in an hour and paste in the text and code

07:11.120 --> 07:12.350
in the browser.

07:12.500 --> 07:17.480
You're not going to be able to navigate around or go to.

07:17.510 --> 07:24.260
It's not so feasible to write large chunks of code inside of this Chrome developer console.

07:24.950 --> 07:29.900
So that's why we're doing it inside of Request or NodeJS.

07:29.930 --> 07:37.160
Also, if you have to save it to a MongoDB database, you mean you can't really do that inside the console

07:37.160 --> 07:38.540
tab in here?

07:38.810 --> 07:41.150
So that's why we test out things.

07:41.150 --> 07:46.010
We try to work out what works inside the Chrome developer tools.

07:46.010 --> 07:52.640
And then once we get, for example, a selector up and working here, we which we see is working in

07:52.640 --> 08:01.610
the console, then we go into our NodeJS code and write it out and execute it and make a actual scraper.

08:02.440 --> 08:08.080
So that's a short intro from top to bottom of how you actually build a scraper.

08:08.110 --> 08:11.520
You just build your first NodeJS Web scraper Now.

08:11.530 --> 08:12.850
Congratulations.

08:13.540 --> 08:19.750
Now I'm going to take you through on to the next lessons where we go through a different CSS selectors

08:19.750 --> 08:21.940
that you can use to select elements.

08:22.210 --> 08:29.080
And yeah, so I'll see you in the next section and I hope you have got something out of this so far.

08:29.110 --> 08:31.840
I hope you build your first scraper already now.
