WEBVTT

00:01.530 --> 00:04.830
So now we created a selector inside of Chrome tools.

00:04.830 --> 00:09.770
We tested that it worked on the website and got all of the data inside the table.

00:09.780 --> 00:14.400
Now we need to move on to NodeJS and get our data inside NodeJS.

00:14.580 --> 00:18.810
So let's first make a folder where we can have our project.

00:18.810 --> 00:27.570
Inside this case, I'm going to call it Scraping Tables with NodeJS.

00:28.570 --> 00:29.380
Just like that.

00:29.380 --> 00:31.720
And then let's go inside the directory.

00:35.820 --> 00:41.820
And then let's initialize NPM so that we can get some NPM packages that we're going to need for our

00:41.820 --> 00:42.450
project.

00:42.450 --> 00:45.300
So npm init dash dash gs.

00:46.970 --> 00:52.130
And then let's go and add the Cheerio and request promise package.

00:52.130 --> 00:58.190
So I'm going to use yarn because yarn is faster than NPM, I think, but you can also use NPM if you

00:58.190 --> 00:59.720
prefer to use NPM.

01:00.170 --> 01:02.540
So yarn add request.

01:02.750 --> 01:06.770
Request promise and then cheerio.

01:12.480 --> 01:13.040
There we go.

01:13.050 --> 01:16.080
Now, all of the packages have been added to the project.

01:16.110 --> 01:19.650
Now let's open it up inside Visual Studio code.

01:21.780 --> 01:25.620
So here I have the project opened up in Visual Studio code.

01:25.650 --> 01:32.630
We have just the node modules folder which has the packages that we added, the cerio and request packages.

01:32.640 --> 01:37.950
And then we have just some empty files for keeping track of the packages that we added.

01:38.070 --> 01:42.570
Let's create the first file now let's call it index.js.

01:44.880 --> 01:47.730
And in here we're going to write out a web scraper.

01:48.500 --> 01:53.920
So let's first import the packages inside the code that we imported with yarn before.

01:53.930 --> 01:57.410
So const request require.

01:57.440 --> 01:59.180
Request promise.

02:00.520 --> 02:01.300
And then.

02:01.300 --> 02:02.170
Cheerio.

02:02.910 --> 02:03.800
Choir.

02:03.810 --> 02:04.920
Cheerio.

02:06.190 --> 02:10.580
Okay, so now let's write out the main function where we have our code inside.

02:10.600 --> 02:18.340
So I'm going to use async because it's a it's a nice clean syntax you get when you have async request

02:18.340 --> 02:19.630
inside NodeJS.

02:19.780 --> 02:23.590
So I'm going to use async function and let's call it main.

02:24.400 --> 02:28.900
And in here we have our main code to write the web scraper inside.

02:29.200 --> 02:35.140
Then I'm executing the function down here in the bottom of the code and run it this way.

02:36.010 --> 02:42.460
So let's first make the request for the website to get the HTML of this page so we can do something

02:42.460 --> 02:52.840
like const result await and we can say request get, and then we type in the page for the the table

02:52.840 --> 02:53.890
that we want to scrape.

02:53.890 --> 02:57.130
In this case https coding with Stefan.

02:58.090 --> 03:00.910
And then we say table example.

03:02.530 --> 03:05.920
Okay, so this just gets the HTML from the page.

03:05.920 --> 03:10.150
It's sort of like curl if you know, curl or wget get.

03:10.180 --> 03:14.410
If you're from the Linux world, it simply just gets the web page.

03:14.410 --> 03:16.840
It's just a Http request we're doing.

03:16.960 --> 03:19.390
So we can't pass the site yet.

03:19.390 --> 03:24.070
We can't select elements from it yet, but that's what we're going to use Cherry-o for.

03:24.520 --> 03:31.360
So Cherry-o is sort of like the jQuery interaction that we did on the page before with the extension.

03:31.750 --> 03:38.460
So we can simply say const and then we assign it to a dollar sign variable just for easy readability.

03:38.470 --> 03:44.770
So we say const cherry-o load and we pass in the HTML string that we get.

03:44.890 --> 03:51.160
In this case the HTML string is inside the result variable that we got from request.

03:52.930 --> 03:54.130
So there we go.

03:54.130 --> 03:57.320
Now we can select elements on the page.

03:57.340 --> 04:03.160
In this case, let's use the selector that we have from our chrome tools and paste it in here.

04:08.020 --> 04:13.300
So we can actually just copy this code that we have inside Chrome tools, the one where we get a line

04:13.300 --> 04:19.510
for each name and just copy that and let's just paste it inside the code here in NodeJS.

04:23.430 --> 04:29.670
And now let's try and run the code inside NodeJS and see if we get the same result as we did in Chrome

04:29.670 --> 04:31.830
Tools with the name on each line.

04:32.070 --> 04:38.790
So I'm going to go into the terminal in Visual Studio code, new terminal here, and we're going to

04:38.790 --> 04:39.420
say.

04:40.200 --> 04:42.660
Node Index.js.

04:45.110 --> 04:45.860
And there we go.

04:45.860 --> 04:50.630
We can see all of the names just like we have inside of Chrome tools.

04:52.340 --> 04:52.640
Okay.

04:52.640 --> 04:56.600
So let's do a quick recap on what this code is doing here.

04:56.720 --> 05:05.330
So first we import the packages request and cheerio request is simply doing a actp request of a given

05:05.330 --> 05:05.810
URL.

05:05.840 --> 05:07.820
It just simply fetches the data.

05:07.850 --> 05:09.230
It doesn't do anything.

05:09.230 --> 05:16.940
But Cheerio is enabling us to select elements and parse the HTML so we can select elements on it.

05:16.940 --> 05:20.630
In this case, we want to get the table data and that's what we can use.

05:20.660 --> 05:21.890
Cheerio to.

05:22.130 --> 05:29.450
It loads the HTML page, parses it, and then we get this jQuery selector as a result, just like we

05:29.450 --> 05:31.160
have inside of Chrome tools.

05:33.190 --> 05:39.700
And we can simply paste in the same code that we have from the Chrome tools and run it inside NodeJS.

05:43.080 --> 05:49.470
Now, the reason why we want to have this inside of NodeJS is because then we can make an API that people

05:49.470 --> 05:53.520
can request from to get scraped results.

05:53.550 --> 05:58.890
We can do it in an automated way instead of pasting in code in chrome tools.

05:58.920 --> 06:05.070
We can save our data inside CSV files or we can save it in a database and so on.

06:05.100 --> 06:10.890
There's a lot more possibilities when we are sitting with the scraping process inside NodeJS rather

06:10.890 --> 06:13.110
than just inside the Chrome browser.

06:14.290 --> 06:14.560
Okay.

06:14.560 --> 06:21.250
So now in the next section, let's move on to how we can put this data inside of the data structure

06:21.250 --> 06:24.160
that we talked before about in the earlier lessons.
