WEBVTT

00:00.380 --> 00:06.440
Okay, everyone in this section or lecture, we're just going to build a really simple scraper and we're

00:06.440 --> 00:11.470
going to save the scraper data to a MongoDB database.

00:11.480 --> 00:13.940
Specifically, we're going to use the free.

00:15.690 --> 00:19.240
MongoDB database that's called Mlab.

00:19.260 --> 00:23.400
So let's take a look at how this works.

00:24.180 --> 00:27.770
So let's build a folder first for our scraper.

00:27.780 --> 00:30.210
Let's call it the Reddit scraper.

00:31.140 --> 00:38.650
And then inside of this Reddit scraper, as usual, we are going to run npm init dash dash js.

00:40.890 --> 00:45.300
And then we're going to open the folder inside of Visual Studio code.

00:48.970 --> 00:49.930
So.

01:00.660 --> 01:06.930
So now I have the basically empty folder with the package.json file inside.

01:07.170 --> 01:14.760
And the first thing we need to do is we are going to add Cheerio request and request promise to build

01:14.760 --> 01:16.920
our super simple scraper.

01:16.930 --> 01:20.400
And then we need to add Mongoose.

01:22.070 --> 01:28.760
So let me just try and enlarge this text so you guys can see what I'm actually getting.

01:28.760 --> 01:29.030
So.

01:29.060 --> 01:29.690
Cheerio.

01:29.720 --> 01:30.380
Request.

01:30.380 --> 01:32.360
Request promise Mongoose.

01:33.050 --> 01:42.860
And of course I need to type in yarn add and you can also type NPM install if you prefer to use NPM.

01:46.780 --> 01:51.980
So now I added the packages and see them here.

01:52.000 --> 01:55.660
So let's write our index.js file.

01:57.630 --> 01:59.220
And we type in.

01:59.990 --> 02:01.760
A mongoose.

02:06.030 --> 02:06.600
And.

02:06.600 --> 02:07.290
Cheerio.

02:16.370 --> 02:17.870
And request.

02:22.850 --> 02:23.930
Just like that.

02:24.050 --> 02:30.440
And then we will write out our main sort of function to scrape inside.

02:33.770 --> 02:39.410
And we're going to scrape Reddit, but it's going to be a super simple scraper.

02:39.410 --> 02:43.380
So in case you don't know Reddit, why is it so slow?

02:43.400 --> 02:52.220
Okay, so we're just going to get the headlines of each of these articles and um, yeah, that's going

02:52.220 --> 02:53.060
to be it for now.

02:53.060 --> 02:57.200
And then we're going to save this data to our MongoDB database.

02:57.200 --> 02:59.030
So let's see.

02:59.030 --> 03:05.540
So each of these headlines here is a H2 tag, so it's easy for us to select it.

03:05.660 --> 03:08.360
So let's go ahead and try to do that.

03:08.540 --> 03:14.300
So const HTML await request get.

03:17.220 --> 03:18.240
Read it.

03:21.760 --> 03:23.020
Then we have.

03:23.560 --> 03:24.340
Cheerio.

03:26.440 --> 03:35.350
So we load in the HTML so we can pass it and then we simply select all the titles.

03:36.950 --> 03:44.810
By doing a H2 selection and then we can go through each of these titles.

03:47.360 --> 03:50.900
And do something with them.

03:57.240 --> 03:57.540
Uh.

04:00.540 --> 04:01.620
Mac keyboards.

04:02.280 --> 04:02.940
Um.

04:05.060 --> 04:07.280
So let's just type console.

04:07.390 --> 04:09.440
Now I want to save the title.

04:09.710 --> 04:11.450
So we have the title.

04:12.930 --> 04:15.690
Which is the element.

04:17.220 --> 04:18.030
Text.

04:18.420 --> 04:24.330
And then I just want to see if we are getting the title of all of these Reddit articles.

04:24.810 --> 04:26.370
So that is it.

04:26.370 --> 04:28.800
Then we call the main function here.

04:33.010 --> 04:35.560
Let's try and run it.

04:45.410 --> 04:52.640
Okay, so now we get all of the titles of these different Reddit articles.

04:54.100 --> 04:58.870
Um, so you can see, for example, this one with Vladimir Putin in it.

04:58.870 --> 05:02.500
You see the same article in here.

05:02.920 --> 05:09.160
So now we get that that is a super simple scraper that we're able to build here.

05:09.370 --> 05:19.180
But, um, now we need to be able to save it to our MongoDB DB database, and that's going to happen

05:19.180 --> 05:20.950
in the next lecture.
