WEBVTT

00:01.600 --> 00:02.160
Okay.

00:02.160 --> 00:11.360
Now, uh, in the previous lecture, we covered how to get some hidden path that was here using a list.

00:11.360 --> 00:12.160
Okay.

00:12.200 --> 00:17.920
And sometimes happen that you don't have a list and you want to get some very useful information from

00:17.920 --> 00:18.760
a website.

00:18.800 --> 00:20.000
Something like this.

00:20.040 --> 00:20.640
Okay.

00:20.680 --> 00:28.480
So I'm using this model, uh, metal de because of, uh, I am allowed to use, uh, my tools on this

00:28.480 --> 00:33.440
and I'm not allowed to use on other website and it's not very good.

00:33.480 --> 00:34.440
Okay.

00:34.480 --> 00:46.560
So here we need to create a program that extracts some useful data from this website.

00:46.560 --> 00:56.240
For example, here you see when I right click here and go to inspect, I see the HTML source code of

00:56.240 --> 00:57.960
this website or web page.

00:59.080 --> 01:05.050
And you see from top to bottom all HTML code.

01:05.810 --> 01:11.370
And in here also we have some very informative data, some useful data.

01:11.370 --> 01:14.450
For example, I searched for each reef.

01:14.490 --> 01:18.890
That is link okay link to other pages.

01:19.090 --> 01:27.050
Here you see we have a link for a CSS link for a css, CSS and also for JavaScript.

01:27.330 --> 01:31.810
If I hit enter some time, let me here.

01:31.810 --> 01:36.850
Now you see that we get a link for index dot php.

01:36.850 --> 01:45.770
And also we have a link for uh login page and a lot of things that we really need them.

01:45.770 --> 01:52.970
And here now I can use a regex and also my program to extract all these links and put them in a file,

01:52.970 --> 01:57.570
or print them into the terminal so I can use them.

01:57.570 --> 01:58.370
Okay.

01:58.410 --> 02:04.980
So let's come here I am going to create a new file Here.

02:06.500 --> 02:12.460
I'm going to name this crawler Chupai.

02:12.500 --> 02:13.100
This time.

02:13.940 --> 02:17.620
And also I need some of this code from here.

02:17.620 --> 02:21.660
So I copy that and then I use Ctrl V to paste them here.

02:22.220 --> 02:26.260
And then I remove this because I don't need to open anything.

02:26.260 --> 02:29.900
And also I need to have this one and also the link.

02:30.140 --> 02:30.580
Okay.

02:31.540 --> 02:36.940
So now I'm trying to access that web page.

02:37.140 --> 02:38.780
So very easy.

02:38.820 --> 02:47.260
We just uh use the response is going to be equal to request okay.

02:49.420 --> 02:53.900
Uh request actually because we have a function called request.

02:53.900 --> 02:57.340
And then in here I'm going to give the URL.

02:57.340 --> 03:01.260
So the URL is already provided right here.

03:01.340 --> 03:02.940
So I give it here.

03:03.180 --> 03:04.740
Now I have the response.

03:05.260 --> 03:09.230
And if I just print the response.

03:09.710 --> 03:13.870
So you will see that I have something like this.

03:17.110 --> 03:19.070
I use two this time hit enter.

03:19.350 --> 03:21.510
I have a response code of 200.

03:21.550 --> 03:23.630
That means I am able to open this.

03:24.150 --> 03:26.830
What if I open?

03:27.030 --> 03:29.190
Want to see the content of this?

03:29.390 --> 03:32.590
So I have content.

03:32.590 --> 03:34.590
I get response dot content.

03:35.630 --> 03:41.870
Now you see that it gives me all the HTML code of this web page.

03:42.550 --> 03:43.190
Perfect.

03:44.110 --> 03:47.110
Here I have all the that thing.

03:47.110 --> 03:52.590
But the most important thing is the link that you see here.

03:52.750 --> 03:55.390
I want to access all the links.

03:55.670 --> 04:02.910
So for that to do that I need to import another module that is called or or regex.

04:02.950 --> 04:03.350
Okay.

04:04.550 --> 04:09.920
So I think in one of the videos We covered regex.

04:10.920 --> 04:12.800
You may know how to use it.

04:12.800 --> 04:13.160
Okay.

04:13.720 --> 04:22.240
So to get only the sheriff or links I'm going to create a variable.

04:22.240 --> 04:24.520
I'm going to name it http links.

04:24.600 --> 04:35.960
It is going to be equal to I'm going to use this okay or E dot find all.

04:36.280 --> 04:41.880
And in here I'm going to provide a regex to get all the links.

04:42.160 --> 04:49.080
So and the first thing that I want is to separate each riff.

04:49.080 --> 04:56.760
And also some of the columns you see we have a strip here and then we have comma.

04:56.760 --> 04:58.280
And also we have something here.

04:58.280 --> 05:01.040
And then we have not comma.

05:01.080 --> 05:02.800
We have this double quote okay.

05:02.840 --> 05:05.360
And also we have double quote here.

05:05.360 --> 05:09.200
So we need to provide something very useful like that.

05:09.560 --> 05:09.800
So.

05:12.240 --> 05:15.160
The first thing is that I want to group these.

05:15.200 --> 05:15.560
Okay.

05:15.960 --> 05:19.320
So I'm going to search for each reef.

05:19.360 --> 05:23.400
And then this sign which is equal and then double quote.

05:23.440 --> 05:26.040
This is one of that thing that I want.

05:26.080 --> 05:31.920
And also after that I want to search for any character or all character.

05:31.960 --> 05:32.920
This is the other group.

05:33.240 --> 05:38.320
And in here I'm going to search for everything okay.

05:40.080 --> 05:44.840
And here I'm going to give that double quote at the end.

05:44.880 --> 05:49.480
So now you see that we have each reef and also this one.

05:49.480 --> 05:51.760
And also we have everything which is inside that.

05:51.760 --> 05:53.920
And then we have that double quote here.

05:54.160 --> 05:57.240
So this will come and search for everything.

05:57.240 --> 05:58.880
But it is greedy okay.

05:58.920 --> 06:05.160
It will include from top to bottom all as one quotation mark.

06:05.200 --> 06:07.600
If we have inside this any quotation mark.

06:07.640 --> 06:10.200
So it is not going to require that okay.

06:10.240 --> 06:12.040
So it is going to move up to the end.

06:12.850 --> 06:16.850
and to remove that we need to use question mark.

06:17.410 --> 06:21.250
And this will remove that this from being greedy.

06:21.250 --> 06:24.490
And also the end like this.

06:24.650 --> 06:30.610
So and this is the point all method.

06:30.610 --> 06:36.010
And also because it is regex I don't need to print the response dot content.

06:36.050 --> 06:42.010
Here I am going to give the response dot content as second argument.

06:42.010 --> 06:48.370
So let's use respond response dot content.

06:49.210 --> 06:53.610
Now instead of printing response dot content, I'm going to print only the links.

06:59.490 --> 07:00.010
Okay.

07:01.090 --> 07:02.370
Use control s.

07:02.370 --> 07:07.810
And also let me come here and re-execute this.

07:07.810 --> 07:15.820
Now you see that we are having a beautiful error here and not use a string pattern on byte like object.

07:15.940 --> 07:16.340
Okay.

07:18.780 --> 07:19.460
Perfect.

07:19.500 --> 07:27.380
We have this one, and you may know that this response dot content is not a string here okay.

07:27.460 --> 07:28.500
It is kind of object.

07:28.500 --> 07:36.620
And we are not able to use the object or object with other thing okay.

07:36.660 --> 07:40.860
So we have this if I just use.

07:43.580 --> 07:50.860
This here is TR and now it is changed to string okay.

07:52.020 --> 07:54.740
And that both now is string.

07:54.740 --> 07:57.420
And string with a string is working properly.

07:57.580 --> 08:01.380
And now you see I only have the A list okay.

08:01.420 --> 08:07.020
A list of all links that exist in that page.

08:07.060 --> 08:08.980
And it is a list okay.

08:09.020 --> 08:10.260
So perfect.

08:10.540 --> 08:15.660
And now I'm going to improve this in the next lecture.
