WEBVTT

0
00:00.480 --> 00:01.230
In this lesson,

1
00:01.230 --> 00:06.230
we're going to get started using a library called Beautiful Soup to pass an HTML

2
00:06.930 --> 00:08.280
file. Parsing

3
00:08.280 --> 00:13.280
the HTML file is the first step to extracting the data contained in a website. To

4
00:13.620 --> 00:17.160
get started, the first thing I want you to do is to head over to

5
00:17.160 --> 00:21.480
the course resources and download the starting project for today.

6
00:21.930 --> 00:26.010
It's called bs4-start and once you've extracted it and opened it in

7
00:26.010 --> 00:30.600
PyCharm, then I want you to take a look inside. First, we've got an empty

8
00:30.600 --> 00:31.800
main.py file

9
00:31.830 --> 00:36.810
which we're going to write  in order to use Beautiful Soup and extract the data

10
00:36.810 --> 00:41.670
that we want. And the data is going to come from this website.html.

11
00:42.270 --> 00:46.590
Now, a really nice thing about PyCharm is when you have an HTML file,

12
00:46.620 --> 00:49.620
you can always just click on your favorite browser icon here

13
00:50.160 --> 00:52.770
and it will open up the website as it is.

14
00:53.430 --> 00:58.430
And you can see that this is a simplified version of our HTML CV page that we

15
00:59.130 --> 01:02.190
built on day 41. Now,

16
01:02.220 --> 01:05.550
if you've skipped day 41 to 44

17
01:05.580 --> 01:08.070
because you already know HTML and CSS,

18
01:08.640 --> 01:11.190
then have a quick read through the HTML.

19
01:11.520 --> 01:15.420
It's a very simple document with a bunch of different HTML tags,

20
01:15.840 --> 01:18.030
attributes like class and ID,

21
01:18.570 --> 01:23.190
and also a list. I've kept this as simple as possible

22
01:23.220 --> 01:25.560
just so that it's easy for us to work through it

23
01:25.890 --> 01:27.600
when we're trying to get hold of things

24
01:27.630 --> 01:30.840
using Beautiful Soup. In the course resources,

25
01:30.900 --> 01:34.380
I've also got a link to the Beautiful Soup documentation.

26
01:34.860 --> 01:39.120
So this is where you can find out everything that you can do with Beautiful 

27
01:39.120 --> 01:43.920
Soup. But I wanna walk you through some of the most commonly used components.

28
01:44.610 --> 01:45.360
Beautiful Soup,

29
01:45.360 --> 01:50.360
as they say, is a Python library for pulling data out of HTML and XML files.

30
01:51.420 --> 01:56.190
So HTML and XML are both structural languages and they're responsible for

31
01:56.190 --> 02:00.960
structuring data like the data in a website using these tags.

32
02:01.680 --> 02:03.330
And the great thing about Beautiful Soup

33
02:03.380 --> 02:07.850
<v 1>is it's super easy to use, and it can save you hours or,</v>

34
02:07.910 --> 02:12.910
<v 0>days of work to get hold of the data that you want from a particular website.</v>

35
02:13.520 --> 02:18.410
And you can see the documentation's also been translated by kind users into some

36
02:18.440 --> 02:21.320
other languages as well. So if you find it easier,

37
02:21.590 --> 02:24.170
then these languages might help you as well. Now,

38
02:24.170 --> 02:29.170
the first thing I have to do here in my main.py is to actually get hold of

39
02:30.290 --> 02:32.270
this particular file.

40
02:32.840 --> 02:35.600
Now you might remember from previous lessons,

41
02:35.660 --> 02:38.240
how we open a file in Python.

42
02:38.900 --> 02:43.900
So have a quick think and see if you can figure out how to get a hold of the

43
02:44.330 --> 02:49.330
content in this HTML file as a string or as a piece of text.

44
02:50.060 --> 02:53.840
Pause the video and see if you can complete this challenge. Just as a hint,

45
02:53.960 --> 02:57.380
you might need the keyword with and the 

46
02:57.380 --> 02:58.330
<v 1>keyword open.</v>

47
03:01.030 --> 03:04.660
<v 0>All right. So the way we do this, as we say with open,</v>

48
03:05.020 --> 03:09.010
and then we provide the name of the file, which is website.html.

49
03:09.520 --> 03:13.540
And we can open this as a alias name which we'll call file,

50
03:14.140 --> 03:18.580
and now we have access to this file and we can say file.read.

51
03:19.240 --> 03:21.010
Now, once we've read this file,

52
03:21.040 --> 03:24.730
we can save this to a variable

53
03:24.730 --> 03:29.590
which we'll call contents. And once we've gotten hold of these contents,

54
03:29.770 --> 03:32.740
then we can start using Beautiful Soup.

55
03:33.460 --> 03:36.940
So as always, we always start with the import,

56
03:37.120 --> 03:42.120
so the name of the library that we're going to install is called bs4

57
03:43.600 --> 03:45.850
and this is Beautiful Soup version four

58
03:45.970 --> 03:49.210
which is currently the latest version of Beautiful Soup.

59
03:49.870 --> 03:53.620
And from this particular package, we want to import the class

60
03:53.620 --> 03:57.130
which is called Beautiful Soup. Now,

61
03:57.160 --> 04:00.460
if you downloaded this project from the course resources,

62
04:00.820 --> 04:05.820
you should see that there's no errors underlining this line because bs4 has

63
04:06.430 --> 04:09.610
already been installed. If you see some squiggly underlines,

64
04:09.670 --> 04:13.660
just click on the red light bulb and install the required module

65
04:13.750 --> 04:17.230
which is called bs4. Now,

66
04:17.230 --> 04:20.380
once we've got hold of our BeautifulSoup class

67
04:20.500 --> 04:24.220
then we're ready to make soup. In order to make soup,

68
04:24.280 --> 04:29.280
we use our BeautifulSoup class and we create a new object from that class

69
04:29.770 --> 04:34.360
and we pass in a string. So this is the markup,

70
04:34.720 --> 04:39.460
and that's the same M that you find in HTML and XML.

71
04:39.760 --> 04:44.760
It's the hypertext markup language and the extensible markup language.

72
04:47.170 --> 04:51.580
So the markup refers to basically all of this.

73
04:52.570 --> 04:55.480
We've already gotten hold of it through this contents

74
04:55.720 --> 04:57.520
so let's go ahead and pass that in.

75
04:58.300 --> 05:02.680
So now that we've specified what it is we want to use to create our soup,

76
05:03.370 --> 05:06.340
the next thing we have to provide is the parser.

77
05:06.850 --> 05:09.370
This is going to help the BeautifulSoup module

78
05:09.430 --> 05:14.430
understand what language this particular content is structured in.

79
05:15.370 --> 05:18.490
As they mentioned, it can parse HTML and XML

80
05:18.700 --> 05:22.660
so we have to tell it what particular type of document we've got.

81
05:23.260 --> 05:26.920
And the easiest way is just to use the Python

82
05:27.040 --> 05:28.420
html.parser.

83
05:29.650 --> 05:34.030
So after we've parsed in the text that we want to turn into soup,

84
05:34.600 --> 05:39.600
we're going to add the parser as html.parser.

85
05:41.740 --> 05:45.940
And this is going to help Beautiful Soup understand these contents.

86
05:47.200 --> 05:50.590
Now, depending on the website that you're working with, occasionally,

87
05:50.590 --> 05:55.390
you might need to use the lxml's parser. And do that,

88
05:55.480 --> 05:59.990
all you have to do is say import lxml

89
06:00.290 --> 06:05.290
and then install this particular package, and here in a string

90
06:06.290 --> 06:10.730
instead of using html.parser, you can use lxml.

91
06:11.450 --> 06:16.400
And this is basically just a different way of parsing or understanding the

92
06:16.400 --> 06:18.380
content that you're passing to Beautiful Soup.

93
06:18.890 --> 06:23.750
And I found that with certain websites the html.parser might not work

94
06:23.750 --> 06:26.990
and you might get an error that tells you something about your parser not

95
06:26.990 --> 06:31.070
working. So then you might consider using lxml instead.

96
06:32.450 --> 06:36.710
So this one line of code basically completes our parsing.

97
06:37.280 --> 06:42.280
And this soup is now an object that allows us to tap in to various parts of the

98
06:43.550 --> 06:47.270
website, but using Python code. For example,

99
06:47.270 --> 06:51.470
if I wanted this title tag out of this whole website,

100
06:51.770 --> 06:55.610
all I have to do is say soup.title.

101
06:57.680 --> 07:01.760
And now if I print this soup.title,

102
07:02.360 --> 07:07.360
you can see that we'll get the title tag being printed out in its entirety.

103
07:09.710 --> 07:14.710
Once Beautiful Soup has made sense of this website by parsing the HTML,

104
07:17.300 --> 07:19.460
we can now tap into that object

105
07:19.520 --> 07:23.510
which is the HTML code as if it were a Python object.

106
07:23.840 --> 07:25.730
So we can tap into the title,

107
07:26.000 --> 07:30.200
but we can dig even deeper. Instead of just tapping into the title,

108
07:30.500 --> 07:35.500
we can also get hold of other things like the title.name,

109
07:36.200 --> 07:41.200
and this is going to give us the name of that particular title tag.

110
07:41.960 --> 07:44.900
So remember this gave us the title tag,

111
07:44.930 --> 07:49.930
so all of this, and this next stage drilling even deeper into the name of this

112
07:50.900 --> 07:54.200
tag, you'll see that it gives us title.

113
07:54.710 --> 07:57.590
So the name of this title tag is called title,

114
07:58.250 --> 08:01.460
and we can also get hold of the string

115
08:01.700 --> 08:06.700
which is contained in the title tag by simply using .title.string.

116
08:07.280 --> 08:11.840
And you can see this is the actual string that's inside that title tag.

117
08:13.280 --> 08:16.130
If we think about it, this entire soup

118
08:16.190 --> 08:19.520
object now represents our HTML code.

119
08:20.030 --> 08:24.440
So I can also actually just print out the entire soup object.

120
08:24.950 --> 08:29.180
And you can see that this is basically just all HTML.

121
08:29.990 --> 08:34.100
And if you want to, there's even a method called prettify

122
08:34.460 --> 08:37.820
which will indent your soup HTML code.

123
08:38.120 --> 08:43.120
So now, compared this where everything's all on one line, with this prettified

124
08:43.910 --> 08:47.900
version where everything's all indented properly and easier to read.

125
08:49.130 --> 08:53.780
In addition to getting the title tag, we can also get hold of,

126
08:53.810 --> 08:56.700
for example, the a tag.

127
08:57.060 --> 09:02.060
So this is going to give us the first anchor tag that it finds in our website,

128
09:02.700 --> 09:05.220
which happens to be this one right here.

129
09:06.180 --> 09:11.180
And we can swap that with maybe the first li or the first paragraph.

130
09:14.910 --> 09:18.450
Essentially what we're doing with beautiful soup is we're just drilling down

131
09:18.480 --> 09:23.370
into this HTML file, finding the HTML tags that we're interested in,

132
09:23.850 --> 09:26.040
and then getting hold of the

133
09:26.100 --> 09:30.660
either name of the tag or the actual text of the tag.

134
09:31.470 --> 09:36.470
But what if we wanted all of the paragraphs or all of the anchor tags in our

135
09:37.380 --> 09:41.880
website, how would we do that? In the next lesson,

136
09:42.030 --> 09:47.030
we're going to dive deeper into searching through websites for all of the

137
09:47.190 --> 09:49.500
components that we're looking for. For example,

138
09:49.740 --> 09:52.410
all of the P tags or all of the anchor tags.

139
09:52.770 --> 09:57.770
And we're going to see how we can refine our search and specify exactly what it

140
09:58.140 --> 10:02.610
is that we want. So for all of that and more, I'll see you on the next lesson.