WEBVTT

0
00:00.210 --> 00:02.190
Now there's two things you'll notice here.

1
00:02.580 --> 00:06.900
One is we're only getting hold of the first, for example,

2
00:06.900 --> 00:11.730
p tag or a tag, but we're not getting hold of any of the other ones.

3
00:12.120 --> 00:14.970
So what if we wanted to get all of the anchor tags,

4
00:15.000 --> 00:18.420
all of the paragraphs in our website? Well,

5
00:18.450 --> 00:23.450
then we can use a function that comes with Beautiful Soup called find_all.

6
00:24.570 --> 00:27.990
This is probably one of the most commonly used methods when it comes to

7
00:27.990 --> 00:28.823
Beautiful Soup 

8
00:29.730 --> 00:34.730
And here we can search by a bunch of things. We could search by name,

9
00:34.950 --> 00:35.783
so we can say

10
00:35.790 --> 00:40.790
find all of the tags where the tag name is equal to a.

11
00:43.080 --> 00:46.050
So this is going to give us all of the anchor tags,

12
00:46.190 --> 00:47.023
<v 1>right?</v>

13
00:49.220 --> 00:50.840
<v 0>And if I print that,</v>

14
00:51.140 --> 00:56.140
you can see that it gives us a list and it gives us all three of the links that

15
00:57.650 --> 00:59.450
exists in our website.

16
01:00.350 --> 01:03.290
And if I change this to p for example,

17
01:03.470 --> 01:08.470
then it'll find all of the paragraphs and we can change this to basically any of

18
01:09.650 --> 01:13.850
the tag names in our website. Now,

19
01:13.850 --> 01:16.280
what if we wanted to drill a little bit deeper?

20
01:16.820 --> 01:19.190
We've got a list of all the anchor tags,

21
01:19.490 --> 01:23.870
but what if I only wanted the text in that anchor tag?

22
01:23.870 --> 01:28.730
So I just wanted this part. Well, how would I get hold of all of them? Well,

23
01:28.760 --> 01:31.730
firstly, we would probably need a for loop.

24
01:32.030 --> 01:37.030
So we could say for tag in all anchor tags and we can loop through all of those

25
01:39.560 --> 01:44.560
anchor tags and use a method called tag.getText.

26
01:46.220 --> 01:49.040
And now if I go ahead and print this,

27
01:49.130 --> 01:54.130
you can see that it's basically going to print out all three of the text that is

28
01:55.490 --> 01:59.810
in all three of the anchor tags that it found. Now,

29
01:59.810 --> 02:02.420
what if I didn't want to get the text,

30
02:02.510 --> 02:07.340
but instead I wanted to get hold of the actual href,

31
02:07.370 --> 02:11.870
so the link, right? So let's print our all_anchor_tags again.

32
02:12.650 --> 02:13.483
<v 1>All right.</v>

33
02:15.800 --> 02:20.150
<v 0>And you can see that there is a attribute called href</v>

34
02:20.420 --> 02:25.190
which stores the actual link that the tag goes to.

35
02:25.700 --> 02:30.200
So very often, you'll want to isolate that link. So how would you do that?

36
02:30.800 --> 02:31.280
Well,

37
02:31.280 --> 02:36.280
you can tap into each of the tags and you can use another method called get.

38
02:38.390 --> 02:42.140
And here you can get the value of any of the attributes.

39
02:42.590 --> 02:46.250
So if I pass in href here and I print this,

40
02:46.940 --> 02:51.200
then it's going to give me all of the links and it's not going to give me

41
02:51.200 --> 02:54.080
anything else. It's basically just stripped out the link

42
02:54.200 --> 02:59.090
which is what I'm interested in. Similarly, when we use find_all

43
02:59.920 --> 03:04.920
we can also find things by their attribute, so the moment we're searching by

44
03:05.410 --> 03:08.950
the tag name, but we can also get hold of things

45
03:09.040 --> 03:11.860
by the attribute name. For example,

46
03:11.860 --> 03:16.780
if I wanted to get hold of this item, I can of course search for an h1.

47
03:17.080 --> 03:19.660
But what if I had lots of h1s? Well,

48
03:19.660 --> 03:24.430
then I could isolate it by this ID. So I could say,

49
03:27.210 --> 03:27.990
<v 1>soup</v>

50
03:27.990 --> 03:32.990
<v 0>.find_all which will give me a list of all of the items that match the search</v>

51
03:34.680 --> 03:35.513
query,

52
03:35.760 --> 03:40.760
or I can use the find method to only find the first item that matches the query.

53
03:42.900 --> 03:45.450
In my case, there's only one thing I'm looking for.

54
03:45.690 --> 03:50.610
And this particular tag has a name of h1

55
03:51.270 --> 03:56.270
but it's also got a ID of name.

56
03:58.440 --> 03:59.280
As you can see,

57
03:59.520 --> 04:03.510
this ID is equal to name and it's also an h1 tag.

58
04:04.110 --> 04:07.290
So this will give us that particular element.

59
04:07.380 --> 04:12.380
So if I print out this heading and let's comment out everything else,

60
04:16.920 --> 04:21.600
then now you can see I've just isolated that one h1.

61
04:22.260 --> 04:25.830
And this also means if I just add another h1 here,

62
04:26.130 --> 04:26.963
<v 1>...</v>

63
04:29.220 --> 04:31.080
<v 0>and I run this code again,</v>

64
04:31.350 --> 04:36.150
that is not going to show up because I've said it has to have a name of h1

65
04:36.570 --> 04:40.020
and an ID that matches this particular value.

66
04:41.670 --> 04:42.900
Now, as you can imagine,

67
04:42.900 --> 04:47.040
you can also do the same thing with the class attribute.

68
04:47.640 --> 04:49.110
So we can say,

69
04:52.920 --> 04:53.430
<v 1>...</v>

70
04:53.430 --> 04:56.520
<v 0>soup.find because again, I'm only looking for one.</v>

71
04:57.240 --> 05:01.290
And the thing that I'm looking for has a name

72
05:01.680 --> 05:04.290
which is a h3

73
05:04.380 --> 05:05.213
<v 1>...</v>

74
05:07.140 --> 05:12.140
<v 0>but it's also got a class that's equal to heading.</v>

75
05:13.200 --> 05:17.220
So I'm just going to copy that and paste that in here. Now,

76
05:17.250 --> 05:22.250
one of the things you'll get here is an error because this class keyword is a

77
05:23.700 --> 05:25.710
reserved keyword in Python.

78
05:26.190 --> 05:29.100
And what that means is that it's a special word

79
05:29.370 --> 05:34.170
which can only be used for creating classes. Now, in this case,

80
05:34.200 --> 05:38.100
we're definitely not creating a class or an object here. Instead,

81
05:38.100 --> 05:39.990
we're trying to tap into an attribute.

82
05:40.470 --> 05:43.410
So in order to not clash with the class keyword,

83
05:43.680 --> 05:46.980
this attribute is actually called class_re.

84
05:48.330 --> 05:52.920
Now it's going to look for all of the h3s where the class attribute is

85
05:52.920 --> 05:54.360
equal to heading.

86
05:54.970 --> 05:59.970
Let's go ahead and print this section_heading and you should see now we'll get that 

87
06:01.220 --> 06:05.810
h3 with the class of heading show up. And again,

88
06:05.810 --> 06:09.080
if we wanted to get hold of the text

89
06:09.110 --> 06:14.060
that's contained in that h3, then we simply use the getText method,

90
06:14.600 --> 06:19.490
or if we want to know the name of that particular tag,

91
06:19.550 --> 06:21.080
then we can say .name.

92
06:23.920 --> 06:24.370
<v 1>Okay.</v>

93
06:24.370 --> 06:29.230
<v 0>And if we want to get hold of the value of an attribute, for example,</v>

94
06:29.260 --> 06:32.680
get the class value,

95
06:32.830 --> 06:36.430
then we can do something like this. Now,

96
06:36.460 --> 06:41.460
while that's a pretty good way of selecting elements from the entire website,

97
06:42.130 --> 06:46.360
there's certain cases where it might not work. For example,

98
06:46.990 --> 06:51.100
at the moment here, we've got our three anchor tags.

99
06:51.640 --> 06:55.300
If we wanted to get hold of a specific anchor tag,

100
06:55.540 --> 06:59.950
let's say we wanted this anchor tag, then what do we do?

101
07:00.340 --> 07:00.640
Well,

102
07:00.640 --> 07:05.640
then we could just simply find all of the anchor tags and then find the first

103
07:06.820 --> 07:09.280
one. But as you can imagine,

104
07:09.310 --> 07:12.400
this is a incredibly simple website.

105
07:12.700 --> 07:15.490
Most websites will have thousands

106
07:15.520 --> 07:19.000
if not tens of thousands of links. In that situation,

107
07:19.180 --> 07:24.180
it's really hard to know which particular link you want from the list of all of

108
07:24.880 --> 07:25.713
the anchor tags.

109
07:26.320 --> 07:31.320
So we want to have a way where we can drill down into a particular element.

110
07:32.440 --> 07:35.770
What's unique about this particular anchor tag? Well,

111
07:35.800 --> 07:40.800
it sits inside a strong tag and it sits inside an emphasis tag and it sits

112
07:41.830 --> 07:46.390
inside a paragraph tag, which itself is in the body.

113
07:47.080 --> 07:52.080
We can narrow it down using these steps. In our current website,

114
07:52.540 --> 07:57.340
nowhere else is there an anchor tag that sits inside a paragraph tag.

115
07:57.940 --> 08:02.940
And you'll remember from our previous lessons on CSS that you can use CSS

116
08:03.670 --> 08:08.670
selectors in order to narrow down on a particular element in order to specify

117
08:09.580 --> 08:13.600
its style. And if we were to write CSS code,

118
08:15.370 --> 08:17.680
then it would look something like this.

119
08:19.000 --> 08:24.000
So we would select first the paragraph and then we would select the anchor tag

120
08:24.670 --> 08:29.670
and then we can specify what the style should be

121
08:32.200 --> 08:32.830
<v 1>.</v>

122
08:32.830 --> 08:34.690
<v 0>Now. When, we're using  Beautiful Soup,</v>

123
08:34.840 --> 08:37.990
we can also use the CSS selectors.

124
08:38.620 --> 08:43.270
I can get hold of that company URL by simply saying soup,

125
08:43.720 --> 08:48.370
and instead of using find or find_all, I'm going to use select_one.

126
08:49.210 --> 08:54.210
There's select and select_one. Select_one will give us the first matching

127
08:54.430 --> 08:58.620
item and select will give us all of the matching items in a list.

128
08:59.280 --> 09:04.050
Now we get to specify the selector as a string. And again,

129
09:04.080 --> 09:07.020
I'm going to use the same selector that I showed you before.

130
09:07.350 --> 09:11.820
So we're looking for a a tag  which sits inside p tag.

131
09:12.240 --> 09:17.040
And this string is the CSS selector. So you can write anything in here

132
09:17.040 --> 09:21.810
really. This means that we'll be able to get that anchor tag.

133
09:22.080 --> 09:27.080
And then once I've gotten hold of the company URL and print it out,

134
09:27.720 --> 09:31.290
you can see its that exact anchor tag that we wanted.

135
09:32.610 --> 09:36.270
We don't have to just stick to the HTML selectors.

136
09:36.300 --> 09:40.890
You can also use the class or the ID in your CSS selector.

137
09:41.280 --> 09:44.220
So remember, to select on an ID,

138
09:44.520 --> 09:47.550
we use the pound sign. So let's say

139
09:47.880 --> 09:51.720
we want to get hold of this h1, which has an ID of name,

140
09:51.990 --> 09:53.820
we can say #name,

141
09:54.180 --> 09:58.740
and now this is going to be equal to my name.

142
09:59.160 --> 10:01.740
And if I now run it, you can see that last one,

143
10:01.950 --> 10:06.570
the element that was picked out is the h1 with the ID of name.

144
10:07.650 --> 10:12.210
And finally you can use a CSS selector to select an element by class.

145
10:12.390 --> 10:16.680
So for example, here, we've got heading and here we've got heading as well.

146
10:17.160 --> 10:22.160
So if we want to select all of the elements that have a class of heading,

147
10:22.980 --> 10:26.940
then we could say soup.select so this will give us a list

148
10:27.060 --> 10:32.060
and then the selector is the first item that goes into the method.

149
10:32.520 --> 10:37.470
So similar to this, we can have this keyword argument there or we can delete it.

150
10:38.070 --> 10:43.070
And this selector will be using the .heading in order to select the element

151
10:45.060 --> 10:47.130
that has a class of heading.

152
10:49.410 --> 10:53.430
And this is now going to be a list if we print it out.

153
10:55.620 --> 10:56.580
Right here.

154
10:56.630 --> 10:58.530
<v 2>So,</v>

155
10:58.880 --> 11:03.500
<v 0>you can use everything that you've learned about CSS selectors to select a</v>

156
11:03.500 --> 11:06.560
particular item out of an HTML file.

157
11:07.040 --> 11:11.330
And this is usually really useful because a lot of these elements will be nested

158
11:11.330 --> 11:14.600
inside divs and the div will have an ID

159
11:14.870 --> 11:19.190
and then all you have to do is to narrow down on the div and then narrow down on the

160
11:19.190 --> 11:20.023
element you want.

161
11:20.240 --> 11:25.240
So you can basically drill through using CSS selectors to get to any item you

162
11:26.180 --> 11:27.260
want on the page.

163
11:28.460 --> 11:33.460
Now that we've looked at how to find various items from HTML using Beautiful

164
11:33.980 --> 11:37.610
Soup, in the next lesson, I've got a quiz for you

165
11:37.910 --> 11:42.910
for you to have a go and have some practice at selecting and finding elements

166
11:43.850 --> 11:46.760
from an HTML file using Beautiful Soup.

167
11:47.240 --> 11:50.330
So for all of that and more, head over to the next lesson.