WEBVTT

0
00:00.360 --> 00:01.290
In this lesson,

1
00:01.290 --> 00:03.390
I have another super quick challenge for you

2
00:03.390 --> 00:06.660
so you can practice setting up the Selenium webdriver

3
00:06.660 --> 00:09.720
in a blank project and scraping a different piece

4
00:09.720 --> 00:11.730
of data from a website.

5
00:11.730 --> 00:15.000
This time we're going to work with the Wikipedia main page.

6
00:15.000 --> 00:17.400
So you're going to head over to the Course Resources

7
00:17.400 --> 00:19.650
and find the link to this page,

8
00:19.650 --> 00:21.033
or you can just type it in.

9
00:21.960 --> 00:25.950
Back in our project, I'm going to create a new file

10
00:25.950 --> 00:28.883
and I'm going to call this interaction.py.

11
00:32.520 --> 00:36.510
Now, in this new Python file, we're going to interact

12
00:36.510 --> 00:38.850
with this Wikipedia webpage.

13
00:38.850 --> 00:41.850
And as a challenge to you, the first thing I want you

14
00:41.850 --> 00:45.000
to do is to figure out how you can get a hold

15
00:45.000 --> 00:47.190
of this particular number

16
00:47.190 --> 00:51.120
and print it out inside our interaction.py.

17
00:51.120 --> 00:54.421
Remember that you'll need to import Selenium

18
00:54.421 --> 00:59.010
and also use the webdriver to get hold of this page

19
00:59.010 --> 01:01.410
and then find this particular number

20
01:01.410 --> 01:03.090
and finally print it out.

21
01:03.090 --> 01:04.890
And then when you're ready to run it, all you have

22
01:04.890 --> 01:06.360
to do is right-click

23
01:06.360 --> 01:09.870
and then Run this interaction.py and it'll work

24
01:09.870 --> 01:13.140
and you should see the outcome being printed

25
01:13.140 --> 01:14.280
in your console.

26
01:14.280 --> 01:16.833
So pause the video now and give that a go.

27
01:19.920 --> 01:22.110
Alright, so here's the solution.

28
01:22.110 --> 01:25.680
First, we're going to go into the Selenium package,

29
01:25.680 --> 01:28.050
which we've already installed into this project

30
01:28.050 --> 01:30.300
so we don't have to install it again.

31
01:30.300 --> 01:33.450
And then we're going to import the webdriver.

32
01:33.450 --> 01:37.770
Now using the web driver, we're going to create a new driver

33
01:37.770 --> 01:40.023
from the Chrome browser,

34
01:43.620 --> 01:48.120
But this is what we put to initialize a new Chrome driver.

35
01:48.120 --> 01:52.770
Now once we've created our driver,

36
01:52.770 --> 01:57.270
now we can use the driver to navigate to our webpage,

37
01:57.270 --> 01:59.760
which is done using get().

38
01:59.760 --> 02:03.663
And this is the URL, which we'll copy and paste into here.

39
02:04.620 --> 02:07.860
And once we've gotten hold of this page, then we're going

40
02:07.860 --> 02:11.310
to try to narrow down on this particular element.

41
02:11.310 --> 02:14.190
So let's go ahead and Inspect it.

42
02:14.190 --> 02:17.520
And you can see that it's inside an anchor tag

43
02:17.520 --> 02:20.040
with no particular identifiers.

44
02:20.040 --> 02:23.880
There's no id, there's no name, there's no class.

45
02:23.880 --> 02:27.780
But this anchor tag lives in a div that has an id.

46
02:27.780 --> 02:31.980
So this articlecount is going to be a unique identifier

47
02:31.980 --> 02:35.940
for the div that holds this particular anchor tag.

48
02:35.940 --> 02:38.100
So we can narrow in on this anchor tag

49
02:38.100 --> 02:40.650
using our CSS selectors.

50
02:40.650 --> 02:44.970
So we can say driver.find_element(By.CSS_SELECTOR...)

51
02:44.970 --> 02:48.030
make sure that it's element, not elements.

52
02:48.030 --> 02:50.820
And then inside here we're going to put our selector,

53
02:50.820 --> 02:55.170
which is first the id of articlecount,

54
02:55.170 --> 02:58.950
and that is going to be proceeded with a pound sign.

55
02:58.950 --> 03:01.860
And then inside that div with that id, we're looking

56
03:01.860 --> 03:04.203
for the first anchor tag.

57
03:05.520 --> 03:07.230
Now notice that inside that div,

58
03:07.230 --> 03:09.540
there's actually two anchor tags.

59
03:09.540 --> 03:13.530
But by using this find_element(By.CSS_SELECTOR...),

60
03:13.530 --> 03:15.510
it's only going to give us the first one

61
03:15.510 --> 03:17.940
that matches this criteria.

62
03:17.940 --> 03:18.897
So this is going

63
03:18.897 --> 03:23.343
to be our article_count.

64
03:25.050 --> 03:28.230
And now what we want to do is we want to print

65
03:28.230 --> 03:31.830
the article_count.text.

66
03:31.830 --> 03:33.960
So now let's go ahead and right-click

67
03:33.960 --> 03:36.630
and Run our interaction.py.

68
03:36.630 --> 03:39.840
It should open up our browser to this page.

69
03:39.840 --> 03:43.293
And now it should have found and printed out that number.

70
03:44.130 --> 03:46.410
So this is what we've been doing so far,

71
03:46.410 --> 03:49.182
creating our driver, opening webpages,

72
03:49.182 --> 03:51.666
and then finding specific elements

73
03:51.666 --> 03:54.540
and printing some sort of property,

74
03:54.540 --> 03:58.020
but the next step is to actually form some sort

75
03:58.020 --> 04:00.510
of interaction with the webpage.

76
04:00.510 --> 04:02.280
For example, clicking on a link,

77
04:02.280 --> 04:05.400
or typing something into the search bar,

78
04:05.400 --> 04:08.070
because after all, when we're working with websites,

79
04:08.070 --> 04:11.040
it's often that we'll need to interact with it in order

80
04:11.040 --> 04:12.600
to navigate to new pages

81
04:12.600 --> 04:16.590
and get hold of specific pieces of information

82
04:16.590 --> 04:18.300
that we're interested in.

83
04:18.300 --> 04:21.090
And that's what I'm going to show you in the next lesson.

84
04:21.090 --> 04:22.190
So I'll see you there.