WEBVTT

0
00:00.690 --> 00:01.860
Now here's a question.

1
00:02.310 --> 00:07.310
Have you ever gotten into a situation where it's evening time and you wanna

2
00:07.920 --> 00:10.710
watch a movie, but you don't know what to watch?

3
00:11.160 --> 00:15.000
Like there's just no ideas coming to your mind and everybody's sitting around

4
00:15.000 --> 00:16.500
the TV, like, what do we do?

5
00:17.400 --> 00:22.400
So what if we scraped a list of the top 100 movies of all time and you pick one

6
00:25.320 --> 00:26.153
from there?

7
00:26.700 --> 00:31.700
So I recently came across this article on empire where they list the 100

8
00:32.040 --> 00:34.590
greatest movies of all time.

9
00:35.220 --> 00:40.220
And you can see that it goes from 100 all the way down to one.

10
00:41.820 --> 00:45.960
These are supposed to be the top rated movies that have ever been produced

11
00:46.440 --> 00:51.440
and I always wanted to watch through all 100 of them and check off each one as I

12
00:51.810 --> 00:56.790
go along. So this is what we're going to be doing for our final project today.

13
00:57.180 --> 01:02.040
You're going to be scraping from this website the 100 movies

14
01:02.430 --> 01:06.570
and you are going to be using Python code to create a text file called movies

15
01:06.570 --> 01:11.040
.text that lists them in order, starting from one.

16
01:11.760 --> 01:16.760
Notice how each of these are just the titles of each of these movies.

17
01:17.730 --> 01:20.160
Essentially, it's this part that you want.

18
01:20.730 --> 01:23.520
And because the listing starts from 100,

19
01:23.790 --> 01:27.090
you have to figure out how to flip it the other way so that you get it

20
01:27.090 --> 01:30.270
starting from one. But essentially this is the goal.

21
01:30.450 --> 01:32.910
And this is the project for today.

22
01:32.970 --> 01:35.190
You're going to have to use what you learned about web scraping,

23
01:35.460 --> 01:38.580
but also other things that you've learned along the way about Python.

24
01:39.330 --> 01:42.540
Pause the video now and try to complete this challenge.

25
01:45.230 --> 01:45.770
<v 1>Okay.</v>

26
01:45.770 --> 01:49.340
<v 0>So here I've created a brand new blank project,</v>

27
01:49.370 --> 01:52.670
which I've called top100-movies, but of course, that doesn't matter.

28
01:53.210 --> 01:56.900
But in the main.py is where we're going to be doing all the serious stuff.

29
01:57.200 --> 02:01.160
So the first thing we'll need is to copy the URL.

30
02:01.670 --> 02:05.870
And at the time when you're doing this project, this URL might change.

31
02:06.110 --> 02:08.090
So don't get it from here. Instead,

32
02:08.360 --> 02:11.480
go to the course resources and copy the URL from there.

33
02:12.380 --> 02:13.790
Once you've got the URL,

34
02:13.820 --> 02:18.020
we're going to paste it into our main.py and save it as a constant.

35
02:18.710 --> 02:21.980
In addition, we're going to have to import all the things that we need,

36
02:22.010 --> 02:23.480
including requests,

37
02:23.870 --> 02:27.500
and also the bs4 module

38
02:28.790 --> 02:31.280
where we can get hold of our Beautiful Soup.

39
02:33.980 --> 02:37.640
I'm going to need to install both of these because I don't have them yet.

40
02:38.510 --> 02:42.440
So I'm going to click on the red squiggly line, install requests,

41
02:43.070 --> 02:47.330
and also install beautiful soup. Once that's all done,

42
02:47.390 --> 02:52.390
then my red squiggly lines should be gone and I can now start using my request

43
02:52.880 --> 02:53.870
module to

44
02:53.870 --> 02:58.870
make a get request to this particular URL. And I'll save the response

45
02:59.100 --> 03:01.750
I get back as the response.

46
03:03.280 --> 03:05.200
The actual HTML files,

47
03:05.200 --> 03:10.200
so the website HTML, is actually under response.text.

48
03:11.560 --> 03:14.440
So this is going to be the raw HTML text,

49
03:14.800 --> 03:18.820
and this is what I'm going to be using to use Beautiful Soup to parse.

50
03:19.480 --> 03:24.010
So now we make soup and we're going to use Beautiful Soup and we're going to

51
03:24.010 --> 03:26.110
parse in our website HTML

52
03:26.530 --> 03:29.170
and also the html.parser

53
03:29.620 --> 03:33.220
which is the method which we're going to use to parse through this website.

54
03:34.360 --> 03:35.710
Now that we've made soup,

55
03:35.740 --> 03:40.720
let's go ahead and print out our soup in a predefined format so that we can just

56
03:40.720 --> 03:41.980
see what it looks like.

57
03:42.160 --> 03:46.660
So let's run our main.py and we can see that we've got basically that

58
03:46.660 --> 03:49.750
entire website's HTML being printed out

59
03:49.750 --> 03:54.750
here. Now comes the part where we need to use our Google Chrome inspector.

60
03:56.020 --> 04:00.700
The part that we want from this website is just these lines. Now,

61
04:00.730 --> 04:02.500
of course, if we didn't know how to code,

62
04:02.740 --> 04:06.490
we would be here copying and pasting for hours on end,

63
04:06.910 --> 04:11.890
and we would die of boredom or we'd get repetitive strain injury from doing so

64
04:11.890 --> 04:15.760
much copy and pasting. But because we know code, we know better than that.

65
04:15.910 --> 04:19.510
So let's get hold of the part that we want. Let's right-

66
04:19.510 --> 04:22.270
click and click inspect. Now,

67
04:22.270 --> 04:27.220
we can see that this lives inside and h3  with the class of title.

68
04:27.790 --> 04:31.630
Now let's just check against one of the other pieces that we want and just make

69
04:31.630 --> 04:33.880
sure that they've got the same structure.

70
04:34.330 --> 04:38.470
So this is also inside an h3 with the class of title.

71
04:38.950 --> 04:43.180
So basically as long as we can scrape this entire page and get all of the 

72
04:43.210 --> 04:48.040
h3 with the class of title and get the text that's contained inside the 

73
04:48.070 --> 04:52.120
h3 element, then we're golden. Let's go ahead and do that.

74
04:52.600 --> 04:55.120
So instead of printing our soup.prettify,

75
04:55.150 --> 04:57.550
I'm going to tap into soup and I'm going to say

76
04:57.550 --> 05:02.550
find all. The thing that I want to find has the tag name of an h3 and it

77
05:04.360 --> 05:07.300
has the class of title.

78
05:08.080 --> 05:11.710
So that all came from our inspections right here.

79
05:12.400 --> 05:15.940
This should get us a list of all of the 

80
05:15.970 --> 05:17.620
h3 elements with this class

81
05:18.160 --> 05:23.110
and we should be able to save that into a variable called all_movies.

82
05:23.980 --> 05:28.450
Let's print all_movies and see what we get.

83
05:30.190 --> 05:33.430
Now, we've got a list of all of our h3s,

84
05:33.970 --> 05:38.970
and we're now going to go one step further and fetch the text from within these

85
05:41.380 --> 05:45.340
h3 elements. We do that using the

86
05:45.640 --> 05:49.000
getText method. But we can't do that on the list

87
05:49.030 --> 05:51.820
so we're going to use list comprehension.

88
05:52.390 --> 05:57.390
So we're going to say the movie_titles is equal to a new list

89
05:58.850 --> 05:59.780
and in this list,

90
05:59.930 --> 06:04.930
each item is going to be formed from a movie in the all_movies list.

91
06:09.710 --> 06:13.970
And this new item is going to be created by taking each of the movies in the

92
06:13.970 --> 06:17.900
list and then calling getText on it. Now,

93
06:17.900 --> 06:21.380
if I go ahead and print my movie_titles

94
06:21.410 --> 06:25.940
instead of all_movies, then this is what we get.

95
06:25.940 --> 06:29.930
We get all of the titles of all 100 movies.

96
06:32.120 --> 06:37.120
Now we want to reverse this list so that we can put it into a text file starting

97
06:38.060 --> 06:40.490
from 1, going down to 100.

98
06:41.000 --> 06:43.070
So there's a couple of ways that we could do this.

99
06:43.130 --> 06:46.520
One is we can use the Python splice operator.

100
06:46.910 --> 06:49.100
So we add a set of square brackets

101
06:49.520 --> 06:52.220
and then we add a ::-1,

102
06:54.650 --> 06:58.400
and this will reverse the order. And as always,

103
06:58.760 --> 07:03.260
you can find out this information either by Googling or through what you've

104
07:03.260 --> 07:07.700
learned before in previous lessons. So this is the syntax that we're using,

105
07:08.180 --> 07:12.380
which comes from the slice operator where we have a start,

106
07:12.680 --> 07:15.620
a stop, and a step. So in this case,

107
07:15.800 --> 07:18.410
the start is at the very beginning of the list,

108
07:18.470 --> 07:20.330
the stop is at the very end of the list so

109
07:20.360 --> 07:22.940
we don't have to specify those cause they're the defaults,

110
07:23.360 --> 07:28.360
and then the step is basically -1 and this syntax will reverse that list.

111
07:31.040 --> 07:32.480
Now, alternatively, you can,

112
07:32.480 --> 07:37.480
of course also use a for loop where you create some sort of n in a range and the

113
07:40.370 --> 07:44.150
range again, can take a start, stop and step.

114
07:44.630 --> 07:46.820
So you could start at the end,

115
07:46.850 --> 07:50.930
so the length of our movie_titles-1

116
07:51.020 --> 07:55.370
because remember lists in Python start numbering from zero.

117
07:55.730 --> 08:00.680
So the very last item is actually at the total number minus one.

118
08:01.370 --> 08:05.690
And next is going to be the end, which is going to be zero, and finally

119
08:05.690 --> 08:09.770
it's going to be the step, which is minus one each time.

120
08:09.890 --> 08:13.790
So this time we start off from the very end of the list,

121
08:13.790 --> 08:18.620
go back to the beginning, stepping by minus one each time. And this way,

122
08:18.620 --> 08:19.880
if we print n,

123
08:19.940 --> 08:24.940
you can see that this is going to give us basically all the way from 99 down to

124
08:26.090 --> 08:26.923
1.

125
08:27.590 --> 08:32.590
And we can use that to tap into our movie titles and get hold of each of the items

126
08:35.330 --> 08:36.620
at index n.

127
08:41.110 --> 08:45.580
And because range actually doesn't go beyond the end

128
08:45.610 --> 08:50.110
we actually have to put -1 there if we want to get the very final one

129
08:50.110 --> 08:55.000
which is number 100. And you'll notice also that there's a

130
08:55.200 --> 08:58.110
bit of a typo here for number 93,

131
08:58.530 --> 09:00.990
and this is actually not our fault at all.

132
09:00.990 --> 09:04.380
It's in fact, in the original empire post.

133
09:04.440 --> 09:09.440
They actually screwed up and this should be number 93. Because we're scraping

134
09:10.650 --> 09:12.150
data we can't really be picky.

135
09:12.480 --> 09:15.090
We're just going to end up with what we end up with.

136
09:15.870 --> 09:20.870
So I'm going to choose the method where we actually take the movie title

137
09:20.910 --> 09:25.910
and then we use the splice to get hold of it in reverse.

138
09:27.090 --> 09:29.460
And I'll call that the movies.

139
09:30.690 --> 09:35.640
And now we can create our new text file. So with open

140
09:35.730 --> 09:38.670
we're going to create a new file called movies.txt,

141
09:39.090 --> 09:44.090
and we of course have to change the mode two write mode so that we can actually

142
09:44.490 --> 09:47.370
create this file. And then

143
09:47.430 --> 09:51.810
because this file movies.txt doesn't exist, once this line runs

144
09:51.810 --> 09:55.230
it's going to create that file and then we're going to write to it.

145
09:55.230 --> 09:57.060
So we're going to say file.write

146
09:57.510 --> 10:02.510
and we're going to write each of the lines in our list of movies.

147
10:03.060 --> 10:05.160
We can again use a for loop

148
10:05.460 --> 10:08.370
so for movie in movies,

149
10:08.850 --> 10:13.850
let's go ahead and write the name of the movie

150
10:16.800 --> 10:20.040
and then lets add a new line operator,

151
10:20.040 --> 10:25.040
so \n, so that we get each of the movies onto its own line.

152
10:26.250 --> 10:28.710
Now, finally, if I run this code,

153
10:28.830 --> 10:32.790
then it's going to create our movies.txt. And if I take a look at it,

154
10:32.820 --> 10:37.820
you can see it's now got all of the 100 movies listed on here from 1 down to

155
10:38.250 --> 10:39.083
100.

156
10:39.900 --> 10:43.710
And now you can go through this list and delete the ones that you've already

157
10:43.710 --> 10:47.970
watched and then continue watching through the rest of the list.

158
10:48.660 --> 10:52.470
I hope you had fun trying out web scraping by yourself in this project.

159
10:52.740 --> 10:56.100
So that's all for today. Take a look at what you've done.

160
10:56.130 --> 11:00.600
If there's anything confusing about the class or ID or HTML,

161
11:00.750 --> 11:04.410
then be sure to review some of the lessons in the previous four days.

162
11:05.400 --> 11:06.420
That's all for today.