WEBVTT

0
00:00.510 --> 00:05.250
Now that we've learned how to do web scraping with requests and Beautiful Soup,

1
00:05.880 --> 00:10.880
it's time to step back for a moment and have a think about what we're allowed to

2
00:11.490 --> 00:16.490
do and what might not be a good idea when we're scraping data from other

3
00:16.710 --> 00:20.640
websites. Because after all, we don't own that data, right?

4
00:21.450 --> 00:26.370
When you think about services like Google or Bing or any other search engine,

5
00:26.790 --> 00:31.790
essentially what they're doing is they're constantly scraping data from all of the

6
00:32.940 --> 00:35.280
websites that are listed on the internet.

7
00:35.850 --> 00:39.780
And that's how they manage to get the information about what's on each page

8
00:40.110 --> 00:43.740
and for it to show up for users who use their search service.

9
00:44.490 --> 00:49.490
Now we have to step back for a moment and think about what is the law on web

10
00:49.920 --> 00:52.530
scraping? What is legal and what is illegal?

11
00:53.250 --> 00:55.530
Even as we were looking at Hacker News

12
00:55.530 --> 00:59.850
just now I noticed that one of the articles in fact talks about the Genius law

13
00:59.880 --> 01:04.170
suit with Google. And in terms of recent history,

14
01:04.170 --> 01:08.310
there's two really famous cases, which is Genius suing Google

15
01:08.640 --> 01:12.810
because they're saying that Google is scraping the song lyrics from their

16
01:12.810 --> 01:17.340
website and they're actually displaying it without taking people to Genius.

17
01:17.910 --> 01:21.270
So for example, if we're looking at the lyrics for Code m
Monkey,

18
01:21.930 --> 01:26.930
you can see that Google automatically shows the lyrics straight inside of

19
01:26.940 --> 01:27.773
Google.

20
01:27.810 --> 01:32.760
That means that a user can potentially just get all the information they need,

21
01:33.120 --> 01:34.410
say all of the song

22
01:34.410 --> 01:39.410
lyrics to this song, without ever needing to visit the website where this lyric

23
01:39.960 --> 01:40.793
might come from.

24
01:41.430 --> 01:45.600
And that lyric might've been compiled by somebody on Genius.

25
01:46.170 --> 01:50.190
Genius has a lyric annotation website. And of course,

26
01:50.280 --> 01:51.510
as with all websites,

27
01:51.510 --> 01:56.510
they rely on users actually visiting their website to make money or to show ads.

28
01:57.630 --> 02:00.570
And if Google simply just shows it in their search results,

29
02:00.870 --> 02:03.930
then this can be a problem for websites like Genius.

30
02:04.470 --> 02:09.470
So they sued them over this and actually ended up losing the lawsuit.

31
02:10.560 --> 02:15.560
Another really famous example of a lawsuit over scraping is hiQ versus

32
02:17.130 --> 02:22.080
LinkedIn. So hiQ was scraping data from LinkedIn to use commercially.

33
02:22.620 --> 02:27.360
So LinkedIn sued them and ended up losing in the lawsuit.

34
02:28.050 --> 02:29.250
Based on these lawsuits,

35
02:29.280 --> 02:33.960
we have a little bit of a better idea of what is legal when it comes to web

36
02:33.960 --> 02:36.180
scraping and what is not legal.

37
02:36.780 --> 02:41.460
The law actually seems to favor web scraping in the sense that you're allowed to

38
02:41.460 --> 02:43.350
scrape a website data

39
02:43.920 --> 02:47.430
as long as you think about a couple of things.

40
02:48.090 --> 02:53.090
A lot of people have been writing about web scraping being legal based on the

41
02:53.370 --> 02:55.860
LinkedIn versus hiQ case.

42
02:56.250 --> 03:01.060
But the important thing to remember is that this is not a blanket sort of,

43
03:01.300 --> 03:04.690
you can do whatever you want, scraping any website's data.

44
03:05.320 --> 03:10.320
It only means that data that is publicly available and not copyrighted is

45
03:11.560 --> 03:16.390
probably legal for companies to scrape. Now,

46
03:16.420 --> 03:18.400
if you are using this data privately

47
03:18.400 --> 03:22.510
like we are creating some sort of service for ourselves, then it doesn't really

48
03:22.510 --> 03:24.160
matter. You're  just a user.

49
03:24.820 --> 03:28.150
The difficulty comes when you're trying to commercialize that data,

50
03:28.150 --> 03:32.590
when you set up a business and your business kind of involves somebody else's

51
03:32.590 --> 03:35.890
data. That is a bit of a gray area. Now,

52
03:35.950 --> 03:38.170
the things that we definitely know are

53
03:38.170 --> 03:40.720
that you can't commercialize copyrighted content.

54
03:40.990 --> 03:45.790
So if you scrape data from YouTube and you scraped the video data,

55
03:45.820 --> 03:50.820
you can't just use that video on your own website. That is still not allowed

56
03:51.040 --> 03:56.040
because that video is copyrighted and it's created by a YouTube user and the

57
03:56.650 --> 04:00.820
copyright belongs to that user, not to you. So this is still illegal.

58
04:01.600 --> 04:05.620
This might also apply to other things like a Medium blog post that somebody else

59
04:05.620 --> 04:09.310
wrote or a piece of music that's being hosted on Spotify.

60
04:09.760 --> 04:12.430
So copyrighted content you can't commercialize.

61
04:13.030 --> 04:17.110
The second thing is that you can't scrape data that's behind authentication.

62
04:17.470 --> 04:21.100
So if you have to log into Facebook in order to scrape the data,

63
04:21.310 --> 04:22.810
then that's pretty much illegal.

64
04:23.380 --> 04:27.460
And the reason for this is when you sign up as a user to any of these services

65
04:27.460 --> 04:30.400
like Facebook or Twitter or Instagram,

66
04:30.820 --> 04:35.020
there's a policy in there that you are agreeing to when you sign up that says

67
04:35.050 --> 04:39.610
I agree to not use this data that I obtained on this website commercially.

68
04:40.180 --> 04:43.120
But the data that is not behind authentication,

69
04:43.420 --> 04:46.720
so any website that you can access as it is,

70
04:47.140 --> 04:51.490
they can't bind you to a policy because you haven't agreed to anything.

71
04:51.970 --> 04:56.260
So if the website has data that just out there in the open that you can access

72
04:56.260 --> 05:00.430
without logging in and the content is not something that can be copyrighted,

73
05:00.670 --> 05:03.640
then it is fair game legally. Now,

74
05:03.670 --> 05:07.780
just because it's legal doesn't mean that you can actually do it.

75
05:08.350 --> 05:13.350
A lot of websites will use captcha or recaptcha in order to prevent bots like

76
05:13.750 --> 05:18.610
our Python code to get data from their websites. Every single time,

77
05:18.640 --> 05:21.850
you're agreeing to one of these captchas, it's testing

78
05:21.850 --> 05:24.100
whether to see if your actually a real human

79
05:24.340 --> 05:28.840
or if you just a bit of code that is trying to access their data. Captcha was the

80
05:28.840 --> 05:33.340
old version where you had the type in some squiggle letters and recaptcha is the

81
05:33.340 --> 05:36.130
new version where you just have to tick a checkbox.

82
05:36.460 --> 05:38.560
And it's actually really interesting how it works.

83
05:39.400 --> 05:43.210
It looks at things like how your mouse approaches the checkbox,

84
05:43.210 --> 05:47.020
how you maybe quiver a little bit before you actually check it

85
05:47.260 --> 05:51.280
and other things like your cookies and the store data that they have on you.

86
05:51.970 --> 05:56.170
Essentially, this service is used by websites to prevent people

87
05:56.230 --> 06:00.590
to scrape their data using a bot. The other thing to remember is that,

88
06:00.830 --> 06:03.560
you know, if you get sued by somebody like LinkedIn

89
06:03.560 --> 06:08.420
because you're using their data and you're building a business on it

90
06:08.450 --> 06:10.970
like hiQ is, then you can

91
06:10.970 --> 06:14.060
at any moment be hit with a really expensive lawsuit

92
06:14.450 --> 06:18.620
and you are going to have to pay a lot of money to lawyer up in order to contest

93
06:18.620 --> 06:20.420
this and actually to fight them in court.

94
06:20.930 --> 06:25.400
Unless you have the money to lawyer up and fight a company like LinkedIn,

95
06:26.000 --> 06:29.810
it's really important to know what are the implications of web scraping,

96
06:29.930 --> 06:33.590
especially when you're selling that data as a part of your business.

97
06:34.250 --> 06:37.040
But in addition to the sort of legal side of things,

98
06:37.190 --> 06:40.940
the other part that you should really think about is the ethics of web scraping.

99
06:41.390 --> 06:44.810
This is basically putting aside what is legal and what is illegal,

100
06:45.020 --> 06:46.640
but more thinking about what is right

101
06:46.640 --> 06:51.640
and what's wrong because let's say that you've built a website and you've got

102
06:51.770 --> 06:56.000
some sort of bot that's constantly scraping it for data, data that you know,

103
06:56.300 --> 06:58.730
has been generated by your own users

104
06:58.970 --> 07:03.950
that's really precious and that you might even charge for it, then,

105
07:03.980 --> 07:07.400
is it really right for somebody to do that?

106
07:07.940 --> 07:11.780
So I often follow the rule where if I don't want something to happen to me,

107
07:11.840 --> 07:16.160
I try to not do that to others. In terms of the ethics, a couple of things

108
07:16.170 --> 07:18.470
I would recommend abiding by is

109
07:18.770 --> 07:21.800
if you come across a website and they have a public API

110
07:21.860 --> 07:26.210
which we've already learned about and we know how to use, then always

111
07:26.210 --> 07:30.770
always go for the API. If it requires an application, then apply for it.

112
07:31.100 --> 07:35.570
Don't just go ahead and try to take their data when there's already a route for

113
07:35.570 --> 07:37.310
you to use and access their data.

114
07:38.480 --> 07:42.590
The second thing is to respect the web owner, because you know,

115
07:42.590 --> 07:46.550
you don't want somebody to access your website a million times a second,

116
07:46.610 --> 07:49.520
potentially making your website go down

117
07:49.670 --> 07:51.680
or it could count as a DDoS

118
07:51.680 --> 07:55.310
attack where it affects other users using the website.

119
07:56.090 --> 07:57.290
When you are on a website,

120
07:57.590 --> 08:02.360
they actually provide a way for you to tell what it is that you can scrape and

121
08:02.360 --> 08:02.810
what it is

122
08:02.810 --> 08:07.810
you can't. At the very end of the URLs after the.com or.co.uk,

123
08:08.930 --> 08:13.220
if you put a forward slash and put robots.txt, you can see

124
08:13.220 --> 08:18.220
this is the advice that they give to any bots that are potentially scraping

125
08:18.260 --> 08:19.093
their website.

126
08:19.610 --> 08:24.610
User agent is the person who is scraping, the person or the bot that's scraping,

127
08:25.280 --> 08:27.890
and it tells you what are the things that it disallows.

128
08:28.220 --> 08:32.690
So it doesn't want you to access the /vote?, /reply?, 

129
08:32.690 --> 08:35.300
/submitted?, /threads?.

130
08:35.600 --> 08:39.950
So basically any of these end points are ones that they don't really want you to

131
08:39.950 --> 08:41.840
use. For example,

132
08:41.840 --> 08:45.050
here I've access the /reply?

133
08:45.380 --> 08:48.890
which is a way to log in and reply to a particular comment.

134
08:49.280 --> 08:51.740
Now that really shouldn't be a bot kind of action

135
08:51.740 --> 08:54.230
because then it means the data that's generated

136
08:54.530 --> 08:57.690
or the replies on here will be automated, right?

137
08:57.690 --> 09:01.980
You actually want humans to comment and reply on the articles rather than some

138
09:01.980 --> 09:02.813
sort of robot.

139
09:03.660 --> 09:07.590
So these are the paths that they don't want you to access as a bot.

140
09:08.040 --> 09:10.890
And finally, it even tells you a crawl-delay.

141
09:10.920 --> 09:15.630
So this is the number of seconds that you should leave between each time you hit

142
09:15.630 --> 09:16.463
up the website.

143
09:17.250 --> 09:22.200
If we're writing Python code and we're using Beautiful Soup and response to

144
09:22.200 --> 09:24.210
scrape data from YCombinator,

145
09:24.480 --> 09:28.590
we could potentially get that code to run every fraction of a second right?

146
09:28.590 --> 09:33.450
I could just write a for loop and just get this to keep scraping again and again

147
09:33.450 --> 09:34.283
and again.

148
09:34.350 --> 09:38.880
But that means that you're adding a lot of extra traffic and a lot of extra

149
09:38.880 --> 09:43.560
demand on their servers which could potentially mean that real users,

150
09:43.560 --> 09:48.560
real humans who want to access their website might not be able to do it at a fast

151
09:48.780 --> 09:51.090
speed. So this is the reason why

152
09:51.120 --> 09:53.880
when a lot of people accessing the same website,

153
09:53.910 --> 09:58.910
say when a new ticket has been released for Glastonbury or some sort of big

154
09:59.100 --> 10:01.950
concert, that the website can go down.

155
10:02.010 --> 10:05.430
Its because a lot of servers can't cope with so much demand.

156
10:05.850 --> 10:08.070
And when that demand is coming from a for loop,

157
10:08.340 --> 10:12.150
then you can imagine that you're just adding a lot of extra work onto the web

158
10:12.150 --> 10:15.480
server. So always respect their crawl-delay

159
10:15.480 --> 10:20.190
if you see one in the robots.txt, and even if you don't see one,

160
10:20.280 --> 10:24.450
just try to limit your rate so that you don't max out their server.

161
10:24.840 --> 10:27.450
I recommend not scraping more than once a minute.

162
10:28.200 --> 10:32.340
The YCombinator's of robots.txt is actually quite permissive.

163
10:32.370 --> 10:35.430
It allows you to do pretty much anything you want,

164
10:35.760 --> 10:37.950
but that's not true for all websites.

165
10:38.160 --> 10:40.260
If you look at the robots.txt for LinkedIn,

166
10:40.590 --> 10:43.770
you can see that they really don't want anyone to scrape it.

167
10:43.770 --> 10:45.450
There is a bit of legal jargon,

168
10:45.480 --> 10:49.470
there's a lot more disallows that you can see, right?

169
10:49.950 --> 10:53.820
This is probably not a website where I would scrape their data and try to build a

170
10:53.820 --> 10:54.690
company around.

171
10:55.620 --> 10:59.940
So remember that this is a piece of text that the website owners have written

172
11:00.180 --> 11:04.920
for you to look at to see what you can do and you can't do with their website.

173
11:05.280 --> 11:06.990
So before you scrape a website,

174
11:07.320 --> 11:12.320
always go to the root of their URL and check out their robots.txt and follow

175
11:14.490 --> 11:18.420
the ethical codes of conduct when you're trying to commercialize a project.

176
11:18.810 --> 11:22.770
So this is just the quick tip on the law and ethics of web scraping

177
11:22.980 --> 11:24.960
just so that you don't get into trouble in the future.