WEBVTT

0
00:01.220 --> 00:07.760
Welcome back to yet another awesome lecture on encoding. And I don't want to hop on too much about URL

1
00:07.940 --> 00:10.940
encoding for the reason that this course is about forms.

2
00:10.940 --> 00:17.240
But it's good to know because sometimes in forms, especially with a GET request, characters are appended

3
00:17.240 --> 00:20.630
to the URL and it's going to be doing funky things, sometimes.

4
00:20.630 --> 00:24.200
And that's why we've kind of taken this tangent to learn about encoding.

5
00:24.200 --> 00:27.170
And this lecture in particular is quite advanced.

6
00:27.200 --> 00:32.510
There's a ton of information I want to get through, so please forgive me if I'm going too fast.

7
00:32.540 --> 00:34.790
If anything is unclear, please ask on Q&amp;A.

8
00:35.000 --> 00:36.440
But let me ask you this.

9
00:36.470 --> 00:38.260
What is a Web address?

10
00:38.270 --> 00:39.830
What is it used for?

11
00:40.880 --> 00:41.600
Well, that's right.

12
00:41.630 --> 00:47.540
A Web address is used to point to a resource on the Web, such as a Web page.

13
00:47.540 --> 00:50.870
You can think of a web address as directions.

14
00:51.470 --> 00:55.040
It tells your browser where to go to fetch a resource.

15
00:56.690 --> 00:59.540
And currently web addresses are expressed,

16
00:59.570 --> 01:06.910
they're defined, they're written using Uniform Resource Identifiers or URIs.

17
01:06.920 --> 01:13.820
And we've already seen that these URIs are governed by certain rules and these rules are defined in

18
01:13.820 --> 01:18.200
a document called the RFC 3986.

19
01:18.350 --> 01:25.310
And the long and the short of it, is that a URI is defined as a sequence of characters chosen from the

20
01:25.310 --> 01:27.860
US ASCII character set.

21
01:27.860 --> 01:32.540
And the key word here is the ASCII character set.

22
01:32.930 --> 01:39.590
This essentially restricts web addresses to a small number of characters, basically just upper and

23
01:39.590 --> 01:45.840
lower case letters of the English alphabet, European numerals, and a small number of symbols.

24
01:45.860 --> 01:53.090
Well, as I'm sure you can agree with me, my students times have changed and users expectations and

25
01:53.090 --> 01:59.700
the use of the Internet have moved on since then, and there's now a growing need to enable use of characters

26
01:59.700 --> 02:02.190
from any language in web addresses.

27
02:02.190 --> 02:02.700
Why?

28
02:02.700 --> 02:08.220
Well, a web address in your own language and alphabet is just easier to create, memorize, transcribe,

29
02:08.220 --> 02:10.500
interpret, guess, and relate to.

30
02:11.040 --> 02:16.950
So it doesn't really make sense to restrict web addresses to the ASCII character set, does it?

31
02:17.040 --> 02:18.390
Well, no, it doesn't.

32
02:18.780 --> 02:20.370
But I wish things were simple.

33
02:20.400 --> 02:28.050
But unfortunately, when it comes to coding and development, things are complex and there is not one

34
02:28.050 --> 02:31.770
unified spec for URLs or URIs.

35
02:32.190 --> 02:34.170
They are found in many different places.

36
02:34.170 --> 02:40.650
Different organizations have tried to attempt to write rules on how they should be governed.

37
02:40.650 --> 02:45.540
I don't want to get into all these different specs and what these organizations say.

38
02:45.570 --> 02:51.470
The point I'm trying to make is that over the years there have been lots of changes.

39
02:51.480 --> 02:53.430
"What kind of changes, Clyde?"

40
02:53.730 --> 02:54.960
Well, that's a good question.

41
02:54.990 --> 03:00.870
Originally, and I'm talking way back now in the 90s, everything was defined as a URL. Everything

42
03:00.870 --> 03:04.950
that you wrote in an address bar was a Uniform Resource Locator.

43
03:05.700 --> 03:10.920
But in the term was later changed to become a URI in 2005.

44
03:11.340 --> 03:16.590
Later, the RFC 3987, not 6, remember 3986

45
03:16.590 --> 03:18.240
that defines a URL,

46
03:19.100 --> 03:27.590
but the RFC 3987 defined an IRI, and said that IRIs can be used instead of URIs. 

47
03:28.360 --> 03:34.480
But do you notice still that we've got separate definitions - a URI and an IRI?

48
03:34.840 --> 03:37.810
Don't worry, I'm going to be talking more about Iris shortly.

49
03:37.810 --> 03:42.250
But for now, note that this RFC 3987 is important.

50
03:43.030 --> 03:51.520
It's important because the W3C has accepted this spec, which means all browsers need to conform to

51
03:51.520 --> 03:51.940
it.

52
03:52.090 --> 03:56.770
And remember, I said there were all the other organizations that have attempted to define URLs.

53
03:56.930 --> 04:00.220
One of them is this WHATWG consortium. 

54
04:00.460 --> 04:10.330
They've produced their own URL spec, basically mixing ideas from URIs and URLs and IRIs with a strong

55
04:10.330 --> 04:11.980
focus on browsers.

56
04:11.990 --> 04:13.660
And it kind of makes sense, right?

57
04:13.660 --> 04:15.390
Because how confusing is it ...

58
04:15.400 --> 04:18.490
there are so many different specs around with different definitions.

59
04:18.490 --> 04:19.630
It doesn't make sense.

60
04:19.630 --> 04:21.130
It's actually just confusing.

61
04:21.130 --> 04:22.630
So what else can I say about this

62
04:22.660 --> 04:23.860
WHATWG consortium?

63
04:23.860 --> 04:25.600
What did they try and achieve?

64
04:25.780 --> 04:29.650
Well, one of the goals was to align RFC 3986,

65
04:29.650 --> 04:35.380
remember, that defines the URL and RFC 3987, which defines IRIs.

66
04:35.410 --> 04:37.780
And what's cool about the WHATWG is that

67
04:37.780 --> 04:40.670
it's very liberal in what a URL can accept.

68
04:40.720 --> 04:46.970
In fact, they say that a URL should be able to handle non-ASCII characters, which makes sense.

69
04:46.970 --> 04:53.900
And I guess unsurprisingly, they say that URLs should be specified as UTF-8, which you and I know

70
04:53.900 --> 04:56.840
can contain more than enough characters.

71
04:56.840 --> 05:01.040
So it really would be ideal if the spec became mandatory.

72
05:01.040 --> 05:06.440
But as I mentioned, it's RFC 3987, which rules the roost at the moment.

73
05:07.920 --> 05:08.580
Okay, cool.

74
05:08.580 --> 05:09.210
That's fine.

75
05:09.210 --> 05:15.260
But if you're anything like me, you love seeing examples, so let's look at a URL.

76
05:16.180 --> 05:19.150
Don't worry about what the international characters mean there.

77
05:19.180 --> 05:21.370
It's just some Japanese characters I put there.

78
05:22.240 --> 05:26.080
I just want us to talk about how this URL will be encoded.

79
05:26.090 --> 05:28.130
I want to talk about what this means.

80
05:28.160 --> 05:30.360
Well, firstly, this is not a URL,

81
05:30.370 --> 05:31.480
strictly speaking.

82
05:31.810 --> 05:35.910
This is known as an International Resource Identifier.

83
05:35.920 --> 05:37.870
An IRI.

84
05:37.900 --> 05:39.040
And why is this important?

85
05:39.040 --> 05:44.590
Well, this is important because, as I mentioned, a URI supports only ASCII character encoding.

86
05:44.590 --> 05:48.040
Remember, that's defined in RFC 3986.

87
05:48.040 --> 05:55.570
An IRI, on the other hand, fully supports international characters and good news for us is that 

88
05:55.570 --> 05:58.990
UTF-8 is the most popular encoding used for IRIs.

89
05:59.020 --> 06:03.880
We're going to be talking a lot more about this example URL that I've put up there shortly.

90
06:04.670 --> 06:10.040
But for now, and I've mentioned this before, important for us is that the W3C URI spec basically

91
06:10.040 --> 06:16.610
the World Wide Web Consortium, have accepted the RFC 3987, which defines an IRI.

92
06:16.640 --> 06:18.260
And why is this important?

93
06:18.290 --> 06:26.990
Well, it's important because various document formats, specifications and browsers support IRIs.

94
06:27.110 --> 06:27.410
Okay.

95
06:27.680 --> 06:31.220
So various document specs and browsers already support IRIs.

96
06:31.310 --> 06:38.660
But the problem is that not many protocols allow IRIs to pass through unchanged.

97
06:38.660 --> 06:46.410
And the protocol that we familiar with when it comes to building sites and apps is HTTP or HTTPS.

98
06:46.430 --> 06:52.280
So if an IRI can't pass through a protocol, as is, what do we do?

99
06:53.180 --> 06:54.290
Well, that's a great question.

100
06:54.290 --> 06:59.450
And how an IRI works depends on where the non-ASCII character is located.

101
06:59.780 --> 07:03.500
Is it located in the domain name, the path?

102
07:03.740 --> 07:09.620
And this, my dear students, has created a lot of confusion, even amongst developers.

103
07:09.620 --> 07:10.430
Trust me.

104
07:10.760 --> 07:14.600
So what I'm about to share with you is super, super interesting, and it's going to put you ahead of

105
07:14.600 --> 07:15.300
the pack.

106
07:15.370 --> 07:17.300
But let me not get ahead of myself.

107
07:17.480 --> 07:19.220
Let's look at the URL again.

108
07:19.400 --> 07:21.280
It's exactly the same one we had before.

109
07:21.290 --> 07:22.470
And let's break this up.

110
07:22.490 --> 07:26.120
We already know that this HTTP is what?

111
07:26.950 --> 07:27.540
That's right.

112
07:27.550 --> 07:29.710
It's known as the scheme or the schema.

113
07:30.040 --> 07:32.710
It contains information about the scheme to be used.

114
07:32.860 --> 07:34.540
And this is what's important.

115
07:34.570 --> 07:38.770
Non-ASCII characters are not allowed in the scheme.

116
07:38.770 --> 07:43.660
So that's step one, and it's pretty obvious we don't want funky characters there.

117
07:44.020 --> 07:46.010
We've just got to keep it plain and simple.

118
07:46.030 --> 07:48.610
The next part is known as the ...

119
07:49.150 --> 07:49.780
that's right,

120
07:49.780 --> 07:51.250
it's known as the domain name.

121
07:51.250 --> 07:55.890
And the remainder of the URL is known as the path.

122
07:55.900 --> 08:01.960
And the path indicates the actual location of the resource you're trying to point to from the server

123
08:01.960 --> 08:02.590
route.

124
08:02.650 --> 08:02.890
Okay.

125
08:02.920 --> 08:03.820
Have you got it?

126
08:03.910 --> 08:05.160
Memorize this picture.

127
08:05.170 --> 08:10.840
I want us to now talk about what happens with those international characters in the domain name versus

128
08:10.840 --> 08:12.700
what happens to them in the path.

129
08:13.960 --> 08:17.980
Let's first and very briefly, discuss the domain name.

130
08:18.490 --> 08:21.250
Remember, that's that middle portion, the domain name.

131
08:22.360 --> 08:24.790
What happens to domain names?

132
08:24.820 --> 08:31.300
Well, what's interesting about domain names is that they are managed by domain name registration companies

133
08:31.300 --> 08:36.280
that are spread across around the world. And the Internet Engineering Task Force,

134
08:36.760 --> 08:42.910
back in early 2000, they produced a spec that governs how multilingual domain names should be dealt

135
08:42.910 --> 08:43.300
with.

136
08:43.300 --> 08:47.140
And if you're very interested, you can read all of those specs.

137
08:47.140 --> 08:52.960
But the long and the short of it, is that the domain name register defines the list of characters that

138
08:52.960 --> 08:57.220
people can request to be used in their country for top level domains.

139
08:57.220 --> 09:02.860
And what's really cool is that these organizations have agreed to certain kind of formats.

140
09:03.070 --> 09:10.630
And if a person requests a domain name using non-ASCII characters like those Japanese symbols you just

141
09:10.630 --> 09:17.590
saw, then these symbols, these characters, will get converted over to "punycode".
(record scratch sound)

142
09:18.550 --> 09:20.470
Wait a second!

143
09:20.980 --> 09:22.750
What is punycode?

144
09:23.630 --> 09:28.520
Well, don't stress my dear students. I don't want to get too much into it, but it just allows for the encoding

145
09:28.550 --> 09:34.870
of characters in the host name - the domain name - that should in theory only support ASCII characters.

146
09:34.880 --> 09:37.250
That's all that punycode allows for.

147
09:37.340 --> 09:42.920
And there's certain rules around punycode. There are certain rules that define how that conversion should

148
09:42.920 --> 09:44.870
take place and the format of it.

149
09:44.870 --> 09:49.790
And all these domain registrar companies around the world have agreed to this format.

150
09:49.790 --> 09:57.080
And in theory, what's cool is that punycode could be used to allow for host names that use emojis.

151
09:57.110 --> 09:58.310
How cool would that be?

152
10:00.400 --> 10:06.370
But emojis are not a widely supported standard as of yet, so there's only a limited subset of top level

153
10:06.370 --> 10:08.890
domains that support emojis currently.

154
10:09.070 --> 10:10.270
But you never know.

155
10:10.300 --> 10:12.580
Things do change. Anyway, 

156
10:12.610 --> 10:14.170
let's hop back into the lecture.

157
10:14.170 --> 10:20.470
So we've discussed domain names ... that, kind-of you can view them as having their own set of rules.

158
10:20.470 --> 10:25.540
And any Non-ASCII character in a domain name has to be converted over to punycode.

159
10:25.930 --> 10:26.800
Have you got it?

160
10:27.130 --> 10:29.740
We really are doing a lot, so please stick with me.

161
10:29.770 --> 10:31.180
We are almost, almost done.

162
10:31.210 --> 10:36.610
So I've spoken about domain names, but now I quickly want to talk about how the path is dealt with,

163
10:36.610 --> 10:39.850
because the path is what we care about when building forms, right?

164
10:39.850 --> 10:45.190
Remember with the GET request, data of the form is appended in the path of the URL.

165
10:45.190 --> 10:48.130
So this is really what concerns us, not really the domain name.

166
10:48.130 --> 10:54.340
And to remind you, we've got our URL and I only want to now talk about the path.

167
10:57.130 --> 11:01.030
Remember what we just said when dealing with domain names, that there are

168
11:01.060 --> 11:07.270
domain registration companies spread all over the world and they've all agreed to accept domain names

169
11:07.270 --> 11:10.990
in a particular form with a particular encoding.

170
11:10.990 --> 11:12.910
And that encoding was ASCII based

171
11:12.930 --> 11:13.710
punycode.

172
11:13.720 --> 11:17.490
But path names are more complicated.

173
11:17.500 --> 11:18.340
Why?

174
11:18.370 --> 11:25.540
Well, just because path names can identify resources located on many different kinds of platforms whose

175
11:25.540 --> 11:30.970
file systems do and will continue to use many different encodings.

176
11:30.970 --> 11:36.040
And this makes the path much more difficult to handle than a domain name.

177
11:36.040 --> 11:44.110
But we don't have to stress because the good news is that the IETF standard 3987 deals with non-ASCII

178
11:44.140 --> 11:46.480
characters in the path.

179
11:46.480 --> 11:50.530
And at the crux of it, it's actually pretty simple.

180
11:50.710 --> 11:58.410
The spec says that browsers need to represent all characters using percent escaping aka URL encoding.

181
11:58.430 --> 12:00.810
So what does this mean for our URL?

182
12:00.870 --> 12:02.900
Well, let's just forget about the domain section.

183
12:02.900 --> 12:04.160
Let's just look at our path.

184
12:04.610 --> 12:05.020
Right.

185
12:05.030 --> 12:07.460
"dir1" and the Japanese symbol.

186
12:09.160 --> 12:10.020
That's the path.

187
12:10.030 --> 12:15.250
Let's assume the page the characters are on are encoded in UTF-8 because that's pretty much every

188
12:15.250 --> 12:17.020
single site we visit today.

189
12:18.070 --> 12:19.180
How will this work?

190
12:20.970 --> 12:25.700
Well, the IRI spec says that the IRI should be converted to UTF-8.

191
12:25.710 --> 12:33.960
First, the user agent, aka the browser, then needs to convert every non-ASCII character to percent

192
12:33.990 --> 12:34.740
escapes.

193
12:34.770 --> 12:41.280
Remember, this is just URL encoding. So it starts with a percentage symbol followed by two hexadecimal

194
12:41.280 --> 12:41.980
values.

195
12:42.000 --> 12:44.700
So what does this mean for our URL?

196
12:45.650 --> 12:47.840
Well, firstly, the dir1 stays as dir1.

197
12:47.870 --> 12:51.020
They are all ASCII character based, right?

198
12:51.060 --> 12:56.360
A "d" and "i" and "r" and a "1" all form part of the ASCII character set, so nothing needs to happen.

199
12:56.390 --> 13:00.440
The only thing that needs to be % encoded is that symbol.

200
13:01.180 --> 13:02.260
And there we have it.

201
13:02.290 --> 13:04.030
It'll look something like this.

202
13:04.060 --> 13:05.140
Isn't that pretty cool?

203
13:05.470 --> 13:10.540
It just means the URL path section is now in URI form.

204
13:10.870 --> 13:13.990
It's kind of being converted from an IRI to a URI.

205
13:14.020 --> 13:15.460
And why is this important?

206
13:15.490 --> 13:17.710
Well, remember a few slides back.

207
13:17.980 --> 13:20.320
Actually, if I go back here, let me try and show you.

208
13:21.830 --> 13:22.340
Here.

209
13:22.370 --> 13:25.940
On the slide here, remember I said there are many different documents,

210
13:25.940 --> 13:32.090
specs and browsers that support IRIs, but not so many protocols allow IRIs to pass through,

211
13:32.090 --> 13:32.920
unchanged.

212
13:32.930 --> 13:33.290
Right.

213
13:33.290 --> 13:36.530
And the protocol we're dealing with is HTTP.

214
13:36.800 --> 13:43.640
So HTTP needs a URL, a valid URL in order to transport that over the wire.

215
13:47.780 --> 13:48.380
Whew, here

216
13:48.380 --> 13:49.970
we are, back here again.

217
13:50.480 --> 13:54.050
That's why it's important, right, to get from and IRI to a valid URI. 

218
13:54.440 --> 14:00.110
It's just going to allow protocols such as HTTP to send that request.

219
14:00.110 --> 14:03.710
And just note how that dir1 did not change.

220
14:03.950 --> 14:08.710
They were unreserved characters in the ASCII set, which we spoke about earlier.

221
14:08.720 --> 14:13.630
So at this point, the user agent, the browser can now send the request for the page.

222
14:15.130 --> 14:18.400
Whew, so this is all good and well, and we are almost done.

223
14:18.400 --> 14:20.500
But you might be thinking, okay, cool.

224
14:20.500 --> 14:24.550
So anything in the path name has to be kind of percent encoded.

225
14:24.550 --> 14:25.420
I've got it.

226
14:25.570 --> 14:32.020
But when we looked at an example of a form that submitted Japanese symbols, why did we see those actual Japanese

227
14:32.020 --> 14:33.790
symbols in the address bar?

228
14:34.750 --> 14:40.810
In other words, why don't we see all the percentage hex values in the address bar?

229
14:41.500 --> 14:43.360
Well, that's a very, very good question.

230
14:43.360 --> 14:49.420
And just remember, the address bar on the browser is not the actual URL that's sent over the wire.

231
14:49.450 --> 14:55.690
The address bar is a UI component that allows users to enter all kinds of fun strings that will get

232
14:55.690 --> 14:58.480
converted over to URL at some point.

233
14:58.480 --> 15:04.510
Basically, it just makes the web experience nicer for a user, so you can kind of think of the URL

234
15:04.510 --> 15:08.710
address bar as is just being a visual help to us users.

235
15:08.710 --> 15:14.470
It's not necessarily the actual true form of the URL. And modern clients,

236
15:14.470 --> 15:16.190
I'm just meaning web browsers,

237
15:16.190 --> 15:22.340
they are able to transform back and forth between percent encoding and Unicode,

238
15:22.340 --> 15:28.820
so the URL is transferred as ASCII, but it looks pretty for us as the user because the browser understands

239
15:28.820 --> 15:34.490
both - the browser understands IRIs, it understands URIs, so it makes sense that it should just

240
15:34.490 --> 15:38.780
display us the correct symbols. And it can do its thing in the background.

241
15:38.780 --> 15:42.020
I know, I know this lecture is getting very long and I'm just about to finish off.

242
15:42.020 --> 15:44.390
What is the bottom line of this whole lecture?

243
15:44.570 --> 15:52.070
Well, it is that IRIs are basically URIs that allow non-ASCII characters to be used. It make sense.

244
15:52.310 --> 15:57.290
And most browsers today will allow you to see international characters in the address bar.

245
15:57.290 --> 16:05.210
But in the background they are using techniques to convert these characters to ASCII so it can be transported

246
16:05.210 --> 16:07.820
over the HTTP protocol.

247
16:07.820 --> 16:10.040
They're using techniques.

248
16:10.430 --> 16:11.940
What kind of techniques?

249
16:11.960 --> 16:13.400
Well, it depends, right?

250
16:13.880 --> 16:18.590
It depends if those international characters are in the hostname or in the path.

251
16:18.590 --> 16:21.650
If it's in the hostname, the browsers use Punycode.

252
16:21.680 --> 16:29.660
If it's in the path, the browsers use percent URL encoding which is defined in the RFC 39

253
16:29.660 --> 16:31.880
86 and 3987.

254
16:33.410 --> 16:34.970
So there we have it.

255
16:35.360 --> 16:39.110
I told you this was going to blow your mind ðŸ¤¯.

256
16:39.110 --> 16:41.900
And seriously, this is very, very advanced stuff.

257
16:42.140 --> 16:48.020
It took me a long time to actually wrap my head around how your URL encoding works, so I hope you appreciate

258
16:48.020 --> 16:48.350
it.

259
16:48.350 --> 16:52.370
And yes, you might not need to know as much as we've discussed here, but you know what?

260
16:52.370 --> 16:54.320
It's just another feather in your cap.

261
16:54.320 --> 16:56.300
It's going to make you a better programmer.

262
16:56.300 --> 17:02.960
And when you start dealing with forms, GET requests and you start seeing some percentage URL encoding,

263
17:02.960 --> 17:05.620
sometimes you'll see that numerical character reference,

264
17:05.630 --> 17:10.700
remember that? When you see those things, at least now you'll be able to appreciate what's happening.

265
17:10.700 --> 17:13.430
So I think let me end the lecture here.

266
17:13.550 --> 17:18.130
There's still a few more things I want to talk about when it comes to URL encoding, things that are

267
17:18.130 --> 17:19.570
perhaps a bit more practical.

268
17:19.570 --> 17:23.740
Stay motivated, grab a coffee and I'll see you in the next lecture.