WEBVTT

0
00:00.870 --> 00:06.810
We're going to be getting into this later, but just know that the question mark, and that equals sign, these...

1
00:06.810 --> 00:11.520
are known as "reserved characters" for URL encoding. Alright, 

2
00:11.850 --> 00:12.970
well this is pretty straightforward...

3
00:13.110 --> 00:13.820
so let's move on.

4
00:14.100 --> 00:15.660
Let's get into the more complicated bit...

5
00:16.670 --> 00:23.330
and that is, what is this about? 
(sound effect: what did you say?)
Well, that is a brilliant question, my dear student...

6
00:23.340 --> 00:25.100
and that's what I want to talk about now.

7
00:26.030 --> 00:32.750
Firstly, the Chrome browser knows that it cannot represent theta in the ISO charset...

8
00:33.200 --> 00:38.930
and so it doesn't even try and represented in that character encoding type, because it can't, it doesn't exist.

9
00:39.500 --> 00:41.000
So what does it do then?

10
00:41.510 --> 00:49.730
Well, the default action by this browser, is that it's going to transform the characters - that theta character -

11
00:50.300 --> 00:54.410
into what's known as "numeric character references".

12
00:54.680 --> 01:00.440
And as I mentioned before, other browsers may behave differently, opting to generate question marks...

13
01:00.440 --> 01:04.560
or even to prevent the input being inserted in the URL entirely.

14
01:04.580 --> 01:06.770
It just depends on the browser.

15
01:07.400 --> 01:10.640
I know we're doing a lot of quick cuts here, but this is very, very important.

16
01:10.740 --> 01:16.430
But I just want to mention here that remember in the previous lecture when we defined the accept-charset encoding...

17
01:16.430 --> 01:20.550
type as utf-8? What happened when we used theta in that instance? 

18
01:21.350 --> 01:22.040
Well, that's right...

19
01:22.040 --> 01:27.320
in that case, we actually saw that theta character in the URL itself.

20
01:27.830 --> 01:32.270
And the reason is, is that it's a ASCII safe character.

21
01:32.300 --> 01:37.720
And that just meant that the browser didn't have to encode that character into a numerical character

22
01:37.730 --> 01:38.200
reference.

23
01:38.660 --> 01:45.380
And as I mentioned, it's because that theta character could just be encoded directly into the URL in...

24
01:45.380 --> 01:46.850
its correct encoding type.

25
01:47.080 --> 01:52.640
But now that we've specified an encoding type that does not recognize theta, the browser has to make...

26
01:52.640 --> 01:56.110
another plan. It has to do something right? It can't just do nothing.

27
01:56.420 --> 01:57.740
And that's what it's trying to do,

28
01:57.740 --> 02:01.100
and that's why it converts it into a numerical character reference.

29
02:01.670 --> 02:06.500
OK, so this is the first thing you need to realize is that when the character is outside of that encoding...

30
02:06.500 --> 02:13.260
type, this browser, Chrome, will try and convert that character into a numeric character reference.

31
02:14.060 --> 02:18.560
The important thing to know, though, is that the browser would not have to do this if that character

32
02:18.770 --> 02:21.950
was, in fact part of the encoding type we specify.

33
02:22.280 --> 02:26.960
So this is now something else that has to happen because it doesn't fit within the encoding type.

34
02:26.970 --> 02:31.460
It has to convert that character into a numeric character reference.

35
02:31.780 --> 02:32.910
"Well, what does that mean, Clyde?"

36
02:32.980 --> 02:40.760
Well, it means that each value is encoded into this format - an "&", a "#", a "D" and a ";". 

37
02:40.760 --> 02:47.740
And that "D" represents the character's decimal codepoint value in the Unicode character set.

38
02:47.750 --> 02:53.420
This is important and this is the distinguishing difference when a character's outside of the encoding

39
02:53.420 --> 02:53.770
type.

40
02:53.990 --> 03:02.000
If the character is in the encoding type, then what's encoded in the URL is the hexadecimal format

41
03:02.000 --> 03:03.780
of that specific character.

42
03:04.010 --> 03:06.760
Here, that doesn't exist because it's not in the encoding type...

43
03:07.160 --> 03:12.710
therefore it has to be converted into a numeric character reference, and that is stored by the browser as a decimal

44
03:12.830 --> 03:14.150
codepoint value.

45
03:14.750 --> 03:19.670
And before we move on, I just want to remind you that there is a difference between numerical character references,

46
03:19.670 --> 03:25.460
which is what the browser is doing now when a character is not within that accept-charset type, and...

47
03:25.460 --> 03:31.520
an external character set a.k.a. the charset. There are differences between the two. Very quickly, a...

48
03:31.520 --> 03:38.930
numerical character reference, or NCR for short, is just a common markup construct for HTML, and they're used in order

49
03:38.930 --> 03:43.310
to represent characters that cannot be URL encoded.

50
03:43.760 --> 03:50.360
And remember in our example here, that theta, that character cannot be URL encoded because

51
03:50.360 --> 03:53.120
it doesn't exist in encoding type that we've specified.

52
03:53.450 --> 03:56.990
That's why this NCR has been the fallback. 

53
03:57.970 --> 04:05.870
With an external character set, for example, this is chosen by us, by the web author, by the programmer.

54
04:05.890 --> 04:07.250
And why do we define it?

55
04:07.270 --> 04:13.270
Well, we define it so we have complete control over what characters we want to allow our browser to accept.

56
04:13.720 --> 04:20.830
So, in short, the numerical character reference is a way for browsers to understand what the characters

57
04:20.830 --> 04:23.980
mean. The external characters set, the charset attribute, 

58
04:24.250 --> 04:29.050
well, that's just an encoding that you and I can specify to both browsers and servers.

59
04:29.470 --> 04:32.830
So that's key differences between the two. But enough of this already. 

60
04:33.160 --> 04:36.070
Let's jump back into why the URL looks like it does.

61
04:36.500 --> 04:38.680
Remember, we've got this format right.

62
04:39.010 --> 04:46.990
This is the numerical character reference format. And that "D" stores whatever value you are dealing with...

63
04:47.560 --> 04:49.240
in the decimal codepoint value.

64
04:49.450 --> 04:51.520
So what is the theta value?

65
04:52.450 --> 04:57.310
Well, the theta sign has a value of 952 in the browser's numerical character set.

66
04:58.150 --> 05:00.790
This is going to result in what URL?

67
05:00.790 --> 05:04.840
Just take our URL, that standard & # D and ;...

68
05:05.200 --> 05:07.260
and now we can replace the D, right.

69
05:07.310 --> 05:12.280
We can replace that D with the value of 952. There we go, 

70
05:12.310 --> 05:13.330
this is what it looks like.

71
05:13.990 --> 05:17.170
Okay, but this doesn't look like our URL previously, does it? 

72
05:17.230 --> 05:19.540
We don't have all this weird percentage signs and numbers.

73
05:19.810 --> 05:20.350
That's right.

74
05:20.350 --> 05:23.100
We're not quite done yet because up to this point...

75
05:23.560 --> 05:26.790
no URL encoding has taken place.

76
05:27.040 --> 05:29.290
Whaaat? I know it sounds strange...

77
05:29.500 --> 05:33.160
just remember that URL encoding does have to take place at some point.

78
05:33.280 --> 05:38.920
Every time a URL is sent to a server, there are rules that the browser has to abide by.

79
05:39.040 --> 05:44.620
And the major rule there is that all the characters in that URL has to form part of the extended ASCII

80
05:44.620 --> 05:45.370
character set.

81
05:46.150 --> 05:47.360
That's just a bit of FYI.

82
05:47.770 --> 05:49.990
Don't worry, we're going to be talking more about this later.

83
05:50.740 --> 05:53.650
So for now, let's look at our URL.

84
05:53.740 --> 05:58.330
But now we're not concerned about the number 952 are we? We know what that means...

85
05:58.340 --> 06:04.330
we know that that is the numerical character codepoint in the browser for the theta character.

86
06:04.450 --> 06:05.130
Simple enough.

87
06:05.620 --> 06:09.490
Let's now deal with those other characters, the &, #, and the ; ...

88
06:10.300 --> 06:12.490
only now does URL encoding take place, 

89
06:12.490 --> 06:13.540
at this point in time.

90
06:13.780 --> 06:20.800
The URL encoding rules specify that all characters have to be converted into the hexadecimal equivalents.

91
06:20.980 --> 06:26.170
And again, I don't want to get into massive detail about how this conversion takes place, but hexadecimal

92
06:26.200 --> 06:30.820
equivalents of the &, # and ;, are the following...

93
06:31.210 --> 06:36.190
let me make this more clear by changing the colors of each so you can see where they have gone and what

94
06:36.190 --> 06:37.600
their hexadecimal values are.

95
06:41.440 --> 06:47.230
Does that make sense? The & is %27, the # is %23 and...

96
06:47.230 --> 06:53.740
that ; is %3B. Whew, I know. We are covering very advanced things, and I hope you are following

97
06:53.740 --> 06:54.100
along.

98
06:54.340 --> 06:58.330
But you know what's weird, is that this doesn't really solve our problem.