WEBVTT

0
00:02.390 --> 00:08.270
Let's carry on exactly where we left off. Remember what we've discussed so far, that text has to be converted...

1
00:08.270 --> 00:11.960
into binary data in order for your machine to understand it.

2
00:11.960 --> 00:15.980
And we need to do that whenever we transport data over the Web...

3
00:16.310 --> 00:22.220
for example, when we send data to a server, when a user submits a form. That's why we're talking about...

4
00:22.220 --> 00:22.510
this.

5
00:22.520 --> 00:24.010
Don't get lost in all the detail.

6
00:24.410 --> 00:26.420
Sometimes we stand too close to the wall...

7
00:26.690 --> 00:28.700
we can't even see the painting.

8
00:29.180 --> 00:33.110
So that's why we have to take a step back sometimes to remind ourselves what this is all about.

9
00:33.650 --> 00:35.750
We discussed ISO, that it's very old school.

10
00:36.120 --> 00:42.560
We then discussed the ASCII character set, which only has 128 unique values.

11
00:42.560 --> 00:44.240
And that is just not enough.

12
00:44.240 --> 00:45.560
Let's be realistic.

13
00:45.860 --> 00:47.450
We need more characters.

14
00:47.780 --> 00:52.650
And this brings me onto the UTF-8 character encoding set.

15
00:52.670 --> 00:58.430
As we've already seen, to create a table that maps characters to letters for a language that...

16
00:58.430 --> 01:05.240
uses more than 256 characters, one byte or 8 bits, simply isn't enough.

17
01:05.690 --> 01:10.880
If you had two bytes, which is 16 bits, you can store 65,536

18
01:11.000 --> 01:12.260
unique values.

19
01:12.440 --> 01:18.380
And over the years, many encoding types and many organizations, many people have tried to invent different...

20
01:18.380 --> 01:21.830
encoding types, to deal with the ASCII limitations.

21
01:22.370 --> 01:29.810
But finally someone got fed up, and they set out to create one encoding standard to unify all encoding...

22
01:29.810 --> 01:30.410
standards.

23
01:30.800 --> 01:34.850
And this standard, my dear students, is called Unicode.

24
01:35.450 --> 01:41.420
And it was invented by the Unicode Consortium, which is just an organization, a group of people that...

25
01:41.420 --> 01:47.090
came up with a standard, as I mentioned, to rule the roost 🐔, not that kind of roost...

26
01:47.810 --> 01:56.440
and this Unicode set basically defines a ginormous table, a massive table of one million, one hundred...

27
01:56.450 --> 02:02.360
and fourteen thousand, one hundred and twelve code points that can be used for all sorts of letters and symbols.

28
02:02.900 --> 02:08.690
Just so you know, this is plenty enough to encode all European, Middle Eastern, Far Eastern, and Southern ...

29
02:08.690 --> 02:14.960
Northern, Western, prehistoric and even futuristic robotic languages in the future that we don't even...

30
02:14.960 --> 02:15.620
know about yet.

31
02:15.830 --> 02:17.690
It is plenty, plenty enough.

32
02:18.290 --> 02:19.580
But lemme ask you this question.

33
02:20.030 --> 02:20.480
How many...

34
02:20.480 --> 02:24.650
bits do you think Unicode uses to encode all of these characters?

35
02:25.430 --> 02:28.370
It's a trick question because the answer is ... none.

36
02:28.940 --> 02:34.010
What! This is because Unicode is not an encoding type.

37
02:34.410 --> 02:35.210
Are you confused 😕?

38
02:36.050 --> 02:36.830
Many people are,

39
02:36.830 --> 02:37.460
so don't worry.

40
02:38.360 --> 02:44.090
Unicode, first and foremost, defines a table of code points for characters.

41
02:44.480 --> 02:50.090
Let me repeat that. Unicode defines a table of code points for characters.

42
02:50.480 --> 02:54.260
And again, don't let all these fancy jargons intimidate you.

43
02:54.740 --> 03:00.980
All this means is that it's a fancy way of saying, for example, big A it's just assigned the code...

44
03:00.980 --> 03:06.590
point 0041. And little "a" is the code point 0061.

45
03:07.250 --> 03:12.770
And what about the coffee ☕ image? Well that's assigned a code point of 2615.

46
03:13.250 --> 03:15.590
And you might think I'm actually joking, but I'm not.

47
03:15.770 --> 03:18.230
This is actually defined as a coffee cup.

48
03:18.620 --> 03:22.730
Let me jump over to Word quickly, and I'll show you. As I said in the lecture, 

49
03:22.730 --> 03:24.050
I'm not making these things up.

50
03:24.050 --> 03:31.400
We do literally have a coffee symbol in UTF, and its code is 2615.

51
03:31.910 --> 03:38.960
I'm in Microsoft Word and all we have to do to convert that into the character, is to press "Alt + x" on the...

52
03:38.960 --> 03:44.930
keyboard, "alt + x". And BOOMSHAKALAKA 💥. There is our coffee symbol.

53
03:44.930 --> 03:45.500
So there we go.

54
03:45.500 --> 03:46.940
I'm not just making these things up.

55
03:47.180 --> 03:51.920
These are literally the different codes assigned to the different variables, different characters,

56
03:51.920 --> 03:54.250
the different emojis in the UTF encoding type. 

57
03:54.260 --> 03:59.570
Cool, let's hop back into the lecture. Okay, starting to make sense.

58
03:59.690 --> 04:01.070
I'm not just making these things up.

59
04:01.490 --> 04:04.160
Alright, so now we have a whole bunch of code points.

60
04:04.640 --> 04:11.750
We know that all of these code points have to be converted into bits in the background. And to convert...

61
04:11.750 --> 04:12.590
these code points into bits...

62
04:12.590 --> 04:19.280
we need an encoding type and this is what UTF is all about. To represent...

63
04:19.400 --> 04:22.200
1.1m values, 2 bytes...

64
04:22.200 --> 04:23.360
is just not enough.

65
04:23.750 --> 04:28.940
Interestingly, 3 bytes would be enough, but 3 bytes often just awkward to work with for computer

66
04:28.940 --> 04:29.420
programmers...

67
04:29.420 --> 04:37.190
so 4 bits is just a lot more comfortable. But, 4 bits also pose some problems and is overkill...

68
04:37.190 --> 04:38.180
in many cases.

69
04:38.180 --> 04:39.140
Why do I say that?

70
04:39.770 --> 04:44.300
Well, unless you're actually using Chinese or some of the other characters with big numbers that take...

71
04:44.300 --> 04:49.640
lots of bits to encode, you're never going to use a huge chunk of those four bytes.

72
04:50.330 --> 04:55.190
So, for example, if the letter A was encoded into four bytes every single time...

73
04:55.190 --> 04:56.030
it would like like this.

74
04:57.620 --> 05:01.850
And if this was the case, I'm sure you can see how bloated a document would become. 

75
05:02.220 --> 05:05.190
In fact, it would be 4 times the necessary size.

76
05:05.970 --> 05:10.060
"OK, Clyde, stop giving me problems." Well, okay. To fix this...

77
05:10.080 --> 05:20.100
there are several ways to convert Unicode characters into bits. The most common is UTF-32, UTF-16 and UTF-8. 

78
05:20.590 --> 05:23.450
Well, let's first look at UTF-8 and UTF-16. 

79
05:23.700 --> 05:26.640
These are known as variable-length encodings.

80
05:26.880 --> 05:32.190
And all I mean by this, is that if a character can be represented using a single byte, for example,

81
05:32.190 --> 05:38.100
because it's code point is a very small number like the letter A, then UTF-8 will encode it with...

82
05:38.100 --> 05:43.950
a single byte. If it requires 2 bytes, it will use 2 bytes, so on and so forth. 

83
05:44.390 --> 05:48.720
For example, if you look at the below, we can see the letter A could be stored in one byte.

84
05:49.170 --> 05:56.660
But when you get some cool letter in Chinese for example, here it's stored in 3 bytes. Whew, that...

85
05:56.850 --> 05:58.770
seems a bit elaborate though, doesn't it? 

86
05:59.190 --> 06:00.990
And it is. It's very elaborate.

87
06:00.990 --> 06:08.010
And the system uses very elaborate ways to use the highest bits in a byte to signal how many bytes a...

88
06:08.010 --> 06:08.970
character consists of.

89
06:09.450 --> 06:14.070
So this can save space, but it also may waste space if these signal bits need to be used often.

90
06:14.400 --> 06:17.670
And that's why UTF-16 is there. It kind of sits in the middle...

91
06:17.880 --> 06:24.830
but instead of using a single byte, it always uses at least 2 bytes, growing up to 4 bytes as necessary...

92
06:24.990 --> 06:27.960
same as UTF-8. Whew, we've learnt a lot. 

93
06:28.260 --> 06:34.770
The other thing I want to say about UTF-8 is that it is binary compatible with ASCII, which is...

94
06:34.770 --> 06:37.050
the de facto baseline for all encodings.

95
06:37.350 --> 06:44.310
All characters available in ASCII encoding only take up a single byte in UTF-8 and they are the exact...

96
06:44.310 --> 06:46.470
same bytes as are used in ASCII.

97
06:47.040 --> 06:50.520
So in other words, ASCII maps 1:1 with UTF-8. 

98
06:50.850 --> 06:55.660
Any characters that are not an ASCII takes up two or more bytes in UTF-8.

99
06:56.160 --> 07:02.220
So, what this means is that for most programming languages that expect to parse ASCII, you can include UTF-8...

100
07:02.220 --> 07:05.430
directly into your programs and you won't have any issues.

101
07:05.730 --> 07:06.780
How awesome is that?

102
07:07.380 --> 07:07.720
Whew, 

103
07:07.810 --> 07:11.710
I don't know about you, but I just feel like my mind has been blown 🤯.

104
07:12.060 --> 07:18.600
We've learned an incredible amount about this simple looking attribute, "accept-charset=utf-8". 

105
07:19.350 --> 07:24.360
Let me just give you a quick, quick summary, because you really have learnt a lot in a very quick...

106
07:24.360 --> 07:25.140
space of time.

107
07:25.290 --> 07:31.800
Just remember, a computer cannot store letters, numbers, pictures or anything else like this.

108
07:32.070 --> 07:39.630
The only thing it can store and work with are bits. And bits are binary data. And remember a bit can only...

109
07:39.630 --> 07:40.530
have two values.

110
07:40.530 --> 07:45.090
Yes or no, true or false, 1 or 0, or whatever else you want to call these two values.

111
07:45.420 --> 07:46.340
Just a bit of FYI...

112
07:46.340 --> 07:53.400
since a computer works with electricity, an actual bit is a blip of electricity that's either there...

113
07:53.400 --> 07:55.440
or isn't. For humans...

114
07:55.440 --> 07:58.440
we just like representing these as 1s and 0s, 

115
07:58.650 --> 08:04.620
so of course I've been sticking with this convention. But moving on, to use bits to represent anything

116
08:04.620 --> 08:07.590
at all besides actual bits or 1 and 0s, 

117
08:07.890 --> 08:14.940
we need to convert a sequence of characters into bits and then we need to later decode those bits back...

118
08:14.940 --> 08:15.840
into characters.

119
08:16.440 --> 08:21.030
And the rules governing all of this, is what's referred to as the encoding type.

120
08:21.420 --> 08:24.310
Don't get confused or get overwhelmed.

121
08:24.330 --> 08:28.170
It's actually quite interesting once you understand all of this, and it's very practical.

122
08:28.170 --> 08:32.910
In fact, all text is already encoded in some encoding.

123
08:33.450 --> 08:38.520
When you type text into source code, it has some encoding, specifically whatever...

124
08:38.520 --> 08:40.670
you saved it as in your text editor.

125
08:41.250 --> 08:44.580
If you get it from a database, it's already in some encoding.

126
08:45.060 --> 08:49.200
If you read text from a file, it's already in some encoding.

127
08:49.680 --> 08:53.190
So just remember, text is either encoded or it's not.

128
08:53.560 --> 08:59.670
And when you specify the accept-charset attribute, you're telling the server basically to encode the...

129
08:59.670 --> 09:05.100
text from its existing encoding type and to use, for example, UTF-8. 

130
09:05.460 --> 09:07.260
And that's all there is to it.

131
09:07.500 --> 09:10.230
So I hope it's starting to make a bit more sense.

132
09:10.500 --> 09:12.630
Don't worry if this was very hard for you.

133
09:12.630 --> 09:14.160
It is a very advanced topic.

134
09:14.160 --> 09:19.200
I warned you at the beginning - I'm throwing you into the deep end. But I hope you're having a ton of fun.

135
09:19.620 --> 09:25.070
And there are a few more things I want to harp on about this accept-charset, because it is fascinating.

136
09:25.410 --> 09:30.240
For example, you're probably wondering what happens if you type a symbol or character that's not defined...

137
09:30.240 --> 09:31.060
by the character set.

138
09:31.170 --> 09:34.880
That's exactly what I want to jump into, in the next lecture.

139
09:35.370 --> 09:35.870
See you now 👋.