1
00:00:00,000 --> 00:00:03,540
Now, this approach works
very well until you have

2
00:00:03,540 --> 00:00:06,825
very large bodies of text
with many many words.

3
00:00:06,825 --> 00:00:09,180
So for example, you could
try the complete works of

4
00:00:09,180 --> 00:00:12,300
Shakespeare and you'll
likely hit memory errors,

5
00:00:12,300 --> 00:00:15,960
as assigning the one-hot
encodings of the labels to

6
00:00:15,960 --> 00:00:20,640
matrices that have
over 31,477 elements,

7
00:00:20,640 --> 00:00:23,100
which is the number of unique
words in the collection,

8
00:00:23,100 --> 00:00:26,060
and there are over
15 million sequences

9
00:00:26,060 --> 00:00:28,775
generated using the algorithm
that we showed here.

10
00:00:28,775 --> 00:00:31,100
So the labels alone would require

11
00:00:31,100 --> 00:00:33,785
the storage of
many terabytes of RAM.

12
00:00:33,785 --> 00:00:35,600
So for your next task,

13
00:00:35,600 --> 00:00:37,580
you'll go through
a workbook by yourself

14
00:00:37,580 --> 00:00:40,025
that uses character-based
prediction.

15
00:00:40,025 --> 00:00:43,340
The full number of unique
characters in a corpus is far

16
00:00:43,340 --> 00:00:44,480
less than the full number of

17
00:00:44,480 --> 00:00:46,760
unique words, at
least in English.

18
00:00:46,760 --> 00:00:48,920
So the same principles
that you use to

19
00:00:48,920 --> 00:00:51,410
predict words can be
used to apply here.

20
00:00:51,410 --> 00:00:54,185
The workbook is at
this URL, so try it out,

21
00:00:54,185 --> 00:00:55,730
and once you've done,
that you'll be ready for

22
00:00:55,730 --> 00:00:58,290
this week's final exercise.