1
00:00:00,470 --> 00:00:03,270
So let's start with
a simple example.

2
00:00:03,270 --> 00:00:05,310
I've taken a
traditional Irish song

3
00:00:05,310 --> 00:00:07,635
and here's the first few
words of it,

4
00:00:07,635 --> 00:00:10,710
and here's the beginning
of the code to process it.

5
00:00:10,710 --> 00:00:12,960
In this case to
keep things simple,

6
00:00:12,960 --> 00:00:16,065
I put the entire song
into a single string.

7
00:00:16,065 --> 00:00:18,540
You can see that string
here and I've

8
00:00:18,540 --> 00:00:20,985
denoted line breaks with \n.

9
00:00:20,985 --> 00:00:24,450
Then, by calling
the split function on \n,

10
00:00:24,450 --> 00:00:26,940
I can create a Python list
of sentences from

11
00:00:26,940 --> 00:00:30,345
the data and I'll convert
all of that to lowercase.

12
00:00:30,345 --> 00:00:33,090
Using the tokenizer, I can then

13
00:00:33,090 --> 00:00:35,550
call fit_on_texts to this corpus

14
00:00:35,550 --> 00:00:37,370
of work and it will create

15
00:00:37,370 --> 00:00:40,505
the dictionary of words
and the overall corpus.

16
00:00:40,505 --> 00:00:43,220
This is a key value pair
with the key being

17
00:00:43,220 --> 00:00:46,520
the word and the value being
the token for that word.

18
00:00:46,520 --> 00:00:49,969
We can find the total number
of words in the corpus,

19
00:00:49,969 --> 00:00:52,315
by getting the length
of its word index.

20
00:00:52,315 --> 00:00:53,540
We'll add one to this,

21
00:00:53,540 --> 00:00:56,690
to consider outer
vocabulary words.