So let's start with
a simple example. I've taken a
traditional Irish song and here's the first few
words of it, and here's the beginning
of the code to process it. In this case to
keep things simple, I put the entire song
into a single string. You can see that string
here and I've denoted line breaks with \n. Then, by calling
the split function on \n, I can create a Python list
of sentences from the data and I'll convert
all of that to lowercase. Using the tokenizer, I can then call fit_on_texts to this corpus of work and it will create the dictionary of words
and the overall corpus. This is a key value pair
with the key being the word and the value being
the token for that word. We can find the total number
of words in the corpus, by getting the length
of its word index. We'll add one to this, to consider outer
vocabulary words.