So let's start with a simple example. I've taken a traditional Irish song and here's the first few words of it, and here's the beginning of the code to process it. In this case to keep things simple, I put the entire song into a single string. You can see that string here and I've denoted line breaks with \n. Then, by calling the split function on \n, I can create a Python list of sentences from the data and I'll convert all of that to lowercase. Using the tokenizer, I can then call fit_on_texts to this corpus of work and it will create the dictionary of words and the overall corpus. This is a key value pair with the key being the word and the value being the token for that word. We can find the total number of words in the corpus, by getting the length of its word index. We'll add one to this, to consider outer vocabulary words.