This code is very similar to what you saw in
the earlier videos, but let's look at
it line by line. We've just created a sentences
list from the headlines, in the sarcasm data set. So by calling
tokenizer.fit_on_texts, will generate the word index and we'll initialize
the tokenizer. We can see the word index as before by calling
the word index property. Note that this returns
all words that the tokenizer saw when
tokenizing the sentences. If you specify num words to
get the top 1000 or whatever, you may be confused by seeing something greater than that here. It's an easy mistake to make, but the key thing to remember, is that when it takes the top 1000 or whatever you specified, it does that in the text
to sequence this process. Our word index is much larger than with
the previous example. So we'll see a greater variety of words in it. Here's a few. Now we'll create
the sequences from the text, as well as padding them. Here's the code to do that. It's very similar to
what you did earlier, and here's the output. First, I took the first headline in the data set and
showed its output. We can see that it has been
encoded with the values for the keys that are the corresponding
word in the sentence. This is the size of
the padded matrix. We had 26,709 sentences, and they were encoded
with padding, to get them up to 40 words long which was the length
of the longest one. You could truncate
this if you like, but I'll keep it at 40. That's it for processing
the Sarcasm data set. Let's take a look at that
in action in a screen cast.