So I've made a few changes to
the code to handle padding. Here's the complete listing and we'll break it down
piece by piece. First, in order to use the padding functions
you'll have to import pad sequences from tensorflow.carastoppreprocessing.sequence. Then once the tokenizer
has created the sequences, these sequences can be passed to pad sequences in order to
have them padded like this. The result is pretty
straight forward. You can now see that the list of sentences has been
padded out into a matrix and that each row in the matrix
has the same length. It achieved this by putting the appropriate number of
zeros before the sentence. So in the case of
the sentence 5.3.2.4, it didn't actually do any. In the case of
the longer sentence here it didn't need to do any. Often you'll see examples
where the padding is after the sentence and not
before as you just saw. If you, like me, are more comfortable with that, you can change the code to this, adding the parameter
padding equals post. You may have noticed that the matrix width was the same
as the longest sentence. But you can override that
with the maxlen parameter. So for example if you only want your sentences to have
a maximum of five words. You can say maxlen
equals five like this. This of course will
lead to the question. If I have sentences longer
than the maxlength, then I'll lose information
but from where. Like with the padding
the default is pre, which means that
you will lose from the beginning of the sentence. If you want to override
this so that you lose from the end instead, you can do so with the
truncating parameter like this. So you've now seen how to
encode your sentences, how to pad them and how to
use Word indexing to encode previously unseen sentences using out of vocab characters. But you've done it with
very simple hard-coded data. Let's take a look at
the coded action in a screencast and then we'll come back and look at how to use
much more complex data.