So, let's take a look at this in
a slightly more sophisticated example, so I'm going to take the tokenizer
as we had before. But, I'm also going to introduce
this pad-sequences tool. The idea behind
the pad-sequences tool is that, it allows you to use sentences
of different lengths. And use padding or truncation, to make
all of the sentences the same length, so in this case I have the same
sentences as before. I love my dog, I love my cat, you love my
dog, but I've added this new sentence. Do you think my dog is amazing? Which is a different length from these
other sentences, these all had four words, this one has more. So my tokenizer,
I'm going to create as before, but I'm also going to use this
parameter called an OOV token. The idea here is that I'm
going to create a new token, a special token that I'm going to use for
words that aren't recognized, aren't in the word_index itself. So, I'm going to just create this, and
I'm going to create something unique here, that I wouldn't expect
to see in the corpus. Something like bracket OOV, and I'm
going to specify my OOV token, is that. So then, I'm going to call tokenizer
fit_ on_ texts sentences, and I'm going to take a look
at the word_index for that. And let's actually run this,
well see now that on my word_index, OOV is now value 1, my is value 2,
love is 3 et cetera. And, we have a total of 11
unique words in this corpus, it's actually ten words
plus the OOV token. So on the tokenizer, I can then convert
the words in those sentences to sequences of tokens, by calling
the texts_ to_ sequences method. And that's going to produce sequences, and that's what I'm printing out here,
so my sequences are 5, 3,2, 4 for the first sentence,
which is I I love my dog, 5324 et cetera. So these are the sequences {5,3,2,4}
{5,3 2, 7} ,{ 6,3,2,4} and {8, 6, 9, 2,4 10, 11}. Now, we can see these are all
different lengths, but we want to make them the same length. So that's where pad_
sequences comes into it, so I'm going to say here my padded set
is pad _sequences with the sequences. I'm going to say, let's make it
a maximum length of five words. So, this maximum length of five words, means that are these four words sentences
end up being pre padded with 0. And the 6 word sentence,
ends up having the first word cut off, because we did say
maximum length equals 5. If I said maximum length equals 8 for
example. And then ran this, we could see now
that they're all pre padded with zeros, including this long sentences
being pre padded with a single 0. There are methods on pad_sequences
that we saw in the lessons, that will allow us to do it post. If we wanted to do so, and
then the zeros would appear afterwards. So now, if I want to take a look at
words that the tokenizer wasn't fit to. So for example, my text data is I really
love my dog, and my dog loves my manatee. If I now tokenize them and
create sequences out of that, we'll see {5,1, 3, 2,
4} for the first sentence. And 5 is I, 1 is out of vocabulary,
because really wasn't actually there, and {3,2 ,4 } I still love my dog. So this is, how the out of
vocabulary token comes into it, when it sees a word that
wasn't in the word_index. it will replace it, it will just use the
out of vocabulary token 1 for that, and similarly for my dog loves my manatee,
I get {2, 4, 1, 2 ,1}. The word loves is not in it,
even though the word love is, and of course manatee isn't in it either. So, I end up with just with {2, 4,
2} other words that really have meaning in this and that's my dog,
my which is my dog. My and loves and
manatee are out of vocabulary tokens, and of course here you can see,
I'm also padding them. So, my {5,1, 3, 2,
4} gets padded and my {2, 4, 1, 2,1} also gets padded,
because I said that maxlen=10. If I said that for example, to 2,
we'll see they end up getting truncated, I'm getting the last two words here. So that's a basic introduction to,
how tokenizer works, and how padding actually works. To give you padding, to be able to get
your sentences all the same length, hope this was useful for you.