So what do we learn from this? First of all, we really need a lot of
training data to get a broad vocabulary or we could end up with
sentences like, my dog my, like we just did. Secondly, in many cases, it's a good idea to instead of just ignoring unseen words, to put a special value in when an unseen word
is encountered. You can do this with a property on the tokenizer.
Let's take a look. Here's the complete code showing both the original sentences
and the test data. What I've changed is to add a property oov token to
the tokenizer constructor. You can see now that I've
specified that I want the token oov for outer vocabulary to be used for words that
aren't in the word index. You can use whatever
you like here, but remember that it should
be something unique and distinct that isn't
confused with a real word. So now, if I run this code, I'll get my test sequences
looking like this. I pasted the word index
underneath so you can look it up. The first sentence will be, i out of vocab, love my dog. The second will be, my dog oov, my oov Still not
syntactically great, but it is doing better. As the corpus grows and
more words are in the index, hopefully previously
unseen sentences will have better coverage. Next up is padding. As we mentioned
earlier when we were building neural networks
to handle pictures. When we fed them into
the network for training, we needed them to
be uniform in size. Often, we use the generators to resize the image
to fit for example. With texts you'll face a similar requirement before
you can train with texts, we needed to have some level
of uniformity of size, so padding is your friend there.