We have a pre-trained
sub-words tokenizer now, so we can inspect its vocabulary by looking
at its sub-words property. If we want to see how it
encodes or decode strings, we can do so with this code. So we can encode
simply by calling the encode method
passing it the string. Similarly, decode by
calling the decode method. We can see the results
of the tokenization when we print out the encoded
and decoded strings. If we want to see
the tokens themselves, we can take each element
and decode that, showing the value to token. Note that this is
case sensitive and punctuation is maintained unlike
the tokenizer we saw in the last video. You don't need to do
anything with them yet, I just wanted to show you how sub-word tokenization works. So now, let's take a look at
classifying IMDB with it. What the results are going to be? Here's the model. Again, it should look very
familiar at this point. One thing to take
into account though, is the shape of
the vectors coming from the tokenizer through
the embedding, and it's not easily flattened. So we'll use Global Average
Pooling 1D instead. Trying to flatten them, will
cause a TensorFlow crash. Here's the output of
the model summary. You can compile and train
the model like this, it's pretty standard code. Training is dealing with a lot of hyper-parameters
and sub-words, so expect it to be slow. Running on a colab
with GPU took me about four-and-a-half
minutes per epoch. So set it off and give
it some time to train. If your results don't look good, don't worry, that's
part of the point. You can graph the results
with this code, and your graphs will probably
look something like this. In my case, the accuracy was
barely about 50 percent, which you could get
with a random guess. While losses decreasing, it's decreasing in a very small way. So why do you think
that might be? Well, the keys in
the fact that we're using sub-words and not for-words, sub-word meanings are often
nonsensical and it's only when we put them together in sequences that they have
meaningful semantics. Thus, some way from learning from sequences would be
a great way forward, and that's exactly what
you're going to do next week with
recurrent neural networks