So we'll start by looking
at TensorFlow data-sets, you can find them at this URL. If you look at the
IMDB reviews data-set, you'll see that
there's a bunch of versions that you can use. These include,"plain_text"
which we used in the last video,"bytes", where the text is
encoded at byte level, and sub-word encoding which
we'll look at in this video. One thing to note is
that you should use TensorFlow 2.0 to for the code
I'll be sharing here. There are some inconsistencies
with version 1.x. So if you're using the colab, you should first print
out the TF version. If it is 1.x, you should
install TensorFlow 2 like this. Note that over time the alpha's 0 will change to later versions. So I would recommend
that you look up the latest install guide for TensorFlow 2.0 if
you hit any issues. I'd recommend running
this code again to ensure that you are on version 2
before going any further, particularly if you're using a Colab or a Jupiter notebook. Once you're on TensorFlow 2, you can now start using
the imdb subwords data-set. We'll use the 8k version today. Getting access to
your training and test data is then
as easy as this. Next, if you want to access
the sub words tokenizer, you can do it with this code. You can learn all about the sub-words texts
encoder at this URL.