Earlier this week, we looked at using TensorFlow Data Services, or TFDS to load the reviews from the IMDb dataset and perform
classification on them. In that video, you loaded
the raw text for the reviews, and tokenized them yourself. However, often with
prepackaged datasets like these, some data scientists have done
the work for you already, and the IMDb dataset
is no exception. In this video, we'll take
a look at a version of the IMDb dataset that has
been pre-tokenized for you, but the tokenization
is done on sub words. We'll use that to demonstrate how text classification can
have some unique issues, namely that the sequence
of words can be just as important
as their existence.