1
00:00:00,000 --> 00:00:01,960
Earlier this week, we looked at

2
00:00:01,960 --> 00:00:03,910
using TensorFlow Data Services,

3
00:00:03,910 --> 00:00:06,040
or TFDS to load the reviews from

4
00:00:06,040 --> 00:00:09,550
the IMDb dataset and perform
classification on them.

5
00:00:09,550 --> 00:00:12,910
In that video, you loaded
the raw text for the reviews,

6
00:00:12,910 --> 00:00:14,675
and tokenized them yourself.

7
00:00:14,675 --> 00:00:18,190
However, often with
prepackaged datasets like these,

8
00:00:18,190 --> 00:00:21,130
some data scientists have done
the work for you already,

9
00:00:21,130 --> 00:00:23,650
and the IMDb dataset
is no exception.

10
00:00:23,650 --> 00:00:26,500
In this video, we'll take
a look at a version of

11
00:00:26,500 --> 00:00:30,460
the IMDb dataset that has
been pre-tokenized for you,

12
00:00:30,460 --> 00:00:33,140
but the tokenization
is done on sub words.

13
00:00:33,140 --> 00:00:34,900
We'll use that to demonstrate how

14
00:00:34,900 --> 00:00:38,500
text classification can
have some unique issues,

15
00:00:38,500 --> 00:00:40,540
namely that the sequence
of words can

16
00:00:40,540 --> 00:00:43,730
be just as important
as their existence.