1
00:00:00,000 --> 00:00:02,865
So we'll start by looking
at TensorFlow data-sets,

2
00:00:02,865 --> 00:00:05,025
you can find them at this URL.

3
00:00:05,025 --> 00:00:07,905
If you look at the
IMDB reviews data-set,

4
00:00:07,905 --> 00:00:09,240
you'll see that
there's a bunch of

5
00:00:09,240 --> 00:00:10,860
versions that you can use.

6
00:00:10,860 --> 00:00:12,840
These include,"plain_text"
which we

7
00:00:12,840 --> 00:00:14,670
used in the last video,"bytes",

8
00:00:14,670 --> 00:00:17,505
where the text is
encoded at byte level,

9
00:00:17,505 --> 00:00:21,995
and sub-word encoding which
we'll look at in this video.

10
00:00:21,995 --> 00:00:24,650
One thing to note is
that you should use

11
00:00:24,650 --> 00:00:27,590
TensorFlow 2.0 to for the code
I'll be sharing here.

12
00:00:27,590 --> 00:00:30,875
There are some inconsistencies
with version 1.x.

13
00:00:30,875 --> 00:00:32,570
So if you're using the colab,

14
00:00:32,570 --> 00:00:35,320
you should first print
out the TF version.

15
00:00:35,320 --> 00:00:39,375
If it is 1.x, you should
install TensorFlow 2 like this.

16
00:00:39,375 --> 00:00:41,270
Note that over time the alpha's

17
00:00:41,270 --> 00:00:43,640
0 will change to later versions.

18
00:00:43,640 --> 00:00:45,410
So I would recommend
that you look up

19
00:00:45,410 --> 00:00:46,850
the latest install guide for

20
00:00:46,850 --> 00:00:49,385
TensorFlow 2.0 if
you hit any issues.

21
00:00:49,385 --> 00:00:52,430
I'd recommend running
this code again to ensure that

22
00:00:52,430 --> 00:00:55,550
you are on version 2
before going any further,

23
00:00:55,550 --> 00:00:56,720
particularly if you're using

24
00:00:56,720 --> 00:00:59,605
a Colab or a Jupiter notebook.

25
00:00:59,605 --> 00:01:02,399
Once you're on TensorFlow 2,

26
00:01:02,399 --> 00:01:06,630
you can now start using
the imdb subwords data-set.

27
00:01:06,630 --> 00:01:09,280
We'll use the 8k version today.

28
00:01:09,280 --> 00:01:11,300
Getting access to
your training and

29
00:01:11,300 --> 00:01:14,145
test data is then
as easy as this.

30
00:01:14,145 --> 00:01:17,840
Next, if you want to access
the sub words tokenizer,

31
00:01:17,840 --> 00:01:19,460
you can do it with this code.

32
00:01:19,460 --> 00:01:20,810
You can learn all about

33
00:01:20,810 --> 00:01:24,270
the sub-words texts
encoder at this URL.