1
00:00:00,000 --> 00:00:01,695
In the previous video,

2
00:00:01,695 --> 00:00:04,710
you saw how to tokenize
the words and sentences,

3
00:00:04,710 --> 00:00:06,405
building up a dictionary of

4
00:00:06,405 --> 00:00:08,715
all the words to make a corpus.

5
00:00:08,715 --> 00:00:11,610
The next step will be
to turn your sentences

6
00:00:11,610 --> 00:00:14,820
into lists of values
based on these tokens.

7
00:00:14,820 --> 00:00:16,425
Once you have them,

8
00:00:16,425 --> 00:00:19,680
you'll likely also need to
manipulate these lists,

9
00:00:19,680 --> 00:00:22,625
not least to make
every sentence the same length,

10
00:00:22,625 --> 00:00:24,150
otherwise, it may be hard to

11
00:00:24,150 --> 00:00:26,070
train a neural network with them.

12
00:00:26,070 --> 00:00:28,140
Remember when we
were doing images,

13
00:00:28,140 --> 00:00:30,270
we defined an input layer
with the size of

14
00:00:30,270 --> 00:00:32,925
the image that we're feeding
into the neural network.

15
00:00:32,925 --> 00:00:36,149
In the cases where images
where differently sized,

16
00:00:36,149 --> 00:00:38,055
we would resize them to fit.

17
00:00:38,055 --> 00:00:40,785
Well, you're going to face
the same thing with text.

18
00:00:40,785 --> 00:00:43,550
Fortunately, TensorFlow
includes APIs

19
00:00:43,550 --> 00:00:45,035
to handle these issues.

20
00:00:45,035 --> 00:00:47,405
We'll look at those
in this video.

21
00:00:47,405 --> 00:00:51,175
Let's start with creating
a list of sequences,

22
00:00:51,175 --> 00:00:53,660
the sentences encoded
with the tokens that we

23
00:00:53,660 --> 00:00:55,490
generated and I've updated

24
00:00:55,490 --> 00:00:58,240
the code that we've been
working on to this.

25
00:00:58,240 --> 00:01:00,410
First of all, I've added

26
00:01:00,410 --> 00:01:03,395
another sentence to the end
of the sentences list.

27
00:01:03,395 --> 00:01:05,765
Note that all of
the previous sentences

28
00:01:05,765 --> 00:01:07,040
had four words in them.

29
00:01:07,040 --> 00:01:08,495
So this one's a bit longer.

30
00:01:08,495 --> 00:01:11,735
We'll use that to demonstrate
padding in a moment.

31
00:01:11,735 --> 00:01:13,775
The next piece of
code is this one,

32
00:01:13,775 --> 00:01:14,960
where I simply call on

33
00:01:14,960 --> 00:01:17,735
the tokenizer to get
texts to sequences,

34
00:01:17,735 --> 00:01:20,870
and it will turn them into
a set of sequences for me.

35
00:01:20,870 --> 00:01:22,775
So if I run this code,

36
00:01:22,775 --> 00:01:24,535
this will be the output.

37
00:01:24,535 --> 00:01:26,945
At the top is the new dictionary.

38
00:01:26,945 --> 00:01:30,950
With new tokens for
my new words like amazing,

39
00:01:30,950 --> 00:01:33,470
think, is, and do.

40
00:01:33,470 --> 00:01:36,440
At the bottom is
my list of sentences

41
00:01:36,440 --> 00:01:39,050
that have been encoded
into integer lists,

42
00:01:39,050 --> 00:01:41,360
with the tokens
replacing the words.

43
00:01:41,360 --> 00:01:46,465
So for example, I love my
dog becomes 4, 2, 1, 3.

44
00:01:46,465 --> 00:01:48,840
One really handy thing about

45
00:01:48,840 --> 00:01:51,350
this that you'll use
later is the fact

46
00:01:51,350 --> 00:01:52,940
that the text to sequences

47
00:01:52,940 --> 00:01:56,105
called can take
any set of sentences,

48
00:01:56,105 --> 00:01:59,625
so it can encode them based
on the word set that it

49
00:01:59,625 --> 00:02:03,320
learned from the one that was
passed into fit on texts.

50
00:02:03,320 --> 00:02:06,380
This is very significant if
you think ahead a little bit.

51
00:02:06,380 --> 00:02:09,755
If you train a neural network
on a corpus of texts,

52
00:02:09,755 --> 00:02:12,935
and the text has a word index
generated from it,

53
00:02:12,935 --> 00:02:16,250
then when you want to do
inference with the train model,

54
00:02:16,250 --> 00:02:18,440
you'll have to encode
the text that you want to

55
00:02:18,440 --> 00:02:21,440
infer on with
the same word index,

56
00:02:21,440 --> 00:02:23,645
otherwise it would
be meaningless.

57
00:02:23,645 --> 00:02:26,060
So if you consider this code,

58
00:02:26,060 --> 00:02:27,935
what do you expect
the outcome to be?

59
00:02:27,935 --> 00:02:31,190
There are some familiar words
here, like love, my,

60
00:02:31,190 --> 00:02:34,805
and dog but also
some previously unseen ones.

61
00:02:34,805 --> 00:02:36,165
If I run this code,

62
00:02:36,165 --> 00:02:37,855
this is what I would get.

63
00:02:37,855 --> 00:02:41,580
I've added the dictionary
underneath for convenience.

64
00:02:41,580 --> 00:02:44,465
So I really love my dog

65
00:02:44,465 --> 00:02:47,475
would still be encoded
as 4, 2, 1, 3,

66
00:02:47,475 --> 00:02:50,750
which is 'I love my
dog' with 'really'

67
00:02:50,750 --> 00:02:55,015
being lost as the word is
not in the Word Index,

68
00:02:55,015 --> 00:02:59,610
and 'my dog loves my manatee'
would get encoded to 1,

69
00:02:59,610 --> 00:03:03,910
3, 1, which is just 'my dog my'.