1
00:00:00,000 --> 00:00:02,040
So what do we learn from this?

2
00:00:02,040 --> 00:00:03,720
First of all, we really

3
00:00:03,720 --> 00:00:05,820
need a lot of
training data to get

4
00:00:05,820 --> 00:00:07,530
a broad vocabulary or

5
00:00:07,530 --> 00:00:09,435
we could end up with
sentences like,

6
00:00:09,435 --> 00:00:12,030
my dog my, like we just did.

7
00:00:12,030 --> 00:00:14,460
Secondly, in many cases,

8
00:00:14,460 --> 00:00:16,410
it's a good idea to instead of

9
00:00:16,410 --> 00:00:18,780
just ignoring unseen words,

10
00:00:18,780 --> 00:00:20,475
to put a special value in

11
00:00:20,475 --> 00:00:22,815
when an unseen word
is encountered.

12
00:00:22,815 --> 00:00:24,480
You can do this with a property

13
00:00:24,480 --> 00:00:26,625
on the tokenizer.
Let's take a look.

14
00:00:26,625 --> 00:00:28,950
Here's the complete code showing

15
00:00:28,950 --> 00:00:32,220
both the original sentences
and the test data.

16
00:00:32,220 --> 00:00:34,575
What I've changed is to add

17
00:00:34,575 --> 00:00:38,670
a property oov token to
the tokenizer constructor.

18
00:00:38,670 --> 00:00:42,225
You can see now that I've
specified that I want the token

19
00:00:42,225 --> 00:00:44,870
oov for outer vocabulary

20
00:00:44,870 --> 00:00:48,320
to be used for words that
aren't in the word index.

21
00:00:48,320 --> 00:00:50,870
You can use whatever
you like here,

22
00:00:50,870 --> 00:00:53,420
but remember that it should
be something unique and

23
00:00:53,420 --> 00:00:56,975
distinct that isn't
confused with a real word.

24
00:00:56,975 --> 00:00:59,585
So now, if I run this code,

25
00:00:59,585 --> 00:01:01,760
I'll get my test sequences
looking like this.

26
00:01:01,760 --> 00:01:05,120
I pasted the word index
underneath so you can look it up.

27
00:01:05,120 --> 00:01:07,355
The first sentence will be,

28
00:01:07,355 --> 00:01:10,280
i out of vocab, love my dog.

29
00:01:10,280 --> 00:01:13,130
The second will be, my dog oov,

30
00:01:13,130 --> 00:01:17,209
my oov Still not
syntactically great,

31
00:01:17,209 --> 00:01:18,800
but it is doing better.

32
00:01:18,800 --> 00:01:22,565
As the corpus grows and
more words are in the index,

33
00:01:22,565 --> 00:01:25,130
hopefully previously
unseen sentences

34
00:01:25,130 --> 00:01:26,690
will have better coverage.

35
00:01:26,690 --> 00:01:28,360
Next up is padding.

36
00:01:28,360 --> 00:01:30,320
As we mentioned
earlier when we were

37
00:01:30,320 --> 00:01:32,855
building neural networks
to handle pictures.

38
00:01:32,855 --> 00:01:35,450
When we fed them into
the network for training,

39
00:01:35,450 --> 00:01:37,940
we needed them to
be uniform in size.

40
00:01:37,940 --> 00:01:40,220
Often, we use the generators

41
00:01:40,220 --> 00:01:42,805
to resize the image
to fit for example.

42
00:01:42,805 --> 00:01:44,420
With texts you'll face

43
00:01:44,420 --> 00:01:47,740
a similar requirement before
you can train with texts,

44
00:01:47,740 --> 00:01:51,320
we needed to have some level
of uniformity of size,

45
00:01:51,320 --> 00:01:53,820
so padding is your friend there.