1
00:00:00,000 --> 00:00:03,420
So I've made a few changes to
the code to handle padding.

2
00:00:03,420 --> 00:00:04,860
Here's the complete listing and

3
00:00:04,860 --> 00:00:07,005
we'll break it down
piece by piece.

4
00:00:07,005 --> 00:00:09,270
First, in order to use

5
00:00:09,270 --> 00:00:11,610
the padding functions
you'll have to import

6
00:00:11,610 --> 00:00:13,050
pad sequences from

7
00:00:13,050 --> 00:00:16,605
tensorflow.carastoppreprocessing.sequence.

8
00:00:16,605 --> 00:00:20,535
Then once the tokenizer
has created the sequences,

9
00:00:20,535 --> 00:00:22,740
these sequences can be passed to

10
00:00:22,740 --> 00:00:26,820
pad sequences in order to
have them padded like this.

11
00:00:26,820 --> 00:00:28,905
The result is pretty
straight forward.

12
00:00:28,905 --> 00:00:30,630
You can now see that the list of

13
00:00:30,630 --> 00:00:32,730
sentences has been
padded out into

14
00:00:32,730 --> 00:00:34,935
a matrix and that

15
00:00:34,935 --> 00:00:37,350
each row in the matrix
has the same length.

16
00:00:37,350 --> 00:00:38,850
It achieved this by putting

17
00:00:38,850 --> 00:00:41,955
the appropriate number of
zeros before the sentence.

18
00:00:41,955 --> 00:00:45,170
So in the case of
the sentence 5.3.2.4,

19
00:00:45,170 --> 00:00:46,910
it didn't actually do any.

20
00:00:46,910 --> 00:00:48,950
In the case of
the longer sentence

21
00:00:48,950 --> 00:00:50,855
here it didn't need to do any.

22
00:00:50,855 --> 00:00:54,320
Often you'll see examples
where the padding is

23
00:00:54,320 --> 00:00:57,755
after the sentence and not
before as you just saw.

24
00:00:57,755 --> 00:00:59,120
If you, like me,

25
00:00:59,120 --> 00:01:00,620
are more comfortable with that,

26
00:01:00,620 --> 00:01:02,795
you can change the code to this,

27
00:01:02,795 --> 00:01:06,020
adding the parameter
padding equals post.

28
00:01:06,020 --> 00:01:07,970
You may have noticed that

29
00:01:07,970 --> 00:01:11,675
the matrix width was the same
as the longest sentence.

30
00:01:11,675 --> 00:01:14,685
But you can override that
with the maxlen parameter.

31
00:01:14,685 --> 00:01:16,580
So for example if you only want

32
00:01:16,580 --> 00:01:19,490
your sentences to have
a maximum of five words.

33
00:01:19,490 --> 00:01:23,310
You can say maxlen
equals five like this.

34
00:01:23,310 --> 00:01:26,495
This of course will
lead to the question.

35
00:01:26,495 --> 00:01:29,555
If I have sentences longer
than the maxlength,

36
00:01:29,555 --> 00:01:32,575
then I'll lose information
but from where.

37
00:01:32,575 --> 00:01:35,560
Like with the padding
the default is pre,

38
00:01:35,560 --> 00:01:37,040
which means that
you will lose from

39
00:01:37,040 --> 00:01:38,750
the beginning of the sentence.

40
00:01:38,750 --> 00:01:40,700
If you want to override
this so that you

41
00:01:40,700 --> 00:01:42,590
lose from the end instead,

42
00:01:42,590 --> 00:01:46,490
you can do so with the
truncating parameter like this.

43
00:01:46,490 --> 00:01:49,640
So you've now seen how to
encode your sentences,

44
00:01:49,640 --> 00:01:52,820
how to pad them and how to
use Word indexing to encode

45
00:01:52,820 --> 00:01:55,295
previously unseen sentences using

46
00:01:55,295 --> 00:01:57,035
out of vocab characters.

47
00:01:57,035 --> 00:02:00,155
But you've done it with
very simple hard-coded data.

48
00:02:00,155 --> 00:02:02,180
Let's take a look at
the coded action in

49
00:02:02,180 --> 00:02:03,920
a screencast and then we'll come

50
00:02:03,920 --> 00:02:07,170
back and look at how to use
much more complex data.