1
00:00:02,091 --> 00:00:05,450
So, let's take a look at this in
a slightly more sophisticated example, so

2
00:00:05,450 --> 00:00:07,661
I'm going to take the tokenizer
as we had before.

3
00:00:07,661 --> 00:00:10,730
But, I'm also going to introduce
this pad-sequences tool.

4
00:00:10,730 --> 00:00:13,383
The idea behind
the pad-sequences tool is that,

5
00:00:13,383 --> 00:00:16,233
it allows you to use sentences
of different lengths.

6
00:00:16,233 --> 00:00:20,538
And use padding or truncation, to make
all of the sentences the same length, so

7
00:00:20,538 --> 00:00:23,181
in this case I have the same
sentences as before.

8
00:00:23,181 --> 00:00:27,051
I love my dog, I love my cat, you love my
dog, but I've added this new sentence.

9
00:00:27,051 --> 00:00:28,294
Do you think my dog is amazing?

10
00:00:28,294 --> 00:00:31,918
Which is a different length from these
other sentences, these all had four words,

11
00:00:31,918 --> 00:00:32,760
this one has more.

12
00:00:34,040 --> 00:00:36,750
So my tokenizer,
I'm going to create as before, but

13
00:00:36,750 --> 00:00:39,750
I'm also going to use this
parameter called an OOV token.

14
00:00:39,750 --> 00:00:42,985
The idea here is that I'm
going to create a new token,

15
00:00:42,985 --> 00:00:47,500
a special token that I'm going to use for
words that aren't recognized,

16
00:00:47,500 --> 00:00:49,940
aren't in the word_index itself.

17
00:00:49,940 --> 00:00:53,885
So, I'm going to just create this, and
I'm going to create something unique here,

18
00:00:53,885 --> 00:00:56,104
that I wouldn't expect
to see in the corpus.

19
00:00:56,104 --> 00:01:00,154
Something like bracket OOV, and I'm
going to specify my OOV token, is that.

20
00:01:00,154 --> 00:01:03,619
So then, I'm going to call tokenizer
fit_ on_ texts sentences, and

21
00:01:03,619 --> 00:01:06,210
I'm going to take a look
at the word_index for that.

22
00:01:06,210 --> 00:01:11,253
And let's actually run this,
well see now that on my word_index,

23
00:01:11,253 --> 00:01:15,511
OOV is now value 1, my is value 2,
love is 3 et cetera.

24
00:01:15,511 --> 00:01:19,159
And, we have a total of 11
unique words in this corpus,

25
00:01:19,159 --> 00:01:22,061
it's actually ten words
plus the OOV token.

26
00:01:23,740 --> 00:01:28,490
So on the tokenizer, I can then convert
the words in those sentences to

27
00:01:28,490 --> 00:01:33,740
sequences of tokens, by calling
the texts_ to_ sequences method.

28
00:01:33,740 --> 00:01:36,887
And that's going to produce sequences, and

29
00:01:36,887 --> 00:01:42,135
that's what I'm printing out here,
so my sequences are 5, 3,2,

30
00:01:42,135 --> 00:01:47,900
4 for the first sentence,
which is I I love my dog, 5324 et cetera.

31
00:01:47,900 --> 00:01:52,721
So these are the sequences {5,3,2,4}
{5,3 2, 7} ,{ 6,3,2,4} and

32
00:01:52,721 --> 00:01:55,240
{8, 6, 9, 2,4 10, 11}.

33
00:01:55,240 --> 00:01:57,286
Now, we can see these are all
different lengths, but

34
00:01:57,286 --> 00:01:59,340
we want to make them the same length.

35
00:01:59,340 --> 00:02:02,063
So that's where pad_
sequences comes into it, so

36
00:02:02,063 --> 00:02:06,140
I'm going to say here my padded set
is pad _sequences with the sequences.

37
00:02:06,140 --> 00:02:09,440
I'm going to say, let's make it
a maximum length of five words.

38
00:02:09,440 --> 00:02:12,795
So, this maximum length of five words,

39
00:02:12,795 --> 00:02:18,925
means that are these four words sentences
end up being pre padded with 0.

40
00:02:18,925 --> 00:02:23,733
And the 6 word sentence,
ends up having the first word cut off,

41
00:02:23,733 --> 00:02:27,350
because we did say
maximum length equals 5.

42
00:02:27,350 --> 00:02:30,471
If I said maximum length equals 8 for
example.

43
00:02:30,471 --> 00:02:34,349
And then ran this, we could see now
that they're all pre padded with zeros,

44
00:02:34,349 --> 00:02:37,691
including this long sentences
being pre padded with a single 0.

45
00:02:37,691 --> 00:02:40,968
There are methods on pad_sequences
that we saw in the lessons,

46
00:02:40,968 --> 00:02:42,615
that will allow us to do it post.

47
00:02:42,615 --> 00:02:45,383
If we wanted to do so, and
then the zeros would appear afterwards.

48
00:02:48,440 --> 00:02:51,851
So now, if I want to take a look at
words that the tokenizer wasn't fit to.

49
00:02:51,851 --> 00:02:56,140
So for example, my text data is I really
love my dog, and my dog loves my manatee.

50
00:02:56,140 --> 00:03:00,251
If I now tokenize them and
create sequences out of that,

51
00:03:00,251 --> 00:03:04,810
we'll see {5,1, 3, 2,
4} for the first sentence.

52
00:03:04,810 --> 00:03:10,789
And 5 is I, 1 is out of vocabulary,
because really wasn't actually there,

53
00:03:10,789 --> 00:03:13,950
and {3,2 ,4 } I still love my dog.

54
00:03:13,950 --> 00:03:17,197
So this is, how the out of
vocabulary token comes into it,

55
00:03:17,197 --> 00:03:20,129
when it sees a word that
wasn't in the word_index.

56
00:03:20,129 --> 00:03:24,919
it will replace it, it will just use the
out of vocabulary token 1 for that, and

57
00:03:24,919 --> 00:03:29,710
similarly for my dog loves my manatee,
I get {2, 4, 1, 2 ,1}.

58
00:03:29,710 --> 00:03:33,127
The word loves is not in it,
even though the word love is, and

59
00:03:33,127 --> 00:03:35,670
of course manatee isn't in it either.

60
00:03:35,670 --> 00:03:40,233
So, I end up with just with {2, 4,
2} other words that really have

61
00:03:40,233 --> 00:03:43,760
meaning in this and that's my dog,
my which is my dog.

62
00:03:43,760 --> 00:03:46,536
My and loves and
manatee are out of vocabulary tokens, and

63
00:03:46,536 --> 00:03:49,041
of course here you can see,
I'm also padding them.

64
00:03:49,041 --> 00:03:53,002
So, my {5,1, 3, 2,
4} gets padded and my {2, 4, 1,

65
00:03:53,002 --> 00:03:56,762
2,1} also gets padded,
because I said that maxlen=10.

66
00:03:56,762 --> 00:04:02,269
If I said that for example, to 2,
we'll see they end up getting truncated,

67
00:04:02,269 --> 00:04:05,004
I'm getting the last two words here.

68
00:04:05,004 --> 00:04:08,207
So that's a basic introduction to,
how tokenizer works, and

69
00:04:08,207 --> 00:04:09,727
how padding actually works.

70
00:04:09,727 --> 00:04:13,291
To give you padding, to be able to get
your sentences all the same length,

71
00:04:13,291 --> 00:04:14,743
hope this was useful for you.