1
00:00:00,000 --> 00:00:02,400
Here you can see
the tokenizer from

2
00:00:02,400 --> 00:00:04,080
the Keras
pre-processing library.

3
00:00:04,080 --> 00:00:05,760
The tokenizer is
your friend when it

4
00:00:05,760 --> 00:00:07,935
comes to doing natural
language processing,

5
00:00:07,935 --> 00:00:10,290
it does all the heavy
lifting of managing tokens,

6
00:00:10,290 --> 00:00:13,035
turning your text into
streams of tokens, et cetera.

7
00:00:13,035 --> 00:00:14,490
Now the reason why
you would need

8
00:00:14,490 --> 00:00:15,930
this is that when it
comes to training,

9
00:00:15,930 --> 00:00:17,580
neural networks you're
going to be doing a lot of

10
00:00:17,580 --> 00:00:19,350
math and math deals with numbers

11
00:00:19,350 --> 00:00:20,700
and instead of having

12
00:00:20,700 --> 00:00:23,130
the words being trained
in a neural network,

13
00:00:23,130 --> 00:00:24,960
you can actually have
the number representing

14
00:00:24,960 --> 00:00:27,450
that word and it just makes
your life a lot easier.

15
00:00:27,450 --> 00:00:29,010
Here you can see
I have a body of

16
00:00:29,010 --> 00:00:30,330
texts where my sentences,

17
00:00:30,330 --> 00:00:32,010
I love my dog and I love my cat.

18
00:00:32,010 --> 00:00:35,114
I'm going to tokenize
those using the tokenizer.

19
00:00:35,114 --> 00:00:38,594
Now one note, the tokenizer
is you'll often creates

20
00:00:38,594 --> 00:00:40,830
the tokenizer using
the num-words property

21
00:00:40,830 --> 00:00:42,255
or the num-words parameter.

22
00:00:42,255 --> 00:00:44,410
In this case, what
it's going to do is in

23
00:00:44,410 --> 00:00:46,630
your body of texts
that it's tokenizing,

24
00:00:46,630 --> 00:00:48,895
it will take the 100
most common words

25
00:00:48,895 --> 00:00:50,995
or whatever value that
you actually put in here,

26
00:00:50,995 --> 00:00:52,990
I've a lot less than
a 100 unique words

27
00:00:52,990 --> 00:00:55,315
here so it's not really
going to have any effect.

28
00:00:55,315 --> 00:00:57,880
What fit on texts will then
do is it will go through

29
00:00:57,880 --> 00:01:00,340
the entire body of text and it

30
00:01:00,340 --> 00:01:03,010
will create a dictionary
with the key being

31
00:01:03,010 --> 00:01:07,100
the word and the value being
the token for that word.

32
00:01:07,100 --> 00:01:10,180
If I run this, will actually
see that in action.

33
00:01:10,180 --> 00:01:13,370
Here you can see now it's
created a word index for me.

34
00:01:13,370 --> 00:01:15,960
The word indexes "I"
would be number 1,

35
00:01:15,960 --> 00:01:17,115
"love" would be number 2,

36
00:01:17,115 --> 00:01:18,240
"my" will be number 3,

37
00:01:18,240 --> 00:01:19,290
"dog" will be number 4,

38
00:01:19,290 --> 00:01:21,000
and "cat" will be number 5.

39
00:01:21,000 --> 00:01:22,990
Those are the unique words that

40
00:01:22,990 --> 00:01:24,765
are actually in this
corpus of text.

41
00:01:24,765 --> 00:01:26,495
A few things to take note of.

42
00:01:26,495 --> 00:01:28,190
Number 1 is that
punctuation like

43
00:01:28,190 --> 00:01:31,385
spaces in the comma I've
actually been removed.

44
00:01:31,385 --> 00:01:33,470
It cleans up my text for me in

45
00:01:33,470 --> 00:01:36,040
that way to just to actually
pull out the words.

46
00:01:36,040 --> 00:01:37,970
Number 2, you may have
noticed that I have

47
00:01:37,970 --> 00:01:39,890
a lowercase i here
and an uppercase

48
00:01:39,890 --> 00:01:43,490
I here and as you can see to
make it case insensitive,

49
00:01:43,490 --> 00:01:45,275
it's just using I
and its detecting.

50
00:01:45,275 --> 00:01:47,885
It's giving the same
token for both of these.

51
00:01:47,885 --> 00:01:50,030
Now if I were to
change this a little

52
00:01:50,030 --> 00:01:51,770
bit by adding some
new words to it.

53
00:01:51,770 --> 00:01:54,005
For example, here
you love my dog.

54
00:01:54,005 --> 00:01:55,580
Notice that you is

55
00:01:55,580 --> 00:01:59,210
capitalized and dog has
an exclamation after it,

56
00:01:59,210 --> 00:02:02,480
but it's not going to confuse
that with the previous dog.

57
00:02:02,480 --> 00:02:04,190
If I run it, we'll

58
00:02:04,190 --> 00:02:06,515
see now that I have a
whole new set of tokens.

59
00:02:06,515 --> 00:02:07,895
I have one new one,

60
00:02:07,895 --> 00:02:09,140
I have six downside of five

61
00:02:09,140 --> 00:02:10,760
and that's because
the word you is

62
00:02:10,760 --> 00:02:12,890
the only unique new word in

63
00:02:12,890 --> 00:02:14,360
this corpus because love

64
00:02:14,360 --> 00:02:16,280
my and dog were
their previously,

65
00:02:16,280 --> 00:02:19,760
but you'll see the exclamation
from dog was removed.

66
00:02:19,910 --> 00:02:22,220
That's a basic introduction to

67
00:02:22,220 --> 00:02:23,720
how the tokenizer
actually works,

68
00:02:23,720 --> 00:02:27,060
and you'll be using that
a lot in this course.