1
00:00:00,000 --> 00:00:01,920
Here's the code to encode

2
00:00:01,920 --> 00:00:04,055
the two sentences that
we just spoke about.

3
00:00:04,055 --> 00:00:06,400
Let's unpack it line by line.

4
00:00:06,400 --> 00:00:08,580
Tensorflow and keras give us

5
00:00:08,580 --> 00:00:10,500
a number of ways to encode words,

6
00:00:10,500 --> 00:00:13,560
but the one I'm going to
focus on is the tokenizer.

7
00:00:13,560 --> 00:00:15,770
This will handle
the heavy lifting for us,

8
00:00:15,770 --> 00:00:17,160
generating the dictionary of

9
00:00:17,160 --> 00:00:21,285
word encodings and creating
vectors out of the sentences.

10
00:00:21,285 --> 00:00:23,925
I'll put the sentences
into an array.

11
00:00:23,925 --> 00:00:25,950
Note that I've
already capitalized

12
00:00:25,950 --> 00:00:29,175
'I' as it is at the beginning
of the sentence.

13
00:00:29,175 --> 00:00:32,595
I then create an instance
of the tokenizer.

14
00:00:32,595 --> 00:00:35,435
A passive parameter
num wards to it.

15
00:00:35,435 --> 00:00:37,970
In this case, I'm using
100 which is way too big,

16
00:00:37,970 --> 00:00:40,675
as there are only five
distinct words in this data.

17
00:00:40,675 --> 00:00:43,760
If you're creating a training
set based on lots of text,

18
00:00:43,760 --> 00:00:45,020
you usually don't know

19
00:00:45,020 --> 00:00:48,230
how many unique distinct words
there are in that text.

20
00:00:48,230 --> 00:00:50,300
So by setting
this hyperparameter,

21
00:00:50,300 --> 00:00:52,130
what the tokenizer
will do is take

22
00:00:52,130 --> 00:00:55,700
the top 100 words by volume
and just encode those.

23
00:00:55,700 --> 00:00:58,820
It's a handy shortcut when
dealing with lots of data,

24
00:00:58,820 --> 00:01:00,890
and worth experimenting
with when you

25
00:01:00,890 --> 00:01:03,800
train with real data
later in this course.

26
00:01:03,800 --> 00:01:05,690
Sometimes the impact of

27
00:01:05,690 --> 00:01:08,544
less words can be minimal
and training accuracy,

28
00:01:08,544 --> 00:01:10,380
but huge in training time,

29
00:01:10,380 --> 00:01:12,540
but do use it carefully.

30
00:01:12,540 --> 00:01:14,920
The fit on texts method of

31
00:01:14,920 --> 00:01:18,805
the tokenizer then takes in
the data and encodes it.

32
00:01:18,805 --> 00:01:22,720
The tokenizer provides
a word index property which

33
00:01:22,720 --> 00:01:26,200
returns a dictionary
containing key value pairs,

34
00:01:26,200 --> 00:01:27,610
where the key is the word,

35
00:01:27,610 --> 00:01:30,250
and the value is the
token for that word,

36
00:01:30,250 --> 00:01:33,830
which you can inspect by
simply printing it out.

37
00:01:33,830 --> 00:01:35,605
You can see the results here.

38
00:01:35,605 --> 00:01:38,515
Remember when we said that
the word I was capitalized,

39
00:01:38,515 --> 00:01:40,600
note that it's lower-cased here.

40
00:01:40,600 --> 00:01:43,225
That's another thing that
the tokenizer does for you.

41
00:01:43,225 --> 00:01:45,160
It strips punctuation out.

42
00:01:45,160 --> 00:01:48,865
This is really useful if
you consider this case.

43
00:01:48,865 --> 00:01:51,235
Here, I've added
another sentence,

44
00:01:51,235 --> 00:01:52,810
'You love my dog!'

45
00:01:52,810 --> 00:01:54,780
but there's something
very different about it.

46
00:01:54,780 --> 00:01:57,655
I've added an exclamation
after the word 'dog!'

47
00:01:57,655 --> 00:01:59,420
Now, should this be treated as

48
00:01:59,420 --> 00:02:01,190
a different word than just dog?

49
00:02:01,190 --> 00:02:02,915
Well, of course not.

50
00:02:02,915 --> 00:02:05,345
So the results of
the code that we saw

51
00:02:05,345 --> 00:02:07,640
earlier with
this new corpus of data,

52
00:02:07,640 --> 00:02:09,365
will look like this.

53
00:02:09,365 --> 00:02:13,145
Notice that we still only
have 'dog' as a key.

54
00:02:13,145 --> 00:02:16,415
That the exclamation
didn't impact this,

55
00:02:16,415 --> 00:02:19,010
and of course, we
have a new key for

56
00:02:19,010 --> 00:02:21,995
the word 'you' that was detected.

57
00:02:21,995 --> 00:02:25,610
So you've seen the beginnings
of handling texts by

58
00:02:25,610 --> 00:02:28,580
creating word-based
encodings of that text,

59
00:02:28,580 --> 00:02:31,355
with some very simple code
intensive flow and keras.

60
00:02:31,355 --> 00:02:32,930
In the next video,
we'll take a look

61
00:02:32,930 --> 00:02:35,430
at the code and see how it works.