1
00:00:00,000 --> 00:00:02,160
This code is very similar to

2
00:00:02,160 --> 00:00:04,050
what you saw in
the earlier videos,

3
00:00:04,050 --> 00:00:05,985
but let's look at
it line by line.

4
00:00:05,985 --> 00:00:09,480
We've just created a sentences
list from the headlines,

5
00:00:09,480 --> 00:00:11,040
in the sarcasm data set.

6
00:00:11,040 --> 00:00:14,310
So by calling
tokenizer.fit_on_texts,

7
00:00:14,310 --> 00:00:16,050
will generate the word index

8
00:00:16,050 --> 00:00:18,315
and we'll initialize
the tokenizer.

9
00:00:18,315 --> 00:00:20,480
We can see the word index as

10
00:00:20,480 --> 00:00:23,135
before by calling
the word index property.

11
00:00:23,135 --> 00:00:25,730
Note that this returns
all words that

12
00:00:25,730 --> 00:00:29,060
the tokenizer saw when
tokenizing the sentences.

13
00:00:29,060 --> 00:00:33,165
If you specify num words to
get the top 1000 or whatever,

14
00:00:33,165 --> 00:00:34,790
you may be confused by seeing

15
00:00:34,790 --> 00:00:36,650
something greater than that here.

16
00:00:36,650 --> 00:00:38,270
It's an easy mistake to make,

17
00:00:38,270 --> 00:00:40,075
but the key thing to remember,

18
00:00:40,075 --> 00:00:41,520
is that when it takes the top

19
00:00:41,520 --> 00:00:43,560
1000 or whatever you specified,

20
00:00:43,560 --> 00:00:47,365
it does that in the text
to sequence this process.

21
00:00:47,365 --> 00:00:49,580
Our word index is much

22
00:00:49,580 --> 00:00:51,500
larger than with
the previous example.

23
00:00:51,500 --> 00:00:53,120
So we'll see a greater variety

24
00:00:53,120 --> 00:00:55,295
of words in it. Here's a few.

25
00:00:55,295 --> 00:00:58,430
Now we'll create
the sequences from the text,

26
00:00:58,430 --> 00:00:59,585
as well as padding them.

27
00:00:59,585 --> 00:01:00,890
Here's the code to do that.

28
00:01:00,890 --> 00:01:03,710
It's very similar to
what you did earlier,

29
00:01:03,710 --> 00:01:06,235
and here's the output.

30
00:01:06,235 --> 00:01:09,060
First, I took the first headline

31
00:01:09,060 --> 00:01:11,375
in the data set and
showed its output.

32
00:01:11,375 --> 00:01:14,060
We can see that it has been
encoded with the values for

33
00:01:14,060 --> 00:01:18,010
the keys that are the corresponding
word in the sentence.

34
00:01:18,010 --> 00:01:21,350
This is the size of
the padded matrix.

35
00:01:21,350 --> 00:01:24,710
We had 26,709 sentences,

36
00:01:24,710 --> 00:01:26,630
and they were encoded
with padding,

37
00:01:26,630 --> 00:01:28,190
to get them up to 40 words

38
00:01:28,190 --> 00:01:30,835
long which was the length
of the longest one.

39
00:01:30,835 --> 00:01:32,820
You could truncate
this if you like,

40
00:01:32,820 --> 00:01:34,905
but I'll keep it at 40.

41
00:01:34,905 --> 00:01:38,395
That's it for processing
the Sarcasm data set.

42
00:01:38,395 --> 00:01:41,730
Let's take a look at that
in action in a screen cast.