1
00:00:00,240 --> 00:00:03,640
So let's take a look at tokenizing and
padding a lot more data.

2
00:00:03,640 --> 00:00:06,680
So instead of just a few sentences
that were looking at earlier on,

3
00:00:06,680 --> 00:00:08,691
we're actually going to
use a full dataset and

4
00:00:08,691 --> 00:00:12,340
this is the sarcasm data set that we
were talking about in the lesson.

5
00:00:12,340 --> 00:00:15,968
So here is, I'm just going to
wget that sarcasm dataset and

6
00:00:15,968 --> 00:00:19,440
downloaded to /tmp/sarcasm.json.

7
00:00:19,440 --> 00:00:23,069
So once I've downloaded it I'm now
going to just create three lists, one for

8
00:00:23,069 --> 00:00:26,060
the sentences, one for
the labels and one for the urls.

9
00:00:26,060 --> 00:00:27,930
Not actually going to
use the urls here but

10
00:00:27,930 --> 00:00:31,040
just if you want to use them
yourself here's how you do it.

11
00:00:31,040 --> 00:00:35,198
And then for each item in the data store
because in Json it's really easy for

12
00:00:35,198 --> 00:00:38,450
me to iterate through it, for
each item in the data store.

13
00:00:38,450 --> 00:00:41,180
I can then just add
the headline to sentences.

14
00:00:41,180 --> 00:00:44,007
I can add the item
is_sarcastic to label and

15
00:00:44,007 --> 00:00:46,689
I can add the item article_link to urls.

16
00:00:46,689 --> 00:00:50,949
Okay, now that my data is ready I'm
going to import my tokenizer and

17
00:00:50,949 --> 00:00:52,950
my pad_ sequences as before.

18
00:00:52,950 --> 00:00:57,077
I'm going to use the tokenizer just
specify my ODV vocabulary token and

19
00:00:57,077 --> 00:01:01,740
I'm going to call it to fit_on_text
to all of the sentences.

20
00:01:01,740 --> 00:01:03,591
Let's actually run it and
see what happens.

21
00:01:05,040 --> 00:01:07,786
So we run,
we see it's downloading the data.

22
00:01:07,786 --> 00:01:13,056
Now it's a case of we have a much larger
vocabulary have 29657 words in the index,

23
00:01:13,056 --> 00:01:18,035
and we can start seeing things like <OOV>
vocabularies 1,2 is number 2 of his

24
00:01:18,035 --> 00:01:20,090
number 3, there's number 4.

25
00:01:20,090 --> 00:01:23,391
Remember these are going to be sorted
into their order of commonality.

26
00:01:23,391 --> 00:01:27,561
So you should see basic words like
the N4 pretty high up in the list.

27
00:01:30,140 --> 00:01:34,060
So once I have my word index and you can
see I'm just printing out the length of

28
00:01:34,060 --> 00:01:36,581
that, which is what gives me the 29,657.

29
00:01:36,581 --> 00:01:38,730
And I printed the word_index.

30
00:01:38,730 --> 00:01:42,413
So now on my tokenizer I can call
text_to_sequences as before and

31
00:01:42,413 --> 00:01:46,240
get my sentences into that
to turn them into sequences.

32
00:01:46,240 --> 00:01:49,476
In the last green cast we use
pad_sequences but the padding was

33
00:01:49,476 --> 00:01:53,500
using the default which was pre and
everything was prefixed with zeros.

34
00:01:53,500 --> 00:01:57,788
This time to show padding=post will
just allow us to put zeros after

35
00:01:57,788 --> 00:02:02,240
the sentence so
that the sentence will be post padded.

36
00:02:02,240 --> 00:02:06,154
And if I for example look at sentence
number two in the corpus and what padded

37
00:02:06,154 --> 00:02:10,256
to in the corpus looks like we'll see
sentence number two is mom's starting to

38
00:02:10,256 --> 00:02:14,400
fear son's web series is the closest
thing she will have to a grandchild.

39
00:02:14,400 --> 00:02:17,171
And we can see the tokens for
those actual words in here.

40
00:02:17,171 --> 00:02:19,590
So for example, here was number 2.

41
00:02:19,590 --> 00:02:24,710
Moms starting to and we can see
that the word to, to is number 2.

42
00:02:24,710 --> 00:02:28,701
And maybe are there any others that
are pretty high up in the list.

43
00:02:28,701 --> 00:02:30,920
We could say, for example, number 39.

44
00:02:30,920 --> 00:02:36,421
If I go through my list here and
see what 39 is,

45
00:02:36,421 --> 00:02:42,040
we'll see that the word will is number 39.

46
00:02:42,040 --> 00:02:46,151
So if I come back to here, she will
have to a grandchild is number 39.

47
00:02:47,420 --> 00:02:52,724
And then the finally the shape of the
padded one is that, each of the sentences

48
00:02:52,724 --> 00:02:58,440
in the dataset is being padded up to
40 characters long or 40 words long.

49
00:02:58,440 --> 00:03:04,880
And so
we have 26709 sentences in the dataset.

50
00:03:04,880 --> 00:03:07,510
So my shape of my padded
array will be that.

51
00:03:07,510 --> 00:03:11,285
And this is what can be used to then train
a neural network with embeddings that

52
00:03:11,285 --> 00:03:12,751
you'll be seeing next week.