So let's take a look at tokenizing and
padding a lot more data. So instead of just a few sentences
that were looking at earlier on, we're actually going to
use a full dataset and this is the sarcasm data set that we
were talking about in the lesson. So here is, I'm just going to
wget that sarcasm dataset and downloaded to /tmp/sarcasm.json. So once I've downloaded it I'm now
going to just create three lists, one for the sentences, one for
the labels and one for the urls. Not actually going to
use the urls here but just if you want to use them
yourself here's how you do it. And then for each item in the data store
because in Json it's really easy for me to iterate through it, for
each item in the data store. I can then just add
the headline to sentences. I can add the item
is_sarcastic to label and I can add the item article_link to urls. Okay, now that my data is ready I'm
going to import my tokenizer and my pad_ sequences as before. I'm going to use the tokenizer just
specify my ODV vocabulary token and I'm going to call it to fit_on_text
to all of the sentences. Let's actually run it and
see what happens. So we run,
we see it's downloading the data. Now it's a case of we have a much larger
vocabulary have 29657 words in the index, and we can start seeing things like <OOV>
vocabularies 1,2 is number 2 of his number 3, there's number 4. Remember these are going to be sorted
into their order of commonality. So you should see basic words like
the N4 pretty high up in the list. So once I have my word index and you can
see I'm just printing out the length of that, which is what gives me the 29,657. And I printed the word_index. So now on my tokenizer I can call
text_to_sequences as before and get my sentences into that
to turn them into sequences. In the last green cast we use
pad_sequences but the padding was using the default which was pre and
everything was prefixed with zeros. This time to show padding=post will
just allow us to put zeros after the sentence so
that the sentence will be post padded. And if I for example look at sentence
number two in the corpus and what padded to in the corpus looks like we'll see
sentence number two is mom's starting to fear son's web series is the closest
thing she will have to a grandchild. And we can see the tokens for
those actual words in here. So for example, here was number 2. Moms starting to and we can see
that the word to, to is number 2. And maybe are there any others that
are pretty high up in the list. We could say, for example, number 39. If I go through my list here and
see what 39 is, we'll see that the word will is number 39. So if I come back to here, she will
have to a grandchild is number 39. And then the finally the shape of the
padded one is that, each of the sentences in the dataset is being padded up to
40 characters long or 40 words long. And so
we have 26709 sentences in the dataset. So my shape of my padded
array will be that. And this is what can be used to then train
a neural network with embeddings that you'll be seeing next week.