So let's take a look at tokenizing and padding a lot more data. So instead of just a few sentences that were looking at earlier on, we're actually going to use a full dataset and this is the sarcasm data set that we were talking about in the lesson. So here is, I'm just going to wget that sarcasm dataset and downloaded to /tmp/sarcasm.json. So once I've downloaded it I'm now going to just create three lists, one for the sentences, one for the labels and one for the urls. Not actually going to use the urls here but just if you want to use them yourself here's how you do it. And then for each item in the data store because in Json it's really easy for me to iterate through it, for each item in the data store. I can then just add the headline to sentences. I can add the item is_sarcastic to label and I can add the item article_link to urls. Okay, now that my data is ready I'm going to import my tokenizer and my pad_ sequences as before. I'm going to use the tokenizer just specify my ODV vocabulary token and I'm going to call it to fit_on_text to all of the sentences. Let's actually run it and see what happens. So we run, we see it's downloading the data. Now it's a case of we have a much larger vocabulary have 29657 words in the index, and we can start seeing things like vocabularies 1,2 is number 2 of his number 3, there's number 4. Remember these are going to be sorted into their order of commonality. So you should see basic words like the N4 pretty high up in the list. So once I have my word index and you can see I'm just printing out the length of that, which is what gives me the 29,657. And I printed the word_index. So now on my tokenizer I can call text_to_sequences as before and get my sentences into that to turn them into sequences. In the last green cast we use pad_sequences but the padding was using the default which was pre and everything was prefixed with zeros. This time to show padding=post will just allow us to put zeros after the sentence so that the sentence will be post padded. And if I for example look at sentence number two in the corpus and what padded to in the corpus looks like we'll see sentence number two is mom's starting to fear son's web series is the closest thing she will have to a grandchild. And we can see the tokens for those actual words in here. So for example, here was number 2. Moms starting to and we can see that the word to, to is number 2. And maybe are there any others that are pretty high up in the list. We could say, for example, number 39. If I go through my list here and see what 39 is, we'll see that the word will is number 39. So if I come back to here, she will have to a grandchild is number 39. And then the finally the shape of the padded one is that, each of the sentences in the dataset is being padded up to 40 characters long or 40 words long. And so we have 26709 sentences in the dataset. So my shape of my padded array will be that. And this is what can be used to then train a neural network with embeddings that you'll be seeing next week.