So that was an example
using the IMDB data set, where data is provided
to you by the tfds API, which I hope you found helpful. Now I'd like to return to the sarcasm data
set from last week, and let's look at building
a classifier for that. We'll start with importing
tensorflow and json, as well as the tokenizer and pad sequences from
pre-processing. Now let's set up
our hyper parameters; the vocabulary size, embedding dimensions,
maximum length of sentences, and other stuff like
the training size. This data set has
about 27,000 records. So let's train on 20,000
and validate on the rest. The sarcasm data is
stored at this URL, so you can download it to /tmp/sarcasm.json
with this code. Now that you have the data set, you can open it and load it as an iterable with this code. You can create an array
for sentences, and another for labels, and then iterate
through the datastore, loading each headline
as a sentence, and each is_sarcastic field,
as your label.