1
00:00:00,000 --> 00:00:03,610
So that was an example
using the IMDB data set,

2
00:00:03,610 --> 00:00:07,530
where data is provided
to you by the tfds API,

3
00:00:07,530 --> 00:00:09,285
which I hope you found helpful.

4
00:00:09,285 --> 00:00:10,890
Now I'd like to return to

5
00:00:10,890 --> 00:00:13,260
the sarcasm data
set from last week,

6
00:00:13,260 --> 00:00:16,545
and let's look at building
a classifier for that.

7
00:00:16,545 --> 00:00:20,160
We'll start with importing
tensorflow and json,

8
00:00:20,160 --> 00:00:21,690
as well as the tokenizer and

9
00:00:21,690 --> 00:00:24,420
pad sequences from
pre-processing.

10
00:00:24,420 --> 00:00:26,879
Now let's set up
our hyper parameters;

11
00:00:26,879 --> 00:00:28,230
the vocabulary size,

12
00:00:28,230 --> 00:00:31,275
embedding dimensions,
maximum length of sentences,

13
00:00:31,275 --> 00:00:33,540
and other stuff like
the training size.

14
00:00:33,540 --> 00:00:36,900
This data set has
about 27,000 records.

15
00:00:36,900 --> 00:00:40,885
So let's train on 20,000
and validate on the rest.

16
00:00:40,885 --> 00:00:44,240
The sarcasm data is
stored at this URL,

17
00:00:44,240 --> 00:00:45,470
so you can download it

18
00:00:45,470 --> 00:00:49,685
to /tmp/sarcasm.json
with this code.

19
00:00:49,685 --> 00:00:51,875
Now that you have the data set,

20
00:00:51,875 --> 00:00:53,510
you can open it and load it

21
00:00:53,510 --> 00:00:55,730
as an iterable with this code.

22
00:00:55,730 --> 00:00:58,130
You can create an array
for sentences,

23
00:00:58,130 --> 00:00:59,525
and another for labels,

24
00:00:59,525 --> 00:01:01,580
and then iterate
through the datastore,

25
00:01:01,580 --> 00:01:03,770
loading each headline
as a sentence,

26
00:01:03,770 --> 00:01:07,680
and each is_sarcastic field,
as your label.