1
00:00:00,000 --> 00:00:01,755
So far this week,

2
00:00:01,755 --> 00:00:03,150
you've been looking at texts,

3
00:00:03,150 --> 00:00:04,950
and how to tokenize the text,

4
00:00:04,950 --> 00:00:06,630
and then turn sentences into

5
00:00:06,630 --> 00:00:09,855
sequences using the tools
available in TensorFlow.

6
00:00:09,855 --> 00:00:13,500
You did that using some very
simple hard-coded sentences.

7
00:00:13,500 --> 00:00:14,760
But of course, when it comes

8
00:00:14,760 --> 00:00:16,530
to doing real-world problems,

9
00:00:16,530 --> 00:00:18,015
you'll be using a lot more data

10
00:00:18,015 --> 00:00:19,905
than just these simple sentences.

11
00:00:19,905 --> 00:00:21,480
So in this lesson, we'll take

12
00:00:21,480 --> 00:00:23,130
a look at some public data-sets

13
00:00:23,130 --> 00:00:24,630
and how you can
process them to get

14
00:00:24,630 --> 00:00:26,970
them ready to train
a neural network.

15
00:00:26,970 --> 00:00:29,520
We'll start with this
one published by

16
00:00:29,520 --> 00:00:33,030
Rishabh Misra with details
on Kaggle at this link.

17
00:00:33,030 --> 00:00:36,435
It's a really fun CC0
public domain data-set

18
00:00:36,435 --> 00:00:39,030
at all around sarcasm detection.

19
00:00:39,030 --> 00:00:41,640
Really? Yeah, really.

20
00:00:41,640 --> 00:00:44,785
This data-set is very
straightforward and simple,

21
00:00:44,785 --> 00:00:46,910
not to mention very
easy to work with.

22
00:00:46,910 --> 00:00:48,605
It has three elements in it.

23
00:00:48,605 --> 00:00:51,665
The first is sarcastic,
is our label.

24
00:00:51,665 --> 00:00:53,270
It's a one if the record is

25
00:00:53,270 --> 00:00:55,615
considered sarcastic
otherwise it's zero.

26
00:00:55,615 --> 00:00:57,445
The second is a headline,

27
00:00:57,445 --> 00:00:59,990
which is just plain text
and the third is

28
00:00:59,990 --> 00:01:02,825
the link to the article that
the headline describes.

29
00:01:02,825 --> 00:01:06,350
Parsing the contents of
HTML, stripping out scripts,

30
00:01:06,350 --> 00:01:07,910
and styles, etc, is a little

31
00:01:07,910 --> 00:01:09,695
bit beyond the scope
of this course.

32
00:01:09,695 --> 00:01:12,500
So we're just going to
focus on the headlines.

33
00:01:12,500 --> 00:01:15,110
If you download the data
from that Kaggle site,

34
00:01:15,110 --> 00:01:16,910
you'll see something like this.

35
00:01:16,910 --> 00:01:20,330
As you can see, it is
a set of list entries with

36
00:01:20,330 --> 00:01:23,930
name-value pairs where
the name is article link,

37
00:01:23,930 --> 00:01:28,075
headline and is_sarcastic
and the values are as shown.

38
00:01:28,075 --> 00:01:31,760
To make it much easier to
load this data into Python,

39
00:01:31,760 --> 00:01:34,790
I made a little tweak to
the data to look like this,

40
00:01:34,790 --> 00:01:37,190
which you can feel free
to do or you can download

41
00:01:37,190 --> 00:01:39,065
my amended data-set from the link

42
00:01:39,065 --> 00:01:41,885
in the co-lab for
this part of the course.

43
00:01:41,885 --> 00:01:44,505
Once you have the data like this,

44
00:01:44,505 --> 00:01:47,185
it's then really easy
to load it into Python.

45
00:01:47,185 --> 00:01:49,000
Let's take a look at the code.

46
00:01:49,000 --> 00:01:51,715
So first you need to import JSON.

47
00:01:51,715 --> 00:01:54,415
This allows you to load
data in JSON format and

48
00:01:54,415 --> 00:01:57,715
automatically create
a Python data structure from it.

49
00:01:57,715 --> 00:02:00,415
To do that you simply
open the file,

50
00:02:00,415 --> 00:02:03,610
and pass it to json.load
and you'll get a list

51
00:02:03,610 --> 00:02:06,985
containing lists of the three
types of data: headlines,

52
00:02:06,985 --> 00:02:10,525
URLs, and is_sarcastic labels.

53
00:02:10,525 --> 00:02:13,510
Because I want
the sentences as a list

54
00:02:13,510 --> 00:02:15,880
of their own to pass
to the tokenizer,

55
00:02:15,880 --> 00:02:19,465
I can then create a list
of sentences and later,

56
00:02:19,465 --> 00:02:22,480
if I want the labels for
creating a neural network,

57
00:02:22,480 --> 00:02:24,410
I can create a list of them too.

58
00:02:24,410 --> 00:02:26,975
While I'm at it, I
may as well do URLs

59
00:02:26,975 --> 00:02:28,250
even though I'm not
going to use them

60
00:02:28,250 --> 00:02:30,415
here but you might want to.

61
00:02:30,415 --> 00:02:33,020
Now I can iterate through
the list that was

62
00:02:33,020 --> 00:02:36,355
created with a for item
in data store loop.

63
00:02:36,355 --> 00:02:38,090
For each item, I can then

64
00:02:38,090 --> 00:02:40,520
copy the headline
to my sentences,

65
00:02:40,520 --> 00:02:43,040
the is_sarcastic to my labels

66
00:02:43,040 --> 00:02:45,820
and the article_link to my URLs.

67
00:02:45,820 --> 00:02:48,724
Now I have something I can
work with in the tokenizer,

68
00:02:48,724 --> 00:02:50,910
so let's look at that next.