1
00:00:00,000 --> 00:00:03,690
To split the corpus into
training and validation sets,

2
00:00:03,690 --> 00:00:05,445
we'll use this code.

3
00:00:05,445 --> 00:00:07,275
To get the training set,

4
00:00:07,275 --> 00:00:10,889
you take array items from
zero to the training size,

5
00:00:10,889 --> 00:00:12,660
and to get the testing set,

6
00:00:12,660 --> 00:00:14,190
you can go from training size to

7
00:00:14,190 --> 00:00:16,725
the end of the array
with code like this.

8
00:00:16,725 --> 00:00:19,010
To get the training
and testing labels,

9
00:00:19,010 --> 00:00:22,630
you'll use similar codes
to slice the labels array.

10
00:00:22,630 --> 00:00:24,870
Now that we have training and

11
00:00:24,870 --> 00:00:27,255
test sets of
sequences and labels,

12
00:00:27,255 --> 00:00:28,995
it's time to sequence them.

13
00:00:28,995 --> 00:00:30,585
To pad those sequences,

14
00:00:30,585 --> 00:00:32,620
you'll do that with this code.

15
00:00:32,620 --> 00:00:34,670
You start with a tokenizer,

16
00:00:34,670 --> 00:00:37,100
passing it the number of
words you want to tokenize

17
00:00:37,100 --> 00:00:40,355
on and the desired out
of vocabulary token.

18
00:00:40,355 --> 00:00:42,515
Then fit that on the training set

19
00:00:42,515 --> 00:00:44,269
by calling fit on texts,

20
00:00:44,269 --> 00:00:47,420
passing it the training
sentences array.

21
00:00:47,420 --> 00:00:49,445
Then you can use text to

22
00:00:49,445 --> 00:00:51,800
sequences to create
the training sequence,

23
00:00:51,800 --> 00:00:54,740
replacing the words
with their tokens.

24
00:00:54,740 --> 00:00:57,800
Then you can pad
the training sequences to

25
00:00:57,800 --> 00:01:01,165
the desired length or
truncate if they're too long.

26
00:01:01,165 --> 00:01:05,000
Next, you'll do the same
but with a test set.

27
00:01:05,000 --> 00:01:08,930
Now, we can create our neural
network in the usual way.

28
00:01:08,930 --> 00:01:11,240
We'll compile it with
binary cross entropy,

29
00:01:11,240 --> 00:01:13,790
as we're classifying
to different classes.

30
00:01:13,790 --> 00:01:16,050
When we call a model's summary,

31
00:01:16,050 --> 00:01:17,985
we'll see that it
looks like this,

32
00:01:17,985 --> 00:01:19,370
pretty much as we'd expect.

33
00:01:19,370 --> 00:01:20,750
It's pretty simple and

34
00:01:20,750 --> 00:01:22,510
embedding feeds into
an average pooling,

35
00:01:22,510 --> 00:01:24,520
which then feeds our DNA.

36
00:01:24,520 --> 00:01:26,595
To train for 30 epochs,

37
00:01:26,595 --> 00:01:28,980
you pass in the padded
data and labels.

38
00:01:28,980 --> 00:01:30,430
If you want to validate,

39
00:01:30,430 --> 00:01:33,350
you'll give the testing
padded and labels to.

40
00:01:33,350 --> 00:01:35,270
After training for little while,

41
00:01:35,270 --> 00:01:36,620
you can plot the results.

42
00:01:36,620 --> 00:01:38,630
Here's the code for simple plot.

43
00:01:38,630 --> 00:01:41,690
We can see accuracy
increase nicely as we

44
00:01:41,690 --> 00:01:44,270
trained and the
validation accuracy

45
00:01:44,270 --> 00:01:46,325
was okay, but not great.

46
00:01:46,325 --> 00:01:49,400
What's interesting is
the loss values on the right,

47
00:01:49,400 --> 00:01:51,200
the training loss fall,

48
00:01:51,200 --> 00:01:52,580
but the validation loss

49
00:01:52,580 --> 00:01:55,480
increased. Well,
why might that be?