To split the corpus into
training and validation sets, we'll use this code. To get the training set, you take array items from
zero to the training size, and to get the testing set, you can go from training size to the end of the array
with code like this. To get the training
and testing labels, you'll use similar codes
to slice the labels array. Now that we have training and test sets of
sequences and labels, it's time to sequence them. To pad those sequences, you'll do that with this code. You start with a tokenizer, passing it the number of
words you want to tokenize on and the desired out
of vocabulary token. Then fit that on the training set by calling fit on texts, passing it the training
sentences array. Then you can use text to sequences to create
the training sequence, replacing the words
with their tokens. Then you can pad
the training sequences to the desired length or
truncate if they're too long. Next, you'll do the same
but with a test set. Now, we can create our neural
network in the usual way. We'll compile it with
binary cross entropy, as we're classifying
to different classes. When we call a model's summary, we'll see that it
looks like this, pretty much as we'd expect. It's pretty simple and embedding feeds into
an average pooling, which then feeds our DNA. To train for 30 epochs, you pass in the padded
data and labels. If you want to validate, you'll give the testing
padded and labels to. After training for little while, you can plot the results. Here's the code for simple plot. We can see accuracy
increase nicely as we trained and the
validation accuracy was okay, but not great. What's interesting is
the loss values on the right, the training loss fall, but the validation loss increased. Well,
why might that be?