1
00:00:00,000 --> 00:00:05,109
In the previous video we looked at the
data, a string containing a single song,

2
00:00:05,109 --> 00:00:08,480
and saw how to prepare that for
generating new text.

3
00:00:08,480 --> 00:00:12,532
We saw how to tokenize the data and
then create sub-sentence

4
00:00:12,532 --> 00:00:16,750
engrams that were labelled with
the next word in the sentence.

5
00:00:16,750 --> 00:00:21,624
We then one-hot encoded the labels to
get us into a position where we can

6
00:00:21,624 --> 00:00:26,838
build a neural network that can,
given a sentence, predict the next word.

7
00:00:26,838 --> 00:00:32,147
Now that we have our data as xs and ys,
it's relatively simple for us to create

8
00:00:32,147 --> 00:00:37,640
a neural network to classify what the next
word should be, given a set of words.

9
00:00:37,640 --> 00:00:40,074
Here's the code.

10
00:00:40,074 --> 00:00:41,919
We'll start with an embedding layer.

11
00:00:41,919 --> 00:00:46,905
We'll want it to handle all of our words,
so we set that in the first parameter.

12
00:00:46,905 --> 00:00:51,177
The second parameter is the number of
dimensions to use to plot the vector

13
00:00:51,177 --> 00:00:51,883
for a word.

14
00:00:51,883 --> 00:00:55,389
Feel free to tweak this to see what
its impact would be on results, but

15
00:00:55,389 --> 00:00:57,270
I'm going to keep it at 64 for now.

16
00:00:57,270 --> 00:01:01,609
Finally, the size of the input
dimensions will be fed in, and

17
00:01:01,609 --> 00:01:05,288
this is the length of
the longest sequence minus 1.

18
00:01:05,288 --> 00:01:09,891
We subtract one because we cropped off
the last word of each sequence to get

19
00:01:09,891 --> 00:01:14,880
the label, so our sequences will be one
less than the maximum sequence length.

20
00:01:14,880 --> 00:01:16,866
Next we'll add an LSTM.

21
00:01:16,866 --> 00:01:19,411
As we saw with LSTMs
earlier in the course,

22
00:01:19,411 --> 00:01:23,373
their cell state means that they
carry context along with them, so

23
00:01:23,373 --> 00:01:26,997
it's not just next door neighbor
words that have an impact.

24
00:01:26,997 --> 00:01:30,990
I'll specify 20 units here, but again,
you should feel free to experiment.

25
00:01:30,990 --> 00:01:34,856
Finally there's a dense layer
sized as the total words,

26
00:01:34,856 --> 00:01:39,051
which is the same size that we used for
the one-hot encoding.

27
00:01:39,051 --> 00:01:42,526
Thus this layer will have one neuron,
per word and

28
00:01:42,526 --> 00:01:46,513
that neuron should light up
when we predict a given word.

29
00:01:46,513 --> 00:01:49,878
We're doing a categorical classification,
so

30
00:01:49,878 --> 00:01:53,580
we'll set the laws to be
categorical cross entropy.

31
00:01:53,580 --> 00:01:55,973
And we'll use the atom optimizer,

32
00:01:55,973 --> 00:02:00,218
which seems to work particularly well for
tasks like this one.

33
00:02:00,218 --> 00:02:06,728
Finally, we'll train for a lot of epoch,
say about 500, as it takes a while for

34
00:02:06,728 --> 00:02:12,412
a model like this to converge,
particularly as it has very little data.

35
00:02:12,412 --> 00:02:16,000
So if we train the model for
500 epochs, it will look like this.