1
00:00:00,000 --> 00:00:02,700
So now, let's look
at the code to take

2
00:00:02,700 --> 00:00:05,085
this corpus and turn
it into training data.

3
00:00:05,085 --> 00:00:08,730
Here's the beginning, I will
unpack this line by line.

4
00:00:08,730 --> 00:00:11,250
First of all, our
training x's will

5
00:00:11,250 --> 00:00:13,020
be called input sequences,

6
00:00:13,020 --> 00:00:14,865
and this will be a Python list.

7
00:00:14,865 --> 00:00:16,890
Then for each line in the corpus,

8
00:00:16,890 --> 00:00:19,860
we'll generate a token list
using the tokenizers,

9
00:00:19,860 --> 00:00:22,095
texts to sequences method.

10
00:00:22,095 --> 00:00:25,065
This will convert
a line of text like,

11
00:00:25,065 --> 00:00:27,510
"In the he town of Athy
one Jeremy Lanigan,"

12
00:00:27,510 --> 00:00:31,545
into a list of the tokens
representing the words.

13
00:00:31,545 --> 00:00:34,320
Then we'll iterate
over this list of

14
00:00:34,320 --> 00:00:37,740
tokens and create a number
of n-grams sequences,

15
00:00:37,740 --> 00:00:39,420
namely the first two words

16
00:00:39,420 --> 00:00:41,175
in the sentence or one sequence,

17
00:00:41,175 --> 00:00:44,860
then the first three are
another sequence etc.

18
00:00:44,860 --> 00:00:46,770
The result of this will be,

19
00:00:46,770 --> 00:00:48,470
for the first line in the song,

20
00:00:48,470 --> 00:00:51,515
the following input sequences
that will be generated.

21
00:00:51,515 --> 00:00:54,590
The same process will
happen for each line,

22
00:00:54,590 --> 00:00:55,610
but as you can see,

23
00:00:55,610 --> 00:00:57,830
the input sequences are simply

24
00:00:57,830 --> 00:01:00,695
the sentences being
broken down into phrases,

25
00:01:00,695 --> 00:01:01,850
the first two words,

26
00:01:01,850 --> 00:01:04,535
the first three words, etc.

27
00:01:04,535 --> 00:01:07,190
We next need to
find the length of

28
00:01:07,190 --> 00:01:09,350
the longest sentence
in the corpus.

29
00:01:09,350 --> 00:01:11,720
To do this, we'll
iterate over all of

30
00:01:11,720 --> 00:01:13,250
the sequences and find the

31
00:01:13,250 --> 00:01:15,685
longest one with code like this.

32
00:01:15,685 --> 00:01:18,490
Once we have our longest
sequence length,

33
00:01:18,490 --> 00:01:20,600
the next thing to
do is pad all of

34
00:01:20,600 --> 00:01:23,285
the sequences so that
they are the same length.

35
00:01:23,285 --> 00:01:25,460
We will pre-pad with zeros to

36
00:01:25,460 --> 00:01:27,620
make it easier to
extract the label,

37
00:01:27,620 --> 00:01:30,295
you'll see that in a few moments.

38
00:01:30,295 --> 00:01:33,980
So now, our line will
be represented by

39
00:01:33,980 --> 00:01:37,790
a set of padded input sequences
that looks like this.

40
00:01:37,790 --> 00:01:40,160
Now, that we have our sequences,

41
00:01:40,160 --> 00:01:41,750
the next thing we need to do

42
00:01:41,750 --> 00:01:43,685
is turn them into x's and y's,

43
00:01:43,685 --> 00:01:46,115
our input values
and their labels.

44
00:01:46,115 --> 00:01:47,870
When you think about it,

45
00:01:47,870 --> 00:01:50,750
now that the sentences are
represented in this way,

46
00:01:50,750 --> 00:01:54,350
all we have to do is take all
but the last character as

47
00:01:54,350 --> 00:01:55,760
the x and then use

48
00:01:55,760 --> 00:01:59,275
the last character as
the y on our label.

49
00:01:59,275 --> 00:02:01,330
We do that like this,

50
00:02:01,330 --> 00:02:03,170
where for the first sequence,

51
00:02:03,170 --> 00:02:04,670
everything up to the four is

52
00:02:04,670 --> 00:02:07,795
our input and the
two is our label.

53
00:02:07,795 --> 00:02:09,830
Similarly, here for

54
00:02:09,830 --> 00:02:11,930
the second sequence
where the input is

55
00:02:11,930 --> 00:02:13,550
two words and the label is

56
00:02:13,550 --> 00:02:17,100
the third word, tokenized to 66.

57
00:02:17,210 --> 00:02:21,710
Here, the input is three words
and the label is eight,

58
00:02:21,710 --> 00:02:24,110
which was the fourth word
in the sentence.

59
00:02:24,110 --> 00:02:26,890
By this point, it should be
clear why we did pre-padding,

60
00:02:26,890 --> 00:02:29,090
because it makes it much
easier for us to get

61
00:02:29,090 --> 00:02:32,550
the label simply by
grabbing the last token.