1
00:00:00,000 --> 00:00:01,530
So now, we have to split

2
00:00:01,530 --> 00:00:04,595
our sequences into
our x's and our y's.

3
00:00:04,595 --> 00:00:07,670
To do this, let's grab
the first n tokens,

4
00:00:07,670 --> 00:00:09,335
and make them our x's.

5
00:00:09,335 --> 00:00:12,475
We'll then get the last token
and make it our label.

6
00:00:12,475 --> 00:00:14,130
Before the label becomes a y,

7
00:00:14,130 --> 00:00:16,470
there's one more step, and
you'll see that shortly.

8
00:00:16,470 --> 00:00:18,390
Python makes this really

9
00:00:18,390 --> 00:00:20,550
easy to do with it's less syntax.

10
00:00:20,550 --> 00:00:22,290
So to get my x's,

11
00:00:22,290 --> 00:00:23,340
I just get all of

12
00:00:23,340 --> 00:00:27,015
the input sequences sliced
to remove the last token.

13
00:00:27,015 --> 00:00:28,424
To get the labels,

14
00:00:28,424 --> 00:00:30,030
I get all of the input sequence

15
00:00:30,030 --> 00:00:33,015
sliced to keep the last token.

16
00:00:33,015 --> 00:00:35,600
Now, I should one-hot encode

17
00:00:35,600 --> 00:00:38,720
my labels as this really is
a classification problem.

18
00:00:38,720 --> 00:00:40,595
Where given a sequence of words,

19
00:00:40,595 --> 00:00:42,650
I can classify from the corpus,

20
00:00:42,650 --> 00:00:44,945
what the next word
would likely be.

21
00:00:44,945 --> 00:00:47,060
So to one-hot encode,

22
00:00:47,060 --> 00:00:49,220
I can use the contrast utility to

23
00:00:49,220 --> 00:00:51,905
convert a list to a categorical.

24
00:00:51,905 --> 00:00:54,950
I simply give it
the list of labels and

25
00:00:54,950 --> 00:00:58,040
the number of classes which
is my number of words,

26
00:00:58,040 --> 00:01:01,840
and it will create a one-hot
encoding of the labels.

27
00:01:01,840 --> 00:01:04,610
So for example, if we

28
00:01:04,610 --> 00:01:07,490
consider this list of
tokens as a sentence,

29
00:01:07,490 --> 00:01:11,090
then the x is the list
up to the last value,

30
00:01:11,090 --> 00:01:15,480
and the label is the last value
which in this case is 70.

31
00:01:15,480 --> 00:01:18,710
The y is a one-hot
encoded array whether

32
00:01:18,710 --> 00:01:21,920
length is the size of
the corpus of words and

33
00:01:21,920 --> 00:01:24,950
the value that is set
to one is the one at

34
00:01:24,950 --> 00:01:26,870
the index of the label which in

35
00:01:26,870 --> 00:01:29,620
this case is the 70th element.

36
00:01:29,620 --> 00:01:32,120
Okay. You now have all of

37
00:01:32,120 --> 00:01:34,960
the data ready to train
a network for prediction.

38
00:01:34,960 --> 00:01:36,690
Hopefully, this was
useful for you.

39
00:01:36,690 --> 00:01:39,245
You'll see the neural network
in the next video.

40
00:01:39,245 --> 00:01:41,060
But first, let's see your screen

41
00:01:41,060 --> 00:01:43,010
cast of processing the data,

42
00:01:43,010 --> 00:01:46,110
using the methods that
you saw in this lesson.