1
00:00:11,660 --> 00:00:17,150
In this lecture, we are going to look at a notebook which demonstrates how to do a tax preprocessing

2
00:00:17,150 --> 00:00:18,320
processing that we just discussed.

3
00:00:19,100 --> 00:00:23,960
As usual, you can look at the title of the Notebook to determine what notebook we are currently looking

4
00:00:23,960 --> 00:00:24,230
at.

5
00:00:25,550 --> 00:00:31,520
So to start, we're going to import the class tokenize here and the function pad sequences.

6
00:00:34,960 --> 00:00:39,200
All right, next, we're going to create a dummy dataset with just three sentences.

7
00:00:39,220 --> 00:00:40,350
I like eggs and ham.

8
00:00:40,360 --> 00:00:43,120
I love chocolate and bunnies, and I hate onions.

9
00:00:45,330 --> 00:00:46,020
It's right that.

10
00:00:47,840 --> 00:00:53,270
OK, so next, we're going to define a max of vocab size to be 20000.

11
00:00:53,840 --> 00:00:56,090
Now this is usually pretty reasonable value.

12
00:00:56,750 --> 00:01:04,010
The Oxford Dictionary has almost 200000 words, but the most common words in the English language number

13
00:01:04,010 --> 00:01:05,480
only about 3000.

14
00:01:06,320 --> 00:01:11,930
So three thousand will cover about 95 percent of most texts, according to a quick Google search.

15
00:01:12,470 --> 00:01:14,930
So thus 20000 is probably good enough.

16
00:01:16,870 --> 00:01:20,350
So in the first line here, we instantiate the tokenized class.

17
00:01:21,100 --> 00:01:26,260
Next, we call tokenize airport fit on tax and pass in our sentences list.

18
00:01:27,400 --> 00:01:32,710
Next, we call it tokenize are tax two sequences and pass in the same sentences list.

19
00:01:33,250 --> 00:01:35,980
This returns our sequences of integers.

20
00:01:36,820 --> 00:01:41,830
By the way, you can think of these two functions in terms of psychic learns feature transformers where

21
00:01:41,830 --> 00:01:44,620
you would always have a dart fit and dart transform.

22
00:01:45,100 --> 00:01:49,000
So the first one is like a dot fit, and the second one is like a dart transform.

23
00:01:49,330 --> 00:01:49,990
Same idea.

24
00:01:51,520 --> 00:01:53,080
So let's run this one.

25
00:01:55,920 --> 00:01:58,350
Next, we're going to print out our sequences list.

26
00:02:00,820 --> 00:02:06,340
Now, normally on a text data set of any practical size, this would just print out way too much data.

27
00:02:06,580 --> 00:02:09,340
But since we only have three sentences, this is reasonable.

28
00:02:10,340 --> 00:02:11,330
So let's run this.

29
00:02:13,380 --> 00:02:18,750
OK, so here we can see that each sentence has been converted into a list of integers.

30
00:02:19,020 --> 00:02:21,240
Each integer corresponding to a word.

31
00:02:22,140 --> 00:02:27,270
Importantly, note that the integers start counting from one and not zero, as you may have expected.

32
00:02:27,840 --> 00:02:31,650
As I mentioned earlier, this is because TensorFlow uses zero for padding.

33
00:02:36,490 --> 00:02:40,990
And you might be wondering, how do I know which word corresponds to which integer?

34
00:02:41,740 --> 00:02:46,360
Luckily, the tokenize are objects already stores a dictionary with this information.

35
00:02:47,260 --> 00:02:51,310
So let's print out the tokenize, their dot word index.

36
00:02:52,270 --> 00:02:56,770
And this is just like our word to index mapping, as we discussed in the intuition like here.

37
00:03:01,360 --> 00:03:06,790
So you should be able to look at this and confirm that if you map each integer back to the corresponding

38
00:03:06,790 --> 00:03:09,490
word, you should get back the original sentences.

39
00:03:14,270 --> 00:03:19,700
Next, let's try out the function at pad sequences in the first block here, we're going to call pad

40
00:03:19,700 --> 00:03:21,800
sequences with all the default values.

41
00:03:23,430 --> 00:03:24,210
So what do we get?

42
00:03:27,740 --> 00:03:33,230
Well, the default appears to set the maximum sequence length to be the maximum sentence length, so

43
00:03:33,230 --> 00:03:34,130
that's five.

44
00:03:34,820 --> 00:03:39,770
The first two sentences were already length five, so there's no padding for the third sentence.

45
00:03:39,770 --> 00:03:41,450
Padding was added at the beginning.

46
00:03:45,450 --> 00:03:51,420
Now in the second block, we can see that if we explicitly pass in Max land equals five, we get the

47
00:03:51,420 --> 00:03:52,080
same answer.

48
00:03:55,830 --> 00:03:58,770
Next, let's see what happens if we start padding to post.

49
00:04:02,980 --> 00:04:08,170
So we can see that the first two sentences still have no padding because they are of the maximum length,

50
00:04:08,830 --> 00:04:12,580
the third sentence now has zeros at the end instead of at the beginning.

51
00:04:17,779 --> 00:04:20,570
Next, we can see what happens if we add too much padding.

52
00:04:20,810 --> 00:04:25,520
So let's set padding equal to six, which is longer than all of the sentences in our dataset.

53
00:04:28,570 --> 00:04:33,820
As you can see, the first two sentences have been padded with one zero so that the sequence length

54
00:04:33,820 --> 00:04:34,570
is six.

55
00:04:38,950 --> 00:04:44,590
Next, we're going to see what happens if we set Maslen equal to a number less than the maximum sequence

56
00:04:44,590 --> 00:04:44,950
length.

57
00:04:45,490 --> 00:04:46,780
So let's try it for.

58
00:04:50,280 --> 00:04:56,520
In this case, we can see that each sequence has been truncated to cut off the beginning of the sequences.

59
00:04:57,240 --> 00:05:02,820
This makes sense as the default because an aunt N. typically pays more attention to the final values

60
00:05:02,820 --> 00:05:03,840
in a sequence anyway.

61
00:05:08,990 --> 00:05:13,700
Next, we're going to see what happens if we set max lenses for again, but this time we're going to

62
00:05:13,700 --> 00:05:15,830
set the truncating argument to post.

63
00:05:19,650 --> 00:05:24,450
All right, so in this case, we can see that the ends of the sequences have been cut off instead of

64
00:05:24,450 --> 00:05:25,380
the beginnings.