1
00:00:11,110 --> 00:00:16,360
So in this lecture, we'll be looking at how to do tax preprocessing using the Keros API.

2
00:00:17,200 --> 00:00:19,360
Note that this is an absolutely critical.

3
00:00:20,020 --> 00:00:25,240
In fact, what we'll be doing is looking at the syntax to do something which you've already learned

4
00:00:25,240 --> 00:00:25,930
how to do.

5
00:00:26,710 --> 00:00:31,000
The reason we're doing this is because Keros makes it easier to do these steps.

6
00:00:31,540 --> 00:00:38,170
In fact, the API is so nice that sometimes PyTorch users will just use crisp processing instead of

7
00:00:38,170 --> 00:00:45,220
PyTorch, although this may be due to the fact that the PyTorch text library is pretty bad in any case.

8
00:00:45,250 --> 00:00:50,410
Do note that while this does make things easier, this comes at the cost of some flexibility.

9
00:00:51,040 --> 00:00:57,070
For example, the Index zero is reserved for padding, so you won't be able to assign this to a word,

10
00:00:57,910 --> 00:00:58,640
as you'll see.

11
00:00:58,660 --> 00:01:03,550
This makes the embedding matrix somewhat inefficient because the first row becomes useless.

12
00:01:04,150 --> 00:01:09,490
Consider it an exercise to think about why this is the case after you've learned about embeddings.

13
00:01:13,980 --> 00:01:20,760
So before we move on to the TensorFlow syntax, let's review how exactly we want to convert text into

14
00:01:20,760 --> 00:01:21,480
integers.

15
00:01:22,260 --> 00:01:27,420
This is just very high level, but if you want more detail, simply go back to the previous lectures

16
00:01:27,420 --> 00:01:28,290
in the course.

17
00:01:29,700 --> 00:01:32,280
So the first step is to tokenize our text.

18
00:01:33,030 --> 00:01:39,360
Recall that our text will consist of documents, and each document is just a big long string containing

19
00:01:39,360 --> 00:01:42,900
multiple sentences, each of which may contain multiple words.

20
00:01:43,500 --> 00:01:48,510
Our goal with tokenization is to convert each document into a list of tokens.

21
00:01:48,990 --> 00:01:52,860
Each token might be a whole word, part of a word, or just a character.

22
00:01:53,670 --> 00:01:59,670
Once we've tokenized our text, the next step is to assign an integer ID to each unique token.

23
00:02:00,330 --> 00:02:07,290
So the word A might get it zero AA might get one, bobcat might get eighty one thousand and so forth.

24
00:02:09,030 --> 00:02:11,039
Let's recall why we want to do this.

25
00:02:11,820 --> 00:02:18,180
The reason we need an integer to map to each token is because these integers represent the positions

26
00:02:18,420 --> 00:02:19,620
in a feature vector.

27
00:02:20,220 --> 00:02:25,470
For example, if we're building a derivative matrix, we need to know which column corresponds to which

28
00:02:25,470 --> 00:02:25,980
token.

29
00:02:26,880 --> 00:02:32,850
Now, going forward, we won't be using TFI Taf, but we still need to be able to map tokens to integers.

30
00:02:33,120 --> 00:02:39,270
As you'll see, in fact, for most of deep learning, we won't be using bag of words, but instead we'll

31
00:02:39,270 --> 00:02:41,070
be working with full sequences.

32
00:02:41,760 --> 00:02:47,610
To that end, there will be one additional step in this lecture, which is that after we create a mapping

33
00:02:47,610 --> 00:02:53,160
from token to integer, we're going to convert his document into a sequence of integers.

34
00:02:53,730 --> 00:02:59,760
So, for example, the document I like cats might become a list containing the integers four three,

35
00:02:59,760 --> 00:03:07,440
five six, oh two and 27, implying that four three five represents i six 02 represents like and 27

36
00:03:07,440 --> 00:03:08,520
represents cats.

37
00:03:09,390 --> 00:03:10,100
Practically.

38
00:03:10,110 --> 00:03:13,890
You can see how this might be helpful in terms of memory strings.

39
00:03:13,890 --> 00:03:16,620
Take up a lot of space, but numbers do not.

40
00:03:17,250 --> 00:03:23,190
So intuitively, you can understand how passing data around in your code in terms of numbers is much

41
00:03:23,190 --> 00:03:24,780
more efficient than text.

42
00:03:29,390 --> 00:03:34,850
OK, so in TensorFlow, all of what I just described is done using the tokenize or class.

43
00:03:35,540 --> 00:03:41,990
This is imported from the Keras module since it was originally part of Keros before TensorFlow adopted

44
00:03:42,260 --> 00:03:42,680
us.

45
00:03:43,280 --> 00:03:46,460
You'll notice that the interface is kind of like what you see inside.

46
00:03:46,460 --> 00:03:52,910
You learn where you fit on some text and then you transform on other, possibly different text.

47
00:03:53,510 --> 00:03:59,720
Except in this case, the fit method is called the fit on text in the transform method is called text

48
00:03:59,720 --> 00:04:00,770
two sequences.

49
00:04:01,520 --> 00:04:06,350
So this will directly convert any input text into sequences of integers.

50
00:04:07,950 --> 00:04:13,860
As usual, you typically want to fit on the training set well, you only transform on the test set.

51
00:04:14,460 --> 00:04:20,339
This emulates the real world operation of your model since in the future it will need to transform data

52
00:04:20,370 --> 00:04:21,720
your model has never seen.

53
00:04:23,220 --> 00:04:27,030
Now let's talk about what kind of data you should pass into these functions.

54
00:04:27,990 --> 00:04:31,020
Basically, you should pass in lists of strings.

55
00:04:31,620 --> 00:04:37,740
So, for example, if you have 100 documents, then you would pass in a list of size 100, and each

56
00:04:37,740 --> 00:04:42,270
element of that list would be a string representing a document.

57
00:04:43,140 --> 00:04:45,810
But note that the tokenize is pretty flexible.

58
00:04:46,380 --> 00:04:52,470
So alternatively, instead of having each document be a single string, you can also have each document

59
00:04:52,470 --> 00:04:54,060
be a list of strings.

60
00:04:54,630 --> 00:04:58,830
In that case, it would mean that each document has already been tokenized.

61
00:05:00,070 --> 00:05:06,730
So in that sense, the name of this object is kind of misleading because although it does do tokenization,

62
00:05:07,150 --> 00:05:12,880
it also does many more steps, such as creating a mapping from each token to an integer and converting

63
00:05:12,880 --> 00:05:15,100
each document into an integer list.

64
00:05:16,210 --> 00:05:22,000
And even more strangely, it works even when your inputs are already tokenized, in which case the so-called

65
00:05:22,000 --> 00:05:25,180
tokenize are wouldn't actually be tokenizing anything.

66
00:05:29,900 --> 00:05:35,750
So let's talk about what kind of output one might expect after using the methods I just described.

67
00:05:36,680 --> 00:05:42,710
Suppose we pass in a very small dataset with just three documents in each document is a sentence.

68
00:05:43,340 --> 00:05:45,920
The first sentences I like eggs and ham.

69
00:05:46,100 --> 00:05:48,410
The second is I love chocolate and bunnies.

70
00:05:48,740 --> 00:05:50,660
And the third is I hate onions.

71
00:05:51,260 --> 00:05:56,000
After using the code, I just described the output of the text to sequences.

72
00:05:56,000 --> 00:05:57,890
Method would be as follows.

73
00:05:59,240 --> 00:06:05,180
As you can see, it's just a list of lists of integers where each integer corresponds to a word.

74
00:06:06,020 --> 00:06:10,190
Now you might be wondering how do I know which word corresponds to which integer?

75
00:06:10,910 --> 00:06:14,540
Of course, this information is stored within the tokenizing object.

76
00:06:15,140 --> 00:06:20,990
Simply use the attribute word index, and this will return a dictionary where the word is the key in

77
00:06:20,990 --> 00:06:22,370
the integer is the value.

78
00:06:27,090 --> 00:06:30,090
So let's discuss some of the arguments into the constructor.

79
00:06:30,960 --> 00:06:36,390
The first argument is NUM words, which allows you to specify the maximum number of words you want to

80
00:06:36,390 --> 00:06:36,840
keep.

81
00:06:37,740 --> 00:06:43,500
As you've seen, the vocabulary size of any language can get quite large, so this might help your model

82
00:06:43,500 --> 00:06:45,960
focus on only the most important words.

83
00:06:46,680 --> 00:06:49,680
As you might expect, the most frequent words are kept.

84
00:06:51,450 --> 00:06:57,600
The next argument is Filter's, which specifies which characters will simply be removed and ignored

85
00:06:58,230 --> 00:06:58,950
by default.

86
00:06:58,980 --> 00:07:04,200
This includes all punctuation tabs and line breaks, except for the single quote.

87
00:07:05,910 --> 00:07:12,030
The next argument is lower, which allows you to specify that all texts should be down cased, the default

88
00:07:12,030 --> 00:07:13,020
for this is true.

89
00:07:14,310 --> 00:07:18,750
The next argument is split, which specifies the separator for each token.

90
00:07:20,010 --> 00:07:25,830
The next argument is car level, which allows you to specify that you want character tokens instead

91
00:07:25,830 --> 00:07:26,910
of word tokens.

92
00:07:27,360 --> 00:07:28,830
By default, this is false.

93
00:07:30,850 --> 00:07:35,950
The next argument is over a token which stands for out of vocabulary token.

94
00:07:36,580 --> 00:07:39,670
This is for tokens which are not included in the vocabulary.

95
00:07:40,120 --> 00:07:45,790
For example, if you encounter a word in the test set that was not in the train set, it would be assigned

96
00:07:45,790 --> 00:07:46,600
to this token.

97
00:07:47,350 --> 00:07:52,480
If you don't set this argument, these tokens will simply be ignored, which is the default.

98
00:07:57,100 --> 00:08:02,050
Now you'll notice that there's nothing in the arguments about stemming or limitation or stop words,

99
00:08:02,440 --> 00:08:04,630
as we learned about earlier in the course.

100
00:08:05,380 --> 00:08:11,800
And this is because we typically don't want to do these operations in deep learning as an example,

101
00:08:11,800 --> 00:08:13,900
suppose you're trying to generate text.

102
00:08:14,440 --> 00:08:19,270
This would be pretty difficult if you couldn't generate stop words as stop words are pretty common.

103
00:08:19,870 --> 00:08:22,000
So your text would be incoherent.

104
00:08:22,930 --> 00:08:26,800
The same goes for stemming in limitation as an example.

105
00:08:26,800 --> 00:08:31,720
If you're doing machine translation, running in Iran will have different translations.

106
00:08:32,200 --> 00:08:34,390
In this case, the difference is important.

107
00:08:38,990 --> 00:08:45,200
Now, just as a small bonus, I want to discuss another method you can call, which is text to matrix.

108
00:08:45,770 --> 00:08:52,040
Essentially, this allows you to convert your text into TFI Taf count vectors or binary vectors, just

109
00:08:52,040 --> 00:08:53,990
as we described earlier in the course.

110
00:08:54,710 --> 00:08:59,900
This won't be needed since in deep learning, we actually want to keep our text as a sequence, but

111
00:08:59,900 --> 00:09:01,460
it's nice to know that it exists.

112
00:09:06,140 --> 00:09:11,000
Now, at this point, we're going to look ahead a little bit and talk about the concept of padding.

113
00:09:11,750 --> 00:09:17,810
The reason why padding is required is because in some deep learning libraries such as TensorFlow, your

114
00:09:17,810 --> 00:09:19,970
sequences must all have the same length.

115
00:09:20,690 --> 00:09:23,300
Practically, you can see why this might be the case.

116
00:09:23,960 --> 00:09:26,990
We know that we like to use arrays to store our data.

117
00:09:27,680 --> 00:09:30,590
But arrays must have the same size in all dimensions.

118
00:09:31,040 --> 00:09:36,500
There's no jagged array in Nampai, but with text we do have variable length sequences.

119
00:09:36,800 --> 00:09:41,330
Not all sentences have the same length, and not all documents have the same length.

120
00:09:43,010 --> 00:09:48,290
I mentioned that this involves some looking ahead because you won't really see why this is true until

121
00:09:48,290 --> 00:09:50,060
you study CNN's in earnings.

122
00:09:50,660 --> 00:09:55,520
So if you haven't studied those topics yet, you might not have any intuition about why we should do

123
00:09:55,520 --> 00:09:58,370
this for now, just trust that this is true.

124
00:09:59,420 --> 00:10:04,160
In any case, there is a function called pad sequences, which is also part of the Keros API.

125
00:10:04,880 --> 00:10:10,760
The default action is to take your list of lists of integers and to prepend zero so that all of the

126
00:10:10,760 --> 00:10:14,210
sequences have the length of the longest sequence.

127
00:10:14,960 --> 00:10:20,660
So if we use our example from earlier, you can see that two zeros have been added to the front of the

128
00:10:20,660 --> 00:10:26,720
third sentence because that sentence only has three words, whereas the other two sentences have five

129
00:10:27,650 --> 00:10:28,070
again.

130
00:10:28,130 --> 00:10:29,360
Looking ahead a little bit.

131
00:10:29,750 --> 00:10:34,640
It's typically the case that we want to add zeros at the front, since models like aren't ends have

132
00:10:34,640 --> 00:10:37,910
trouble remembering things from too far in the past.

133
00:10:38,570 --> 00:10:42,080
Thus, we would like to have the most useful information and near the end.

134
00:10:43,370 --> 00:10:47,840
There are other options we'll explore in the next lecture, such as adding zeros at the end instead

135
00:10:47,840 --> 00:10:51,950
of the front, and truncating the sequences to limit the maximum length.

136
00:10:56,600 --> 00:11:00,080
As usual in machine learning, it's useful to think of shapes.

137
00:11:00,650 --> 00:11:06,680
So now that we've converted our text into tokens and then into a padded sequence of integers, let's

138
00:11:06,680 --> 00:11:12,350
think about what shape we should have if we were to store the results in a numpy array or TensorFlow

139
00:11:12,350 --> 00:11:12,920
Tensor.

140
00:11:13,850 --> 00:11:20,840
Basically, just remember end by T and is the number of documents and T is the maximum document length.

141
00:11:21,710 --> 00:11:27,410
Just as we use the letter D by Convention for a number of features, we use T by convention for a sequence

142
00:11:27,410 --> 00:11:27,650
like.

