1
00:00:11,670 --> 00:00:18,170
In this lecture I want to introduce you to the concepts behind tax pre processing in the previous lecture

2
00:00:18,180 --> 00:00:23,520
we learn that obviously you can't multiply words by numbers which doesn't make any sense.

3
00:00:23,790 --> 00:00:29,100
In order to use deep learning which really means neural networks we must have numbers because a neural

4
00:00:29,100 --> 00:00:33,120
network is basically just a series of multiplications and additions.

5
00:00:33,810 --> 00:00:40,500
So the question now becomes how do I turn my text into numbers specifically.

6
00:00:40,510 --> 00:00:45,610
We need the kind of numbers that can be passed into an embedding layer which means we need integers

7
00:00:45,610 --> 00:00:48,580
that correspond to rows in an embedding matrix

8
00:00:53,730 --> 00:00:57,450
let's start from basic and hopefully pretty obvious points.

9
00:00:57,450 --> 00:00:59,750
First how is text stored.

10
00:00:59,760 --> 00:01:06,390
The answer is text files to be clear I mean in plain text files and not Microsoft Word or anything like

11
00:01:06,390 --> 00:01:06,720
that.

12
00:01:07,530 --> 00:01:12,810
Hopefully you've done enough Python so that you know how to open a text file read its contents line

13
00:01:12,810 --> 00:01:15,910
by line and so forth.

14
00:01:16,090 --> 00:01:22,870
Hopefully you know that Python files are themselves text files so text files are a fundamental ingredient

15
00:01:22,870 --> 00:01:24,170
of computer programming.

16
00:01:24,250 --> 00:01:25,030
You should love them.

17
00:01:26,020 --> 00:01:29,600
OK so let's say I read in a plain text file using Python.

18
00:01:29,620 --> 00:01:33,580
How are those file contents represented in Python code.

19
00:01:33,580 --> 00:01:36,340
Well the answer is as a string.

20
00:01:36,340 --> 00:01:37,460
But wait.

21
00:01:37,510 --> 00:01:39,550
If the whole file is a string.

22
00:01:39,550 --> 00:01:43,520
How can I get the individual words and sentences.

23
00:01:43,540 --> 00:01:49,750
Remember that in order to convert my words into word vectors those words must be represented individually

24
00:01:49,780 --> 00:01:52,900
as integers but starting out.

25
00:01:53,030 --> 00:01:58,670
I have nothing but a humungous string full of multiple sentences and each sentence containing multiple

26
00:01:58,670 --> 00:01:59,720
words.

27
00:01:59,720 --> 00:02:00,290
No good

28
00:02:05,420 --> 00:02:05,850
OK.

29
00:02:05,870 --> 00:02:07,870
I'll make it a little easier on you.

30
00:02:07,880 --> 00:02:10,640
Suppose we impose some structure on our tax file.

31
00:02:10,640 --> 00:02:12,520
Let's make it a CSP.

32
00:02:12,590 --> 00:02:13,990
We've all seen C views before.

33
00:02:14,020 --> 00:02:17,200
So I hope there's no need to explain what these are.

34
00:02:17,210 --> 00:02:22,670
Let's suppose since this is what we're going to see later anyway that this is a two columns ISV where

35
00:02:22,670 --> 00:02:27,310
the first column is some string representing an email or a tweet that we want to classify.

36
00:02:27,560 --> 00:02:30,730
And the second column is its corresponding label.

37
00:02:30,740 --> 00:02:36,470
This makes things slightly easier because each line of the file consists of only two parts an email

38
00:02:36,470 --> 00:02:41,880
or a tweet and some corresponding label as a side note.

39
00:02:41,890 --> 00:02:49,150
Sometimes people including myself will refer to these emails or tweets simply as documents that may

40
00:02:49,150 --> 00:02:54,050
be confusing to you as a beginner because you might think of a document as an entire file.

41
00:02:54,060 --> 00:02:55,530
Like a book or a report.

42
00:02:56,110 --> 00:03:02,410
But in A.P. when we say document that probably just refers to a single sample which could be a sentence

43
00:03:02,410 --> 00:03:04,110
or maybe a few sentences.

44
00:03:04,390 --> 00:03:10,620
It could possibly be longer but that depends on the task you're doing so be aware that the word document

45
00:03:10,740 --> 00:03:13,470
is kind of an overloaded term and an LP.

46
00:03:13,470 --> 00:03:17,160
It doesn't mean a document as an entire file like a word document.

47
00:03:17,280 --> 00:03:19,980
It just means a single sample in your dataset.

48
00:03:19,980 --> 00:03:24,840
So from here on out when I say the word document I'm going to assume you understand what I'm talking

49
00:03:24,840 --> 00:03:25,230
about.

50
00:03:30,440 --> 00:03:35,890
Now that our data is in the form of a CSP maybe we can use pandas to load it into a data frame.

51
00:03:36,840 --> 00:03:42,750
But the problem is each of the cells in the first column of this data frame are still documents they

52
00:03:42,750 --> 00:03:49,710
contain multiple words when what we really want is individual words in the next portion of this lecture.

53
00:03:49,770 --> 00:03:55,500
I want to discuss how we might process these documents in pure Python code to give you an understanding

54
00:03:55,500 --> 00:04:01,950
of how this works and how you might implement it yourself in the end we're going to end up using a library

55
00:04:01,980 --> 00:04:08,700
that's part of the pie to ecosystem but it's quite immature and limited in its capabilities so it's

56
00:04:08,700 --> 00:04:14,360
entirely likely that in the future it may not fit your needs and you may need to write your own textbook

57
00:04:14,370 --> 00:04:20,940
processing code so it's good to have some idea of how this works in my more advanced and in-depth and

58
00:04:20,940 --> 00:04:22,030
AP courses.

59
00:04:22,080 --> 00:04:25,470
We actually go over how to do textbook processing in Python.

60
00:04:25,470 --> 00:04:29,820
So if you're interested in getting practice with that you can ask me about it on the Q and A

61
00:04:35,000 --> 00:04:40,460
in any case let's start with what we have a document which contains multiple words and possibly multiple

62
00:04:40,460 --> 00:04:41,910
sentences.

63
00:04:41,930 --> 00:04:45,900
We know that we want to split this up into individual words.

64
00:04:45,950 --> 00:04:48,620
This is a process called tokenization.

65
00:04:48,950 --> 00:04:54,020
If you wanted to take the simplest possible approach you could achieve this by taking your document

66
00:04:54,440 --> 00:04:57,820
which is a string and just calling the split function.

67
00:04:57,830 --> 00:05:03,560
This is probably fine and a lot of cases but there are some downsides to this too such as that.

68
00:05:03,560 --> 00:05:07,630
It doesn't take care of punctuation consider the document.

69
00:05:07,660 --> 00:05:14,890
Today I walk my dog my dog chased a rabbit we see here that although the word dog appears twice the

70
00:05:14,890 --> 00:05:21,780
two times it appears are different because the first time there is a period appended at the end.

71
00:05:22,180 --> 00:05:28,030
So if you wanted to say assign a unique integer to each word that appears in your dataset these would

72
00:05:28,030 --> 00:05:30,150
appear as two distinct words.

73
00:05:30,280 --> 00:05:31,780
That's possibly not good.

74
00:05:32,260 --> 00:05:36,420
So you may want to have some code that removes punctuation.

75
00:05:36,580 --> 00:05:41,110
You might also want to down case all the letters as well although there have been models that kept the

76
00:05:41,110 --> 00:05:46,990
original casing.

77
00:05:47,180 --> 00:05:51,990
Let's suppose that we've made it to the next stage where we've successfully token eyes all our documents

78
00:05:52,050 --> 00:05:58,380
so that instead of just single strings they now appear as lists of strings where each string as an individual

79
00:05:58,380 --> 00:05:59,280
word.

80
00:05:59,640 --> 00:06:06,870
As you know the next step is to convert these words as strings into an integer representation.

81
00:06:06,990 --> 00:06:11,240
Clearly each word must have a uniquely assigned integer for this to work.

82
00:06:11,820 --> 00:06:18,150
So the word a might be a sign that Integer 0 AA might be assigned one and so on until we get to the

83
00:06:18,150 --> 00:06:23,010
last word in the dictionary Zeitung which might be assigned 1 million.

84
00:06:23,040 --> 00:06:30,340
The question is how do you write this in code.

85
00:06:30,440 --> 00:06:32,680
Again this is all just conceptual for now.

86
00:06:32,750 --> 00:06:34,730
But here are some ideas.

87
00:06:34,730 --> 00:06:38,950
First you define some counter called current index and you make that zero.

88
00:06:39,020 --> 00:06:44,210
You also instantiate a set which keeps track of every word you've seen so far.

89
00:06:44,210 --> 00:06:49,550
Then you live through each word and your corpus by the way Corpus is just a fancy term for your text

90
00:06:49,550 --> 00:06:51,550
data set.

91
00:06:51,700 --> 00:06:54,760
So each time you encounter a word that you haven't seen before.

92
00:06:54,850 --> 00:06:56,300
You do two things.

93
00:06:56,590 --> 00:07:02,690
First you add this word and the current index which will be assigned to this word to a dictionary.

94
00:07:02,830 --> 00:07:09,040
We'll call this the word to index dictionary since the key is a word and the value is its index.

95
00:07:09,030 --> 00:07:14,860
Second we increment the current index by 1 so that the next time we encounter a new word it will be

96
00:07:14,860 --> 00:07:16,730
assigned a new index value.

97
00:07:16,840 --> 00:07:17,910
Pretty simple I hope

98
00:07:23,050 --> 00:07:29,030
one caveat to the previous pseudocode is that usually we don't want to start at the index 0.

99
00:07:29,080 --> 00:07:32,280
Instead I propose that we start at the index too.

100
00:07:32,470 --> 00:07:34,710
So why would I want to do that.

101
00:07:34,720 --> 00:07:38,710
Well if you recall each sentence in our dataset might be of different length.

102
00:07:38,860 --> 00:07:45,310
So we need padding padding is usually done with zeros so zero is typically a reserve number for that

103
00:07:45,310 --> 00:07:46,630
use case.

104
00:07:46,630 --> 00:07:48,250
But why start at 2 and not 1.

105
00:07:50,260 --> 00:07:53,660
Well remember that we're going to have a train set and a test set.

106
00:07:53,680 --> 00:07:59,350
What if our test set contains words that didn't exist in the train set then these should be considered

107
00:07:59,440 --> 00:08:04,240
unknown words because we couldn't have learned anything about them during the training process.

108
00:08:04,300 --> 00:08:11,350
And so for unknown words we should assign that next one and therefore we have two special indices zero

109
00:08:11,350 --> 00:08:18,940
represents padding and 1 represents unknown words as a side note in the library we'll use later called

110
00:08:18,970 --> 00:08:20,060
talks text.

111
00:08:20,080 --> 00:08:23,950
It actually does the opposite but obviously that doesn't change the point.

112
00:08:23,980 --> 00:08:28,410
So in actuality zero is unknown and one is padding.

113
00:08:28,690 --> 00:08:37,570
And of course this is only the case if you use towards text and you don't write this yourself.

114
00:08:37,780 --> 00:08:38,080
OK.

115
00:08:38,110 --> 00:08:39,940
So what's the next step.

116
00:08:39,970 --> 00:08:44,620
Remember that we would like to have constant like sequences or at least constant Length Sequences per

117
00:08:44,620 --> 00:08:45,390
batch.

118
00:08:45,820 --> 00:08:51,910
So let's suppose we have a batch of sentences each of different length the first sentence is nine three

119
00:08:51,910 --> 00:08:54,010
one hundred twenty seven and six.

120
00:08:54,130 --> 00:09:00,630
The second sentence is 14 21 and the third sentence is ninety nine five hundred one eighty seven and

121
00:09:00,630 --> 00:09:06,520
to remember at this point we've converted each sentence into a list of tokens which we then converted

122
00:09:06,580 --> 00:09:14,460
into a list of integers what we can see that the length of these sentences are five two and four respectively

123
00:09:15,240 --> 00:09:18,070
since the longest sentence has length five.

124
00:09:18,150 --> 00:09:22,390
The two shorter sentences would have to be padded to be of length five.

125
00:09:22,470 --> 00:09:29,940
So the second sentence becomes 14 21 0 0 0 and the third sentence becomes ninety nine five hundred one

126
00:09:29,940 --> 00:09:32,170
eighty seven two and zero.

127
00:09:32,250 --> 00:09:36,100
And so just a reminder we're using the convention that zero equals patting.

128
00:09:36,210 --> 00:09:38,100
Even though this might not be the case in the code

129
00:09:43,260 --> 00:09:45,300
but here's an interesting idea.

130
00:09:45,420 --> 00:09:48,740
Instead of patting at the end why not Pat at the beginning.

131
00:09:48,930 --> 00:09:55,420
So the second sentence becomes 0 0 0 14 20 one and the third sentence becomes zero ninety nine.

132
00:09:55,420 --> 00:10:01,380
Five hundred to one eighty seven and two in the first case we call that post padding because the padding

133
00:10:01,620 --> 00:10:07,470
goes at the end of each sentence and in the case we're considering now we call it pre padding because

134
00:10:07,470 --> 00:10:11,310
the padding goes at the beginning of each sentence as a quiz.

135
00:10:11,310 --> 00:10:16,970
Think about why we might want to choose one over the other which one might be better for tax classification.

136
00:10:16,980 --> 00:10:18,600
Using our own ends and why.

137
00:10:19,410 --> 00:10:30,800
I'll give you a minute to think about it or you can pause the video and come back when you're ready.

138
00:10:30,850 --> 00:10:31,140
All right.

139
00:10:31,170 --> 00:10:36,390
So hopefully you thought about your answer to the question which might be better for tax classification.

140
00:10:36,400 --> 00:10:43,780
Using Arnold's pre padding or post padding the answer is padding in our case the Arnold will read the

141
00:10:43,780 --> 00:10:45,380
sequence from left to right.

142
00:10:45,490 --> 00:10:48,230
Or equivalently from beginning to end.

143
00:10:48,280 --> 00:10:55,200
That's not a problem in and of itself but recall what challenges do Arnold's face.

144
00:10:55,270 --> 00:10:59,950
The main challenge for our own ends is learning long term dependencies due to the vanishing gradient

145
00:10:59,950 --> 00:11:00,400
problem.

146
00:11:01,030 --> 00:11:04,180
So if we put padding at the end what happens.

147
00:11:04,180 --> 00:11:09,040
That means if there was any pattern to be learned from the sentence it's going to appear at the beginning

148
00:11:09,910 --> 00:11:15,220
and if there's tons of padding say 50 or 100 zeroes then there's a good chance that by the end of all

149
00:11:15,220 --> 00:11:21,550
those zeros the Arnon will have completely forgotten what it saw earlier and due to the vanishing gradient

150
00:11:21,550 --> 00:11:26,890
problem it's going to be very hard for the aren't in to learn to capture those patterns that appeared

151
00:11:26,890 --> 00:11:28,350
at the beginning of the sentence.

152
00:11:30,090 --> 00:11:35,310
Therefore it's better if you're doing a many to one task in the left to right direction that you put

153
00:11:35,310 --> 00:11:41,660
your padding at the beginning.

154
00:11:41,760 --> 00:11:48,350
Okay so let's summarize all the conceptual steps that need to happen in a text pre processing pipeline.

155
00:11:48,710 --> 00:11:53,810
First your data might arrive in some weird format like for example and maybe be a collection of web

156
00:11:53,810 --> 00:12:00,110
page else or a database table or you might be using the Twitter API to download tweets as Jason documents

157
00:12:00,830 --> 00:12:08,920
for reasons you'll understand very soon it'll be convenient if you can format these into a CSP.

158
00:12:09,010 --> 00:12:14,770
Second when you load in your data whether it's a CSP or not this data won't be in the form of one string

159
00:12:14,770 --> 00:12:21,100
per word in order to convert each document into a list of strings where each string contains only a

160
00:12:21,100 --> 00:12:22,480
single word or token.

161
00:12:22,900 --> 00:12:27,900
We need to perform a process called tokenization.

162
00:12:28,290 --> 00:12:34,130
Third once you have your documents encoded as lists of strings you need to convert each of those tokens

163
00:12:34,130 --> 00:12:36,830
into an integer representation.

164
00:12:36,860 --> 00:12:45,410
That means you need to assign or map each token a word to a unique integer and in addition you'll need

165
00:12:45,410 --> 00:12:51,320
to convert each sentence from a list of strings into a list of corresponding integers.

166
00:12:51,380 --> 00:12:57,750
At this point your dataset consists of nothing but lists of integers finally.

167
00:12:57,780 --> 00:13:03,210
You'll want to perform padding so that your data set or at least each batch has the same sequence.

168
00:13:04,320 --> 00:13:10,350
I hope that none of these steps seems particularly difficult to write by yourself in Python code.

169
00:13:10,350 --> 00:13:16,650
As an exercise I would strongly recommend writing your own Python code to perform each of these steps.

170
00:13:16,650 --> 00:13:23,010
This isn't strictly required for the upcoming example but it's an extremely useful skill to have especially

171
00:13:23,010 --> 00:13:28,740
if you plan on doing something unique and not covered by the basic use cases in existing libraries.