1
00:00:11,050 --> 00:00:16,149
So in this lecture, we are going to begin looking at our first and most simple method of converting

2
00:00:16,149 --> 00:00:17,380
text into vectors.

3
00:00:17,890 --> 00:00:22,570
This method is so simple that it doesn't even have a name, but if you want to give it a name, you

4
00:00:22,570 --> 00:00:23,920
would simply call it counting.

5
00:00:25,120 --> 00:00:30,700
As a side note, recognize that this technique is an instance of the bag of words approach, since we

6
00:00:30,700 --> 00:00:33,850
will be ignoring the order of the words in each document.

7
00:00:38,370 --> 00:00:43,500
OK, so suppose we're doing a simple classification task where we want to differentiate between the

8
00:00:43,500 --> 00:00:46,650
documents about biology and documents about physics.

9
00:00:47,310 --> 00:00:51,270
Now the first thing to keep in mind is that document is a very generic word.

10
00:00:51,720 --> 00:00:55,230
This could mean anything depending on what your real world task is.

11
00:00:55,710 --> 00:01:01,440
For example, you might want to categorize journal papers in which case one document would be one journal

12
00:01:01,440 --> 00:01:01,950
paper.

13
00:01:02,640 --> 00:01:08,070
Or you might want to categorize files on your desktop, in which case one document would be one file.

14
00:01:08,880 --> 00:01:13,980
Of course, these files could still be journal papers, but they might also be for books, maybe even

15
00:01:13,980 --> 00:01:15,900
simple text files, and so forth.

16
00:01:16,500 --> 00:01:20,280
On the other hand, a document could simply refer to a single sentence.

17
00:01:20,730 --> 00:01:25,620
For example, I might want to just look at a single sentence and classify whether it is positive or

18
00:01:25,620 --> 00:01:26,220
negative.

19
00:01:27,150 --> 00:01:29,550
An example of that might be news headlines.

20
00:01:29,970 --> 00:01:35,640
So there is a wide range of possible document sizes from a single sentence all the way to a full book.

21
00:01:36,210 --> 00:01:38,460
Now let's think about what our data would look like.

22
00:01:43,200 --> 00:01:47,610
So we're going to start from a very general point of view where your data might be stored inside an

23
00:01:47,610 --> 00:01:54,420
Excel spreadsheet, HSV or a pandas data frame, I'll assume that you all know your basic programming.

24
00:01:54,420 --> 00:01:59,580
So if you had a bunch of data files lying around and you wanted to classify them, you would know how

25
00:01:59,580 --> 00:02:05,280
to convert them into this format, since this isn't a machine learning task, but just a basic programming

26
00:02:05,280 --> 00:02:05,760
task.

27
00:02:06,060 --> 00:02:09,120
We would consider that a prerequisite to this course.

28
00:02:09,960 --> 00:02:14,940
In any case, let's suppose that you were able to get your data into this format or that it was given

29
00:02:14,940 --> 00:02:16,650
to you this way, which it could be.

30
00:02:16,650 --> 00:02:21,390
In many cases, we can see that we have essentially two columns of data.

31
00:02:21,870 --> 00:02:24,750
Column one is the text where each row is a document.

32
00:02:25,140 --> 00:02:28,260
Column two is the label corresponding to the document beside it.

33
00:02:29,010 --> 00:02:34,350
OK, so hopefully you find this format to be very simple, and you have a good idea about why your data

34
00:02:34,350 --> 00:02:37,290
would be formatted in this way in the real world.

35
00:02:38,720 --> 00:02:45,050
Note that we might also call column one the inputs to our model, while Column two consists of the targets.

36
00:02:45,620 --> 00:02:51,380
Our goal in this lecture is to figure out how we will convert the documents in column one into a numerical

37
00:02:51,380 --> 00:02:55,010
representation that would be fit for use in a machine learning model.

38
00:02:59,830 --> 00:03:05,770
OK, so as mentioned, this method simply involves counting firstly, let's suppose that we know how

39
00:03:05,770 --> 00:03:08,260
to determine the size of our vocabulary.

40
00:03:08,770 --> 00:03:12,610
If you don't know how to do this yet, then please consider it as an exercise.

41
00:03:13,150 --> 00:03:16,660
Basically, you would just need to live through every word in your text corpus.

42
00:03:16,930 --> 00:03:20,650
Count up all the unique words, and that is your vocabulary size.

43
00:03:21,250 --> 00:03:27,550
Suppose that we call our vocabulary size big V. It's the number of unique words in your training corpus.

44
00:03:29,150 --> 00:03:33,110
OK, now the next thing to do is for each document in our text corpus.

45
00:03:33,410 --> 00:03:37,200
We're going to create a vector of size v inside this vector.

46
00:03:37,220 --> 00:03:41,300
We are going to count the number of times each word appears in the document.

47
00:03:41,930 --> 00:03:46,880
Of course, this also means that we've assigned each position in the vector to correspond with a unique

48
00:03:46,880 --> 00:03:47,450
word.

49
00:03:48,020 --> 00:03:51,860
So, for example, the first position will contain the count for the word A.

50
00:03:52,250 --> 00:03:57,740
The second position will contain the count for the word AA and so forth all the way up to the word zygote,

51
00:03:57,980 --> 00:04:00,200
which we will assume is the last word in our corpus.

52
00:04:00,440 --> 00:04:01,820
If we sort alphabetically.

53
00:04:03,260 --> 00:04:05,930
Note that the words need not go in alphabetical order.

54
00:04:05,960 --> 00:04:09,890
But for the purpose of this lecture, we'll just assume that they do for simplicity.

55
00:04:14,570 --> 00:04:18,740
So let's go through a very simple example just to demonstrate how this works.

56
00:04:19,490 --> 00:04:26,540
Let's suppose that we have a very simple vocabulary with just six words I like and hate eggs and cats.

57
00:04:27,290 --> 00:04:32,300
Now, suppose that our documents are as follows Document one is I like eggs.

58
00:04:32,750 --> 00:04:34,880
Document two is I hate cats.

59
00:04:35,240 --> 00:04:38,150
Document three is I like cats and I like eggs.

60
00:04:38,900 --> 00:04:45,410
Now let's convert these into vectors using counting now because their vocabulary size is six.

61
00:04:45,680 --> 00:04:50,720
We're going to end up with three vectors each of size six one for each of our documents.

62
00:04:51,320 --> 00:04:56,510
Let's suppose that the words will be ordered alphabetically, so the first position corresponds to and

63
00:04:56,810 --> 00:04:59,090
the second to cats and so forth.

64
00:05:00,500 --> 00:05:05,240
OK, so Document one would be converted to zero zero one zero one one.

65
00:05:05,630 --> 00:05:10,640
This is because the eggs shows up once I shows up once and like, shows up once.

66
00:05:11,330 --> 00:05:15,140
Document two would be converted to zero one zero one one zero.

67
00:05:15,560 --> 00:05:20,600
This is because cats shows up once, he shows up once and he shows up once.

68
00:05:21,200 --> 00:05:25,850
And finally, Document three would be converted to one one one zero two two.

69
00:05:26,480 --> 00:05:31,160
This is because and it shows up once cats shows up, once eggs shows up.

70
00:05:31,160 --> 00:05:36,260
Once I shows up twice and like, shows up twice as an exercise.

71
00:05:36,500 --> 00:05:41,270
Please go through these yourself and confirm that they are correct or if I've made a mistake, which

72
00:05:41,270 --> 00:05:42,480
is entirely possible.

73
00:05:42,590 --> 00:05:43,670
Please let me know.

74
00:05:48,290 --> 00:05:53,450
OK, so now that we know how to convert text into vectors by counting, let's return to our original

75
00:05:53,450 --> 00:05:57,770
task, which was to classify between biology and physics documents.

76
00:05:58,490 --> 00:06:00,860
Let's suppose that our counting scheme is very simple.

77
00:06:00,860 --> 00:06:03,770
We only count the words mitochondria and gravity.

78
00:06:04,910 --> 00:06:10,400
Now, if you don't know, the mitochondria is a component of the cell, and thus this would be associated

79
00:06:10,400 --> 00:06:11,360
with biology.

80
00:06:11,870 --> 00:06:15,620
On the other hand, gravity is usually involved with the study of physics.

81
00:06:16,310 --> 00:06:19,340
So suppose that we did our accounting for all of our documents.

82
00:06:19,700 --> 00:06:21,080
What might we expect to get?

83
00:06:21,860 --> 00:06:24,110
Well, we would expect to get something like this.

84
00:06:24,740 --> 00:06:30,770
Most of the biology documents appear along the mitochondria axis, while most of the physics documents

85
00:06:30,770 --> 00:06:32,630
appear along the gravity axes.

86
00:06:33,080 --> 00:06:37,550
Of course, sometimes biology documents might contain the word gravity and vice versa.

87
00:06:38,570 --> 00:06:40,280
But clearly, there's a pattern here.

88
00:06:40,760 --> 00:06:44,810
I can draw a neat line that separates the documents of the two classes.

89
00:06:45,230 --> 00:06:51,410
That is to say, using these count vectors, I can now classify between documents about biology and

90
00:06:51,410 --> 00:06:52,760
documents about physics.

91
00:06:53,420 --> 00:06:58,760
So suppose I have a new document, and I'd like to automatically determine which class the document

92
00:06:58,760 --> 00:07:02,000
belongs to without having to read the document in myself.

93
00:07:02,630 --> 00:07:08,510
What I could do is use my computer program to count up the number of times each word appears, thus

94
00:07:08,510 --> 00:07:11,210
converting it into a vector in this space.

95
00:07:11,840 --> 00:07:15,350
Then I look at which side of the line does the vector fall on?

96
00:07:15,500 --> 00:07:18,410
And that would be the class I predict for my new document.

97
00:07:19,130 --> 00:07:20,930
OK, so hopefully that's pretty simple.

98
00:07:25,700 --> 00:07:30,530
Now, there are a few things we have yet to discuss about converting text into vectors by counting.

99
00:07:31,130 --> 00:07:33,050
These are all very practical issues.

100
00:07:33,530 --> 00:07:37,760
That is, you wouldn't be able to code this up in Python without solving these issues first.

101
00:07:39,200 --> 00:07:44,720
So the first issue is to recognize that text is represented in a computer by a string.

102
00:07:45,260 --> 00:07:51,740
But our algorithm involves counting individual words, so the detail we have to consider is how do we

103
00:07:51,740 --> 00:07:57,680
get a single string of text which contains a multiple words with punctuation and other complications

104
00:07:57,980 --> 00:08:00,200
into single words that we can count?

105
00:08:00,770 --> 00:08:04,520
This is a process called tokenization, which we will discuss shortly.

106
00:08:05,780 --> 00:08:08,900
The second issue is this which we sort of alluded to earlier.

107
00:08:09,410 --> 00:08:14,000
How do we know which component of our document vectors correspond to which word?

108
00:08:14,510 --> 00:08:16,400
How do we keep track of this information?

109
00:08:17,210 --> 00:08:22,010
Clearly, we're going to need some kind of mapping that will tell us which position corresponds with

110
00:08:22,010 --> 00:08:22,760
which word.

111
00:08:23,360 --> 00:08:27,740
Again, we will discuss this shortly, but I want you to start thinking now about how this might be

112
00:08:27,740 --> 00:08:28,130
done.

113
00:08:32,870 --> 00:08:37,460
The next issue I want to discuss in this lecture is how can we implement this in code?

114
00:08:38,120 --> 00:08:39,919
Well, there are two ways to do this.

115
00:08:40,640 --> 00:08:46,190
The first way which doesn't require much thinking simply uses the count vector riser class inside it.

116
00:08:46,190 --> 00:08:46,610
Learn.

117
00:08:47,390 --> 00:08:49,820
As usual, this proceeds in three steps.

118
00:08:50,450 --> 00:08:53,810
The first step is to initialize account vector riser object.

119
00:08:54,410 --> 00:08:58,790
The second step is to fit the count vector riser object to your training corpus.

120
00:08:59,270 --> 00:09:04,400
You may also call Fit Transform, which will give you back the matrix of counts and fit the model at

121
00:09:04,400 --> 00:09:05,240
the same time.

122
00:09:06,080 --> 00:09:08,720
The third step is to transform the test corpus.

123
00:09:09,290 --> 00:09:14,720
Remember that in general, we don't want to fit on the whole data set at the same time, since the object

124
00:09:14,720 --> 00:09:17,750
may store parameters that depend on the data you passed in.

125
00:09:18,530 --> 00:09:23,510
For example, if we have a word that appears in the test set but not the train set, you wouldn't want

126
00:09:23,510 --> 00:09:28,250
to include that in your count vectors as your model wouldn't know what to do with those words anyway,

127
00:09:28,460 --> 00:09:30,890
since we never saw them during the training process.

128
00:09:35,620 --> 00:09:41,170
Now, the second way to implement are counting method is to simply implemented ourselves using Python

129
00:09:41,170 --> 00:09:45,880
and potentially Nampai or CPI, although we won't look at how to do this in code.

130
00:09:46,150 --> 00:09:49,390
You are strongly encouraged to do it on your own as an exercise.

131
00:09:49,780 --> 00:09:55,030
It's easy enough that you can do it by yourself without assistance, and it's a good way to gain experience

132
00:09:55,030 --> 00:09:56,080
with writing code.

133
00:09:56,470 --> 00:09:58,030
So please give this a try.

134
00:10:02,740 --> 00:10:07,840
As a side note, the reason you would want to use PSI Pi instead of Nampai is because it has smarts

135
00:10:07,840 --> 00:10:08,650
matrices.

136
00:10:09,280 --> 00:10:12,720
Suppose that we have any documents and a vocabulary size of V.

137
00:10:13,420 --> 00:10:19,390
Our data will ultimately be stored in a matrix of size N by V and rows and V columns.

138
00:10:20,050 --> 00:10:23,950
Now, it's pretty likely that most documents will not contain most words.

139
00:10:24,250 --> 00:10:25,900
That's just the nature of language.

140
00:10:26,410 --> 00:10:31,840
As an example, suppose that our data is a collection of articles from Wikipedia, and our vocabulary

141
00:10:31,840 --> 00:10:34,000
is pretty much all of the English language.

142
00:10:34,540 --> 00:10:39,820
Well, clearly any single Wikipedia article will not contain every word in the English language.

143
00:10:40,570 --> 00:10:46,150
More likely, most, if not all, Wikipedia articles will only contain a very small fraction of the

144
00:10:46,150 --> 00:10:52,930
possible words in the English language, and as such, our final end by V Matrix will consist mostly

145
00:10:52,930 --> 00:10:53,720
of zeros.

146
00:10:54,520 --> 00:10:59,650
Now, it turns out that there are very efficient ways to store matrices when they consist of mostly

147
00:10:59,650 --> 00:11:00,400
zeros.

148
00:11:00,700 --> 00:11:02,710
These are called sparse matrices.

149
00:11:03,250 --> 00:11:07,270
At this point, the details aren't very important, but it's just something to be aware of.

150
00:11:07,810 --> 00:11:11,380
I recommend for now to read the documentation if you want to learn more.

151
00:11:11,770 --> 00:11:15,130
But at this time, it's not strictly necessary for what we're going to do.

152
00:11:15,640 --> 00:11:20,860
So if you do the exercise of implementing the count vector as a yourself, you may want to look into

153
00:11:20,860 --> 00:11:21,890
using these.

154
00:11:26,680 --> 00:11:32,320
OK, so the final topic I want to touch on in this lecture is the concept of vector normalization.

155
00:11:33,100 --> 00:11:38,590
Now suppose that some of the documents in our dataset are very long, while others are very short.

156
00:11:39,220 --> 00:11:45,070
Our results in count vectors might have large size disparities simply because long documents contain

157
00:11:45,070 --> 00:11:46,030
more words.

158
00:11:46,480 --> 00:11:48,070
More words means higher counts.

159
00:11:48,340 --> 00:11:50,230
Higher counts means larger vectors.

160
00:11:50,800 --> 00:11:56,230
And this might be a bad thing since, as mentioned, we want similar documents to be close to each other

161
00:11:56,230 --> 00:11:57,430
in vector space.

162
00:11:59,020 --> 00:12:01,300
So there are two ways to normalize vectors.

163
00:12:02,230 --> 00:12:08,050
Firstly, note that our count vectors always contain at non-negative numbers, either the value is zero

164
00:12:08,050 --> 00:12:12,610
because the word didn't appear or the value is greater than zero if the word did appear.

165
00:12:12,970 --> 00:12:15,400
But we can have a negative count for any word.

166
00:12:16,090 --> 00:12:20,500
So one way to normalize a vector is to make it have an El two norm of one.

167
00:12:21,190 --> 00:12:26,170
As you recall from your high school math studies, this involves dividing by the square root of the

168
00:12:26,170 --> 00:12:28,030
sum of squares of each element.

169
00:12:28,330 --> 00:12:30,460
This is basically the length of the vector.

170
00:12:32,200 --> 00:12:37,930
The second method is to simply divide by the Sun so that the sum of each element in every document vector

171
00:12:37,930 --> 00:12:38,530
is one.

172
00:12:39,340 --> 00:12:44,770
As you'll see, when we study probabilistic models, this gives us something like the probability that

173
00:12:44,770 --> 00:12:46,480
each word appears in the document.

174
00:12:47,050 --> 00:12:50,080
So that's one way to interpret this method of normalization.

175
00:12:52,140 --> 00:12:58,620
Note that an equivalent method is to make the L1 norm of the count vector one, the reason why this

176
00:12:58,620 --> 00:13:02,580
is equivalent is because we already know that all the counts are non-negative.

177
00:13:03,210 --> 00:13:07,650
Technically, the L1 norm is the sum of all the absolute values of each element.

178
00:13:08,160 --> 00:13:13,350
But since we already know all the values are not negative, the absolute values are equal to the original

179
00:13:13,350 --> 00:13:13,830
values.

180
00:13:14,010 --> 00:13:16,410
And this is just equal to dividing by the sum.

181
00:13:17,340 --> 00:13:21,120
So either way, this will ensure that the sum of all the elements is one.

182
00:13:22,810 --> 00:13:27,880
As a final note, I want to mention that the count victories are inside, you learn, does not implement

183
00:13:27,880 --> 00:13:28,960
any normalization.

184
00:13:29,560 --> 00:13:32,710
However, TF IDF does, which we will learn about later.

185
00:13:33,190 --> 00:13:38,620
So if you want to build count vectors with normalization, you would use to fight off and not the count

186
00:13:38,620 --> 00:13:39,400
vector riser.

