1
00:00:11,100 --> 00:00:16,560
So in this lecture, we you're going to discuss how to improve the count riser using a method known

2
00:00:16,560 --> 00:00:17,910
as T of IDF.

3
00:00:19,230 --> 00:00:27,330
So DFAT is a very popular method in NLP specifically for document retrieval and text mining will begin

4
00:00:27,330 --> 00:00:32,610
this lecture by discussing the motivation for why we might need to fight yes and the drawbacks of the

5
00:00:32,850 --> 00:00:34,440
victimizer that it solves.

6
00:00:34,980 --> 00:00:40,650
Well, then look at how T IDF works so that you understand why it does what we say it should and how

7
00:00:40,650 --> 00:00:41,940
you might coded yourself.

8
00:00:46,710 --> 00:00:52,050
OK, so let's start from the beginning, which is to answer the question what is wrong with the victimizer?

9
00:00:52,740 --> 00:00:56,110
Let's consider why would you stop words as you recall?

10
00:00:56,130 --> 00:00:59,250
Stop words are used as a list of words to filter out.

11
00:00:59,670 --> 00:01:01,620
These are words we do not want to keep.

12
00:01:02,310 --> 00:01:05,310
Now, let's recall why we do not want to keep these words.

13
00:01:05,730 --> 00:01:09,000
These are words like the it and so forth.

14
00:01:09,510 --> 00:01:11,970
So why do we not want to count these words?

15
00:01:12,690 --> 00:01:19,140
The answer is that these words are very unlikely to be helpful for any NLP task we want to do if we

16
00:01:19,140 --> 00:01:21,750
want to do document retrieval or build a search engine.

17
00:01:21,990 --> 00:01:23,100
These won't be useful.

18
00:01:23,670 --> 00:01:27,900
Every document probably contains the words of the IT and so forth.

19
00:01:28,380 --> 00:01:31,500
Doing a search based on these words is likely not a good idea.

20
00:01:32,250 --> 00:01:35,070
What about spam detection or sentiment analysis?

21
00:01:35,670 --> 00:01:39,630
Again, these words are unlikely to be useful for either of these tasks.

22
00:01:40,080 --> 00:01:43,890
Both spam and non spam emails will contain words like these.

23
00:01:44,190 --> 00:01:48,660
Sentences with positive and negative sentiment will both contain words like these.

24
00:01:53,250 --> 00:01:58,320
Now, here's a question to consider how do we know that our list of stop words is correct?

25
00:01:58,980 --> 00:02:00,360
The answer is we do not.

26
00:02:00,690 --> 00:02:03,660
In fact, stop words can be application specific.

27
00:02:04,980 --> 00:02:10,979
For example, the word mitochondria might be useful if we want to differentiate biology documents from

28
00:02:10,979 --> 00:02:12,330
documents about physics.

29
00:02:12,840 --> 00:02:17,460
But what if all our documents are about biology and they all contain the word mitochondria?

30
00:02:18,030 --> 00:02:24,570
In this particular case, that word is not so useful, so perhaps it would be useful if we could automatically

31
00:02:24,570 --> 00:02:29,250
identify good words and bad words in terms of what will help us model our documents.

32
00:02:29,820 --> 00:02:35,340
That is, we would like to be able to follow some general principle instead of having to manually curate

33
00:02:35,340 --> 00:02:36,510
a stop word list.

34
00:02:41,160 --> 00:02:46,950
Let's also recall why we like to have stop words in terms of the resulting document vectors, as you

35
00:02:46,950 --> 00:02:53,010
recall, our method of converting text into vectors is by counting the value for each component in our

36
00:02:53,010 --> 00:02:57,690
vector is the count of how many times each word appears in our document.

37
00:02:58,260 --> 00:02:59,880
Stop words tend to appear a lot.

38
00:03:00,210 --> 00:03:03,630
They are like glue words that helps you form a coherent sentence.

39
00:03:04,200 --> 00:03:10,260
Thus, their accounts tend to be very large, and large components in a vector will completely overpower

40
00:03:10,260 --> 00:03:12,300
the other smaller components.

41
00:03:12,870 --> 00:03:17,820
For example, if we're trying to find clusters of vectors or we're trying to find the distance between

42
00:03:17,820 --> 00:03:22,650
vectors, those large components will have more influence than the smaller components.

43
00:03:23,070 --> 00:03:26,810
And we don't really want words like the to have such a large influence.

44
00:03:31,440 --> 00:03:35,520
So the basis for a TFI Taf is essentially what we have already discussed.

45
00:03:36,300 --> 00:03:40,080
That we want to ignore our words that appear in many different documents.

46
00:03:40,500 --> 00:03:44,880
That's why they don't help us when we want to differentiate between different documents.

47
00:03:45,390 --> 00:03:50,250
In other words, when a word appears in many different documents, we want to scale down the vector

48
00:03:50,250 --> 00:03:51,570
component for that work.

49
00:03:52,230 --> 00:03:58,110
So, for example, the word verb may have a high count due to appearing many times in a document.

50
00:03:58,680 --> 00:04:05,160
What we would like to do with TFI Taf is to scale this count down based on some count of how many documents

51
00:04:05,160 --> 00:04:05,970
it appears in.

52
00:04:10,760 --> 00:04:18,140
So the complete solution for this issue is the TF IDF method TFI IDF stands for a term frequency inverse

53
00:04:18,140 --> 00:04:19,220
document frequency.

54
00:04:19,910 --> 00:04:22,340
Intuitively, it does what I've just described.

55
00:04:22,910 --> 00:04:27,320
Now this is not the final form just yet, but I want to give you the intuition first.

56
00:04:28,400 --> 00:04:30,980
So in the numerator, we have the term frequency.

57
00:04:31,280 --> 00:04:36,950
This is simply the count that is we count how many times the word appears like we normally do.

58
00:04:37,640 --> 00:04:41,630
But instead of just the count, we also divide by the document frequency.

59
00:04:42,140 --> 00:04:48,020
That's what we mean when we say the inverse document frequency, and this value is based on how many

60
00:04:48,020 --> 00:04:49,880
documents this word appears in.

61
00:04:50,720 --> 00:04:55,700
And of course, if the word appears in many documents, then this value will be larger, which will

62
00:04:55,700 --> 00:04:57,440
make the tough idea of smaller.

63
00:04:58,190 --> 00:05:00,710
OK, so I hope you get the idea on the top.

64
00:05:00,710 --> 00:05:07,010
We have the count which increases when the word appears more often in a single document on the bottom,

65
00:05:07,010 --> 00:05:11,870
we have the document frequency, which increases when the word appears in more documents.

66
00:05:12,290 --> 00:05:17,150
But since it's on the bottom, it decreases the value of the component in our feature vector.

67
00:05:21,990 --> 00:05:26,220
OK, so now that we understand the intuition, let's look at the TF IDF in full.

68
00:05:27,000 --> 00:05:32,010
We'll start with the most common variation, but later in this lecture, we'll discuss other variations

69
00:05:32,010 --> 00:05:32,550
as well.

70
00:05:33,510 --> 00:05:40,020
So let's see if IDF is actually the multiplication of two terms the term frequency and the inverse document

71
00:05:40,020 --> 00:05:40,710
frequency.

72
00:05:41,370 --> 00:05:43,620
Well, discuss each of these terms one by one.

73
00:05:44,640 --> 00:05:50,880
The first thing to notice is the arguments of each term notice that the term frequency has two arguments

74
00:05:50,880 --> 00:05:54,840
T to use the term and this is the document.

75
00:05:55,500 --> 00:05:59,790
This makes sense because each term will have a different count for each document.

76
00:06:00,300 --> 00:06:05,850
For example, document one might contain the word mitochondria five times, while Document two might

77
00:06:05,850 --> 00:06:07,980
contain the word gravity zero times.

78
00:06:09,090 --> 00:06:13,200
However, the idea of term only has one argument, which is the term T.

79
00:06:13,770 --> 00:06:20,100
This makes sense because the IDF isn't specific to any particular documents, only to a particular term.

80
00:06:20,760 --> 00:06:23,700
The IDF is measured by summing over all documents.

81
00:06:24,180 --> 00:06:30,360
For example, we can say the word mitochondria appears in 10 out of 100 documents in our corpus.

82
00:06:35,120 --> 00:06:38,270
OK, so let's start by looking at the term frequency components.

83
00:06:38,900 --> 00:06:44,210
As mentioned, this component is simply the count, which is the same thing as the count vector riser.

84
00:06:44,960 --> 00:06:50,000
In other words, when you call the transform function with Saikia learns count of exerciser, this t.f

85
00:06:50,000 --> 00:06:51,680
matrix is what you get back.

86
00:06:52,580 --> 00:06:55,790
Also note that it should be obvious why this is a matrix.

87
00:06:56,150 --> 00:06:59,360
There are two arguments corresponding to the rows and columns.

88
00:07:00,050 --> 00:07:02,870
It's also helpful to think of the size of this matrix.

89
00:07:03,950 --> 00:07:07,550
Suppose that we have unique terms and unique documents.

90
00:07:08,120 --> 00:07:14,150
In this case, our t.f matrix would be of size and by V or V by end, depending on how you orient the

91
00:07:14,150 --> 00:07:16,880
matrix by convention.

92
00:07:16,910 --> 00:07:22,910
Although we write it with T first and then D, we typically store this matrix as an end by V matrix.

93
00:07:23,510 --> 00:07:28,490
Of course, this makes sense, since this is the same shape as the output of the count vector riser.

94
00:07:33,100 --> 00:07:35,380
The next step is to consider the IDF term.

95
00:07:36,160 --> 00:07:42,370
Now, earlier, I said that for intuition only you can think of this as one over the document frequency.

96
00:07:42,940 --> 00:07:45,070
However, that's not precisely correct.

97
00:07:45,700 --> 00:07:51,250
What do we actually want to do is represent this as a proportion, and we take the log of this value.

98
00:07:52,090 --> 00:07:59,290
So suppose that NFT is the number of documents that contain the term tea as before and by itself is

99
00:07:59,290 --> 00:08:00,940
the total number of documents.

100
00:08:01,510 --> 00:08:07,060
Therefore, and of T divided by N is the proportion of documents that contain the term T.

101
00:08:07,750 --> 00:08:11,290
We then take that inverse of this proportion and then we take the log.

102
00:08:11,650 --> 00:08:12,850
This is the IDF.

103
00:08:17,470 --> 00:08:20,710
So the only curious part about this is why we take the law.

104
00:08:21,580 --> 00:08:24,490
The first thing to note is that it doesn't break any rules.

105
00:08:24,970 --> 00:08:30,880
Since the log is a monotonically increasing function, if an over NFT is larger than the log of that

106
00:08:30,880 --> 00:08:31,990
will also be larger.

107
00:08:32,650 --> 00:08:37,450
Thus, the TFI ETF will go down as the term appears and more and more documents.

108
00:08:38,169 --> 00:08:41,590
The main idea is that the log squashes down large values.

109
00:08:42,130 --> 00:08:47,470
So imagine that the term T appears in just one document, but we have one million documents.

110
00:08:47,980 --> 00:08:51,370
Without the log term, we would end up multiplying by one million.

111
00:08:52,240 --> 00:08:57,310
With the law term, we scale this down to about thirteen point eight, which is a much more sensible

112
00:08:57,310 --> 00:09:00,310
value in terms of feature vectors for machine learning.

113
00:09:01,660 --> 00:09:04,480
Note that there are deeper reasons for taking the log as well.

114
00:09:04,900 --> 00:09:08,590
This involves information theory, which is outside the scope of this course.

115
00:09:08,950 --> 00:09:12,820
But please check extra reading dot to see if you would like to learn more.

116
00:09:17,520 --> 00:09:23,490
OK, so now that you understand how to fight works, let's discuss how one might apply it in Python

117
00:09:24,240 --> 00:09:24,960
as before.

118
00:09:24,990 --> 00:09:27,120
We'll discuss how to do this in psyche learn.

119
00:09:27,870 --> 00:09:33,270
In fact, you'll notice that this is very easy because there is a class called TF IDF Vector isAre,

120
00:09:33,510 --> 00:09:36,600
which has essentially the same API as the count vector riser.

121
00:09:37,320 --> 00:09:39,090
Therefore, the steps remain the same.

122
00:09:40,660 --> 00:09:45,010
We begin by creating an instance of the TF IDF Vector Riser class.

123
00:09:45,640 --> 00:09:48,730
The next step is to fit the victimizer to the training set.

124
00:09:49,270 --> 00:09:55,240
We can also transform this into a TFA IDF matrix in a single step by calling Fit Transform.

125
00:09:56,570 --> 00:10:02,780
The next step is to transform the test set by calling Transform Note again, that we do not want to

126
00:10:02,780 --> 00:10:08,480
fit to the test set because our vocabulary and IDF values should come from the train set.

127
00:10:10,050 --> 00:10:15,600
Note that like the count of exerciser, the TSA I.D. affects, Riser also has many options you can choose

128
00:10:15,600 --> 00:10:16,560
in the constructor.

129
00:10:17,100 --> 00:10:22,410
So, for example, you can use stop provides around tokenize, air strip accents and so forth.

130
00:10:26,960 --> 00:10:32,330
So at this point in the lecture, we understand how to free of works and what issue it solves.

131
00:10:33,230 --> 00:10:39,230
We know that the TF IDF is made up of two terms the term frequency and the inverse document frequency.

132
00:10:40,070 --> 00:10:44,120
The next step in this lecture is to discuss some variations on these items.

133
00:10:44,660 --> 00:10:48,470
Note that none of these variations are necessarily better than the others.

134
00:10:49,460 --> 00:10:54,950
As always, if you want to know whether or not something will work on your data set, the best and only

135
00:10:54,950 --> 00:10:57,710
option is to try it on your data set.

136
00:11:02,430 --> 00:11:05,310
So let's look at some variations for the term frequency.

137
00:11:06,540 --> 00:11:12,690
One method is to simply represent them using binary values zero or one, one of the word appears in

138
00:11:12,690 --> 00:11:15,030
the document and zero if it does not.

139
00:11:15,510 --> 00:11:18,840
In fact, this method can be used in the count victimizer as well.

140
00:11:19,890 --> 00:11:25,350
Another method, which is sometimes presented as the default for the term frequency, is to normalize

141
00:11:25,350 --> 00:11:29,230
the count by dividing by the sum over all terms in the document.

142
00:11:29,850 --> 00:11:35,730
As you recall, the result is that each term now represents the proportion of time that the word appears

143
00:11:35,730 --> 00:11:36,660
in the document.

144
00:11:37,770 --> 00:11:40,470
This can also be thought of as a type of normalization.

145
00:11:41,850 --> 00:11:45,450
Yet another method is to take the log of one plus the count.

146
00:11:45,960 --> 00:11:49,560
This has the effect of reducing the influence of extreme values.

147
00:11:50,250 --> 00:11:53,700
Again, recall that the log function squashes down large values.

148
00:11:54,090 --> 00:11:56,520
The larger they are, the more they get squashed.

149
00:12:01,100 --> 00:12:03,950
OK, so now let's look at some variations of the idea of.

150
00:12:05,290 --> 00:12:11,560
One alternative is to use the smooth ADF in this case, we add one to the denominator, and we also

151
00:12:11,560 --> 00:12:13,030
had one after taking the law.

152
00:12:13,840 --> 00:12:19,360
Basically, this prevents us from ever getting a value of zero, which is possible if NFC is equal to

153
00:12:19,360 --> 00:12:19,840
N.

154
00:12:20,470 --> 00:12:23,170
In that case, we get log of one, which is zero.

155
00:12:23,890 --> 00:12:26,630
This prevents us from possibly dividing by zero.

156
00:12:26,650 --> 00:12:31,480
Later in the code, another option is the IDF Max.

157
00:12:32,020 --> 00:12:38,050
In this case, instead of using end, we use the maximum term count from the same document and thus

158
00:12:38,050 --> 00:12:43,270
the ratio inside the log is relative to the current document instead of the whole data set.

159
00:12:44,920 --> 00:12:51,750
Yet another variation is the probabilistic IDF in this case we use and the mindset of T instead of just

160
00:12:51,760 --> 00:12:57,310
n note that this quantity gives us the log odds, also known as the large it.

161
00:12:58,840 --> 00:13:04,630
One final variation on the IDF I want to discuss is the trivial IDF, where we just said it's a one.

162
00:13:05,320 --> 00:13:09,190
In this case, the IDF just becomes the term frequency itself.

163
00:13:13,950 --> 00:13:21,720
So one final variation on the TFI ETF I want to discuss is normalizing the entire TFI ETF vector, as

164
00:13:21,720 --> 00:13:24,390
you recall, we can do the same thing for account vectors.

165
00:13:25,170 --> 00:13:32,370
So to remind you what we do is we take the existing ETF IDF vector and then we divide it by its L2 norm.

166
00:13:32,850 --> 00:13:36,630
This makes it so that the length of every TFI together vector is one.

167
00:13:37,440 --> 00:13:43,260
As you recall, one advantage of this method is that it makes ranking by Euclidean distance or cosine

168
00:13:43,260 --> 00:13:45,270
distance yield the same results.

169
00:13:46,620 --> 00:13:51,960
Now, unlike the count vector riser, this type of normalization is implemented and so I could learn.

170
00:13:52,410 --> 00:13:57,600
So if you'd like to use it, you simply set the norm argument in the TFI IDF constructor.

171
00:13:58,170 --> 00:14:02,160
This takes in two possible values, which can be either one or two.

172
00:14:03,390 --> 00:14:09,750
Note that L2 is the default, so if you don't pass in anything, your TFI IDF vectors will all be normalized

173
00:14:09,780 --> 00:14:11,040
to have unit length.