1
00:00:11,120 --> 00:00:16,700
So in this lecture, we will be looking at the notebook where we build TF IDF completely from scratch

2
00:00:17,300 --> 00:00:22,640
for this notebook, once we build our TFI, the effectors, what we will do is pick random documents

3
00:00:22,640 --> 00:00:27,680
from our data set and look at which words end up having the largest TFT of components.

4
00:00:28,370 --> 00:00:33,830
What we expect to see is that words with large TFI of values should display two traits.

5
00:00:34,940 --> 00:00:38,090
Firstly, they should be words that appear often in the documents.

6
00:00:38,450 --> 00:00:40,460
That means their term frequency is high.

7
00:00:41,300 --> 00:00:45,260
Secondly, they should be relatively unique compared to the whole data set.

8
00:00:45,650 --> 00:00:50,600
That means their document frequency is low and hence their inverse document frequency is high.

9
00:00:51,320 --> 00:00:55,700
So this is our hypothesis about what should happen before looking at the results.

10
00:00:58,760 --> 00:01:02,720
So we'll begin this notebook by downloading our BBC news data once again.

11
00:01:03,470 --> 00:01:05,560
Note that this is an arbitrary choice.

12
00:01:05,570 --> 00:01:08,270
You should feel free to choose any data set you like.

13
00:01:14,810 --> 00:01:16,730
The next step is to do our imports.

14
00:01:17,450 --> 00:01:23,390
Note that for this script, our inputs are quite simple because we'll be implementing TF IDF using only

15
00:01:23,390 --> 00:01:24,560
basic components.

16
00:01:30,440 --> 00:01:35,120
The next step is to download the necessary data for MLC, where tokenize are.

17
00:01:40,270 --> 00:01:44,350
The next step is to read, in our case, we're using PD that reads, Yes, we.

18
00:01:48,820 --> 00:01:54,610
The next step is to call DfT head to remind ourselves what our data looks like, although this step

19
00:01:54,610 --> 00:01:56,050
seems to be quite trivial.

20
00:01:56,110 --> 00:01:59,830
I always find that printing things out helps me through the coding process.

21
00:02:06,820 --> 00:02:11,290
The next step is to populate our word to a next mapping, as you recall.

22
00:02:11,320 --> 00:02:18,880
This is necessary so that we know which word maps to which columns in our RDF matrix will begin by initializing

23
00:02:18,880 --> 00:02:19,990
index to zero.

24
00:02:20,140 --> 00:02:22,420
And we're 280X to be an empty dit.

25
00:02:23,200 --> 00:02:27,190
We'll also create an empty list where we will store our tokenized documents.

26
00:02:27,640 --> 00:02:30,070
This is helpful since we'll need these tokens later.

27
00:02:30,280 --> 00:02:31,480
Why don't we do our accounts?

28
00:02:34,080 --> 00:02:39,000
The next step is to loop through the text column of our data frame inside this loop.

29
00:02:39,030 --> 00:02:44,370
We begin by lower casing and tokenizing the current document we call the result words.

30
00:02:45,330 --> 00:02:47,970
The next step is to create an empty list called Duck.

31
00:02:47,970 --> 00:02:54,900
Is it since we will be building our word to index dictionary, there is no need to store the words themselves.

32
00:02:55,410 --> 00:02:59,820
Instead, it would be more efficient to simply store the indices that they map to.

33
00:03:01,830 --> 00:03:05,340
The next step is, do you live through each word inside this loop?

34
00:03:05,460 --> 00:03:09,230
We check whether or not the current word is in word to RDX.

35
00:03:09,750 --> 00:03:14,160
If it is not, then that means this word it doesn't yet have a corresponding index.

36
00:03:14,610 --> 00:03:20,190
So if this is the case, we add this word to our dictionary where the key is the word in the value is

37
00:03:20,190 --> 00:03:21,020
the index.

38
00:03:21,720 --> 00:03:27,360
The next step is to increment index by one so that it has a new value the next time you use it.

39
00:03:28,620 --> 00:03:34,050
At this point, we know that our word has a corresponding index because we either just created it or

40
00:03:34,050 --> 00:03:35,280
it already existed.

41
00:03:35,850 --> 00:03:40,630
Therefore, we can grab its index from word to ADX and then store it inside our list.

42
00:03:40,650 --> 00:03:47,640
Doc as it once we're outside the inner loop, our isn't list is complete or in other words, the sentence

43
00:03:47,640 --> 00:03:52,500
has been converted to integers and then we can store that inside our tokenized docs list.

44
00:04:00,970 --> 00:04:06,790
OK, so at this point, we're to index mapping is complete, and we have converted all of our documents

45
00:04:06,790 --> 00:04:08,740
into lists of integer IDs.

46
00:04:09,400 --> 00:04:11,920
The next step is to create the reverse mapping as well.

47
00:04:12,760 --> 00:04:18,160
So this code iterates through our word to index dictionary and simply makes the value the key and the

48
00:04:18,160 --> 00:04:19,060
key, the value.

49
00:04:19,810 --> 00:04:24,790
Note that this is somewhat inefficient because we're using a dictionary data structure, even though

50
00:04:24,790 --> 00:04:28,900
our indices are only integers from zero up to the vocabulary size.

51
00:04:29,350 --> 00:04:31,960
It would be more efficient to store a list instead.

52
00:04:32,350 --> 00:04:38,980
Since the index to a list is already an integer as an exercise, you may want to think about how do

53
00:04:38,980 --> 00:04:40,390
you store it that way instead?

54
00:04:45,580 --> 00:04:49,090
The next step is to find the number of documents which we will call end.

55
00:04:49,750 --> 00:04:52,540
Note that this is simply the number of rows in the F.

56
00:04:57,730 --> 00:05:01,200
The next step is to find the number of unique words which we will call.

57
00:05:01,210 --> 00:05:04,180
We note that this is simply the size of words.

58
00:05:04,180 --> 00:05:04,990
Why acts?

59
00:05:10,240 --> 00:05:13,720
The next step is to instantiate our term frequency matrix.

60
00:05:14,290 --> 00:05:17,470
Note that this is the same thing as our count vector matrix.

61
00:05:18,130 --> 00:05:23,140
Now, in practice, although it would be more efficient to use a sparse matrix that makes things a bit

62
00:05:23,140 --> 00:05:26,140
more complicated without improving your understanding.

63
00:05:26,890 --> 00:05:32,470
Thus, in this script, we will be using a dense matrix, which is just a regular Nampai array.

64
00:05:33,070 --> 00:05:37,930
Using a sparse matrix will be an exercise to extend what you've learned in this lecture.

65
00:05:39,370 --> 00:05:45,500
OK, so as you can see, we initialize this matrix to be a matrix of zeros of size n by V.

66
00:05:52,070 --> 00:05:58,400
The next step is to populate our term frequency matrix, to do this, we simply loop through our tokenized

67
00:05:58,400 --> 00:05:59,180
documents.

68
00:05:59,810 --> 00:06:04,100
As you recall, we already have them all stored as lists of integer IDs.

69
00:06:05,210 --> 00:06:10,430
Note that we use Enumerate, which gives us the index for the document at the same time, which we will

70
00:06:10,430 --> 00:06:11,750
denote with the letter I.

71
00:06:12,410 --> 00:06:17,480
So it tells us which document we are looking at inside the first.

72
00:06:17,690 --> 00:06:21,560
We have a second loop which goes through each word index in the documents.

73
00:06:22,100 --> 00:06:25,180
Since these are already word indices, we'll call each of them J.

74
00:06:26,180 --> 00:06:29,060
Thus, to populate our term frequency matrix.

75
00:06:29,420 --> 00:06:33,110
We simply increment the value at position I.J by one.

76
00:06:33,890 --> 00:06:38,330
This means that in document I, we saw the J ith word one more time.

77
00:06:44,790 --> 00:06:46,980
The next step is to compute the IDF term.

78
00:06:48,470 --> 00:06:52,760
Note that the IDF term can be derived from the term frequency matrix.

79
00:06:53,300 --> 00:06:55,400
So let's think about how that can be done.

80
00:06:56,360 --> 00:06:59,450
Remember that we will have an IDF value for each word.

81
00:06:59,930 --> 00:07:04,820
Thus, we will have v IDF values or, in other words, a vector of size v.

82
00:07:05,750 --> 00:07:10,280
Thus, what we need to look at are the columns of our term frequency matrix.

83
00:07:10,460 --> 00:07:11,840
One word at a time.

84
00:07:12,680 --> 00:07:16,520
So for each column, the term frequency matrix contains a zero.

85
00:07:16,880 --> 00:07:22,100
If the word did not appear in a specific document and the number greater than zero if it did.

86
00:07:22,820 --> 00:07:27,110
Thus, what we are interested in is when TAF is greater than zero.

87
00:07:27,770 --> 00:07:33,020
By doing this Boolean operation, we get back a Boolean matrix of the same size.

88
00:07:33,620 --> 00:07:36,930
As you recall, a Boolean value is either true or false.

89
00:07:37,340 --> 00:07:41,420
But in Python, it true is treated like one and false is treated like zero.

90
00:07:42,050 --> 00:07:47,630
Therefore, if you want to know how many values are true, we simply need to sum up all the values.

91
00:07:48,140 --> 00:07:50,030
So that's why we call the same function.

92
00:07:50,840 --> 00:07:55,970
But note that if we call the some function with no arguments, it will just sum up the whole matrix

93
00:07:55,970 --> 00:07:57,260
and give us a single number.

94
00:07:58,040 --> 00:08:01,220
And that number will be the number of ones in the whole matrix.

95
00:08:01,520 --> 00:08:03,020
But this is not what we want.

96
00:08:03,680 --> 00:08:09,050
Instead, we want to have the sum over each column or equivalently the sum over each word.

97
00:08:09,620 --> 00:08:12,410
Therefore, we must specify access equals zero.

98
00:08:12,890 --> 00:08:14,660
We call the result document free.

99
00:08:16,040 --> 00:08:21,350
So this is a V size array, which contains the number of documents that each word appears in.

100
00:08:22,100 --> 00:08:24,740
Of course, this is not the IDF vector just yet.

101
00:08:25,400 --> 00:08:31,970
As you recall, the IDF value is n divided by the document frequency and then logged to squash down

102
00:08:31,970 --> 00:08:32,600
the value.

103
00:08:38,080 --> 00:08:41,740
The next step is to complete our computation of the TFI Taf.

104
00:08:42,610 --> 00:08:48,760
Now you might think that this would not work, since t.f is an end by matrix and the IDF is just a v

105
00:08:48,780 --> 00:08:49,600
size vector.

106
00:08:50,320 --> 00:08:56,350
However, recall that NumPy automatically broadcasts when you try to perform operations on objects of

107
00:08:56,350 --> 00:08:57,610
different dimensions.

108
00:08:58,120 --> 00:09:03,790
As an exercise, you might want to try this the slow way to verify that Nampai will do this broadcast

109
00:09:03,790 --> 00:09:04,540
incorrectly.

110
00:09:05,320 --> 00:09:10,510
In fact, you really should do this exercise because it will help you think through how we compute the

111
00:09:10,510 --> 00:09:13,600
TFI t.f using t.f and IDF.

112
00:09:18,750 --> 00:09:23,970
The next step is to set a random scene so that we get consistent results for the next portion of the

113
00:09:23,970 --> 00:09:24,600
notebook.

114
00:09:25,080 --> 00:09:28,650
You can feel free to turn this off if you want to see other examples.

115
00:09:33,870 --> 00:09:39,510
So in the next block of code, what we are going to do is choose random documents from our dataset and

116
00:09:39,510 --> 00:09:42,450
prints out the top five TF IDF terms.

117
00:09:43,680 --> 00:09:48,930
As you recall, our goal is to basically make sure that these makes sense in terms of what we expect

118
00:09:48,990 --> 00:09:49,530
to find.

119
00:09:49,530 --> 00:09:53,010
You have to do so to quickly explain this code.

120
00:09:53,040 --> 00:09:58,590
We start by choosing a random integer from zero up to when this will be the document index.

121
00:10:00,010 --> 00:10:03,220
The next step is to index ADF at this index.

122
00:10:03,610 --> 00:10:06,070
This will give us back one row of ADF.

123
00:10:07,560 --> 00:10:10,050
The next step is to print the label for this row.

124
00:10:10,590 --> 00:10:13,830
This will help us further confirm that our results make sense.

125
00:10:14,160 --> 00:10:17,400
Well, want to make sure that the resulting words correspond to the label?

126
00:10:19,440 --> 00:10:22,170
The next step is to print out the first line of text.

127
00:10:22,770 --> 00:10:26,130
As you recall, each news article contains multiple lines.

128
00:10:26,550 --> 00:10:30,540
However, if we print out the whole article, it will make things more difficult to see.

129
00:10:31,230 --> 00:10:37,140
So to do this, we simply split the text on the new line character and grab the zero with return value.

130
00:10:39,110 --> 00:10:44,420
The next step is to print out the top five terms as determined by their TF IDF value.

131
00:10:45,350 --> 00:10:50,870
We'll begin by grabbing the TF IDF vector at the iith row of our ATF IDF matrix.

132
00:10:51,320 --> 00:10:52,970
We'll call the results scores.

133
00:10:53,990 --> 00:10:56,150
The next step is to sort the scores.

134
00:10:56,930 --> 00:10:58,400
Now there are two things to know.

135
00:10:59,210 --> 00:11:03,710
Firstly, when we call sort this by default sorts in ascending order.

136
00:11:04,310 --> 00:11:09,020
Thus, if we want the top score to come first, then we should negate the scores.

137
00:11:10,580 --> 00:11:15,200
Secondly, it's not the actual score we care about, but the ordering of the score.

138
00:11:15,980 --> 00:11:21,410
Thus, we call our exhort instead of sort, which gives us back the indices in the sorted order.

139
00:11:23,500 --> 00:11:27,920
The next step is to loop through the first five indices inside this loop.

140
00:11:27,940 --> 00:11:30,500
We print out the word corresponding to the index.

141
00:11:31,900 --> 00:11:37,050
So here you can see why we needed the reverse mapping to go from index back to word.

142
00:11:38,050 --> 00:11:42,850
Otherwise, we would have just printed the index and it wouldn't be clear which word it belongs to.

143
00:11:43,900 --> 00:11:45,520
OK, so let's try our function.

144
00:11:51,520 --> 00:11:55,240
OK, so our first two randomly selected article is about sports.

145
00:11:56,290 --> 00:11:58,870
The headline is Athens memories soar above.

146
00:11:58,870 --> 00:12:05,620
Those in the top five terms are Paula Athens 1500 Meter, Her and Kelly.

147
00:12:06,340 --> 00:12:11,350
So from my perspective, all of these makes sense, except for the word her, which seems more like

148
00:12:11,350 --> 00:12:12,220
a stop word.

149
00:12:12,940 --> 00:12:18,490
Perhaps it's the case that this word simply doesn't appear to often in many news documents, and thus

150
00:12:18,490 --> 00:12:20,170
the IDF value is high.

151
00:12:20,830 --> 00:12:26,410
However, the other terms, such as Paula Athens, 500 Meter and Kelly makes sense.

152
00:12:26,800 --> 00:12:31,930
These are likely to be unique to this article while also appearing frequently in the article.

153
00:12:34,180 --> 00:12:35,890
So let's look at another example.

154
00:12:40,490 --> 00:12:42,590
So this article is about politics.

155
00:12:42,830 --> 00:12:46,260
The headline is Clark faces ID cards, rebellion.

156
00:12:46,880 --> 00:12:48,560
So these terms also make sense.

157
00:12:48,950 --> 00:12:51,830
Cards, Clark Rebellion, ID and Bill.

158
00:12:52,610 --> 00:12:58,190
In some sense, you can think of these words as the most important words in the article or the words

159
00:12:58,190 --> 00:12:59,900
that best represent the article.

160
00:13:01,360 --> 00:13:06,820
Notice how these are all terms which are relevant to this article and again, are likely to appear often

161
00:13:06,820 --> 00:13:09,910
in this article, but probably not all articles.

162
00:13:11,780 --> 00:13:13,340
So let's look at another example.

163
00:13:18,650 --> 00:13:24,590
So the next example is another article about sports again, we see terms which look like they would

164
00:13:24,590 --> 00:13:26,600
be unique to this article.

165
00:13:27,770 --> 00:13:33,740
Note that Hingis is a tennis player and we see terms like 30th and 95th, which makes sense because

166
00:13:33,740 --> 00:13:36,710
in tennis, we often care about player rankings.

167
00:13:43,340 --> 00:13:48,320
OK, so now that you understand the basics of how to build a fighter, yes, here are some ways you

168
00:13:48,320 --> 00:13:49,970
could extend this exercise.

169
00:13:50,900 --> 00:13:52,490
The first exercise is simple.

170
00:13:53,210 --> 00:13:57,410
As you recall, one part of the script involves creating the count vectors.

171
00:13:57,950 --> 00:14:03,080
Of course, we already know of some code that can do this, which also returns a sparse matrix.

172
00:14:03,440 --> 00:14:05,990
This is the count victimizer inside can learn.

173
00:14:06,800 --> 00:14:12,410
So this exercise involves replacing our counting code with the count vector riser instead.

174
00:14:13,070 --> 00:14:18,590
This should be relatively simple, but recall that the count victimizer also does another step, which

175
00:14:18,590 --> 00:14:20,540
is to create the word to index mapping.

176
00:14:21,200 --> 00:14:24,860
So you'll have to check the documentation to find out where this is.

177
00:14:26,300 --> 00:14:33,320
The second exercise is more difficult for this exercise, you should use spies see us, our matrix objects

178
00:14:33,650 --> 00:14:35,630
instead of a dense Nampai array.

179
00:14:36,440 --> 00:14:42,260
Note that this is not just a matter of replacing and zeros with the CSR Matrix constructor.

180
00:14:42,950 --> 00:14:48,140
The reason for this is our existing code is not compatible with CSR Matrix.

181
00:14:48,770 --> 00:14:55,220
In particular, the part where we increment the value at position I.J is not allowed or it may be allowed,

182
00:14:55,220 --> 00:14:56,720
but it is very inefficient.

183
00:14:57,230 --> 00:15:03,050
So you should figure out the other ways you can initialize a CSR matrix and use one of those instead.