1
00:00:11,090 --> 00:00:15,620
So in this lecture, we will be looking at a notebook to implement Tex rank in Python.

2
00:00:16,460 --> 00:00:21,320
Now please note that this is an advanced notebook where we will implement the concepts we previously

3
00:00:21,320 --> 00:00:21,980
discussed.

4
00:00:22,580 --> 00:00:27,140
If all you want is a function that can do all this work, simply skip to the next lecture.

5
00:00:28,340 --> 00:00:33,560
Now, as promised, most of this notebook is the same as the previous notebook, and the only difference

6
00:00:33,560 --> 00:00:35,300
is in how we compute the scores.

7
00:00:35,930 --> 00:00:41,810
Because of this will be starting this notebook at the point right after we've computed the TFI Taf matrix.

8
00:00:49,940 --> 00:00:53,120
So the next step will be to compute the cosine similarity.

9
00:00:53,900 --> 00:00:58,160
Now this function is very concise since we can just pass an X and get back.

10
00:00:58,160 --> 00:00:58,640
Yes.

11
00:00:59,090 --> 00:01:04,550
But because it's so concise, it may not be clear what it's actually doing if you've never seen it before.

12
00:01:05,390 --> 00:01:10,820
Also note that s is analogous to the Google matrix, which we call G in the theory lectures.

13
00:01:12,270 --> 00:01:18,000
In any case, you recall that the cosine similarity is supposed to be a function that computes the similarity

14
00:01:18,330 --> 00:01:19,770
between two vectors.

15
00:01:20,520 --> 00:01:23,370
But here we are, just passing in the whole Matrix X.

16
00:01:24,030 --> 00:01:25,800
So what does this actually mean?

17
00:01:26,850 --> 00:01:32,160
Well, firstly, note that this function comes from so I could learn if you like, please scroll up

18
00:01:32,160 --> 00:01:34,200
to the top and check the import yourself.

19
00:01:35,730 --> 00:01:41,430
So this function is a special function, which actually does many similarity computations all at once.

20
00:01:42,060 --> 00:01:46,290
It helps to think about what we would do if this function did not exist.

21
00:01:47,190 --> 00:01:50,460
Essentially, what we want to do is a double for loop.

22
00:01:51,150 --> 00:01:56,880
This is because we want the cosine similarity between every sentence and every other sentence.

23
00:01:58,680 --> 00:02:03,420
So suppose we had some function that could compute the similarity between two vectors.

24
00:02:04,050 --> 00:02:10,020
If this is the case, then inside our loop, we would simply compute the similarity between Vector AI

25
00:02:10,229 --> 00:02:11,280
and Vector J.

26
00:02:12,570 --> 00:02:18,450
Furthermore, we would store this similarity in a sentence by sentence matrix called S at position i

27
00:02:18,450 --> 00:02:18,800
j.

28
00:02:19,650 --> 00:02:21,240
So hopefully that straightforward.

29
00:02:21,960 --> 00:02:27,360
Basically, the second learn version of this function does all that work internally so that the matrix

30
00:02:27,360 --> 00:02:30,390
as we get back is exactly what I just described.

31
00:02:36,160 --> 00:02:41,240
Now, just as a sanity check, we're going to check the shape of s if we did this correctly.

32
00:02:41,260 --> 00:02:45,550
This should be a relatively small matrix since the number of sentences is small.

33
00:02:45,790 --> 00:02:47,620
Well, the number of words is large.

34
00:02:51,370 --> 00:02:53,110
OK, so this result makes sense.

35
00:02:53,830 --> 00:02:59,380
Firstly, note that this is a square matrix, which must be the case since its sentences by sentences.

36
00:03:00,010 --> 00:03:03,160
Furthermore, this implies that we have 17 sentences.

37
00:03:06,650 --> 00:03:10,730
As a further sanity check, we can also check the length of our sentences list.

38
00:03:14,290 --> 00:03:16,630
And we get back 17 as expected.

39
00:03:20,670 --> 00:03:26,100
Now, as you recall, our matrix must eventually be a mark of matrix, which means that each row must

40
00:03:26,100 --> 00:03:31,270
come to one which is currently not the case in order for us to make this true.

41
00:03:31,290 --> 00:03:35,280
We need to normalize our matrix by dividing every row by its sun.

42
00:03:36,150 --> 00:03:42,450
Note that we must also pass and keep them as equals true so that the result is a 2D matrix, which ensures

43
00:03:42,450 --> 00:03:45,450
that numpy broadcasting operates the way we expect.

44
00:03:51,420 --> 00:03:56,310
The next step is to check the some of the first arrow of our matrix as a sanity check.

45
00:03:59,500 --> 00:04:01,450
And it seems to one as expected.

46
00:04:04,690 --> 00:04:08,260
Now, as you recall, this matrix is not our final matrix.

47
00:04:08,710 --> 00:04:12,580
We also need to smooth this matrix in case it contains zeros.

48
00:04:13,390 --> 00:04:17,290
To do this, we'll start by creating a uniform transition matrix called you.

49
00:04:18,010 --> 00:04:21,670
It's simply a matrix of 1s divided by the number of sentences.

50
00:04:26,910 --> 00:04:31,560
The next step, which is, again, just a sanity check, is to make sure that the some of the first

51
00:04:31,560 --> 00:04:32,520
row is one.

52
00:04:36,070 --> 00:04:39,010
And as expected, the some of the first row is one.

53
00:04:42,830 --> 00:04:49,810
The next step is to compute our final matrix, which is a convex combination of SC and you basically

54
00:04:49,820 --> 00:04:53,930
we're going to take 85 percent of this and 15 percent of you.

55
00:04:54,680 --> 00:04:59,900
Of course, these proportions must sum to one so that each row in the result still sums to one.

56
00:05:04,920 --> 00:05:09,690
The next step is to just double check that the first row of the result still sums to one.

57
00:05:13,070 --> 00:05:15,680
And as expected, it still seems to one.

58
00:05:19,240 --> 00:05:26,320
OK, so the next step is to compute the eigenvalues and eigenvectors of our matrix in this notebook.

59
00:05:26,350 --> 00:05:31,840
I'm actually going to show you both methods since they both actually work and you can decide which you

60
00:05:31,840 --> 00:05:32,680
think is better.

61
00:05:33,640 --> 00:05:39,940
Now, as mentioned by convention, we think of the eigenvalue equation in terms of column vectors instead

62
00:05:39,940 --> 00:05:40,990
of row vectors.

63
00:05:41,560 --> 00:05:47,260
Because of this, Nampai implements the AISC function to work on column vectors by default, and therefore

64
00:05:47,260 --> 00:05:50,830
we need to transpose our matrix mathematically.

65
00:05:50,830 --> 00:05:56,800
The reason why we need to do this is easy to see if we transpose our state transition equation to be

66
00:05:56,800 --> 00:05:58,090
in the standard format.

67
00:05:58,870 --> 00:06:04,600
When we do this, the state distribution becomes a column vector and our matrix is transposed.

68
00:06:05,350 --> 00:06:09,790
Note that this function gives us back both the eigenvalues and the eigenvectors.

69
00:06:14,820 --> 00:06:21,030
So let's check the eigenvalues to make sure they are what we expect, as you recall, the theory states

70
00:06:21,030 --> 00:06:23,640
that one of these eigenvalues should be one.

71
00:06:27,420 --> 00:06:30,570
As you can see, the first eigenvalue is indeed one.

72
00:06:33,560 --> 00:06:36,350
The next step is to check the corresponding eigenvectors.

73
00:06:37,400 --> 00:06:40,640
Now this requires a bit of understanding of NumPy API.

74
00:06:41,420 --> 00:06:46,230
You might be wondering why is this Colin Ma0 and not just zero by itself?

75
00:06:46,910 --> 00:06:51,290
And the reason is that the eigenvalues are stored as columns and not as rows.

76
00:06:51,830 --> 00:06:57,020
Therefore, if we want the corresponding eigen vector that goes with one, we need to index the first

77
00:06:57,020 --> 00:06:58,790
column and not the first row.

78
00:07:02,220 --> 00:07:04,890
So for now, just make note of these values.

79
00:07:09,110 --> 00:07:15,230
The next step is to test our assumption that this is in fact an eigenvectors corresponding to an eigenvalue

80
00:07:15,230 --> 00:07:20,330
of one, to do this, we simply multiply our eigenvectors by the matrix.

81
00:07:20,660 --> 00:07:22,730
And this should return the same I can vector.

82
00:07:26,440 --> 00:07:28,900
OK, so it appears that the result is the same.

83
00:07:29,170 --> 00:07:31,660
So this is in fact a proper vector.

84
00:07:35,030 --> 00:07:39,200
Now, at this point, if you did this exercise yourself, you may have gotten stuck.

85
00:07:39,800 --> 00:07:45,050
This is because if you haven't noticed, our eigenvectors vector contains some strange values.

86
00:07:45,590 --> 00:07:50,300
In particular, it doesn't look like they seem to one, nor are they even positive.

87
00:07:50,900 --> 00:07:55,160
This doesn't make sense since they're supposed to represent a probability distribution.

88
00:07:55,820 --> 00:08:01,190
So you might be thinking something is wrong, but in fact, this is entirely expected.

89
00:08:01,850 --> 00:08:04,490
As you recall, eigenvectors are not unique.

90
00:08:04,970 --> 00:08:10,400
This is because you can simply multiply both sides of the eigenvalue equation by a constant, and it

91
00:08:10,400 --> 00:08:11,390
would still hold.

92
00:08:12,110 --> 00:08:17,050
And of course, this constant can be absorbed into the eigen vector to make a new eigenvectors vector

93
00:08:17,060 --> 00:08:18,560
parallel with the original.

94
00:08:19,580 --> 00:08:22,010
So what constant should we multiply it with?

95
00:08:23,330 --> 00:08:28,090
The answer is we should divide our eigenvectors by its sum as an exercise.

96
00:08:28,100 --> 00:08:32,990
Please confirm that this will result in all the values being positive and all the values something to

97
00:08:32,990 --> 00:08:33,440
one.

98
00:08:40,770 --> 00:08:46,140
Now, before we move on, I want to show you a more direct method of finding the limiting distribution,

99
00:08:46,470 --> 00:08:51,620
which might look simpler depending on how you think of the problem in this block of code.

100
00:08:51,630 --> 00:08:55,090
We're simply going to find the limiting distribution by brute force.

101
00:08:55,770 --> 00:09:01,140
That is to say, we're going to actually go through the transitions until the distribution converges.

102
00:09:02,430 --> 00:09:06,090
To begin, we'll initialize our limiting distribution to be the uniform.

103
00:09:06,090 --> 00:09:12,090
Distribution will define a threshold of 10 to the minus eight, which is what we'll use to quit our

104
00:09:12,090 --> 00:09:12,540
loop.

105
00:09:13,890 --> 00:09:16,830
We'll also initialize a delta variable to infinity.

106
00:09:17,430 --> 00:09:22,350
Inside the loop, this will store how much the distribution has changed from one step to the next.

107
00:09:23,670 --> 00:09:26,970
Finally, we'll keep track of the number of iterations of the loop.

108
00:09:27,900 --> 00:09:30,720
The next step is to enter a while loop that continues.

109
00:09:30,720 --> 00:09:36,450
As long as Delta is bigger than the threshold inside the loop, we increment it by one.

110
00:09:37,740 --> 00:09:43,200
The next step is to compute the distribution for the next step, which we get from multiplying our distribution

111
00:09:43,200 --> 00:09:43,920
by X.

112
00:09:44,760 --> 00:09:47,400
We'll start this in a temporary variable called P.

113
00:09:48,540 --> 00:09:53,790
The next step is to compute the next delta, which is the sum of absolute differences between the new

114
00:09:53,790 --> 00:09:56,250
distribution and the old distribution.

115
00:09:59,640 --> 00:10:02,370
The next step is to update the distribution to be.

116
00:10:04,050 --> 00:10:08,520
When we exit our loop will print the number of iterations to see how long it took.

117
00:10:13,620 --> 00:10:16,890
OK, so as you can see, this only took 41 steps.

118
00:10:20,810 --> 00:10:23,690
The next step is to just print our limiting distribution.

119
00:10:27,130 --> 00:10:31,240
Although you could simply check whether these numbers are the same by looking at them yourself.

120
00:10:31,630 --> 00:10:34,060
Well, check this numerically later in this notebook.

121
00:10:37,210 --> 00:10:40,990
The next step is to check for some of the values just as a sanity check.

122
00:10:44,160 --> 00:10:46,810
As you can see, the distribution of sums to one.

123
00:10:46,830 --> 00:10:47,820
As expected.

124
00:10:51,400 --> 00:10:56,590
The next step is to compute the sum of absolute differences between the answer we got before and the

125
00:10:56,590 --> 00:10:58,210
answer we got by brute force.

126
00:11:01,850 --> 00:11:06,830
As you can see, the result is on the order of 10 to the minus eight, which makes sense because that's

127
00:11:06,830 --> 00:11:08,270
what we set as a threshold.

128
00:11:11,590 --> 00:11:17,680
The final step in this lecture is to set our scores to the limiting distribution, as you recall, this

129
00:11:17,680 --> 00:11:23,320
distribution actually corresponds to the text rank scores, which we will then use to find the top sentences

130
00:11:23,320 --> 00:11:24,310
in our document.

131
00:11:29,290 --> 00:11:32,800
As before, the next step is to find the sort index for the score.

132
00:11:33,880 --> 00:11:38,650
Note that since the remaining steps are the same as the previous notebook, we'll go through them quickly.

133
00:11:42,650 --> 00:11:46,820
The next step is to print our summary, which is just the top five scoring sentences.

134
00:11:52,480 --> 00:11:56,350
OK, so the summary is the retail sales figures are very weak.

135
00:11:56,710 --> 00:12:01,990
But as Bank of England Governor Mervyn King indicated last night, you don't really get an accurate

136
00:12:01,990 --> 00:12:04,490
impression of Christmas trading until about Easter.

137
00:12:04,510 --> 00:12:05,620
Said Mr Shah.

138
00:12:06,250 --> 00:12:09,940
A number of retailers have already reported poor figures for December.

139
00:12:10,450 --> 00:12:16,030
The warnings echoed an earlier caution from Bank of England Governor Mervyn King not to read too much

140
00:12:16,030 --> 00:12:17,860
into the poor December figures.

141
00:12:18,310 --> 00:12:24,610
Retail sales dropped by one percent on the month in December after a 0.6 percent rise in November.

142
00:12:24,880 --> 00:12:30,880
The Office of National Statistics said clothing retailers and non-specialist stores were the worst hit,

143
00:12:31,210 --> 00:12:35,800
with only internet retailers showing any significant growth, according to the ONS.

144
00:12:36,970 --> 00:12:39,490
So again, this seems like a pretty reasonable summary.

145
00:12:42,960 --> 00:12:48,150
So as before, one thing you can do is check the summary against the title, which is like a summary

146
00:12:48,150 --> 00:12:48,750
itself.

147
00:12:49,770 --> 00:12:54,540
Also, you might want to check this summary against what we got in the previous lecture to see if the

148
00:12:54,540 --> 00:12:59,100
results improved on other examples, I noticed a pretty big improvement.

149
00:13:04,880 --> 00:13:09,950
OK, so as with the previous notebook, we're going to put all the steps into a function which will

150
00:13:09,950 --> 00:13:12,440
take in some text and print out a summary.

151
00:13:13,310 --> 00:13:17,330
This time we won't step through the function since it's the same as what we just went through.

152
00:13:24,050 --> 00:13:29,300
The next step is to test our function on a new article, which I've chosen from the entertainment class.

153
00:13:35,150 --> 00:13:41,690
So this time the summary is good REM Green Day and the Black Eyed Peas took home two awards each, as

154
00:13:41,690 --> 00:13:42,680
well as Best Female.

155
00:13:42,710 --> 00:13:48,020
Goodrem also took home the Pepsi viewer's choice award, whilst Green Day bagged the prize for Best

156
00:13:48,020 --> 00:13:49,850
Rock video for American Idiot.

157
00:13:50,360 --> 00:13:54,410
Other winners included Green Day, voted for Best Group and the Black Eyed Peas.

158
00:13:54,830 --> 00:13:59,780
The Black Eyed Peas won awards for Best R&B video and Sexiest VIDEO, both for Hey Mama.

159
00:14:00,290 --> 00:14:05,840
Local singer and songwriter Missy Higgins took the title of Breakthrough Artist of the Year, with Australian

160
00:14:05,840 --> 00:14:11,600
Idol winner Guy Sebastian taking the honours for Best Pop VIDEO Again, this seems like a reasonable

161
00:14:11,600 --> 00:14:12,140
summary.

162
00:14:15,650 --> 00:14:21,260
As before, it's nice to compare our summary against the title to see how the authors summarize the

163
00:14:21,260 --> 00:14:21,860
story.

164
00:14:25,820 --> 00:14:28,340
And in this sense, our summary seems to work well.