1
00:00:11,070 --> 00:00:16,110
So in this lecture, we'll look at a more advanced method of summarizing text known as text rank.

2
00:00:16,740 --> 00:00:21,900
In my opinion, this tends to work better than the simple TFI Taf method we previously discussed.

3
00:00:22,590 --> 00:00:27,480
Now, because this method is quite advanced, these lectures have been split up into multiple parts.

4
00:00:28,140 --> 00:00:32,400
This first theory lecture will look at intuition only, which will be for everyone.

5
00:00:33,060 --> 00:00:38,190
The second lecture will look at how tech's rank really works, which is based on Markov chains and requires

6
00:00:38,190 --> 00:00:41,670
a decent amount of knowledge about probability and linear algebra.

7
00:00:42,420 --> 00:00:46,830
And the reason we need to do this is because there is no library to do what we're going to do.

8
00:00:47,520 --> 00:00:52,050
In order to write the code, you'll have to understand what the code is doing, which means you'll have

9
00:00:52,050 --> 00:00:53,190
to understand the math.

10
00:00:53,880 --> 00:00:56,160
But note that I have to qualify this statement.

11
00:00:56,670 --> 00:01:00,990
There is a library you can use where you simply pass in some text and get a summary.

12
00:01:01,470 --> 00:01:04,410
So that will be the last example I show you in this section.

13
00:01:05,190 --> 00:01:10,500
The reason that this is not the main code is because plugging your text into one line of code doesn't

14
00:01:10,500 --> 00:01:12,450
really help you understand what's going on.

15
00:01:13,050 --> 00:01:18,390
In fact, when it's only one line of code, it doesn't really even matter which method you use because

16
00:01:18,390 --> 00:01:20,340
every method looks exactly the same.

17
00:01:20,970 --> 00:01:23,610
So it's not a great learning experience in that respect.

18
00:01:24,270 --> 00:01:29,370
However, if you're more application minded and you don't really have the skill or desire to learn about

19
00:01:29,370 --> 00:01:34,890
how this works, you can feel free to skip to this later lecture, which I've called Text, Summarization

20
00:01:34,890 --> 00:01:36,570
and Python are the easy way.

21
00:01:37,290 --> 00:01:40,920
Now that being said, let's continue on to how tech's rank works.

22
00:01:45,500 --> 00:01:51,530
So the best way to understand how tech's ring works is to first review how the TF IDF method works.

23
00:01:52,730 --> 00:01:55,190
As you recall, this is a very simple process.

24
00:01:55,760 --> 00:01:58,400
We begin by splitting the document into sentences.

25
00:01:58,910 --> 00:02:04,850
We then compute ATF IDF matrix on those sentences, which gives us a matrix for each row, represents

26
00:02:04,850 --> 00:02:07,610
a sentence in each column represents a word.

27
00:02:08,240 --> 00:02:13,100
In other words, every sentence becomes a vector of 250 of values for each word.

28
00:02:13,760 --> 00:02:20,120
We then score each sentence by taking the average of the TFR of components after we have a score for

29
00:02:20,120 --> 00:02:20,870
each sentence.

30
00:02:21,080 --> 00:02:24,350
The summary is simply made up of the top scoring sentences.

31
00:02:25,640 --> 00:02:28,070
So where does TEX rank fit into this picture?

32
00:02:28,760 --> 00:02:34,700
Well, Tex Rank is simply an alternative method of scoring each sentence instead of simply taking the

33
00:02:34,700 --> 00:02:36,590
average of the TFI D of values.

34
00:02:36,860 --> 00:02:39,350
We use a more advanced scoring methodology.

35
00:02:40,160 --> 00:02:43,160
Importantly, note that all the other steps remain the same.

36
00:02:43,760 --> 00:02:49,430
We still split the documents into sentences and we still compute the TFI Taf matrix, and we still form

37
00:02:49,430 --> 00:02:51,920
the summary by taking the top scoring sentences.

38
00:02:52,490 --> 00:02:55,580
The only difference is in how we compute those scores.

39
00:03:00,130 --> 00:03:02,170
So how does techs rank scoring work?

40
00:03:03,010 --> 00:03:08,110
Well, to understand this, you really need to understand Google's PageRank, which is what tech's rank

41
00:03:08,110 --> 00:03:08,860
is based on.

42
00:03:09,730 --> 00:03:15,640
As you recall, Page Rank was a state of the art ranking method for search engines 20 or so years ago,

43
00:03:15,640 --> 00:03:19,900
and it's the reason why Google became one of the largest tech companies in the world.

44
00:03:21,310 --> 00:03:23,800
So what is the intuition behind Page Rank?

45
00:03:24,610 --> 00:03:29,980
Well, we first begin by recognizing that the internet is made up of web pages, and these web pages

46
00:03:30,250 --> 00:03:32,920
are what will be potentially returned to search results.

47
00:03:33,610 --> 00:03:37,750
Thus, our goal is to compute a score for each of these web pages.

48
00:03:38,440 --> 00:03:43,750
Once we have a score for each web page, we can then rank the web pages by this score and then sort

49
00:03:43,750 --> 00:03:45,850
our search results, according to this ranking.

50
00:03:46,600 --> 00:03:48,130
So this should sound familiar.

51
00:03:49,390 --> 00:03:52,090
Now the question is how do we score each web page?

52
00:03:56,670 --> 00:04:02,550
So the secret to scoring each Web page is to take what is called a random walk in terms of web pages.

53
00:04:02,580 --> 00:04:07,680
This simply means that for each web page, we randomly select a link on that web page.

54
00:04:08,010 --> 00:04:14,220
Go to the next web page and repeat this process after doing a bunch of fancy math, which is the subject

55
00:04:14,220 --> 00:04:15,540
of a more advanced lecture.

56
00:04:15,810 --> 00:04:17,399
A surprising thing happens.

57
00:04:18,329 --> 00:04:24,390
It turns out that the probability of landing on any single web page after a certain amount of time remains

58
00:04:24,390 --> 00:04:25,110
constant.

59
00:04:25,770 --> 00:04:29,460
It also turns out that it doesn't even matter which web page you started on.

60
00:04:30,240 --> 00:04:35,400
That is to say, no matter where you start, if you just keep going from web page to web page randomly,

61
00:04:35,670 --> 00:04:38,340
you'll eventually settle on some fixed distribution.

62
00:04:39,090 --> 00:04:44,310
So if you imagine a very small internet of just a few web pages, here's what it might look like.

63
00:04:44,940 --> 00:04:51,150
As you can see, the chance of landing on Page B is thirty eight point four percent for Page C. It's

64
00:04:51,150 --> 00:04:53,490
thirty four point three percent and so forth.

65
00:04:58,260 --> 00:05:03,960
Intuitively, you can imagine that a webpage with more links to it would be more popular and hence you'd

66
00:05:03,960 --> 00:05:06,810
have a higher chance of landing on that web page.

67
00:05:07,410 --> 00:05:12,660
On the other hand, web pages with only a few links or no links to it would be less popular, and the

68
00:05:12,660 --> 00:05:15,750
probability of landing on those pages would be smaller.

69
00:05:16,590 --> 00:05:18,150
So hopefully that makes sense.

70
00:05:18,540 --> 00:05:22,800
Popular web pages have a higher chance of being landed on an unpopular web.

71
00:05:22,800 --> 00:05:25,270
Pages have a smaller chance now.

72
00:05:25,290 --> 00:05:26,700
What does this mean for us?

73
00:05:27,420 --> 00:05:31,170
Well, these probabilities are in fact, the Page Rank scores.

74
00:05:31,620 --> 00:05:36,330
Popular web pages get a higher score and less popular web pages get a lower score.

75
00:05:37,470 --> 00:05:42,600
The assumption is that more authoritative, legitimate pages will have more incoming links.

76
00:05:44,130 --> 00:05:48,930
Now, of course, you might wonder, well, what if people just make a bunch of websites and point all

77
00:05:48,930 --> 00:05:53,550
of them to their own web site, which they want to rank highly as an exercise?

78
00:05:53,580 --> 00:05:55,650
I'll let you think about why that won't work.

79
00:06:00,290 --> 00:06:04,760
OK, so now that you know how PageRank works, how does this relate to text rank?

80
00:06:05,480 --> 00:06:10,910
Well, the intuition is this we know that we want to score each sentence similar to how we score each

81
00:06:10,910 --> 00:06:12,380
web page in PageRank.

82
00:06:12,920 --> 00:06:16,460
Therefore, we treat each sentence as if it were a web page.

83
00:06:17,060 --> 00:06:19,970
So in this analogy, every sentence is a web page.

84
00:06:21,500 --> 00:06:26,480
The next step is to think about what's the equivalent of a link from one web page to another.

85
00:06:27,350 --> 00:06:32,600
In the case of sentences, you recall that we computed the TF IDF vector for each sentence.

86
00:06:33,290 --> 00:06:39,770
The number of links from one sentence to another is just the cosine similarity between their two IDF

87
00:06:39,770 --> 00:06:40,550
vectors.

88
00:06:41,180 --> 00:06:47,990
Intuitively, if many sentences have a high cosine similarity to one authoritative sentence, that authoritative

89
00:06:47,990 --> 00:06:49,940
sentence will end up with a higher score.

90
00:06:50,810 --> 00:06:54,830
But note that this is not symmetric, even though cosine similarity is.

91
00:06:56,470 --> 00:07:02,830
And this is because if this one authoritative sentence has a high cosine similarity to many other sentences,

92
00:07:03,130 --> 00:07:08,740
the probability of taking a random walk to any of those other sentences will be small because it has

93
00:07:08,740 --> 00:07:10,840
to spread out among those sentences.

94
00:07:11,830 --> 00:07:17,590
In other words, if you take a random walk among the sentences using cosine similarity as a proxy for

95
00:07:17,590 --> 00:07:22,810
the number of links from one sentence to another, you will eventually have a higher chance of landing

96
00:07:22,810 --> 00:07:24,730
on a more authoritative sentence.

97
00:07:25,420 --> 00:07:31,000
And again, as mentioned as you take this random walk, this eventually settles on a fixed probability

98
00:07:31,000 --> 00:07:37,570
distribution of sentences, and these probabilities are the scores that we will use to build our summary.

99
00:07:38,320 --> 00:07:40,570
So that's the intuition behind text rank.

100
00:07:41,970 --> 00:07:47,190
Now, please note that this final part of the intuition lecture is what I would consider to be intermediate

101
00:07:47,190 --> 00:07:48,030
to advanced.

102
00:07:48,480 --> 00:07:53,070
It does take some time to think about, so don't be afraid of it seems confusing at first.

103
00:07:53,580 --> 00:07:57,780
If you found this last part difficult to understand, please skip over this for now.