1
00:00:11,020 --> 00:00:17,680
So in this lecture, we will look at how to summarize text by using TFR, yes, I like this method because

2
00:00:17,680 --> 00:00:20,430
it's surprisingly simple at a high level.

3
00:00:20,440 --> 00:00:23,740
The way it works is this we begin by splitting the document.

4
00:00:23,740 --> 00:00:26,050
We want to summarize in two sentences.

5
00:00:26,680 --> 00:00:31,180
Once we have our sentences, we're going to come up with a method to score each sentence.

6
00:00:31,690 --> 00:00:36,520
Once we have a score for each sentence, we're going to rank the sentences by those scores.

7
00:00:36,970 --> 00:00:40,030
Our summary will simply be the top scoring sentences.

8
00:00:40,540 --> 00:00:42,460
So pretty simple, in my opinion.

9
00:00:43,330 --> 00:00:49,390
Also note that this is an extractive method because we extract the sentences from the given documents.

10
00:00:49,870 --> 00:00:52,300
There is no need to generate text from scratch.

11
00:00:56,890 --> 00:00:59,740
OK, so let's go through this process in more detail.

12
00:01:00,550 --> 00:01:04,480
As mentioned, the first step is to split our document in two sentences.

13
00:01:04,989 --> 00:01:10,450
Note that we call this sentence tokenization and note that there is a function for this in NCTC.

14
00:01:11,410 --> 00:01:16,630
Now, once we have these sentences, we're going to treat each of them as if they were a separate document.

15
00:01:16,990 --> 00:01:19,210
When we build our TFI, we have matrix.

16
00:01:19,990 --> 00:01:24,580
In other words, notice how this is unlike the way we've used to fight in the past.

17
00:01:25,690 --> 00:01:31,960
In previous examples, we used to taf on a set of documents so that each row was a document and each

18
00:01:31,960 --> 00:01:33,070
column was a term.

19
00:01:33,790 --> 00:01:39,490
In this case, since the sentences themselves are documents, well, we will get is a matrix of sentences

20
00:01:39,490 --> 00:01:40,480
by terms.

21
00:01:41,530 --> 00:01:45,070
Also notice that this does not require a whole data set of documents.

22
00:01:45,370 --> 00:01:49,180
Since everything is computed from the sentences of a single document.

23
00:01:49,930 --> 00:01:51,880
So that's one benefit of these methods.

24
00:01:52,030 --> 00:01:54,400
They don't require a big data set to learn from.

25
00:01:59,130 --> 00:02:03,000
OK, so once we have our TFI free of Matrix, what do we do with it?

26
00:02:03,930 --> 00:02:06,330
Well, we need some way to score each sentence.

27
00:02:07,290 --> 00:02:13,590
The simplest way to do that is to take the average of the non-zero values for each sentence, as you

28
00:02:13,590 --> 00:02:14,170
recall.

29
00:02:14,190 --> 00:02:18,960
Each sentences to free a vector is a row in order to free of matrix.

30
00:02:19,650 --> 00:02:24,600
So for each sentence, which is a row, finds all the non-zero values and take their mean.

31
00:02:25,950 --> 00:02:31,710
For example, if we have a row containing the values one, two and three but the rest all zeros, then

32
00:02:31,710 --> 00:02:35,170
the result will just be the average of one, two and three, which is two.

33
00:02:36,810 --> 00:02:39,390
Let's now think about the logic behind this process.

34
00:02:40,020 --> 00:02:47,550
We know that each component of a 240 a vector tells us how often a specific word appear, but also recall

35
00:02:47,550 --> 00:02:52,710
that these are weighted by the idea of term, which makes the score lower if it's a word that appears

36
00:02:52,710 --> 00:02:54,870
very often across all sentences.

37
00:02:55,410 --> 00:03:00,210
In other words, unimportant words have a smaller value, while more important words will have a larger

38
00:03:00,210 --> 00:03:00,690
value.

39
00:03:01,620 --> 00:03:07,260
Furthermore, this value will be even larger if an important word appears more often in that sentence.

40
00:03:08,550 --> 00:03:11,940
Now, you might be wondering why take the mean and not the sum?

41
00:03:12,750 --> 00:03:18,450
Well, as you recall, some sentences might be longer and others might be shorter since other values

42
00:03:18,450 --> 00:03:19,230
are positive.

43
00:03:19,260 --> 00:03:23,490
If we take the sum, then the score would be biased towards longer sentences.

44
00:03:24,030 --> 00:03:29,010
By taking the mean, we are asking the model to pick the sentence that contains words with the highest

45
00:03:29,010 --> 00:03:30,510
scores on average.

46
00:03:31,740 --> 00:03:36,540
The second thing you might be wondering is why do we take the non-zero values and not the mean of the

47
00:03:36,540 --> 00:03:37,350
whole vector?

48
00:03:38,160 --> 00:03:42,900
To answer this question, we have to remember that the tier of RDF matrix is very sparse.

49
00:03:43,530 --> 00:03:45,840
That is, most of the values will be zero.

50
00:03:46,590 --> 00:03:52,110
The sentences with the least zeros will be the sentences with a large variety of words, but not necessarily

51
00:03:52,110 --> 00:03:53,550
the most important words.

52
00:03:54,240 --> 00:03:58,410
Of course, you're encouraged to test out different variations to see what works best.

53
00:04:02,960 --> 00:04:07,880
OK, so now that we know how to compute the scores for each sentence, the next question to consider

54
00:04:07,880 --> 00:04:10,160
is what do we do with these scores?

55
00:04:10,910 --> 00:04:15,260
The answer is to sort them and then to pick the sentences with the top scores.

56
00:04:15,890 --> 00:04:18,350
The question really is how do you do this?

57
00:04:19,040 --> 00:04:24,020
And for this, there really are multiple options you'd have to choose which option works best for you?

58
00:04:24,950 --> 00:04:27,740
I'll discuss a few possibilities, but there could be more.

59
00:04:27,770 --> 00:04:29,450
If you want to think about this further.

60
00:04:31,040 --> 00:04:36,680
So the simplest option would be to take the top end sentences, for example, top five, top 10 and

61
00:04:36,680 --> 00:04:37,400
so forth.

62
00:04:38,240 --> 00:04:42,710
Another simple option is to take the top end words or the top end characters.

63
00:04:43,250 --> 00:04:47,810
This could be useful if you're building a search engine, since there is limited space for you to show

64
00:04:47,810 --> 00:04:48,530
your summary.

65
00:04:49,770 --> 00:04:52,260
Yet another method is to use a percentage.

66
00:04:52,680 --> 00:04:57,450
For instance, take the top 10 percent of sentences or the top 10 percent of words.

67
00:04:58,980 --> 00:05:04,710
Another possibility I've seen is to take all the sentences where the score is greater than some threshold.

68
00:05:04,800 --> 00:05:10,140
For instance, the average score, however, that might include it too many sentences.

69
00:05:10,500 --> 00:05:15,720
So you can modify that by multiplying the average score by a factor to get a new threshold.

70
00:05:16,320 --> 00:05:20,460
In my opinion, this is also not ideal, since this factor is kind of arbitrary.

71
00:05:21,030 --> 00:05:25,350
So try it if you like and remember to experiment to see what works best.