1
00:00:11,190 --> 00:00:15,210
So in this lecture, we'll be looking at the notebook for how to summarize text.

2
00:00:15,900 --> 00:00:19,830
We'll begin by doing all our imports, all of which you've seen before in this course.

3
00:00:24,870 --> 00:00:28,400
The next step is to download the necessary files from Ntsiki.

4
00:00:34,660 --> 00:00:40,810
The next step is to download our data set, which is the BBC news data once again, as mentioned, please

5
00:00:40,810 --> 00:00:45,250
feel free to try this on any document you like to test out how well it works.

6
00:00:51,150 --> 00:00:54,990
The next step is to load in our data using pedigreed 6V.

7
00:00:59,260 --> 00:01:03,940
The next step is to call the head to remind ourselves what is inside this data frame.

8
00:01:07,290 --> 00:01:09,840
As you recall, we have text and labels.

9
00:01:13,320 --> 00:01:18,150
The next step is to choose a random document note that have chosen a business document.

10
00:01:22,900 --> 00:01:27,010
The next step is to define a rap function, which helps us print the results more nicely.

11
00:01:32,010 --> 00:01:35,400
The next step is to print our chosen document to see what we got.

12
00:01:40,040 --> 00:01:45,740
OK, so it's Christmas sales worst since 1981, some article about how sales are not good.

13
00:01:46,460 --> 00:01:50,420
Note that for these articles, the title is included with the article text.

14
00:01:51,260 --> 00:01:53,450
In fact, it's the first line in the text.

15
00:01:54,050 --> 00:01:59,270
This is actually quite useful because the title can be thought of as a summary for the text.

16
00:01:59,960 --> 00:02:04,970
Therefore, one way to check whether or not our summary is good is to see if the title gives us the

17
00:02:04,970 --> 00:02:07,340
same information as our own summary.

18
00:02:11,990 --> 00:02:15,230
The next step is to tokenize their document in two sentences.

19
00:02:15,950 --> 00:02:18,650
Now you'll notice that this line seems a bit complex.

20
00:02:18,830 --> 00:02:23,900
And this is because we're trying to remove the title from the article before we do any processing.

21
00:02:25,010 --> 00:02:30,470
So to explain what this does, we first call I look zero, which gives us the document as a string.

22
00:02:31,400 --> 00:02:35,420
Once we have this string, we then call the split function to split on new lines.

23
00:02:35,990 --> 00:02:39,500
But we only want to split once since we only want to remove the title.

24
00:02:39,920 --> 00:02:44,010
So we also pass in a one since we split it once.

25
00:02:44,030 --> 00:02:46,370
This will give us back a list of two items.

26
00:02:46,670 --> 00:02:50,960
The first of which is the title in the second of which is the article text.

27
00:02:51,770 --> 00:02:56,060
Therefore, we choose the value at index one, which gives us back the text.

28
00:03:01,420 --> 00:03:05,020
The next step is to create our team IDF vector riser object.

29
00:03:05,620 --> 00:03:09,580
Note that we're using English stop words and also L1 normalization.

30
00:03:10,210 --> 00:03:15,910
This ensures that we won't be biased towards longer sentences, which would have a higher score simply

31
00:03:15,910 --> 00:03:18,310
due to the fact that they contain more words.

32
00:03:23,900 --> 00:03:29,600
The next step is to call Fit Transform, which gives us back our TFI Taf matrix, which as usual, we

33
00:03:29,600 --> 00:03:30,500
call X.

34
00:03:36,350 --> 00:03:41,570
The next step is to write a function to score a sentence, given its CFR the F representation.

35
00:03:42,440 --> 00:03:48,590
As you can see, it takes in a single row from the CFR The Matrix inside the function.

36
00:03:48,600 --> 00:03:51,590
We start by selecting only the non-zero values.

37
00:03:52,250 --> 00:03:56,930
Once we have those values, which I've called Little X, we then call the mean function.

38
00:03:57,920 --> 00:04:03,260
As you recall, this is exactly what we described in the previous lecture, which was do take the average

39
00:04:03,260 --> 00:04:04,850
of all the non-zero values.

40
00:04:11,190 --> 00:04:16,110
In this next block of code, we're going to compute the score for each sentence and store them in a

41
00:04:16,110 --> 00:04:17,430
new array called Score.

42
00:04:18,450 --> 00:04:25,050
We begin by initializing an array of zeros with size equal to the number of sentences we then to afford

43
00:04:25,290 --> 00:04:29,280
to live through each index from zero up to the number of sentences.

44
00:04:30,150 --> 00:04:36,870
Note that at this point, we don't need the sentences themselves, but only the 240 of Matrix inside

45
00:04:36,870 --> 00:04:37,230
the loop.

46
00:04:37,230 --> 00:04:43,350
We index the Tier five matrix by grabbing the ith row and passing this into the function called to get

47
00:04:43,350 --> 00:04:44,400
sentence score.

48
00:04:45,030 --> 00:04:50,130
This will return the score for the eye of a sentence, at which point we can store it inside our scores

49
00:04:50,130 --> 00:04:50,730
array.

50
00:04:57,220 --> 00:05:02,080
Now, as a mini exercise, you should recognize that there is a way to make the above two blocks of

51
00:05:02,080 --> 00:05:03,460
code more efficient.

52
00:05:04,180 --> 00:05:08,020
In particular, it's possible to compute all the scores in one step.

53
00:05:08,710 --> 00:05:14,720
So consider how you might accomplish that and then implement it yourself as a sanity check.

54
00:05:14,740 --> 00:05:19,840
Compare it with the existing code to ensure that both methods give you the same result.

55
00:05:24,180 --> 00:05:30,030
OK, so the next step is to sort the scores, as you recall, is not the scores themselves we really

56
00:05:30,030 --> 00:05:33,660
care about, but rather they're ordering as such.

57
00:05:33,660 --> 00:05:39,400
What we really need is the sort of index which we obtained by calling ARG saw it instead of just sort

58
00:05:39,770 --> 00:05:43,170
of note that I've also negated the scores since by default.

59
00:05:43,410 --> 00:05:48,090
It's sort of in ascending order, whereas we would like the top scoring sentences first.

60
00:05:54,020 --> 00:05:59,240
The next step is to generate our summary note that have made a few comments here to explain that there

61
00:05:59,240 --> 00:06:03,410
are many different options for how to choose which sentences to include.

62
00:06:04,430 --> 00:06:10,280
To review, the simplest method, which is the one will use, is to simply pick the top and sentences.

63
00:06:11,030 --> 00:06:16,820
Another method is to pick the top in words or even the top end characters, as you recall.

64
00:06:16,850 --> 00:06:21,800
This can be useful for applications such as search engines, where the space to include your summary

65
00:06:21,800 --> 00:06:22,550
is limited.

66
00:06:24,380 --> 00:06:30,050
Another option is to use a percentage instead of just the number, and the final options are to pick

67
00:06:30,050 --> 00:06:35,480
the sentences above some threshold, like the average score or some multiple of the average score.

68
00:06:36,410 --> 00:06:41,360
Note that I've made one final remark here, which is that using the methods where we pick the sentences

69
00:06:41,360 --> 00:06:45,170
above some threshold, it's not necessary to sort by score.

70
00:06:46,010 --> 00:06:48,950
Instead, we might wish to keep the sentences in order.

71
00:06:49,670 --> 00:06:51,150
This could make a lot of sense.

72
00:06:51,170 --> 00:06:55,760
For example, if you're reading a story, you don't want to read the end of the story before the beginning

73
00:06:55,760 --> 00:06:56,480
of the story.

74
00:06:57,140 --> 00:07:02,300
If it's a news article, that could also make sense since news articles are written such that the most

75
00:07:02,360 --> 00:07:05,900
general facts show up first and the details show up later.

76
00:07:06,680 --> 00:07:11,900
So these comments here are probably more important to consider compared to the actual code itself.

77
00:07:15,460 --> 00:07:21,340
In any case, the code is as follows we simply loop through the first five elements of the sword index

78
00:07:21,970 --> 00:07:22,810
inside the loop.

79
00:07:22,840 --> 00:07:26,260
We then print the score along with the corresponding sentence.

80
00:07:32,560 --> 00:07:38,660
OK, so the summary is as follows A number of retailers have already reported poor figures for December.

81
00:07:39,230 --> 00:07:43,490
However, reports from some high street retailers highlight the weakness of the sector.

82
00:07:44,240 --> 00:07:50,660
The UN has revised the annual 2004 rate of growth, down from the 5.9 percent estimated in November

83
00:07:50,930 --> 00:07:52,370
to three point two percent.

84
00:07:53,030 --> 00:07:58,340
Our view is the Bank of England will keep its powder dry and wait to see the big picture in a British

85
00:07:58,340 --> 00:08:03,200
Retail Consortium survey found that Christmas 2004 was the worst in 10 years.

86
00:08:04,700 --> 00:08:06,170
OK, so not a bad summary.

87
00:08:10,760 --> 00:08:16,220
Now, as you recall, I mentioned that one way to check whether or not this is a good summary is to

88
00:08:16,220 --> 00:08:19,820
compare it with the title, which is itself kind of like a summary.

89
00:08:20,780 --> 00:08:25,880
Note that this is the same quote as before, except that we take the first element instead of the second,

90
00:08:25,880 --> 00:08:27,110
which gives us the title.

91
00:08:31,030 --> 00:08:35,890
In fact, we've already seen this title, which is Christmas sales worst since 1981.

92
00:08:38,049 --> 00:08:43,270
Now, at this point, we should realize that whether or not our summary is good is really a subjective

93
00:08:43,270 --> 00:08:43,960
concept.

94
00:08:44,530 --> 00:08:47,920
It's really up to you whether or not the summary suits your needs.

95
00:08:48,430 --> 00:08:53,230
There isn't really a concept of accuracy, since it's possible that many different summaries would be

96
00:08:53,230 --> 00:08:54,040
sufficient.

97
00:08:54,730 --> 00:08:59,860
There are summaries which would be obviously bad, but again, that's more of a subject of assessment.

98
00:09:00,910 --> 00:09:07,360
Finally, note that objective metrics for summaries do exist, such as the Rewe score, but understanding

99
00:09:07,360 --> 00:09:11,320
that requires even further study, which is outside the scope of this cause.

100
00:09:12,070 --> 00:09:16,870
Furthermore, it's still not perfect, so the benefits you're learning it at this point does not outweigh

101
00:09:16,870 --> 00:09:17,590
the cost.

102
00:09:21,160 --> 00:09:26,500
Now you recognize that the above process took quite a bit of code, and this is the kind of thing that

103
00:09:26,500 --> 00:09:28,120
we would like to put inside a function.

104
00:09:28,990 --> 00:09:33,250
So the next step is to implement a function that performs the steps we just went through.

105
00:09:34,120 --> 00:09:37,540
This function will take in a document and print the corresponding summary.

106
00:09:38,350 --> 00:09:41,830
Since you've seen all these steps before, we'll just go through them quickly.

107
00:09:42,700 --> 00:09:44,860
So our input variable is called text.

108
00:09:45,520 --> 00:09:48,880
Inside the function, we tokenize the text into sentences.

109
00:09:52,660 --> 00:09:55,960
We then conferred the sentences into a two year free of Matrix.

110
00:09:58,380 --> 00:10:03,150
The next step is to score each sentence using the Get Senate score function we define above.

111
00:10:05,390 --> 00:10:09,500
Once we have the scores, we then sort the scores and get back to sort index.

112
00:10:11,780 --> 00:10:16,910
Once we have the sword index, we then loop through the first five values and inside the loop, we prints

113
00:10:16,910 --> 00:10:19,940
out the score along with the corresponding sentence.

114
00:10:25,870 --> 00:10:29,230
OK, so let's test out our new function on a random document.

115
00:10:29,800 --> 00:10:33,040
This time I've chosen a document from the entertainment class.

116
00:10:38,360 --> 00:10:44,540
OK, so the summary is the Black Eyed Peas won awards for best R&B video and Sexiest video, both for

117
00:10:44,540 --> 00:10:45,190
Hey Mama.

118
00:10:45,650 --> 00:10:51,020
The ceremony was held at the Luna Park fairground in Sydney Harbour and was hosted by the Osborne Family

119
00:10:51,530 --> 00:10:55,670
Good Round, Green Day and the Black Eyed Peas took home two awards each other.

120
00:10:55,670 --> 00:10:59,210
Winners included Green Day, voted Best Group and the Black Eyed Peas.

121
00:10:59,690 --> 00:11:04,550
The VH one first music award went to share, honouring her achievements within the music industry.

122
00:11:05,480 --> 00:11:07,790
OK, so that seems like a pretty good summary.

123
00:11:10,990 --> 00:11:15,100
Now, let's prints out the title of this document, which is kind of like a reference summary for us

124
00:11:15,100 --> 00:11:15,940
to compare to.

125
00:11:19,780 --> 00:11:23,800
OK, so the title is go to RIM wins, top female MTV prize.

126
00:11:24,460 --> 00:11:31,030
Now, in my opinion, our summary is better because it tells us what multiple bands have won the title

127
00:11:31,030 --> 00:11:33,310
only tells us about one particular artist.

128
00:11:33,700 --> 00:11:38,110
And it's not clear why this is the most relevant fact, such that it would make up the title.

129
00:11:39,130 --> 00:11:44,470
The downside to our summary is that it doesn't tell us what awards ceremony the article is about, which

130
00:11:44,470 --> 00:11:47,020
is something you would probably expect of a summary.

131
00:11:50,070 --> 00:11:55,830
In fact, because of how news articles are written and even simpler way to summarize them would be to

132
00:11:55,830 --> 00:11:57,990
take the title in the first few lines.

133
00:11:58,830 --> 00:12:01,500
So let's see what we get when we print the whole article.

134
00:12:06,920 --> 00:12:12,290
As you can see, the first sentence does, in fact, tell us what awards ceremony it's about, which

135
00:12:12,290 --> 00:12:14,720
is the Australian MTV Music Awards.

136
00:12:15,470 --> 00:12:20,300
So again, practically speaking, if it's a news article, remember that they are always written from

137
00:12:20,300 --> 00:12:21,710
general to specific.

138
00:12:22,250 --> 00:12:27,420
So it may be simpler just to take the first few lines in terms of machine learning.

139
00:12:27,440 --> 00:12:33,300
Recall that this is an example of applying domain knowledge since we know how news articles are designed.

140
00:12:33,320 --> 00:12:35,090
We can use that information.