1
00:00:11,100 --> 00:00:14,520
So in this lecture, we will continue looking at our previous notebook.

2
00:00:14,910 --> 00:00:17,940
But this time we'll be using a library to do the work for us.

3
00:00:18,630 --> 00:00:23,380
We'll begin by installing a package called Sumie, which contains an implementation of text rank.

4
00:00:24,150 --> 00:00:29,040
Note that the results will be different because it uses a different method of computing similarities.

5
00:00:35,290 --> 00:00:40,660
The next step is to import a few classes from the Sumie package, including summariser is a tax parser

6
00:00:40,930 --> 00:00:42,010
Nana tokenize her.

7
00:00:42,640 --> 00:00:45,100
At this point, you can guess why these might be useful.

8
00:00:49,880 --> 00:00:52,700
So the next step is to show you how to create a summary.

9
00:00:53,360 --> 00:00:57,020
We first begin by creating an object of type, text, rank summariser.

10
00:00:57,710 --> 00:01:03,230
Now, unfortunately, this doesn't take in text directly since, as you recall, A.P. can work with

11
00:01:03,230 --> 00:01:04,550
many different languages.

12
00:01:05,150 --> 00:01:10,710
So the next step is to create a plain text parser object, which will take in our text and the tokenize

13
00:01:10,710 --> 00:01:11,450
their object.

14
00:01:12,050 --> 00:01:17,120
Note that when we create the tokenize or object, we pass in English, which will be used to tokenize

15
00:01:17,130 --> 00:01:20,300
the document correctly since our document is in English.

16
00:01:21,650 --> 00:01:27,170
The final step is to generate our summary by calling the summariser parsing in the document and telling

17
00:01:27,170 --> 00:01:28,850
it how many sentences to return.

18
00:01:29,750 --> 00:01:34,790
Note that the document is passed in by calling the document attribute on our parser object.

19
00:01:35,960 --> 00:01:41,780
In addition, and notice that this library uses the same method we used earlier in the section for choosing

20
00:01:41,780 --> 00:01:44,630
which sentences to keep, as you recall.

21
00:01:44,660 --> 00:01:49,430
There are many ways of doing this, including choosing the top end sentences, the top x percent of

22
00:01:49,430 --> 00:01:51,860
sentences, the top and words, and so forth.

23
00:01:52,490 --> 00:01:57,440
Thus, it just so happens that the person who wrote this package decided on the same method as we did

24
00:01:57,440 --> 00:01:58,400
in this course.

25
00:02:03,580 --> 00:02:05,200
OK, so let's print our summary.

26
00:02:09,130 --> 00:02:14,470
So note that the summary is a tuple of sentence objects, which makes it hard to see since it goes off

27
00:02:14,470 --> 00:02:15,010
the screen.

28
00:02:18,530 --> 00:02:23,210
The next step is to print our summary by looping through each sentence, casting each sentence to a

29
00:02:23,210 --> 00:02:28,580
string and then using our rap function to keep the printout within a limited number of columns.

30
00:02:32,660 --> 00:02:39,200
OK, so this time the summary is the 21 year old singer won the award for Best Female Artist, with

31
00:02:39,200 --> 00:02:45,320
Australian Idol runner up Shannon Noel taking the title of Best Male at the ceremony, as well as Best

32
00:02:45,320 --> 00:02:45,770
Female.

33
00:02:45,770 --> 00:02:50,660
Goodrem also took home the Pepsi viewer's choice award, while Screen Day bagged the prize for Best

34
00:02:50,660 --> 00:02:52,400
Rock video for American Idiot.

35
00:02:52,970 --> 00:02:57,830
The Black Eyed Peas won awards for Best R&B video and Sexiest VIDEO, both for Hey Mama.

36
00:02:58,310 --> 00:03:03,470
Local singer and songwriter Missy Higgins took the title of Breakthrough Artist of the Year, with Australian

37
00:03:03,470 --> 00:03:07,220
Idol winner Guy Sebastian taking the honors for Best Pop video.

38
00:03:07,760 --> 00:03:13,010
The ceremony was held at the Luna Park Fairground at Sydney Harbor and was hosted by the Osborne family.

39
00:03:14,330 --> 00:03:17,180
So interestingly, this summary seems to work pretty well.

40
00:03:17,180 --> 00:03:24,350
So perhaps TFR TFN cosine similarity is not necessarily the best way to compute similarities, although

41
00:03:24,350 --> 00:03:27,320
you would still want to test this on other documents, to be sure.

42
00:03:31,030 --> 00:03:36,550
Now, this might be a bit of a surprise, but our old friend latent semantic analysis makes an appearance

43
00:03:36,550 --> 00:03:38,810
once again in this blog.

44
00:03:38,830 --> 00:03:43,660
We use an LSA based summariser, which is also included in the Sumie package.

45
00:03:44,290 --> 00:03:49,270
Note that all the summariser is in this library have the same API, so I want to explain the syntax

46
00:03:49,270 --> 00:03:49,720
again.

47
00:03:54,360 --> 00:04:00,000
OK, so this time the summary is as follows I won't read it out, but again, this seems like a pretty

48
00:04:00,000 --> 00:04:00,900
decent summary.

49
00:04:05,770 --> 00:04:09,310
So the next method I want to show you in this lecture is even simpler.

50
00:04:09,850 --> 00:04:15,100
Instead of having to deal with parsers and tokenize hours and so forth, Genzyme has a function where

51
00:04:15,100 --> 00:04:17,740
you can just pass in text and get a summary.

52
00:04:18,550 --> 00:04:24,040
Now, interestingly, Jen SIM summariser also makes use of text rank, which makes it very appropriate

53
00:04:24,040 --> 00:04:24,790
for this lecture.

54
00:04:25,480 --> 00:04:30,970
It happens to use a variation on the similarity function since, as mentioned, you are free to choose

55
00:04:30,970 --> 00:04:34,630
any method you like for comparing how similar two sentences are.

56
00:04:35,440 --> 00:04:40,930
Unfortunately, the documentation doesn't specify which variation they used, but they do link to the

57
00:04:40,930 --> 00:04:42,940
paper on which their method is based.

58
00:04:43,600 --> 00:04:49,360
In fact, this paper lists the TFI, TAF and Cosine method, which is the similarity function we used

59
00:04:49,360 --> 00:04:50,590
in the previous lecture.

60
00:04:51,310 --> 00:04:53,620
So check out this paper if you want to learn more.

61
00:04:54,670 --> 00:04:59,530
Note that we've also included the arguments for this function here, since they relate to what we discussed

62
00:04:59,530 --> 00:05:00,760
earlier in this section.

63
00:05:01,600 --> 00:05:06,670
In particular, you'll recall that there are multiple methods of choosing how many sentences to include

64
00:05:06,670 --> 00:05:07,450
in the summary.

65
00:05:08,110 --> 00:05:11,620
These arguments reflect some of the options I previously discussed.

66
00:05:12,850 --> 00:05:18,100
First, we have ratio, which lets us choose a proportion of sentences to include, for example, 10

67
00:05:18,100 --> 00:05:19,480
percent or 20 percent.

68
00:05:21,250 --> 00:05:25,600
The next possibility is a word count, which lets you choose how many words to include.

69
00:05:26,500 --> 00:05:32,230
Note that if you specify one, you can't specify the other, since both of these can be used together.

70
00:05:35,890 --> 00:05:40,060
In any case, as promised, you can see that this is just a single function call.

71
00:05:40,330 --> 00:05:43,360
Well, we pass in some text and get a summary back.

72
00:05:49,240 --> 00:05:51,700
OK, so the result consists of two sentences.

73
00:05:52,240 --> 00:05:54,700
Interestingly, this is not that great of a summary.

74
00:05:55,390 --> 00:06:00,550
Personally, I prefer the summary we generated earlier, which included the names of multiple bands

75
00:06:00,550 --> 00:06:01,570
that won awards.