1
00:00:11,010 --> 00:00:16,560
In this lecture, I will be giving you an official exercise prompt in order to complete the article

2
00:00:16,560 --> 00:00:17,130
spinner.

3
00:00:17,910 --> 00:00:20,790
Now, technically, you already know what the exercise is.

4
00:00:21,210 --> 00:00:24,060
So this lecture is more of a reminder to do it.

5
00:00:24,630 --> 00:00:28,320
In addition, I'll provide a few details in case you find them helpful.

6
00:00:29,130 --> 00:00:33,360
As usual, you can feel free to look at the solution at Notebook to get the data set.

7
00:00:33,720 --> 00:00:37,350
But please do not go further as this will ruin the exercise for you.

8
00:00:39,070 --> 00:00:45,100
OK, so the data set will be using is the BBC News data, once again, our model will be trained on

9
00:00:45,100 --> 00:00:46,640
the business articles only.

10
00:00:46,810 --> 00:00:51,550
So if you want to make your model like the solution, you'll need to filter out these articles.

11
00:00:52,420 --> 00:00:54,220
You may find that you prefer to train on.

12
00:00:54,250 --> 00:01:00,330
All the articles, or perhaps on a different dataset entirely, such as Wikipedia, whatever dataset

13
00:01:00,370 --> 00:01:02,440
you use, it is up to you.

14
00:01:03,490 --> 00:01:08,590
Personally, I think it would be more interesting if you used Wikipedia, since the dataset is much

15
00:01:08,590 --> 00:01:11,410
larger and contains much more attacks per article.

16
00:01:11,980 --> 00:01:16,210
However, it would take more work if you're interested in doing this.

17
00:01:16,450 --> 00:01:21,760
You can find archived versions of Wikipedia at Dumpster Wikimedia dot org.

18
00:01:22,540 --> 00:01:28,120
Note that the last time I checked these were stored as XML files, so you'll need to do some work in

19
00:01:28,120 --> 00:01:30,190
order to parse these into plaintext.

20
00:01:31,420 --> 00:01:37,480
Another option is to use our poetry data that is poems by either Robert Frost or Edgar Allan Poe.

21
00:01:37,990 --> 00:01:39,820
I think this would be interesting as well.

22
00:01:41,700 --> 00:01:46,860
OK, so once you've decided what text you want to use, your next task will be to build the model we

23
00:01:46,860 --> 00:01:47,610
described.

24
00:01:48,150 --> 00:01:53,670
Our model is essentially a three dimensional matrix where the two context words are used as conditioning

25
00:01:53,670 --> 00:01:58,440
variables for the middle word in terms of arrays which will store in Python.

26
00:01:58,770 --> 00:02:02,280
This is no different than the second order Markov model accepts.

27
00:02:02,280 --> 00:02:05,280
The ordering of the dimensions will be switched around.

28
00:02:06,060 --> 00:02:07,940
That is, in either of these cases.

29
00:02:07,950 --> 00:02:10,910
Conceptually, you have a V by V by V matrix.

30
00:02:12,310 --> 00:02:18,460
In addition, you may want to consider whether it is better to store the distributions in a Python dictionary,

31
00:02:19,180 --> 00:02:24,790
as you recall, this also requires you to be able to sample from a probability distribution, which

32
00:02:24,790 --> 00:02:26,590
is represented as a dictionary.

33
00:02:27,220 --> 00:02:31,450
Note that we've written this code before, so there is no need for you to figure it out again.

34
00:02:36,110 --> 00:02:41,600
Now, there are some details about the language model that would be worth discussing specifically,

35
00:02:41,630 --> 00:02:43,610
what should we consider as tokens?

36
00:02:44,690 --> 00:02:50,030
As you recall, one method is to simply drop all punctuation and use words as tokens.

37
00:02:50,570 --> 00:02:55,070
Another method is to keep punctuation using nutcase case tokenize.

38
00:02:55,760 --> 00:03:01,550
Another method is to simply use a string split, which keeps the punctuation by default, although in

39
00:03:01,550 --> 00:03:05,210
this case, the punctuation will be stuck to the adjacent word.

40
00:03:06,930 --> 00:03:12,150
In the coming solution, we'll be using an old case where tokenize, which I think is appropriate for

41
00:03:12,150 --> 00:03:15,990
this task because it keeps punctuation as separate tokens.

42
00:03:16,620 --> 00:03:22,530
So for example, if a sentence ends with a question mark and we replace the word before it, then the

43
00:03:22,530 --> 00:03:24,870
new word will also have come from a question.

44
00:03:26,430 --> 00:03:31,560
Furthermore, this will be helpful when we string the sentence back together after it has been spun.

45
00:03:32,250 --> 00:03:37,620
Otherwise, it would be difficult to read the result because we wouldn't immediately know where one

46
00:03:37,620 --> 00:03:39,570
sentence starts and another ends.

47
00:03:40,770 --> 00:03:45,210
We also wouldn't know which sentences are statements, which are questions and so forth.

48
00:03:45,780 --> 00:03:49,260
We would also lose units of measurement where there are numbers.

49
00:03:49,680 --> 00:03:54,840
So for example, since we'll be looking at business articles, there will be many documents containing

50
00:03:54,840 --> 00:03:56,250
monetary values.

51
00:03:56,940 --> 00:03:59,040
So keeping punctuation is useful.

52
00:04:03,740 --> 00:04:08,090
Now, once you have your model, you'll need to write a function to actually spin the article.

53
00:04:08,900 --> 00:04:11,840
This is pretty straightforward, but still requires some work.

54
00:04:12,440 --> 00:04:17,329
You'll need to figure out which words you want to replace and also how often to replace them.

55
00:04:18,350 --> 00:04:23,240
For example, if you replace every word, then the result won't even resemble the original.

56
00:04:23,510 --> 00:04:25,910
Nor will it be likely to even make sense.

57
00:04:26,570 --> 00:04:31,760
You'll also need to consider certain details like Do you want to ever replace two words in a row?

58
00:04:33,110 --> 00:04:38,880
Furthermore, you'll want to think about how to check whether a word can be replaced for some words.

59
00:04:38,900 --> 00:04:41,090
There simply may not be any other options.

60
00:04:41,360 --> 00:04:42,050
Since the try.

61
00:04:42,050 --> 00:04:43,010
Graham is unique.

62
00:04:47,740 --> 00:04:53,830
Another interesting task to consider is we'll end up with a tokenized document, but how are we going

63
00:04:53,830 --> 00:04:57,280
to put it back together so that it looks like something we can read?

64
00:04:58,120 --> 00:05:01,210
One way to do this is to use the joint function in Python.

65
00:05:01,690 --> 00:05:07,330
However, this is not perfect because, for example, if we join each token with a space, then we'll

66
00:05:07,330 --> 00:05:09,790
be putting spaces before punctuation as well.

67
00:05:10,450 --> 00:05:13,540
And of course, this is not correct syntax for English.

68
00:05:14,290 --> 00:05:20,590
Instead, we'll be using a class called Tree-bank Word de Tokenize or from NCTC, which can be used

69
00:05:20,590 --> 00:05:22,810
to tokenized a list of tokens.

70
00:05:23,500 --> 00:05:28,660
So once you've finished spinning your article, you should print the result, which displays both the

71
00:05:28,660 --> 00:05:32,650
original words and the replacements so that it's easy to compare.

72
00:05:33,310 --> 00:05:36,310
And this will require you to tokenize the tokens.

73
00:05:37,390 --> 00:05:39,820
So good luck, and I'll see you in the next lecture.