1
00:00:11,070 --> 00:00:16,260
So in this lecture, we will be looking at the notebook to implement the articles, spinner and code.

2
00:00:16,980 --> 00:00:20,690
We'll begin by downloading our data set, which is the BBC news dataset.

3
00:00:21,060 --> 00:00:23,250
As we've seen in previous lectures.

4
00:00:31,600 --> 00:00:36,730
The next step is to do our imports of note, here is the text wrap module.

5
00:00:37,300 --> 00:00:40,100
This is helpful when we print out our spun article.

6
00:00:40,120 --> 00:00:44,170
Since if we don't wrap the text, each paragraph will go off the screen.

7
00:00:45,220 --> 00:00:47,200
Also make note of the tree-bank word.

8
00:00:47,200 --> 00:00:53,350
The tokenize are class, which is used to tokenize a list of tokens back into a single string.

9
00:00:54,130 --> 00:00:59,350
In the real world, this is necessary, since that's how your article will eventually be presented.

10
00:01:06,260 --> 00:01:10,220
The next step is to download the necessary data files for NCTC.

11
00:01:15,800 --> 00:01:19,730
The next step is to load in our data using PD that reads the we.

12
00:01:23,940 --> 00:01:28,380
The next step is to call the DFAT head to remind ourselves what our data looks like.

13
00:01:32,750 --> 00:01:37,100
So as you recall, we have two columns, which are the text and the labels.

14
00:01:40,950 --> 00:01:45,900
The next step is to cast our labels to a set to remind ourselves what labels we have.

15
00:01:46,650 --> 00:01:51,150
As you recall, we're going to be choosing just one of these labels to train our model.

16
00:01:51,900 --> 00:01:56,910
As mentioned, you can feel free to use the whole data set if you like, or even a completely different

17
00:01:56,910 --> 00:01:58,980
data set, which you find interesting.

18
00:02:04,900 --> 00:02:10,479
The next step is to set our label of choice, which for the purpose of this lecture will be business.

19
00:02:10,960 --> 00:02:13,480
Please feel free to pick a different label if you like.

20
00:02:18,580 --> 00:02:24,820
The next step is to grab only the text for only the rose that match our chosen label, as always.

21
00:02:24,850 --> 00:02:27,490
Recall that you can read this from the inside out.

22
00:02:28,060 --> 00:02:32,080
The inside part is where we select only the rose that match our label.

23
00:02:32,770 --> 00:02:36,490
Once we've done that, we can grab the appropriate column, which is text.

24
00:02:37,270 --> 00:02:41,740
Once we have the result, we'll call the head function once again to check the result.

25
00:02:46,230 --> 00:02:52,220
OK, so as you can see, the result appears to be a panda series containing only business articles,

26
00:02:52,440 --> 00:02:53,850
which is what we expect.

27
00:02:59,000 --> 00:03:04,400
The next step is to create our model, as you recall, we can break this into two parts.

28
00:03:04,880 --> 00:03:07,910
The first part is to count the number of possible outcomes.

29
00:03:08,180 --> 00:03:12,890
And the second part is to normalize the counts so that they become probabilities.

30
00:03:13,580 --> 00:03:16,400
So this block of code pertains to the first part.

31
00:03:17,000 --> 00:03:20,150
We'll begin by creating an empty dictionary called Proverbs.

32
00:03:20,840 --> 00:03:23,870
Note that this will be a dictionary of dictionaries.

33
00:03:24,470 --> 00:03:27,710
The key for this dictionary will be a tuple of context words.

34
00:03:28,190 --> 00:03:30,900
In our case, that's the previous word and the next word.

35
00:03:31,790 --> 00:03:36,740
The value for this dictionary will be another dictionary for the nested dictionary.

36
00:03:36,740 --> 00:03:38,990
The keys will be possible middle words.

37
00:03:39,410 --> 00:03:43,550
The value will be the corresponding count for the corresponding middle word.

38
00:03:46,740 --> 00:03:50,880
The next step is to look through all of our documents inside the loop.

39
00:03:50,910 --> 00:03:53,640
We will first split the document into lines.

40
00:03:54,210 --> 00:03:59,010
As you recall, each document contains multiple paragraphs, which are separate lines.

41
00:03:59,580 --> 00:04:02,700
This includes the title of the article and the text.

42
00:04:03,750 --> 00:04:08,040
The next step is to live through each line or, in other words, loop through each paragraph.

43
00:04:09,670 --> 00:04:14,350
Inside this inner loop, we will then tokenized the line by calling word tokenize.

44
00:04:14,890 --> 00:04:16,870
This will give us a list of tokens.

45
00:04:17,740 --> 00:04:19,959
The next step is to loop through the tokens.

46
00:04:20,709 --> 00:04:23,830
Now you'll notice that I'm not leaping through the tokens directly.

47
00:04:24,340 --> 00:04:29,110
This is because on each iteration, we'll need to grab three tokens simultaneously.

48
00:04:29,770 --> 00:04:33,160
That is to say, we are actually iterating through tri grams.

49
00:04:33,730 --> 00:04:39,340
This is why we want to loop through the index instead of the token and also why we stop it then tokens

50
00:04:39,340 --> 00:04:40,090
minus two.

51
00:04:42,040 --> 00:04:48,190
So inside this third, we start by grabbing our three tokens notice that these are three consecutive

52
00:04:48,190 --> 00:04:52,210
tokens that index I I Plus and I plus two.

53
00:04:53,930 --> 00:04:59,510
The next step is to form the key to our dictionary, which is a tuple containing T. zero, A. two.

54
00:05:00,110 --> 00:05:02,630
Obviously, T one is the middle word.

55
00:05:04,680 --> 00:05:09,450
The next step is to check whether or not this key exists in our probs dictionary.

56
00:05:10,110 --> 00:05:15,960
If it does not, we should create an entry for it, which will be initialized as an empty dictionary.

57
00:05:19,830 --> 00:05:25,250
The next step is to increment the count for the middle word to you want to do this?

58
00:05:25,260 --> 00:05:30,900
We have to handle two cases the case where two you one is not present already and the case where it

59
00:05:30,900 --> 00:05:31,470
is.

60
00:05:32,310 --> 00:05:36,210
So if Taiwan is not present, we simply set the count to one.

61
00:05:37,020 --> 00:05:42,240
However, if Taiwan is president, then we increment the count by using plus equals one.

62
00:05:43,170 --> 00:05:48,150
OK, so hopefully you're convinced that by the end of this loop, our dictionary will contain at the

63
00:05:48,150 --> 00:05:49,260
proper counts.

64
00:05:57,020 --> 00:06:01,710
The next step is to normalize the counts we just collected to do this.

65
00:06:01,730 --> 00:06:05,630
We're going to begin by looping through each key value pair in Proverbs.

66
00:06:06,290 --> 00:06:12,470
In this case, he represents the tuple of context words and represents the probability for the middle

67
00:06:12,470 --> 00:06:12,920
word.

68
00:06:13,490 --> 00:06:16,160
Except these are not probabilities just yet.

69
00:06:16,580 --> 00:06:20,990
For now, they are just counts in order to turn them into probabilities.

70
00:06:21,200 --> 00:06:24,800
We'll need the total count, which is the sum of all the values in D.

71
00:06:25,820 --> 00:06:28,940
Once we have the total, we can then loop through the itself.

72
00:06:29,750 --> 00:06:35,390
In this case, the key is the middle word and the value is the count inside this loop.

73
00:06:35,420 --> 00:06:40,580
We turn this into a probability by dividing the current count, by the total count.

74
00:06:46,730 --> 00:06:52,160
The next step is to print out our problems dictionary to confirm that it has the correct format.

75
00:06:57,240 --> 00:07:00,180
OK, so it appears that our dictionary looks like it should.

76
00:07:00,780 --> 00:07:05,790
The key is a tuple of the two context words and the value is a probability dictionary.

77
00:07:06,630 --> 00:07:11,790
Note that for some programs, there is only one possible middle word for these cases.

78
00:07:11,790 --> 00:07:16,890
We probably wouldn't bother to try and replace those words because the result would be no change.

79
00:07:17,880 --> 00:07:21,510
However, note that for some keys, there are many possible words.

80
00:07:22,140 --> 00:07:25,410
For example, make note of the key US giants.

81
00:07:26,070 --> 00:07:33,690
In this case, many programs could make sense to US agrochemical giant U.S. banking giant, US foods

82
00:07:33,690 --> 00:07:36,360
giant, US media giant, and so forth.

83
00:07:36,990 --> 00:07:38,850
So typical business news.

84
00:07:43,330 --> 00:07:49,930
Also notice how our punctuation is tokenized, so you could have jumped one point eight percent or jumped

85
00:07:49,930 --> 00:07:51,400
ten point seven percent.

86
00:07:57,960 --> 00:08:00,750
OK, so at this point, our model is complete.

87
00:08:01,410 --> 00:08:06,090
The next part is to figure out how to actually use the model to spin articles.

88
00:08:06,630 --> 00:08:11,520
Conceptually, it's simple, but as you'll see, there are a few details we need to contend with.

89
00:08:12,390 --> 00:08:16,050
Firstly, let's recall that each line in a document is a paragraph.

90
00:08:16,740 --> 00:08:22,950
So if we grab the first article at I look zero and we split on a new lines, we should be able to see

91
00:08:22,950 --> 00:08:24,300
a list of paragraphs.

92
00:08:29,000 --> 00:08:34,669
So notice that some lines are simply empty, which is because our articles are formatted to have an

93
00:08:34,669 --> 00:08:36,590
empty line between paragraphs.

94
00:08:41,030 --> 00:08:44,179
The next step is to write a function called spin documents.

95
00:08:44,900 --> 00:08:47,990
The main idea of this function is that it's a higher level function.

96
00:08:48,740 --> 00:08:53,870
Basically, we're going to break our problem down into slightly smaller parts, which is to spend each

97
00:08:53,870 --> 00:08:55,340
paragraph one by one.

98
00:08:56,240 --> 00:09:00,980
So the point of this function is to simply call another function, which will spin each paragraph.

99
00:09:01,730 --> 00:09:06,680
It's also responsible for joining the paragraphs back together at the end so that the result has the

100
00:09:06,680 --> 00:09:08,690
same format as the input.

101
00:09:10,800 --> 00:09:16,950
OK, so as input, we take in a variable called Doc, which is a string representing the entire document.

102
00:09:18,350 --> 00:09:24,230
Inside the function, we begin by splitting the document by new line, as you recall, we just did this

103
00:09:24,230 --> 00:09:26,480
above, so you know what the result looks like.

104
00:09:27,830 --> 00:09:33,500
We'll also create an empty list called output, where we will story each spun paragraph as they are

105
00:09:33,500 --> 00:09:34,220
received.

106
00:09:35,240 --> 00:09:41,490
The next step is to enter a loop, which will loop through each line or each paragraph inside the loop.

107
00:09:41,510 --> 00:09:44,060
We will check whether the line is empty or not.

108
00:09:45,230 --> 00:09:49,670
If it is not, then we call the spin line function, which we have not yet seen.

109
00:09:50,210 --> 00:09:51,920
The result is called New Line.

110
00:09:52,700 --> 00:09:57,260
Otherwise, the line is simply an empty line and we can set New Line equal to line.

111
00:09:58,190 --> 00:10:02,450
Once we have a new line variable, we can then append it to our output list.

112
00:10:03,080 --> 00:10:08,240
Once we finish a loop, we can then join the output by a new line character, which will reverse the

113
00:10:08,240 --> 00:10:09,410
split we did above.

114
00:10:15,320 --> 00:10:20,270
So the main thing on your mind at this point should be what does the spin line function look like?

115
00:10:20,870 --> 00:10:26,060
But before we get into that, we need to think about how to tokenize a list of tokens.

116
00:10:26,690 --> 00:10:31,010
As you recall, we need to tokenized each paragraph in order to spin the article.

117
00:10:31,460 --> 00:10:35,960
But once it's been spun, we need to turn the list of tokens back into a string.

118
00:10:36,830 --> 00:10:42,350
The hard part about this is sometimes tokens should be joined by a space, but sometimes they should

119
00:10:42,350 --> 00:10:43,820
be joined without a space.

120
00:10:44,330 --> 00:10:49,460
For example, if we have two words side by side, we would like to join those with a space.

121
00:10:49,910 --> 00:10:55,580
But if we have a word and punctuation, then we would not want to put a space between them in order

122
00:10:55,580 --> 00:10:56,250
to do this.

123
00:10:56,270 --> 00:11:02,780
We're going to make use of a class called tree-bank word de tokenize or will begin by creating an instance

124
00:11:02,780 --> 00:11:03,830
of this class.

125
00:11:08,310 --> 00:11:11,880
The next step is to pick a random sentence from one of our documents.

126
00:11:12,420 --> 00:11:17,970
At this point, we just want to print this out to confirm that if we tokenize and then de tokenize this

127
00:11:17,970 --> 00:11:20,790
sentence, we end up with the same sentence.

128
00:11:28,760 --> 00:11:33,950
So now that we know what our sentence looks like, the next step will be to call word tokenize on this

129
00:11:33,950 --> 00:11:34,670
sentence.

130
00:11:35,360 --> 00:11:39,350
This will return a list of tokens on that list of tokens.

131
00:11:39,350 --> 00:11:44,840
We are then going to call tokenized using our tokenized objects that we just created.

132
00:11:50,490 --> 00:11:55,320
OK, so as you can see, the tokenized string is the same as the original string.

133
00:11:56,010 --> 00:12:01,860
This means that punctuation has been treated correctly without spaces at the appropriate points.

134
00:12:05,080 --> 00:12:10,510
The next step is to define our sample word function, which will sample a random word from a probability

135
00:12:10,510 --> 00:12:15,020
distribution represented as a dictionary, as you recall.

136
00:12:15,040 --> 00:12:18,820
This is the same as what we've seen before, so I won't explain it again.

137
00:12:25,080 --> 00:12:29,460
The next step is to implement our spin line function as input.

138
00:12:29,520 --> 00:12:34,050
This takes in one line as a string inside the function.

139
00:12:34,080 --> 00:12:37,830
We begin by calling word tokenize to get a list of tokens.

140
00:12:38,400 --> 00:12:43,860
We'll also initialize an index to start at zero, which will be used while we live through the tokens.

141
00:12:44,550 --> 00:12:48,030
The reason we are not using a for loop will become clear later on.

142
00:12:49,230 --> 00:12:54,930
We'll also initialize their output at this point, which will be a list containing just the first token.

143
00:12:55,980 --> 00:13:00,210
Note that the first token cannot be spun because there is no previous word.

144
00:13:02,140 --> 00:13:07,870
The next step will be to answer a while, loop up to lend tokens minus two as before.

145
00:13:07,900 --> 00:13:12,760
The reason we stop at minus two is because we'll be indexing three tokens at a time.

146
00:13:13,930 --> 00:13:19,480
Inside the loop will grab a three token set indices eye plus one and eye plus two.

147
00:13:21,290 --> 00:13:26,870
The next step will be to form our key, which is a tuple containing the previous tokens is zero, and

148
00:13:26,870 --> 00:13:34,100
the next tokens to the next step is to grab the probability distribution corresponding to this key.

149
00:13:34,760 --> 00:13:37,730
That is the distribution of all possible metal.

150
00:13:38,990 --> 00:13:40,810
We'll call the result p dist.

151
00:13:43,230 --> 00:13:47,070
The next step is to check whether or not we should replace this metal work.

152
00:13:52,230 --> 00:13:59,610
So the next if statement defines when we should do this note that we are using two criteria, the first

153
00:13:59,610 --> 00:14:04,560
criteria is obvious and that is that the length of penis must be bigger than one.

154
00:14:05,220 --> 00:14:10,080
If it is not, then there is no possibility of replacing this word with something else.

155
00:14:10,950 --> 00:14:13,230
The second criterion is stochastic.

156
00:14:13,740 --> 00:14:19,470
We generate a random number and then check whether or not that random number is below zero point three.

157
00:14:20,220 --> 00:14:23,460
Basically, this will control how often words are replaced.

158
00:14:24,150 --> 00:14:28,830
As you recall, the random function draws the number uniformly between zero and one.

159
00:14:29,580 --> 00:14:35,250
Therefore, by using a threshold of zero point three, we're saying there's a probability of 30 percent

160
00:14:35,580 --> 00:14:38,490
to replace each word that can be replaced.

161
00:14:40,730 --> 00:14:46,580
OK, so if we want to replace this middle word, then we answer this if statement, the next step is

162
00:14:46,580 --> 00:14:48,700
to find a replacement for the work.

163
00:14:49,820 --> 00:14:56,030
To do this, we simply call the function sample word passing in the probability dictionary p dist,

164
00:14:56,660 --> 00:14:57,950
we'll call the result middle.

165
00:14:59,780 --> 00:15:04,700
Now, because it would be nice to compare each replacement with the original, we're still going to

166
00:15:04,700 --> 00:15:06,920
a penalty one to the output list.

167
00:15:07,700 --> 00:15:11,480
The next step will be to upend middle surrounded by angle brackets.

168
00:15:12,050 --> 00:15:17,780
Since these brackets are unlikely to occur in the news articles, it should be unambiguous when we see

169
00:15:17,780 --> 00:15:19,040
these in the results.

170
00:15:20,790 --> 00:15:27,390
Finally, we'll also append the next word to what this means is that we will never replace two words

171
00:15:27,390 --> 00:15:28,020
in a row.

172
00:15:28,740 --> 00:15:31,710
It's not that you can't, but I've simply decided not to.

173
00:15:35,470 --> 00:15:40,720
In this case, because we've effectively added two new words to the output, we're going to increment

174
00:15:40,720 --> 00:15:41,620
I by two.

175
00:15:42,310 --> 00:15:45,040
So this is the reason we did not use a for loop.

176
00:15:46,180 --> 00:15:49,540
If we could use default, we would always be incrementing by one.

177
00:15:50,050 --> 00:15:53,110
But this lets us control how much to increment I.

178
00:15:56,470 --> 00:16:01,900
The next step is to look at the Else. block, in this case, we are not replacing the metal work.

179
00:16:02,440 --> 00:16:06,160
Therefore, we simply append a T one to the output list.

180
00:16:06,790 --> 00:16:09,730
We also increment I by one instead of two.

181
00:16:13,370 --> 00:16:16,700
OK, so by this point, we will be outside the wire loop.

182
00:16:17,510 --> 00:16:23,120
There is one final thing we need to consider, which is the value of AI at this point.

183
00:16:23,150 --> 00:16:28,550
I could be either one or two steps away from the end of the sentence because we may have incremented

184
00:16:28,550 --> 00:16:31,520
AI by one or two on the last step of the loop.

185
00:16:32,810 --> 00:16:36,770
Therefore, the behavior will be different depending on what we just did.

186
00:16:37,580 --> 00:16:43,670
Now, if we increment it AI by two, that means we replace the second last token and we appended the

187
00:16:43,670 --> 00:16:45,920
final token to the output list.

188
00:16:46,460 --> 00:16:48,470
And since we increment it, I buy two.

189
00:16:48,830 --> 00:16:51,980
That means I will be greater than LEN tokens minus two.

190
00:16:53,030 --> 00:16:59,030
In that case, there is nothing else to do in the other case where we did not increment I by two.

191
00:16:59,420 --> 00:17:01,940
That means we did not replace the second last token.

192
00:17:02,660 --> 00:17:04,160
In that case, we increment it.

193
00:17:04,160 --> 00:17:05,000
I buy one.

194
00:17:05,300 --> 00:17:08,180
And so I is equal to LEN tokens minus two.

195
00:17:09,050 --> 00:17:12,050
In this case, we still have yet to append the final token.

196
00:17:12,319 --> 00:17:14,180
So that is what we do at this point.

197
00:17:15,290 --> 00:17:21,290
The final step, once the output list is complete, is to call the tokenized function to convert our

198
00:17:21,290 --> 00:17:24,020
list of tokens back into a single string.