1
00:00:11,090 --> 00:00:16,640
So in this lecture, we will be looking at the code to implement our second order language model to

2
00:00:16,640 --> 00:00:18,530
generate Robert Frost poems.

3
00:00:19,280 --> 00:00:21,500
We'll start by importing Nampai and String.

4
00:00:22,280 --> 00:00:27,830
We'll also set the NumPy random seed so that you get results which are consistent with this video.

5
00:00:28,550 --> 00:00:33,530
Note that this isn't required and you are encouraged to turn this off to see other examples.

6
00:00:39,620 --> 00:00:44,150
OK, so the next step will be to create dictionaries to represent our probabilities.

7
00:00:44,720 --> 00:00:49,370
The first dictionary, called Initial, will store the initial state distribution.

8
00:00:50,450 --> 00:00:55,940
The second dictionary, called First Order, will store the First Order transition probabilities, but

9
00:00:55,940 --> 00:00:58,130
only for the second word in each sentence.

10
00:00:58,640 --> 00:01:04,730
This is unlike the First Order Markov model, where the matrix was counted for all one step transitions.

11
00:01:05,330 --> 00:01:07,610
Of course, you could count the words that way, too.

12
00:01:07,850 --> 00:01:09,590
It's simply that I've decided not to.

13
00:01:10,790 --> 00:01:16,220
The third dictionary, which I've called Second Order, will store the second order transition probabilities.

14
00:01:16,400 --> 00:01:17,570
As you might expect.

15
00:01:22,680 --> 00:01:28,410
The next step is to define a function call to remove punctuation, which will remove punctuation from

16
00:01:28,410 --> 00:01:29,520
each line of text.

17
00:01:30,030 --> 00:01:35,190
This is optional, so if you'd like to leave us out, feel free to do so as before.

18
00:01:35,220 --> 00:01:39,930
For this course, there's no need to worry about what this code actually does, since it's not machine

19
00:01:39,930 --> 00:01:40,500
learning.

20
00:01:44,970 --> 00:01:50,070
The next step is to download our data set, which is a text file containing Robert Frost poems.

21
00:01:57,290 --> 00:01:59,660
The next step is to define a function called add to.

22
00:02:00,800 --> 00:02:05,690
So this function will take in three items a dictionary, a key and a value.

23
00:02:06,560 --> 00:02:12,800
The basic idea is the key will represent some starting word or pairs of words in the second order case.

24
00:02:13,580 --> 00:02:16,940
Recall that Python dictionaries can be any type of object.

25
00:02:17,450 --> 00:02:21,980
The value for this dictionary will be a list storing all the possible next words.

26
00:02:22,640 --> 00:02:28,190
So as a quick example, suppose that we're modeling Second Order transitions from a text corpus.

27
00:02:28,670 --> 00:02:35,240
We come across the phrases I am happy, I am sad, I am content, I am happy, I am happy, I am hungry.

28
00:02:36,020 --> 00:02:42,950
In this case, the key to the dictionary would be a tuple containing the tokens I and -- the value

29
00:02:42,950 --> 00:02:46,680
would be a list containing happy three times and the rest of the words.

30
00:02:46,700 --> 00:02:52,820
Once the goal is, we're going to eventually turn this into a dictionary where the word points to the

31
00:02:52,820 --> 00:02:54,860
probability of that word appearing.

32
00:02:55,490 --> 00:02:59,060
But the first step is to simply collect the words from the text.

33
00:03:05,610 --> 00:03:11,010
The next step is to traverse our data set where we make use of the function we just wrote to populate

34
00:03:11,010 --> 00:03:12,240
the dictionaries above.

35
00:03:12,900 --> 00:03:18,730
So we'll start by looping over each line of the Robert Frost text file inside the loop.

36
00:03:18,750 --> 00:03:21,150
The first step will be to our strip and lowercase.

37
00:03:21,150 --> 00:03:27,000
The line will then remove the punctuation and then call the split function to tokenize the sentence.

38
00:03:27,690 --> 00:03:30,240
So the result of this will be a list of tokens.

39
00:03:33,590 --> 00:03:38,450
The next step will be to grab the length of our list of tokens, which we will call big tea.

40
00:03:40,810 --> 00:03:48,550
The next step is to look through each index from zero up to Big T inside the loop or grab the IV token,

41
00:03:48,550 --> 00:03:49,750
which we'll call T.

42
00:03:52,810 --> 00:03:58,930
The next step will be to check whether or not I is equal to zero, if it is, then this is the first

43
00:03:58,930 --> 00:04:00,010
word in a sentence.

44
00:04:00,340 --> 00:04:03,250
So we would like to update our initial state distribution.

45
00:04:03,940 --> 00:04:10,210
We can do this by incrementing the value stored for T by one note that we use the get method.

46
00:04:10,240 --> 00:04:17,110
So if T does not exist in the dictionary yet, it automatically gets a count of zero as before.

47
00:04:17,110 --> 00:04:22,510
The first step is only to gather the counts after we've gone through the whole data set.

48
00:04:22,750 --> 00:04:24,640
We can normalize this distribution.

49
00:04:28,270 --> 00:04:33,520
The next step is to look at the Ellis block, as you may have inferred, the L'ESPOIR takes care of

50
00:04:33,520 --> 00:04:35,980
any token when I is bigger than zero.

51
00:04:36,490 --> 00:04:39,280
In other words, this is not the first word in the sentence.

52
00:04:40,090 --> 00:04:45,850
Therefore, the first thing we can do is grab the previous word in the sentence at Index II minus one,

53
00:04:46,390 --> 00:04:48,640
we'll call the results underscore one.

54
00:04:51,800 --> 00:04:57,680
The next step is to check whether or not we are at the end of a sentence, if we are, then I is equal

55
00:04:57,680 --> 00:04:59,150
to Big T minus one.

56
00:05:00,020 --> 00:05:05,390
In this case, we would like to add a special second order transition by essentially creating a fake

57
00:05:05,390 --> 00:05:05,990
token.

58
00:05:06,710 --> 00:05:11,270
As you can see, this fake token is called end in all uppercase letters.

59
00:05:11,930 --> 00:05:17,960
The purpose of this is when we are generating new poems, we know that the line should end if we sample

60
00:05:17,960 --> 00:05:18,980
the end a token.

61
00:05:19,700 --> 00:05:21,200
And of course, this is required.

62
00:05:21,200 --> 00:05:23,000
Otherwise our line would go on forever.

63
00:05:25,200 --> 00:05:29,500
So pay attention to the arguments we pass into the add to debt function.

64
00:05:30,270 --> 00:05:34,350
The dictionary we want to update is the one for second order transitions.

65
00:05:34,920 --> 00:05:39,420
The key for this dictionary is a tuple containing T underscore one and T.

66
00:05:40,200 --> 00:05:42,990
The value for this dictionary is our fake and token.

67
00:05:45,510 --> 00:05:49,950
The next step is to enter a new if statement where we check if I is equal to one.

68
00:05:50,700 --> 00:05:55,920
Note that this is a separate if statement since we might want to run both statements on the same token.

69
00:05:56,700 --> 00:06:00,660
In other words, they are not mutually exclusive as an exercise.

70
00:06:00,810 --> 00:06:03,870
Think of a situation where both statements would be run.

71
00:06:06,270 --> 00:06:08,910
OK, so what should we do if I is equal to one?

72
00:06:09,690 --> 00:06:12,780
This is when we are looking at the second word in a sentence.

73
00:06:13,320 --> 00:06:19,680
Therefore, we would like to update the First Order dictionary note again the arguments into the adjudicate

74
00:06:19,680 --> 00:06:20,280
function.

75
00:06:20,910 --> 00:06:24,270
The key is the previous word, which is to underscore one.

76
00:06:24,900 --> 00:06:27,360
The value is the current word, which is T.

77
00:06:31,880 --> 00:06:33,980
The next step is to look at the Ellis block.

78
00:06:34,610 --> 00:06:40,100
Basically, this looks at any case we have not yet considered, which is when I is not zero and I is

79
00:06:40,100 --> 00:06:44,660
not one that is to say we are not looking at the first or second word.

80
00:06:45,830 --> 00:06:48,980
In this case, we should have at least two previous words to look at.

81
00:06:49,400 --> 00:06:55,670
So we can grab the second last token at Index II minus to the next step is to update our second order

82
00:06:55,670 --> 00:06:57,600
dictionary again.

83
00:06:57,620 --> 00:06:58,880
Notice the arguments.

84
00:06:59,240 --> 00:07:02,960
The key is the previous two words and the value is the current word.

85
00:07:09,490 --> 00:07:14,980
Now, one thing to keep in mind is that at this point, the initial state distribution is the only dictionary

86
00:07:14,980 --> 00:07:16,660
that contains actual counts.

87
00:07:17,140 --> 00:07:22,720
Therefore, it's the only dictionary we can normalize directly for the other dictionaries, they simply

88
00:07:22,720 --> 00:07:25,390
store samples of possible next words.

89
00:07:25,810 --> 00:07:27,820
We have yet to turn those into counts.

90
00:07:29,730 --> 00:07:33,840
So in this step, we're going to normalize the initial state distribution first.

91
00:07:34,470 --> 00:07:37,920
We'll start by grabbing the sum of all the values, which gives us the total.

92
00:07:38,970 --> 00:07:42,330
Of course, this is just equal to the number of lines we encountered.

93
00:07:43,290 --> 00:07:48,390
The next step is to live through each key value pair in the dictionary for each pair.

94
00:07:48,420 --> 00:07:52,530
We're going to take the count, which I've called see and divide that by the total.

95
00:07:53,250 --> 00:07:58,080
As you know, this is our maximum likelihood estimate for the probability of each token.

96
00:08:04,990 --> 00:08:10,690
OK, so the next step is a bit more difficult, as you know, for the transitions, we don't yet have

97
00:08:10,690 --> 00:08:11,260
counts.

98
00:08:11,560 --> 00:08:14,890
We simply have lists which store all the possible next words.

99
00:08:15,550 --> 00:08:20,140
So the following function called list to predict does two things.

100
00:08:20,800 --> 00:08:24,070
The first thing it does is it creates a dictionary of counts.

101
00:08:24,730 --> 00:08:30,370
The second thing it does is once it has the counts, it then normalizes the counts, which converts

102
00:08:30,370 --> 00:08:31,750
them into probabilities.

103
00:08:33,250 --> 00:08:40,580
OK, so the input argument to this function is T-S, which is just a list of tokens inside the function.

104
00:08:40,600 --> 00:08:43,590
We'll start by creating a new empty dictionary called D.

105
00:08:44,680 --> 00:08:48,250
The next step is to grab the total number of samples, which we'll call end.

106
00:08:48,760 --> 00:08:50,030
This is just the length of time.

107
00:08:51,670 --> 00:08:54,760
The next step is to live through each token into yes.

108
00:08:56,140 --> 00:09:01,690
Inside the loop, we simply increment the value for the token by one each time we encounter the token,

109
00:09:02,470 --> 00:09:07,000
as before we use the get method so that we can say the default value is zero.

110
00:09:08,330 --> 00:09:11,900
Once this loop is done, we will have a dictionary starring the counts.

111
00:09:12,500 --> 00:09:16,040
So the key is the token and the value is the corresponding counts.

112
00:09:17,480 --> 00:09:23,450
The next step is to then loop through each key value pair in the dictionary inside this loop.

113
00:09:23,510 --> 00:09:28,340
We simply divide the count by the total, which gives us the probability for each token.

114
00:09:29,570 --> 00:09:31,070
Once we're done, we return.

115
00:09:38,610 --> 00:09:41,550
OK, so the next step is to make use of our new function.

116
00:09:42,120 --> 00:09:47,010
We'll start by looping through the first start a dictionary which stores the probability for each second

117
00:09:47,010 --> 00:09:48,390
word in a sentence.

118
00:09:48,990 --> 00:09:54,630
As you recall, what we previously stored in this dictionary was a list of possible next tokens.

119
00:09:55,860 --> 00:10:01,860
We now call the list to repeat its function to simply replace that list of tokens with a dictionary

120
00:10:01,860 --> 00:10:04,680
of probabilities as an exercise.

121
00:10:04,890 --> 00:10:07,890
You should think about what structure this dictionary will have.

122
00:10:08,190 --> 00:10:14,400
Once we are done, basically convince yourself that the result will be a dictionary of dictionaries.

123
00:10:19,870 --> 00:10:25,720
The next step is to use our function again, but this time on the Second Order dictionary, the main

124
00:10:25,720 --> 00:10:31,600
difference with this dictionary is that the key is now a tuple of two words instead of just one word.

125
00:10:32,080 --> 00:10:37,600
But we're still doing the same thing as before, which is converting a list of words into a dictionary

126
00:10:37,600 --> 00:10:39,430
of probabilities of words.

