1
00:00:02,210 --> 00:00:06,920
Everyone and welcome back to this class natural language processing in Python.

2
00:00:11,640 --> 00:00:15,750
In this lecture, you're going to get your first introduction to language modeling.

3
00:00:16,500 --> 00:00:23,190
Intuitively, what we want is this we want to build some kind of model that will assign high probabilities

4
00:00:23,220 --> 00:00:28,530
to real words and sentences and low probabilities to unreal words and sentences.

5
00:00:29,130 --> 00:00:30,360
Let's look at an example.

6
00:00:30,510 --> 00:00:36,630
In fact, let's just use the same example from before the phrase I like cats should have high probability

7
00:00:36,960 --> 00:00:45,240
because it is a real sentence or as the phrase why in jail OBE should have low probability because it

8
00:00:45,240 --> 00:00:46,770
is not a real sentence.

9
00:00:47,370 --> 00:00:53,190
The idea is, if you decrypt the message correctly, then the result should have high probability.

10
00:00:53,580 --> 00:00:58,050
If you decrypt the message incorrectly, then the result should have low probability.

11
00:01:03,220 --> 00:01:06,100
Let's think about how we can build such a probability model.

12
00:01:06,820 --> 00:01:11,620
The key concepts we require at this time are the engram and the markup model.

13
00:01:12,370 --> 00:01:18,280
First, let's explain the Engram and Engram simply refers to a sequence of the tokens.

14
00:01:18,790 --> 00:01:23,200
Now, if you've never heard of any of these words before, this probably sounds like gibberish to you.

15
00:01:23,560 --> 00:01:26,170
So let's break it down first.

16
00:01:26,230 --> 00:01:30,340
Tokens in the usual context would refer to individual words.

17
00:01:30,730 --> 00:01:35,770
But in our code cracking example, tokens will actually refer to individual letters.

18
00:01:36,430 --> 00:01:42,460
This makes sense because if we decode a sentence, each individual word must follow a logical sequence

19
00:01:42,460 --> 00:01:45,220
of letters to be more specific.

20
00:01:45,310 --> 00:01:47,350
Usually, we work with small values event.

21
00:01:47,950 --> 00:01:52,870
For example, we might consider you anagrams, which are individual letters by themselves.

22
00:01:53,320 --> 00:01:57,370
We can also consider by grams, which are sequences of two letters.

23
00:01:57,880 --> 00:02:01,660
We can also consider tri grams, which are sequences of three letters.

24
00:02:02,170 --> 00:02:04,840
We could do up to four grams or five grams and so forth.

25
00:02:05,050 --> 00:02:10,720
But this would be unusual for our example we will stick with by grams and unit grams.

26
00:02:11,410 --> 00:02:13,780
Now what does this have to do with Markov models?

27
00:02:18,940 --> 00:02:25,060
In the abstract form, a mark of model is a probability model that says that the current state X of

28
00:02:25,060 --> 00:02:29,140
T depends only on the previous state x of T minus one.

29
00:02:29,920 --> 00:02:35,740
In other words, it does not depend on any previous state like X of T minus two x of C minus three and

30
00:02:35,740 --> 00:02:36,280
so on.

31
00:02:36,940 --> 00:02:38,530
This is how we can write it formally.

32
00:02:38,980 --> 00:02:47,170
We see that p of X of T given X of T minus one x of T minus two and so forth is equal to p of X of T

33
00:02:47,380 --> 00:02:48,730
given X of T minus one.

34
00:02:49,630 --> 00:02:53,020
If you are afraid of math, don't worry this is actually very simple.

35
00:02:53,020 --> 00:02:57,730
As you'll see in our case, the state simply refers to the letter.

36
00:02:58,210 --> 00:03:03,160
So I want to know how often does one letter follow some other, possibly the same letter?

37
00:03:03,940 --> 00:03:08,920
This model makes what is called the mark of assumption, because we assume that each letter depends

38
00:03:08,920 --> 00:03:11,800
only on the previous letter and not on any others.

39
00:03:12,310 --> 00:03:16,000
Of course, this assumption is probably false, but it is quite useful.

40
00:03:21,150 --> 00:03:23,730
Let's think about what this means in a practical sense.

41
00:03:24,120 --> 00:03:28,980
Think about the word cat as you can see, the letter A follows the letter C.

42
00:03:29,430 --> 00:03:31,350
The letter T follows the letter A..

43
00:03:31,800 --> 00:03:35,610
So I want to know P of a given C and T given A..

44
00:03:35,760 --> 00:03:38,010
These are the two big diagrams that appear in this word.

45
00:03:39,300 --> 00:03:45,030
This will tell me how often the Letter H follows a letter C and how often the letter T follows the letter

46
00:03:45,030 --> 00:03:45,390
A..

47
00:03:46,140 --> 00:03:48,720
If our data set only consists of the word cat.

48
00:03:48,990 --> 00:03:53,580
Then of course, these probabilities are 100 percent because we only have these two examples.

49
00:03:58,720 --> 00:04:01,690
But now imagine you have an entire corpus of data.

50
00:04:02,080 --> 00:04:07,120
For example, you could take all the text on Wikipedia or all the text from a book like, say, Moby

51
00:04:07,120 --> 00:04:07,600
Dick.

52
00:04:08,290 --> 00:04:15,490
We would then be able to find a bigram probabilities for every combination of letters, ABC, ID and

53
00:04:15,490 --> 00:04:16,060
so on.

54
00:04:16,690 --> 00:04:22,930
In general, I can write this as follows the probability of the letter at position T, which we're calling

55
00:04:22,930 --> 00:04:23,470
X.

56
00:04:23,920 --> 00:04:31,270
Given the letter at position T minus one is measured by the number of times the letter at position T

57
00:04:31,570 --> 00:04:37,360
follows the letter at position T minus one divided by the number of times the letter at position T minus

58
00:04:37,360 --> 00:04:38,920
one appears in total.

59
00:04:39,760 --> 00:04:44,470
As you can see, this is quite simple, but explaining it in words is a little complicated.

60
00:04:45,100 --> 00:04:50,320
All you need to know how to do is count, which I think I can confidently assume all of you know how

61
00:04:50,320 --> 00:04:56,740
to do well, just in case let's think of a concrete example, we can say that the probability that a

62
00:04:56,770 --> 00:05:03,670
follow C is the number of times that a follow C divided by the number of CS in the dataset.

63
00:05:08,710 --> 00:05:15,280
As a mini quiz question, consider this How many bigram probabilities will we have to calculate, given

64
00:05:15,280 --> 00:05:17,290
that there are 26 letters in the alphabet?

65
00:05:18,010 --> 00:05:22,450
I'll give you a minute to think about this, so please pause the video until you have the answer.

66
00:05:29,580 --> 00:05:30,900
All right, so here's the answer.

67
00:05:31,500 --> 00:05:36,490
The number of probabilities we have to calculate if we have v letters is v squared.

68
00:05:37,200 --> 00:05:41,390
We need to consider AA a b a c all the way up to a z.

69
00:05:42,030 --> 00:05:48,570
Then we need to consider it be a B-BBEE B.C. all the way up to B.C. We continue this pattern all the

70
00:05:48,570 --> 00:05:49,740
way up to Z.

71
00:05:50,340 --> 00:05:52,440
As you can see, this forms a square.

72
00:05:52,950 --> 00:05:58,830
In other words, since there are 26 letters, we need to find a 776 different probabilities.

73
00:06:03,960 --> 00:06:07,140
OK, so now that I have this bigram model, how can I use it?

74
00:06:08,040 --> 00:06:12,210
Remember that what I would like to have is the probability of an entire sentence.

75
00:06:12,660 --> 00:06:14,640
A sentence is made up of words.

76
00:06:15,180 --> 00:06:20,280
So it would be helpful if we could first consider how I can find the probability of a single word.

77
00:06:21,030 --> 00:06:24,240
First, let's recall the rule of conditional probabilities.

78
00:06:24,720 --> 00:06:29,550
If I want to know the probability of the sequence, Abby, this is a joint probability.

79
00:06:30,090 --> 00:06:36,360
It's equal to the probability that B follows a multiplied by the probability that A appears by itself.

80
00:06:37,200 --> 00:06:42,480
Therefore, if I see a two letter word, this is how I can calculate the probability of that word.

81
00:06:43,290 --> 00:06:49,920
Note that p of B given a is a bigram probability, whereas p of by itself is a unique grant probability.

82
00:06:54,980 --> 00:07:00,860
What if I see a three letter word in this case, I can apply the chain rule of probability, which says

83
00:07:00,860 --> 00:07:08,480
that P of ABC is equal to PFC, given a b times of be given a times PRV.

84
00:07:09,440 --> 00:07:16,340
But remember, we are making the mark of assumption, which means that PFC given AB reduces to PFC given

85
00:07:16,340 --> 00:07:22,460
B since we assume that C only depends on the previous letter and not on any letter before that.

86
00:07:23,060 --> 00:07:29,960
So P of ABC as equal to PFC, given b type of be given a times b of a.

87
00:07:30,470 --> 00:07:33,140
Now we have to buy grams and one unit gram.

88
00:07:38,200 --> 00:07:41,080
In general, I can extend this to a word of any length.

89
00:07:41,590 --> 00:07:47,710
So if I have a word of length Big T, then its probability is just a unique grand probability multiplied

90
00:07:47,710 --> 00:07:50,350
by the product of each of the bigram probabilities.

91
00:07:51,100 --> 00:07:54,280
Take a moment to look at this equation and make sure it makes sense.

92
00:07:55,880 --> 00:07:59,870
Notice how the first letter is always represented by a unique grand probability.

93
00:07:59,990 --> 00:08:03,500
That's because there are no previous letters that come before the first letter.

94
00:08:03,740 --> 00:08:04,700
By definition.

95
00:08:06,130 --> 00:08:11,770
And just to be super clear, the unit grand probabilities are only counted from letters that appear

96
00:08:11,770 --> 00:08:13,480
as the first letter of a word.

97
00:08:13,930 --> 00:08:14,590
These are not.

98
00:08:14,590 --> 00:08:18,820
You anagram probabilities in the sense that these letters can appear anywhere in a word.

99
00:08:20,580 --> 00:08:23,880
For all the subsequent letters, we use, their bigram probabilities.

100
00:08:24,060 --> 00:08:27,540
We want to count up from T equals two rather than C equals one.

101
00:08:27,900 --> 00:08:32,730
So that the first bigram probability we have to consider as p of x two given x one.

102
00:08:37,929 --> 00:08:44,230
Finally, let's consider how we would extend this to a sentence which may contain multiple words since

103
00:08:44,230 --> 00:08:45,220
each word is separate.

104
00:08:45,550 --> 00:08:50,650
We can just consider the probability of each word separately and then multiply all those probabilities

105
00:08:50,650 --> 00:08:51,160
together.

106
00:08:51,790 --> 00:08:53,920
So that would give us the equation we see here.

107
00:08:54,610 --> 00:08:57,670
And this notation each W represents one word.

108
00:08:57,970 --> 00:09:01,510
And there are big N words and total indexed by little end.

109
00:09:02,480 --> 00:09:09,110
The N-word has big tea event letters and will use the superscript on X to show which word it belongs

110
00:09:09,110 --> 00:09:14,150
to, the subscript on X still refers to the position of the letter in that word.

111
00:09:14,840 --> 00:09:20,780
So x subscript t superscript n means the position T in the end the word.

112
00:09:25,920 --> 00:09:30,780
One thing you might wonder is we're only considering letters, but don't words also follow a certain

113
00:09:30,780 --> 00:09:31,870
logical sequence.

114
00:09:32,220 --> 00:09:36,780
For example, you wouldn't have a sentence that just says Cat, Cat Cat over and over again.

115
00:09:38,070 --> 00:09:43,110
However, our model doesn't differentiate between a sentence that doesn't make sense, like cat, cat,

116
00:09:43,110 --> 00:09:49,530
cat versus a sentence that actually does make sense, like I like cats in a more advanced model.

117
00:09:49,800 --> 00:09:55,140
You might want to consider a language model over sequences of words alongside sequences of letters,

118
00:09:55,410 --> 00:09:58,830
but you'll see that this actually isn't necessary for our example.

119
00:10:03,830 --> 00:10:08,750
Now that we understand the basic language model, let's discuss some modifications will make in order

120
00:10:08,750 --> 00:10:10,250
to improve how things work.

121
00:10:11,090 --> 00:10:13,580
First, we're going to do something called one smoothing.

122
00:10:14,150 --> 00:10:15,020
This is important.

123
00:10:15,680 --> 00:10:21,170
Consider what would happen if there is some rare bigram that never occurs in our training corpus, but

124
00:10:21,170 --> 00:10:23,630
does occur in the message we are trying to decrypt.

125
00:10:24,290 --> 00:10:28,130
In that case, the probability of that bigram will be measured as zero.

126
00:10:28,880 --> 00:10:34,010
The problem with zeros is that our sentence model is a product of individual probabilities.

127
00:10:34,550 --> 00:10:38,120
And whenever you multiply by zero, the whole thing becomes zero.

128
00:10:40,200 --> 00:10:45,510
That's not good, because it doesn't give us a good indication of whether the sentence as a whole is

129
00:10:45,510 --> 00:10:46,260
probable or not.

130
00:10:51,340 --> 00:10:54,820
The usual approach to fixing this problem is called + one smoothing.

131
00:10:55,510 --> 00:11:01,810
Basically, this gives us a tiny probability of each transition occurring so that even if we encounter

132
00:11:01,810 --> 00:11:06,640
something that never occurred in the training set, it won't make the whole sentence probability zero.

133
00:11:07,510 --> 00:11:13,780
The idea is we want to add one to the numerator and we want to add Big V the total number of letters

134
00:11:13,990 --> 00:11:16,150
in the alphabet to the denominator.

135
00:11:16,930 --> 00:11:22,690
The reason we want to add Big V to the denominator is because of the rule that probabilities must sum

136
00:11:22,690 --> 00:11:23,290
to one.

137
00:11:23,890 --> 00:11:25,810
You can check this for yourself quite easily.

138
00:11:30,950 --> 00:11:36,620
First, we consider the sum over all possible next letters, little we going up from one to big V,

139
00:11:37,400 --> 00:11:41,270
then we replace the probability with the add ones moving expression.

140
00:11:42,140 --> 00:11:47,180
We can see that the denominator doesn't depend on little v so we can pull it outside the sum.

141
00:11:47,990 --> 00:11:52,290
Then we can simplify the terms inside the Sun for the first term.

142
00:11:52,310 --> 00:11:58,610
The count of X of T minus one going to all possible letters is just the number of times X of T minus

143
00:11:58,610 --> 00:11:59,270
one occurred.

144
00:12:00,050 --> 00:12:03,910
For the second term, the sum of one v times is just V.

145
00:12:04,640 --> 00:12:08,660
Therefore, the top is equal to the bottom and the sum is one as expected.

146
00:12:13,770 --> 00:12:18,960
The next thing we have to consider is how do we use this model to actually decrypt the message?

147
00:12:19,560 --> 00:12:25,290
Remember the intuition once we build our model by finding all these probabilities using an English text

148
00:12:25,290 --> 00:12:30,750
corpus, we can then find that the probability of any sequence of letters or sentences.

149
00:12:31,470 --> 00:12:39,030
So let's say I translate the message incorrectly and I get some gibberish output like G B G Q LRP.

150
00:12:39,900 --> 00:12:43,260
This should give me a very low probability because it's not English.

151
00:12:43,860 --> 00:12:49,980
Whereas if I translate the message correctly to I like cats, then I should get a very high probability

152
00:12:50,250 --> 00:12:52,010
because this is correct English.

153
00:12:53,200 --> 00:12:58,840
This is, in other words, like a maximum likelihood problem because we want to find a decoding such

154
00:12:58,840 --> 00:13:01,390
that our output message has the maximum likelihood.

155
00:13:06,550 --> 00:13:09,880
There is one practical consideration we have to make before moving on.

156
00:13:10,660 --> 00:13:14,590
The problem with probabilities and language modeling is that they are very small.

157
00:13:15,370 --> 00:13:19,840
Let's say on average that the probability of a transition is one tenth or zero point one.

158
00:13:20,530 --> 00:13:23,090
I'm not sure how accurate that is, but let's suppose it's true.

159
00:13:23,890 --> 00:13:29,920
What happens if I have a sentence that say 100 characters lower than my probability will be 10 to the

160
00:13:29,920 --> 00:13:33,100
power minus 100, which is a very small number.

161
00:13:34,390 --> 00:13:39,250
These probabilities are generally so small that the computer will just end up rounding them down to

162
00:13:39,250 --> 00:13:39,760
zero.

163
00:13:40,970 --> 00:13:46,550
That's not good because I can't find out which probability is the maximum if they all give me zero.

164
00:13:51,600 --> 00:13:57,480
The solution to this is to take the log likelihood instead of the raw likelihood, since our likelihood

165
00:13:57,480 --> 00:14:03,120
is just the product, the log likelihood becomes a sum due to the multiplication rule for the log function.

166
00:14:03,810 --> 00:14:04,770
Why does this work?

167
00:14:05,400 --> 00:14:09,270
Suppose I take the log of zero point one, then I get minus two point three.

168
00:14:09,990 --> 00:14:12,360
But what if I take the log of a much smaller number?

169
00:14:13,140 --> 00:14:18,600
I still only get minus twenty point seven, even though this number is somewhere around millions or

170
00:14:18,600 --> 00:14:20,970
billions of times smaller than zero point one.

171
00:14:23,850 --> 00:14:26,130
Now you might ask, want this change the answer?

172
00:14:26,730 --> 00:14:30,840
In fact, it won't, because the log is a monotonically increasing function.

173
00:14:31,470 --> 00:14:34,470
Remember that we don't care about the actual likelihood itself.

174
00:14:34,830 --> 00:14:37,260
We only care that our translation is correct.

175
00:14:37,890 --> 00:14:40,710
In other words, I don't want to know the value of the likelihood.

176
00:14:41,100 --> 00:14:44,640
I only want to know which translation gives me the maximum likelihood.

177
00:14:45,330 --> 00:14:48,570
In other words, I only care about their values relative to each other.

178
00:14:49,410 --> 00:14:51,540
You can test this yourself to prove it's correct.

179
00:14:52,140 --> 00:14:54,450
Take any two positive numbers A and B.

180
00:14:55,110 --> 00:15:00,240
You can easily show that if B is greater than a, then log B is also greater than log.

181
00:15:01,080 --> 00:15:02,490
There is no exception to this.

182
00:15:03,150 --> 00:15:09,600
Therefore, if you find the decryption mapping that yields a log likelihood of log B and it's better

183
00:15:09,600 --> 00:15:14,640
than the decryption mapping that yields a log likelihood of log AA, then you can be sure that B is

184
00:15:14,640 --> 00:15:15,750
also greater than A.

185
00:15:20,880 --> 00:15:25,560
All right, so this lecture was quite long, but the concepts are very simple and we can summarize them

186
00:15:25,560 --> 00:15:26,250
concisely.

187
00:15:27,180 --> 00:15:32,130
First, we're going to build a language model using you anagrams and Vikram's at the character level.

188
00:15:32,850 --> 00:15:36,930
We can measure these probabilities using simple counting with add one smoothing.

189
00:15:37,500 --> 00:15:39,390
If you can count, then you can do this.

190
00:15:40,140 --> 00:15:45,120
We can count the letters in some standard English text corpus like Wikipedia or from books.

191
00:15:45,810 --> 00:15:52,320
Then we can apply the intuition that real English sentences should have a higher probability than unreal

192
00:15:52,320 --> 00:15:53,460
English sentences.

193
00:15:54,000 --> 00:15:59,820
Therefore, if our description is correct, the decrypted message should have a higher probability than

194
00:15:59,820 --> 00:16:01,830
an incorrectly decrypted message.