1
00:00:11,090 --> 00:00:17,150
So in this lecture, we will consider a slight modification to our method of estimating a pie and also

2
00:00:17,150 --> 00:00:20,960
how to compute the probability of a sequence using the Markov model.

3
00:00:21,650 --> 00:00:26,510
In other words, we're going to consider slightly different ways to answer the two questions we previously

4
00:00:26,510 --> 00:00:27,140
posed.

5
00:00:27,800 --> 00:00:32,870
So the previous estimates we discussed are called the maximum likelihood estimates, since they are

6
00:00:32,870 --> 00:00:37,880
the estimates that yield the maximum value of the likelihood given the dataset being trained on.

7
00:00:38,720 --> 00:00:43,460
In this lecture, we will look at one potential problem of using this method and how to fix it.

8
00:00:48,130 --> 00:00:54,280
So let's consider how we find the probability of a sequence using a PI from this expression, we can

9
00:00:54,280 --> 00:00:57,880
see that it involves only one operation multiplication.

10
00:00:58,780 --> 00:01:03,460
Now let's suppose that one of these values is zero, which can happen if we're trying to find the likelihood

11
00:01:03,460 --> 00:01:08,290
of a sentence that contains pairs of words that never appeared in our training dataset.

12
00:01:08,890 --> 00:01:13,120
This is entirely possible since our train set will be different from our test set.

13
00:01:14,350 --> 00:01:19,540
Well, because this expression contains only multiplication and we know that anything times zero is

14
00:01:19,540 --> 00:01:20,290
still zero.

15
00:01:20,530 --> 00:01:26,080
This will make the entire expression zero, although this seems like it might be the right answer.

16
00:01:26,140 --> 00:01:32,230
It's not because we probably don't want to conclude a sentence is impossible simply because it contains

17
00:01:32,230 --> 00:01:34,750
a pair of words that did not appear in our training set.

18
00:01:39,480 --> 00:01:41,280
So what should we do about this problem?

19
00:01:42,240 --> 00:01:48,420
Intuitively, what we would like to do is give a small probability to every possible transition, even

20
00:01:48,420 --> 00:01:50,430
if they never occur in the training set.

21
00:01:51,060 --> 00:01:54,510
By doing this, we can avoid encountering any zeros during testing.

22
00:01:55,740 --> 00:02:01,710
The easiest way to do this is called +1 one smoothing +1 one smoothing is exactly what it sounds like.

23
00:02:01,830 --> 00:02:05,730
We simply out of fake count of one to every possible transition.

24
00:02:06,480 --> 00:02:11,100
The resulting estimate for AIG then looks like this on the numerator.

25
00:02:11,110 --> 00:02:17,940
We had a one on the denominator we had em, as you recall, for a probability distribution to be valid.

26
00:02:18,360 --> 00:02:20,700
All of the probabilities must sum to one.

27
00:02:22,190 --> 00:02:28,310
Since we had a one to each of the impossible values, we must also add em to the denominator as well.

28
00:02:28,910 --> 00:02:34,760
You should convince yourself if it's not already obvious that this indeed results in each row of AIG,

29
00:02:34,940 --> 00:02:35,810
something to one.

30
00:02:40,710 --> 00:02:44,340
Note that it's also possible to use add one smoothing for pie as well.

31
00:02:44,820 --> 00:02:48,000
Again, we add one to the top and we get em to the bottom.

32
00:02:52,740 --> 00:02:56,610
One simple extension to add one smoothing is add Epsilon smoothing.

33
00:02:57,300 --> 00:03:01,650
Perhaps we believe that adding one is a bit too biased in this case.

34
00:03:01,650 --> 00:03:07,830
What we do instead is add a small fake accounts epsilon to the numerator and epsilon times m to the

35
00:03:07,830 --> 00:03:15,270
denominator again by the same logic we use before we can conclude that this results in each row of AIG,

36
00:03:15,390 --> 00:03:16,260
something to one.

37
00:03:17,010 --> 00:03:19,290
And of course, the same thing applies to PI as well.

38
00:03:20,460 --> 00:03:25,470
Furthermore, note that we can add even more smoothing to our estimate simply by making Epsilon greater

39
00:03:25,470 --> 00:03:26,010
than one.

40
00:03:30,770 --> 00:03:36,230
So thus far, we've considered how we might modify our answer to one of the questions we posed earlier.

41
00:03:36,590 --> 00:03:39,650
How to find a pie given a training data set.

42
00:03:40,580 --> 00:03:45,110
The next question we're going to consider in this lecture is how we can modify our answer to the other

43
00:03:45,110 --> 00:03:47,750
question which was given a sequence.

44
00:03:48,050 --> 00:03:50,750
How do we find the probability of that sequence?

45
00:03:51,560 --> 00:03:56,910
So let's start by discussing our motivation for why we would want to do this again.

46
00:03:56,930 --> 00:04:02,240
We'll begin by recognizing the fact that our joint probability involves only multiplying a bunch of

47
00:04:02,240 --> 00:04:03,140
numbers together.

48
00:04:03,860 --> 00:04:06,450
In particular, these numbers are probabilities.

49
00:04:07,160 --> 00:04:12,440
In the case of language, these probabilities will be very small simply due to the fact that there are

50
00:04:12,440 --> 00:04:14,360
so many words in the English language.

51
00:04:15,140 --> 00:04:20,600
In many applications, it's not uncommon to consider a vocabulary of twenty thousand or fifty thousand

52
00:04:20,600 --> 00:04:21,200
words.

53
00:04:21,649 --> 00:04:24,410
So these probabilities will likely be very small.

54
00:04:25,340 --> 00:04:28,550
Now what happens when you multiply many small numbers together?

55
00:04:29,330 --> 00:04:31,190
The answer is they get even smaller.

56
00:04:31,670 --> 00:04:36,110
So, for example, zero point one times zero point one is zero point zero one.

57
00:04:37,160 --> 00:04:42,530
So as you multiply more and more of these small numbers together, you get closer and closer to zero.

58
00:04:43,400 --> 00:04:46,280
Of course, computers do not have infinite precision.

59
00:04:46,910 --> 00:04:51,560
Eventually, you're going to get so close to zero that the computer will just round the answer down

60
00:04:51,560 --> 00:04:52,160
to zero.

61
00:04:53,360 --> 00:04:59,300
This is very problematic since if we want to compare two sentences which are both very improbable,

62
00:04:59,600 --> 00:05:03,440
if they both end up giving us zero, then our answer will not be valid.

63
00:05:04,340 --> 00:05:08,870
Now, this might sound like a very weird problem to consider, but in practice, you should find that

64
00:05:08,870 --> 00:05:10,160
this happens quite easily.

65
00:05:10,820 --> 00:05:13,670
So I recommend if you don't believe that this is true.

66
00:05:13,940 --> 00:05:18,860
Try it yourself in wait until you encounter it to convince yourself that it can and will happen.

67
00:05:23,560 --> 00:05:25,420
So what's the solution to this issue?

68
00:05:26,260 --> 00:05:28,870
The solution is to work in the log space instead.

69
00:05:29,860 --> 00:05:34,780
That is, instead of finding probabilities directly, we can find a log probabilities.

70
00:05:35,440 --> 00:05:41,320
Note that it's fine to do this transformation because what we often want to do is compare to probabilities.

71
00:05:41,860 --> 00:05:48,250
So for example, if we have two sentences, we want to know which sentence is more likely or if we have

72
00:05:48,250 --> 00:05:53,320
to Markov models, we might want to know under which Markov model does the sequence yield the largest

73
00:05:53,320 --> 00:05:54,130
probability.

74
00:05:55,240 --> 00:06:01,090
Also note that doing this transformation would not change the answer, since the log is a monotonically

75
00:06:01,090 --> 00:06:02,140
increasing function.

76
00:06:02,830 --> 00:06:06,940
If you don't believe this, I recommend trying it yourself using a calculator.

77
00:06:07,480 --> 00:06:12,130
Choose any values and b then take the log of both A and B.

78
00:06:12,790 --> 00:06:18,280
You should find that if A is greater than B, then log is also greater than log B, and this should

79
00:06:18,280 --> 00:06:19,150
always be true.

80
00:06:21,610 --> 00:06:25,330
Now you might ask, how does taking the law solve the under flow problem?

81
00:06:26,170 --> 00:06:29,740
Well, suppose you have a very small value, say, 10 to the minus 10.

82
00:06:30,400 --> 00:06:33,160
Suppose that we take the log based 10 of this value.

83
00:06:33,760 --> 00:06:38,170
This simply gives us minus 10, which a computer has no problem representing.

84
00:06:39,160 --> 00:06:41,530
Consider the number 10 to the minus 100.

85
00:06:42,100 --> 00:06:47,860
This number is a factor of 10 smaller, but if we take the log of this, we get minus 100, which is

86
00:06:47,860 --> 00:06:50,160
again easy to represent with a computer.

87
00:06:54,850 --> 00:06:59,290
Now, under these circumstances, the log probability is very easy to compute.

88
00:07:00,010 --> 00:07:05,350
As you recall, there is a rule for logarithms, which states that the log of a product is the sum of

89
00:07:05,350 --> 00:07:06,010
logs.

90
00:07:06,520 --> 00:07:11,620
In fact, if you know that you won't need the values of an PI directly, you can simply store their

91
00:07:11,620 --> 00:07:12,940
logged values instead.

92
00:07:13,840 --> 00:07:18,730
Note that you would not want to compute the product form and then take the log since you would end up

93
00:07:18,730 --> 00:07:21,520
taking the log of zero, which doesn't solve a problem.

94
00:07:22,450 --> 00:07:26,530
So the solution is to take the sum of the log probabilities instead.

95
00:07:27,640 --> 00:07:33,250
Another thing to notice is that in a computer, doing this sum is more computationally efficient than

96
00:07:33,250 --> 00:07:38,290
computing a product, as you recall, adding as much faster than doing multiplication.

97
00:07:38,920 --> 00:07:44,170
So even if you do not have problems with under flow, you may still want to use log probabilities for

98
00:07:44,170 --> 00:07:44,890
efficiency.