1
00:00:11,090 --> 00:00:16,010
OK, so in this lecture, we will be looking at another application of Markov models.

2
00:00:16,640 --> 00:00:22,310
This time we will be using Markov models to generate text instead of just classifying text.

3
00:00:23,090 --> 00:00:29,930
So one important distinction to make is this when you are classifying text, that's an example of supervised

4
00:00:29,930 --> 00:00:33,140
learning because we're using labeled data to build our model.

5
00:00:33,920 --> 00:00:39,380
In this example, we will be doing unsupervised learning because we will not require labeled text.

6
00:00:44,030 --> 00:00:49,430
Now, one factor that might be interesting to some of you is that when we built our base classifier,

7
00:00:49,820 --> 00:00:52,700
that was an example of what we call a generative model.

8
00:00:53,330 --> 00:00:59,180
In particular, we can split up supervised models into two categories based on the kind of distribution

9
00:00:59,180 --> 00:00:59,810
we learn.

10
00:01:00,500 --> 00:01:06,230
These categories are discriminative and generative at a high level discriminative.

11
00:01:06,230 --> 00:01:08,780
Models learn P of Y given X directly.

12
00:01:09,380 --> 00:01:14,750
That is, there's no need to apply Bayes rule because the model already gives us P of Y given X.

13
00:01:15,320 --> 00:01:18,920
Some examples of this are logistic regression and neural networks.

14
00:01:20,270 --> 00:01:23,480
Generative models are the opposite, as you saw.

15
00:01:23,510 --> 00:01:29,630
This involves learning X given Y and later we find P of Y given X using Bayes rule.

16
00:01:30,500 --> 00:01:36,080
So the reason why this is interesting is because this lecture will help us understand why we call these

17
00:01:36,080 --> 00:01:42,740
generative models in the first place by learning P of X, given Y or, in other words, a model of the

18
00:01:42,740 --> 00:01:47,510
data, we will be able to use that distribution to generate more data.

19
00:01:48,080 --> 00:01:53,420
And because this distribution can be used to generate data, we call it a generative model.

20
00:01:58,220 --> 00:02:03,950
OK, so in order to understand how we can use Markov models to generate data, we need to understand

21
00:02:03,950 --> 00:02:09,680
the concept of sampling most of you taking this course will already know how this works, but I'm going

22
00:02:09,680 --> 00:02:11,900
to review it with you just in case.

23
00:02:12,650 --> 00:02:17,990
So let's consider how we might draw a sample from the normal distribution with mean zero and variance

24
00:02:17,990 --> 00:02:18,440
one.

25
00:02:19,190 --> 00:02:22,940
This is the usual Nampai function and spit out random dot random.

26
00:02:24,080 --> 00:02:28,520
OK, so that's an example of how to sample from a continuous distribution.

27
00:02:29,120 --> 00:02:30,830
What about a discrete distribution?

28
00:02:31,820 --> 00:02:36,560
Well, suppose we'd like to draw a sample from the newly which you can think of like a coin toss.

29
00:02:37,100 --> 00:02:39,200
The result can only be zero or one.

30
00:02:40,370 --> 00:02:47,030
Well, one option is to call out random choice, passing in either a list containing zero and one or

31
00:02:47,030 --> 00:02:50,660
passing in the integer to both will yield the same results.

32
00:02:51,590 --> 00:02:53,570
But what if we have more than two options?

33
00:02:54,140 --> 00:02:57,260
Let's suppose we want to draw the numbers zero up to nine.

34
00:02:58,190 --> 00:03:03,770
In this case, we can still use that random choice again, but passing in a 10 instead.

35
00:03:05,430 --> 00:03:11,250
OK, but these examples are too simplistic because they always assign equal probability to each outcome.

36
00:03:12,030 --> 00:03:15,930
What if we want to Bernoulli where the probability of heads is zero point eight?

37
00:03:16,680 --> 00:03:22,260
In this case, we need to make use of the argument, which allows you to assign a probability to every

38
00:03:22,260 --> 00:03:23,250
possible value.

39
00:03:24,990 --> 00:03:30,690
So in the newly case, we would pass in a list containing zero point two and zero point eight in order

40
00:03:30,690 --> 00:03:34,140
to have the probability of sampling a one be 80 percent.

41
00:03:35,580 --> 00:03:41,850
As you recall, this is exactly what we need to sample words from a distribution of words since we are

42
00:03:41,850 --> 00:03:43,890
storing our words as integers.

43
00:03:44,550 --> 00:03:50,310
So as a simple example, suppose that our vocabulary size is three and it contains the words cat, dog

44
00:03:50,310 --> 00:03:51,000
and mouse.

45
00:03:51,540 --> 00:03:56,820
These have the probabilities zero point two, zero point five and zero point three, respectively.

46
00:03:57,750 --> 00:04:03,420
Then supposing that we know how to map the integers zero, one and two to their corresponding word,

47
00:04:03,750 --> 00:04:06,240
we can just use and put out random choice.

48
00:04:07,170 --> 00:04:12,450
Now, please note that there are many more ways to generate samples from a distribution and even to

49
00:04:12,450 --> 00:04:13,680
store a distribution.

50
00:04:14,190 --> 00:04:17,850
So if you had any other ideas, please feel free to use them instead.

51
00:04:18,510 --> 00:04:21,120
Remember that these exercises are for you to do.

52
00:04:21,329 --> 00:04:25,350
You should really be using these opportunities to explore your coding abilities.

53
00:04:25,800 --> 00:04:31,470
So do not treat my code as the target for what you must do, but simply as an example of what you could

54
00:04:31,470 --> 00:04:31,890
do.

55
00:04:36,660 --> 00:04:41,410
So for this lecture, we will be extending our knowledge of the Markov model just a little bit.

56
00:04:42,030 --> 00:04:47,670
In particular, you've seen so far how we can make the mark of assumption where the current state depends

57
00:04:47,670 --> 00:04:49,080
only on the previous state.

58
00:04:49,860 --> 00:04:51,630
However, that's pretty restrictive.

59
00:04:52,290 --> 00:04:54,870
Consider, for example, these two sentences.

60
00:04:55,350 --> 00:05:00,180
I made myself a peanut butter sandwich and I'll go and see her myself.

61
00:05:01,080 --> 00:05:06,270
Now, suppose that we just generated the word myself and we'd like to now generate the next word.

62
00:05:06,930 --> 00:05:12,480
Well, we can see that a peanut butter sandwich is a feasible continuation of the word myself.

63
00:05:13,200 --> 00:05:19,050
Furthermore, we can see that I'll go and see her myself as a feasible start of a sentence containing

64
00:05:19,050 --> 00:05:20,010
the word myself.

65
00:05:21,420 --> 00:05:27,150
However, we can also see that if we only focus on one word at a time, we could end up generating a

66
00:05:27,150 --> 00:05:31,380
sentence like I'll go and see her myself a peanut butter sandwich.

67
00:05:31,860 --> 00:05:34,350
Of course, this sentence does not make sense.

68
00:05:35,400 --> 00:05:38,370
Now consider the sentence the peanut is not a nut.

69
00:05:39,120 --> 00:05:43,500
Again, suppose we just generated the words I made myself a peanut.

70
00:05:44,010 --> 00:05:45,990
If we only focus on the word peanut.

71
00:05:46,290 --> 00:05:49,920
We can see that it is feasible to end with the other sentence.

72
00:05:50,310 --> 00:05:53,250
But altogether, this sentence does not make sense.

73
00:05:53,640 --> 00:05:55,980
I made myself a peanut is not a nut.

74
00:05:56,610 --> 00:05:58,560
This sentence is incomprehensible.

75
00:06:03,350 --> 00:06:09,920
So here's one way to extend the Markov model instead of only depending on a one pass the state, we

76
00:06:09,920 --> 00:06:11,840
can depend on two past states.

77
00:06:12,380 --> 00:06:16,550
In this context, we referred to the number of past states as the model order.

78
00:06:17,120 --> 00:06:24,050
So a model where the current state depends on two past states is called Second Order Markov, a model

79
00:06:24,050 --> 00:06:28,390
where the current state depends on only one party state is called the First Order markup.

80
00:06:29,660 --> 00:06:31,640
OK, so why would you want to do this?

81
00:06:32,210 --> 00:06:37,460
Well, the hope is that by including more information from the past, we can make better predictions

82
00:06:37,460 --> 00:06:38,360
about the future.

83
00:06:38,960 --> 00:06:41,990
Of course, it remains to be seen whether or not this will help.

84
00:06:42,590 --> 00:06:47,330
As always, the model is machine learning is experimentation, not philosophy.

85
00:06:52,100 --> 00:06:58,490
So it's worth thinking about how we will actually implement a second order Markov model, as you recall,

86
00:06:58,490 --> 00:07:03,530
we need to have a new state transition distribution where the current state depends on the previous

87
00:07:03,530 --> 00:07:04,440
two states.

88
00:07:05,000 --> 00:07:10,700
We can write the says price of T equals K given us of T minus one equals J.

89
00:07:11,000 --> 00:07:13,310
And this of T minus two equals I.

90
00:07:14,390 --> 00:07:19,370
So analogously, with our first order Markov model, we can store this in a three dimensional array

91
00:07:19,370 --> 00:07:22,670
we call a with the subscript i, j and K.

92
00:07:23,870 --> 00:07:25,940
Now here's something that should concern you.

93
00:07:26,690 --> 00:07:32,700
We know that each of the indices I, J and K can take on any of the M states as before.

94
00:07:32,720 --> 00:07:40,130
This means our E matrix will have the shape M by M by M or MQ and by Matrix I really mean a tensor.

95
00:07:41,180 --> 00:07:46,610
So what should be concerning about this is that the size of this tensor seems to grow exponentially

96
00:07:46,910 --> 00:07:49,340
with the number of extra previous terms.

97
00:07:49,910 --> 00:07:53,780
So if you want to add another past state, you'll have to the power four.

98
00:07:54,290 --> 00:07:58,280
If you want to have another past state, you'll have enter the power five and so forth.

99
00:07:58,940 --> 00:08:03,170
Of course, it then becomes infeasible to include it too many past states.

100
00:08:03,710 --> 00:08:08,810
Not only would it be computationally intensive to store in memory, you will reach a point where you

101
00:08:08,810 --> 00:08:11,630
don't even have enough data to estimate the values.

102
00:08:16,080 --> 00:08:20,940
The next thing to discuss is that like the first order Markov model, we still need a first order state

103
00:08:20,940 --> 00:08:22,230
transition matrix.

104
00:08:22,770 --> 00:08:28,530
So just to disambiguate between First Order and Second Order, we'll call the First Order Matrix A1

105
00:08:28,530 --> 00:08:31,880
and we'll call the Second Order Tensor a a.

106
00:08:31,890 --> 00:08:36,150
One will have the indices, I.J and A2 will have the indices IDGC.

107
00:08:36,990 --> 00:08:41,190
And of course, these are both in addition to the initial state distribution pie.

108
00:08:42,120 --> 00:08:47,280
This makes sense since when we want to generate the first word in a sentence, we will use pie.

109
00:08:47,880 --> 00:08:51,630
When we want to generate the second word in a sentence, we will use A1.

110
00:08:51,840 --> 00:08:58,860
Since there is only one previous word after this, we will use A2 since at this point, every word will

111
00:08:58,860 --> 00:09:01,170
be preceded by at least two words.

112
00:09:02,640 --> 00:09:08,730
However, note that this version of A1 is not like the transition matrix in a first order Markov model.

113
00:09:09,210 --> 00:09:14,220
This is because A1 is only being used as a model for the second word in a sentence.

114
00:09:14,610 --> 00:09:16,020
Not any word after that.

115
00:09:20,690 --> 00:09:25,610
I also want to do a little preview at this point, which is to think about this in the context of neural

116
00:09:25,610 --> 00:09:26,330
networks.

117
00:09:26,990 --> 00:09:32,300
What you will learn when you study deep learning is that firstly, there is no need to make the mark

118
00:09:32,300 --> 00:09:33,020
of assumption.

119
00:09:33,560 --> 00:09:38,690
But furthermore, almost magically, there is essentially no increase in computational cost.

120
00:09:38,960 --> 00:09:44,660
When you add more previous words now, this is not quite accurate, since the increase will depend on

121
00:09:44,660 --> 00:09:46,340
the type of neural network you use.

122
00:09:46,640 --> 00:09:49,910
But for the purpose of this discussion, you can think of it as linear.

123
00:09:50,480 --> 00:09:56,800
Either way, it's still much more efficient than exponential as we've seen when we do naive accounting,

124
00:09:57,080 --> 00:10:00,470
and we try to measure each probability value one by one.

125
00:10:00,740 --> 00:10:03,980
There is an exponential growth in the computational cost.

126
00:10:04,640 --> 00:10:10,580
So it's a good exercise to think about what is a neural network actually doing that allows this to happen.