1
00:00:11,060 --> 00:00:17,030
So in this lecture, we will be discussing how a markov model is represented mathematically and by extension

2
00:00:17,030 --> 00:00:18,200
in computer code.

3
00:00:18,860 --> 00:00:22,160
As you recall, the Markov model is used for modeling sequences.

4
00:00:22,640 --> 00:00:24,830
The question is sequences of what?

5
00:00:25,490 --> 00:00:30,710
For the purpose of this section, we will be considering sequences of categorical symbols, which we

6
00:00:30,710 --> 00:00:32,060
will simply call states.

7
00:00:32,540 --> 00:00:38,060
So, for example, if we're modeling the weather, we might have three states sunny, rainy and cloudy.

8
00:00:38,900 --> 00:00:43,960
If we're modeling language, then our states might be parts of speech tags, for example, noun, a

9
00:00:43,970 --> 00:00:45,830
verb, adjective, and so forth.

10
00:00:46,410 --> 00:00:50,150
Okay, so these are examples of states which are categorical symbols.

11
00:00:50,360 --> 00:00:55,850
And by creating a markov model for these symbols, we will have a model for sequences of these states.

12
00:01:00,370 --> 00:01:03,370
In this section, we will use the letter S to mean state.

13
00:01:03,940 --> 00:01:09,520
Note that in other resources you may see other letters being used such as Z or X, depending on the

14
00:01:09,520 --> 00:01:10,270
context.

15
00:01:10,690 --> 00:01:15,010
But in this section we will use the letter s, since that seems to be a neutral symbol.

16
00:01:15,970 --> 00:01:22,720
Now, given this letter s, we can use the notation s of T to mean the state at timestamp t in this

17
00:01:22,720 --> 00:01:23,110
course.

18
00:01:23,110 --> 00:01:24,640
Time will also be discrete.

19
00:01:25,000 --> 00:01:28,150
So we'll have time, step one at time, step two, and so forth.

20
00:01:28,480 --> 00:01:30,160
But we won't have something like time.

21
00:01:30,160 --> 00:01:31,420
Step 1.5.

22
00:01:32,320 --> 00:01:38,200
Okay, so one convention we will use is that we are going to number the states from one up to M where

23
00:01:38,200 --> 00:01:40,420
M is the total number of possible states.

24
00:01:40,870 --> 00:01:45,130
So for the example where the states are a cloudy, rainy and sunny M would be three.

25
00:01:46,240 --> 00:01:51,010
In general, when we want to refer to a specific state, we'll use the letter I or J.

26
00:01:51,430 --> 00:01:59,200
So for example, press of T equals I translates to the probability that the state at time t is equal

27
00:01:59,200 --> 00:01:59,770
to I.

28
00:02:04,560 --> 00:02:09,539
Now, of course, we can have such a probability for all values of AI from one up to m.

29
00:02:10,229 --> 00:02:15,780
This gives us M different probabilities, and together they form a probability distribution.

30
00:02:16,440 --> 00:02:18,240
This distribution has a special name.

31
00:02:18,330 --> 00:02:20,550
We call it the state distribution.

32
00:02:21,150 --> 00:02:26,400
So, for example, the state distribution would answer the question What is the probability that the

33
00:02:26,400 --> 00:02:28,080
weather is rainy on Sunday?

34
00:02:28,950 --> 00:02:34,260
Note that as shorthand we may simply write parts of T to refer to this distribution.

35
00:02:38,940 --> 00:02:44,310
Now, as you recall, Markov models are all about modelling sequences where each state depends only

36
00:02:44,310 --> 00:02:46,230
on the previous state we came from.

37
00:02:46,890 --> 00:02:54,480
So using mathematical symbols we can write p of us of t equals j given as of t minus one equals II.

38
00:02:55,260 --> 00:02:56,520
So what does this mean?

39
00:02:57,240 --> 00:03:03,540
If we translate this into words, it translates to the probability that the state at time t is equal

40
00:03:03,540 --> 00:03:04,200
to j.

41
00:03:04,650 --> 00:03:07,890
Given that the state of time t minus one was equal to.

42
00:03:07,890 --> 00:03:11,460
I note that this is a conditional distribution.

43
00:03:12,510 --> 00:03:18,960
Now, one question we can consider is this How many of these conditional probability values exist?

44
00:03:19,680 --> 00:03:25,860
Well, we must have a value for every possible ie from one up to M and every possible j from one up

45
00:03:25,860 --> 00:03:26,310
to M.

46
00:03:27,000 --> 00:03:31,170
Therefore, we should have m times m values which is m squared.

47
00:03:35,780 --> 00:03:41,360
Now you may have noticed that it's quite tedious to write down PV as a t equals j given as of t minus

48
00:03:41,360 --> 00:03:42,230
one equals II.

49
00:03:42,740 --> 00:03:45,050
In fact, we normally avoid writing this.

50
00:03:45,590 --> 00:03:49,640
Instead, we can simply store all these values in a matrix called a.

51
00:03:50,150 --> 00:03:52,700
We call this the State Transition Matrix.

52
00:03:53,300 --> 00:03:58,940
Our convention is that AJ is equal to the probability that the previous state was I.

53
00:03:58,970 --> 00:04:00,350
And the next state is J.

54
00:04:01,070 --> 00:04:07,340
Thus the first index corresponds to the previous state, and the second index corresponds to the next

55
00:04:07,340 --> 00:04:07,670
state.

56
00:04:08,360 --> 00:04:10,940
And clearly it has the shape M by M.

57
00:04:11,270 --> 00:04:13,280
It has m rows and columns.

58
00:04:17,769 --> 00:04:20,740
Now, you might have noticed something weird about our definition of that.

59
00:04:21,579 --> 00:04:25,960
If we look closely, we can recognize that, in fact, one variable is missing.

60
00:04:26,620 --> 00:04:28,540
The variable that's missing is the time.

61
00:04:28,540 --> 00:04:31,030
T So what happens to t?

62
00:04:32,140 --> 00:04:36,370
So in general, these conditional distributions may change over time.

63
00:04:36,760 --> 00:04:42,970
That is to say they will be functions of T, but for the Markov model, this is usually not the case.

64
00:04:43,540 --> 00:04:49,900
Instead, we will assume that the same probabilities hold for all values of t that is across all time.

65
00:04:50,680 --> 00:04:55,180
When this is true, we say that the Markov process is time homogeneous.

66
00:04:55,600 --> 00:04:57,670
For us, this will always be the case.

67
00:05:02,420 --> 00:05:06,640
Now, although this will not be very useful for us in terms of applying market models.

68
00:05:06,650 --> 00:05:10,340
I thought it was worth mentioning since students often like seeing pictures.

69
00:05:11,030 --> 00:05:15,620
So one way of representing a markov model is with a state transition diagram.

70
00:05:16,340 --> 00:05:20,750
In this type of picture, we have one circle for each possible value of the state.

71
00:05:21,320 --> 00:05:26,150
The arrows represent the probability of transitioning from one state to the next.

72
00:05:26,780 --> 00:05:31,580
So, for example, the probability of it going from sunny to sunny is 0.8.

73
00:05:32,030 --> 00:05:35,450
The probability of going from cloudy to sunny is 0.2.

74
00:05:36,830 --> 00:05:40,940
Note that this diagram assumes that the Markov model is time homogeneous.

75
00:05:41,240 --> 00:05:47,450
Since these probabilities hold, no matter what time step we are at as an exercise, try to figure out

76
00:05:47,660 --> 00:05:50,930
what the state transition matrix for this Markov model would be.

77
00:05:55,780 --> 00:06:00,730
Now, if all we have is the state transition matrix, A is our picture complete?

78
00:06:01,330 --> 00:06:02,320
The answer is no.

79
00:06:02,890 --> 00:06:07,450
You see, A only tells us the probability of moving from one state to the next.

80
00:06:07,840 --> 00:06:14,140
But every sequence starts somewhere, each value, and tells us the probability of going from the previous

81
00:06:14,140 --> 00:06:15,670
state to the next state.

82
00:06:16,180 --> 00:06:20,710
But if we're looking at the very first state in a sequence, there is no previous state.

83
00:06:21,280 --> 00:06:26,050
So for a concrete example of this, just think about a sentence, which is a sequence of words.

84
00:06:26,590 --> 00:06:30,760
Suppose our sentence is the quick brown fox jumps over the lazy dog.

85
00:06:31,330 --> 00:06:33,210
The first word in the sentence is the.

86
00:06:33,550 --> 00:06:35,470
And nothing comes before this word.

87
00:06:36,070 --> 00:06:41,380
To handle this issue, we need to define another distribution called the initial state distribution.

88
00:06:42,340 --> 00:06:46,090
Typically, we use the Greek letter pie to represent this distribution.

89
00:06:46,690 --> 00:06:50,500
So pi a vi z equal to this one equals II.

90
00:06:51,250 --> 00:06:53,080
In this case, we say a sub one.

91
00:06:53,350 --> 00:06:57,430
Because we're assuming that time one is the first time step in the sequence.

92
00:06:58,900 --> 00:07:04,210
Note that since I can take on the values from one up to m pi as a vector of size m.

93
00:07:08,900 --> 00:07:09,260
Okay.

94
00:07:09,260 --> 00:07:11,240
So let's recap what we've learned so far.

95
00:07:11,930 --> 00:07:17,660
We've learned that a markov model is described by two distributions, the state transition matrix A

96
00:07:17,930 --> 00:07:20,090
and the initial state distribution pie.

97
00:07:20,750 --> 00:07:24,290
Given these two objects, we can ask some interesting questions.

98
00:07:24,860 --> 00:07:28,400
Here are two questions we're going to try to answer in this lecture.

99
00:07:29,270 --> 00:07:34,220
The first question is given a and pie and the sequence s of one of two s of t.

100
00:07:34,670 --> 00:07:37,280
What is the probability of this sequence?

101
00:07:38,030 --> 00:07:40,790
And the second question we're going to answer is very practical.

102
00:07:41,330 --> 00:07:45,800
How do we find a n py in the first place as usual?

103
00:07:45,800 --> 00:07:51,200
Because this is machine learning, we'll be given a dataset and using this dataset will determine some

104
00:07:51,200 --> 00:07:54,650
procedure for estimating the best values of and py.

105
00:07:59,350 --> 00:07:59,710
Okay.

106
00:07:59,710 --> 00:08:01,870
So let's start by tackling the first question.

107
00:08:02,530 --> 00:08:08,500
Given a sequence of states in the Markov model, which is an PI, how do we find the probability of

108
00:08:08,500 --> 00:08:09,340
that sequence?

109
00:08:10,060 --> 00:08:12,010
Luckily, we've already discussed this.

110
00:08:12,610 --> 00:08:17,830
In general, we could apply the chain rule of probability, but thanks to the Markov property, this

111
00:08:17,830 --> 00:08:20,620
simplifies to the product of state transitions.

112
00:08:21,850 --> 00:08:26,860
The difference now is that we will be using our new variables in PI to represent our answer.

113
00:08:28,330 --> 00:08:33,130
It's also important at this point to start thinking about how you would do this in a computer.

114
00:08:34,330 --> 00:08:40,659
Inside a computer, a will be a matrix or a 2D array and pi will be a vector or a1d array.

115
00:08:41,350 --> 00:08:45,670
Thus you should consider how you would write some code to carry out this computation.

116
00:08:46,060 --> 00:08:50,530
Multiplying the required components of Pioneer, perhaps in some kind of loop.

117
00:08:55,310 --> 00:08:55,670
Okay.

118
00:08:55,670 --> 00:09:02,570
So the next question we want to answer is how do we train a markov model now to fully derive this rigorously,

119
00:09:02,570 --> 00:09:03,440
take some work.

120
00:09:03,470 --> 00:09:06,200
So for now, we'll only discuss the intuition.

121
00:09:06,860 --> 00:09:09,230
But the intuition is very easy to understand.

122
00:09:09,590 --> 00:09:11,720
In fact, it requires nothing but counting.

123
00:09:12,800 --> 00:09:13,700
So to begin.

124
00:09:13,730 --> 00:09:18,350
Suppose that we flip a coin a bunch of times, which is a sequence of heads and tails.

125
00:09:18,680 --> 00:09:23,120
And we would like to know what is our best guess for the probability of heads?

126
00:09:23,990 --> 00:09:25,970
Well, we all know the answer to this question.

127
00:09:26,120 --> 00:09:30,020
It's the number of heads divided by the total number of coin tosses.

128
00:09:30,800 --> 00:09:35,000
As you recall, this is a binary distribution, also known as a Bernoulli.

129
00:09:35,960 --> 00:09:38,540
But in our case, we can have M greater than two.

130
00:09:39,110 --> 00:09:42,080
For example, each state is a word in the English dictionary.

131
00:09:42,860 --> 00:09:44,600
In this case, the answer is similar.

132
00:09:45,350 --> 00:09:51,170
Suppose that we've taken all of Wikipedia and we would like to know what is our best guess for the probability

133
00:09:51,170 --> 00:09:52,580
of seeing the word cat.

134
00:09:53,300 --> 00:09:59,450
The answer is the number of times the word cat appears divided by the total number of words in the text.

135
00:10:04,040 --> 00:10:04,400
Okay.

136
00:10:04,400 --> 00:10:09,230
So how can we apply this principle to estimating an pi in a mark of model?

137
00:10:10,250 --> 00:10:12,440
Let's start with pie since that's a bit simpler.

138
00:10:13,220 --> 00:10:19,880
Pie a VI is equal to the number of times each sequence started with State I divided by the total number

139
00:10:19,880 --> 00:10:21,650
of sequences in our dataset.

140
00:10:22,340 --> 00:10:26,780
Of course, this requires that our dataset consists of multiple sequences to begin with.

141
00:10:27,230 --> 00:10:29,540
Otherwise our estimate would not be very good.

142
00:10:31,130 --> 00:10:32,930
Okay, so what about AJ?

143
00:10:33,710 --> 00:10:41,390
Our best guess for AJ is the number of times we transition from state AI to state j divided by the total

144
00:10:41,390 --> 00:10:43,490
number of times we were in State II.

145
00:10:44,450 --> 00:10:47,900
Now, just in case this is a bit confusing, let's think of an example.

146
00:10:48,770 --> 00:10:52,380
Suppose that we're working with the English language and each state is a word.

147
00:10:53,150 --> 00:10:57,990
We would like to know what is the probability of seeing the word cat following the word the.

148
00:10:59,540 --> 00:11:04,310
This is simply the number of times we see the sequence the cat divided by the total.

149
00:11:04,310 --> 00:11:06,950
Number of times we see the word the by itself.

150
00:11:11,690 --> 00:11:12,080
Okay.

151
00:11:12,080 --> 00:11:16,550
So let's summarize what we've learned in this lecture since we've covered quite a lot of new content.

152
00:11:17,150 --> 00:11:22,490
In this lecture, we learn that using more generic terms, our job is to model a sequence of states.

153
00:11:23,060 --> 00:11:29,450
Our Markov model is made up of two objects the state transition matrix A and the initial state distribution

154
00:11:29,450 --> 00:11:29,990
pie.

155
00:11:30,710 --> 00:11:33,680
At this point, you know what these probabilities represent.

156
00:11:34,430 --> 00:11:36,000
We also learn how to answer it.

157
00:11:36,020 --> 00:11:38,900
Two important questions concerning Markov models.

158
00:11:39,440 --> 00:11:46,100
The first question we learn to answer is Given a sequence of states and a mark of model, find the probability

159
00:11:46,220 --> 00:11:47,360
of that sequence.

160
00:11:48,140 --> 00:11:52,370
The second question we learn to answer is given a dataset of sequences.

161
00:11:52,640 --> 00:11:56,330
Find the best fitting Markov model, which means find a in pie.

162
00:11:57,260 --> 00:12:02,390
Note that because this is machine learning, we call the process of finding a in pie, the learning

163
00:12:02,390 --> 00:12:04,220
process or the training process.

164
00:12:04,880 --> 00:12:09,470
Now, again, it's always useful to think about how you would implement this in computer code.

165
00:12:09,920 --> 00:12:15,410
So as an exercise, consider thinking about how you would take a bunch of sentences like what you would

166
00:12:15,410 --> 00:12:19,160
find on Wikipedia or your favorite book and compute a pie.

167
00:12:19,970 --> 00:12:24,950
In fact, if you really want to test your understanding, you should write the code to actually do this.