1
00:00:11,090 --> 00:00:15,810
So in this lecture, we will be discussing the mark of property, which is the essential ingredient

2
00:00:15,810 --> 00:00:17,510
that we need for the Markov model.

3
00:00:18,380 --> 00:00:20,030
So what is the mark of property?

4
00:00:20,810 --> 00:00:25,910
Well, in order to understand the mark of property, we need to discuss what are we normally trying

5
00:00:25,910 --> 00:00:28,730
to do in machine learning when we have a sequence?

6
00:00:29,420 --> 00:00:35,570
As you recall, a sequence can be any ordered list of values or symbols, like a time series or a sentence,

7
00:00:35,810 --> 00:00:37,220
which is a sequence of words.

8
00:00:37,730 --> 00:00:45,590
So one common and intuitive thing to do is to ask What is the probability of a sequence that is given

9
00:00:45,590 --> 00:00:47,590
a sequence X1 up to x t?

10
00:00:47,990 --> 00:00:51,440
What is the probability of seeing x one up to 60?

11
00:00:52,250 --> 00:00:55,070
Using this, we can do all sorts of useful things.

12
00:00:55,460 --> 00:01:01,550
For example, if we only know the values from x one up to x t minus one, we can ask the question What

13
00:01:01,550 --> 00:01:05,480
is the distribution of X of T given all the previous values?

14
00:01:06,020 --> 00:01:10,940
This is useful if you want to forecast the next value in a time series or predicts the next word in

15
00:01:10,940 --> 00:01:11,690
a sentence.

16
00:01:16,420 --> 00:01:22,810
So the mark of property is a very restrictive assumption on the dependency structure of the joint probability

17
00:01:22,810 --> 00:01:23,620
distribution.

18
00:01:24,490 --> 00:01:31,090
Basically, it says that each symbol X of T depends only on the symbol at the previous time Step X of

19
00:01:31,090 --> 00:01:31,900
T minus one.

20
00:01:32,740 --> 00:01:38,950
It does not depend on X of T minus two x of T minus three or any other X in the past or in the future.

21
00:01:39,760 --> 00:01:45,310
OK, so x of T depends only on one thing, and that's the X at the preceding timestamp.

22
00:01:46,150 --> 00:01:49,750
So what does this mean in terms of our joint probability distribution?

23
00:01:50,530 --> 00:01:57,180
Well, it means that the distribution of X one up to X T is equal to a p of X one times B of X to give

24
00:01:57,190 --> 00:02:03,310
an x one times p of x three given x two and so on, up to T given x t minus y.

25
00:02:04,930 --> 00:02:10,720
Note that the first symbol of one is not conditioned on anything since we're assuming nothing comes

26
00:02:10,720 --> 00:02:11,710
before X1.

27
00:02:16,450 --> 00:02:18,820
So here's another way to write the markoff property.

28
00:02:19,510 --> 00:02:24,880
We start with the general form on the left, where we have X of T given X of T minus one X of C minus

29
00:02:24,880 --> 00:02:26,930
two all the way down to X of one.

30
00:02:27,670 --> 00:02:31,600
This is equal to X of T given X of T minus one only.

31
00:02:32,110 --> 00:02:34,360
And this is if the of property is true.

32
00:02:35,410 --> 00:02:42,160
Another way of saying this is X of T is independent of X of T minus two x of T minus three and so forth.

33
00:02:46,650 --> 00:02:52,980
Another interesting question we can ask is what if the markoff property was not true in this case,

34
00:02:52,980 --> 00:02:55,350
we can still factor out the joint distribution.

35
00:02:55,410 --> 00:02:58,590
It just doesn't look as nice as it did in the markoff case.

36
00:02:59,400 --> 00:03:04,770
To see how we can do this, it helps to start with a simple example with just two variables x one,

37
00:03:04,770 --> 00:03:05,440
an x two.

38
00:03:06,390 --> 00:03:15,270
In this case, we can say p of x one and X two is equal to one times Pivac's to give an x one, as you

39
00:03:15,270 --> 00:03:15,730
recall.

40
00:03:15,750 --> 00:03:19,470
This is just Bayes rule or the rule of conditional probability.

41
00:03:24,200 --> 00:03:27,620
Now, what if we have three variables x one next to an X three?

42
00:03:28,400 --> 00:03:30,590
In this case, we can still do the same thing.

43
00:03:31,340 --> 00:03:39,230
We can start by saying p of X one and X two and X three is equal to one times p of x two x three given

44
00:03:39,230 --> 00:03:39,740
x one.

45
00:03:40,700 --> 00:03:46,610
Of course, the next thing to do is to split up the X two x three term, which gives us P of X to give

46
00:03:46,610 --> 00:03:49,790
an X one times B of x three given x one.

47
00:03:49,790 --> 00:03:54,020
And next to this is just the application of Bayes rule once again.

48
00:03:58,910 --> 00:04:02,210
So in general, this is called the chain rule of probability.

49
00:04:02,930 --> 00:04:08,330
Essentially, what we do is repeatedly split up the distribution, starting with X one, then X to the

50
00:04:08,330 --> 00:04:09,680
next three and so forth.

51
00:04:10,190 --> 00:04:15,650
And each time we split up the distribution, the next symbol depends on more and more passed symbols.

52
00:04:16,310 --> 00:04:19,910
So you see, we've kind of learned about the mark of property backwards.

53
00:04:20,360 --> 00:04:23,600
In fact, what you are seeing here is the most general form.

54
00:04:24,200 --> 00:04:29,990
But if we assume that the mark of property is true, then this simplifies, since each X term does not

55
00:04:29,990 --> 00:04:34,220
depend on any other X terms, except of the immediate preceding term.

56
00:04:38,990 --> 00:04:43,220
So now that you know what the mark of property is, let's think about why it's useful.

57
00:04:44,120 --> 00:04:47,720
So imagine that we would like to build a model of the English language.

58
00:04:48,170 --> 00:04:53,720
Suppose that in our model we consider the most commonly used English words, which number around two

59
00:04:53,720 --> 00:04:54,380
thousand.

60
00:04:55,160 --> 00:05:00,200
Now, suppose that we would like to build a probability distribution for the tenth word in a sentence,

61
00:05:00,440 --> 00:05:02,300
given the previous nine words.

62
00:05:02,840 --> 00:05:08,180
So this distribution has 10 variables in total x one x two all the way up to ten.

63
00:05:09,020 --> 00:05:12,650
But each of these ten variables has two thousand possible values.

64
00:05:13,280 --> 00:05:16,640
So what is the total dimensionality of this distribution?

65
00:05:17,360 --> 00:05:21,140
Well, it's two thousand times two thousand times two thousand and so on.

66
00:05:21,230 --> 00:05:22,400
Ten times.

67
00:05:23,000 --> 00:05:25,580
As you can imagine, this is a pretty large number.

68
00:05:26,300 --> 00:05:30,530
As an exercise, you may want to try to calculate how large this number is.

69
00:05:32,930 --> 00:05:35,510
So what's the problem with this being such a large number?

70
00:05:36,350 --> 00:05:39,920
Well, this is the number of probabilities that we have to estimate.

71
00:05:40,460 --> 00:05:47,120
In order to estimate all these numbers, we must have a sufficient amount of data to learn from as another

72
00:05:47,120 --> 00:05:52,240
bonus exercise, consider whether or not a sufficient amount of data actually exists.

73
00:05:56,920 --> 00:06:00,340
Now we often call them Mark Property, the mark of assumption.

74
00:06:01,030 --> 00:06:06,490
This is because we often assume that the mark of property holds, even when we know it does not.

75
00:06:07,150 --> 00:06:12,820
For example, with language, it's pretty obvious that the next word in a sentence doesn't only depend

76
00:06:12,820 --> 00:06:14,080
on the preceding word.

77
00:06:14,800 --> 00:06:22,390
Consider the following two sentences I like Green Eggs and Ham, and I like to code in C++ and Python.

78
00:06:23,320 --> 00:06:28,810
If the mark of property were true, this is saying that the words ham and python only depend on the

79
00:06:28,810 --> 00:06:29,680
word end.

80
00:06:30,280 --> 00:06:31,900
But clearly, this is not true.

81
00:06:32,590 --> 00:06:38,500
We can see that the word ham becomes more likely because Green Eggs appeared earlier in the sentence.

82
00:06:39,130 --> 00:06:45,280
We can also see that the word python becomes more likely because code in C++ appeared earlier in the

83
00:06:45,280 --> 00:06:46,270
other sentence.

84
00:06:46,990 --> 00:06:52,690
So using this simple example, it is clear that the mark of property does not hold for language, even

85
00:06:52,690 --> 00:06:56,200
though this is a very popular use case of Markov chains.

86
00:06:57,160 --> 00:07:02,020
However, there is a popular quote in statistics that helps to shed some light on this issue.

87
00:07:02,980 --> 00:07:08,410
The famous statistician George Box, once said all models are wrong, but some are useful.

88
00:07:09,130 --> 00:07:14,020
That is, we know that the mark of assumption is wrong, but despite this, it still turns out to be

89
00:07:14,020 --> 00:07:15,010
useful anyway.

90
00:07:15,850 --> 00:07:21,610
As mentioned, it's been successfully applied in language finance, reinforcement learning and biological

91
00:07:21,610 --> 00:07:22,810
sequence analysis.

92
00:07:23,320 --> 00:07:29,080
So although the mark of assumption appears to be very restrictive, it is still useful in the real world.

