1
00:00:11,080 --> 00:00:16,720
So in this lecture, we are going to apply our knowledge of Markov models to build a text classifier,

2
00:00:17,380 --> 00:00:22,510
basically, now that we've done all this theory, it's time to put that theory to a real world application.

3
00:00:23,260 --> 00:00:27,130
So let's start by asking the question What is a text classifier?

4
00:00:27,790 --> 00:00:34,240
A text classifier is a model that will take as input a string of text and output a prediction about

5
00:00:34,240 --> 00:00:35,860
the category it belongs to.

6
00:00:36,490 --> 00:00:41,710
So, for example, suppose that I've trained the model to understand poems by Robert Frost and Edgar

7
00:00:41,710 --> 00:00:42,400
Allan Poe.

8
00:00:42,970 --> 00:00:48,610
I will then input some new text and I will be able to get a prediction about who authored that text.

9
00:00:49,420 --> 00:00:54,370
Now, this might seem like an example which is not useful in the real world, but that's only because

10
00:00:54,370 --> 00:00:56,150
you may have forgotten about my rule.

11
00:00:56,230 --> 00:00:59,410
All data is the same because of this rule.

12
00:00:59,410 --> 00:01:03,310
The same method can be applied to any problem with the same structure.

13
00:01:03,970 --> 00:01:08,170
So, for example, I could train a model to predict whether or not an email is spam.

14
00:01:08,740 --> 00:01:13,570
I could train a model to predict whether a movie review has positive or negative sentiment.

15
00:01:14,350 --> 00:01:19,810
These are two of the most popular applications of text classifiers, and this would be using the exact

16
00:01:19,810 --> 00:01:20,560
same code.

17
00:01:25,400 --> 00:01:30,530
Now, I want to make note that the goal of this example is not to build the best and most accurate text

18
00:01:30,530 --> 00:01:31,310
classifier.

19
00:01:31,880 --> 00:01:36,200
The goal of this example is to exercise and to practice what we have learned.

20
00:01:36,740 --> 00:01:41,000
If our goal is accuracy, then we might simply use a pre-trained a deep neural network.

21
00:01:41,450 --> 00:01:46,910
Of course, plugging our data into one line of code does not count as an exercise that would be like

22
00:01:46,910 --> 00:01:50,870
watching people on TV to exercise instead of doing exercise yourself.

23
00:01:51,500 --> 00:01:56,780
So as you go through this exercise, be mindful of what its purposes and try your hardest to see the

24
00:01:56,780 --> 00:01:58,910
big picture of how everything fits together.

25
00:02:03,620 --> 00:02:05,660
OK, so let's recognize one fact.

26
00:02:06,640 --> 00:02:12,370
Text classification as an example of supervised learning, but Markov models are unsupervised.

27
00:02:12,850 --> 00:02:15,040
This is because we do not use labels.

28
00:02:15,550 --> 00:02:18,610
Our training data consists only of sequences of text.

29
00:02:19,180 --> 00:02:22,660
So how can we make use of markup models in a supervised context?

30
00:02:23,590 --> 00:02:24,940
The answer is Bayes rule.

31
00:02:25,540 --> 00:02:29,350
In fact, what we are really building in this lecture is a Bayes classifier.

32
00:02:34,140 --> 00:02:38,230
The basic idea is this suppose that I have two sets of poems.

33
00:02:38,280 --> 00:02:41,730
Poems by Robert Frost and poems by Edgar Allan Poe.

34
00:02:42,600 --> 00:02:47,760
Now suppose that I train at two separate Markov models for each poet, so each of these will have their

35
00:02:47,760 --> 00:02:50,880
own distinct matrices and PI vectors.

36
00:02:52,390 --> 00:02:58,150
As you recall, one task we can perform when we have a Markov model is to figure out the probability

37
00:02:58,150 --> 00:03:00,280
of a sequence given a pot.

38
00:03:01,120 --> 00:03:04,270
So suppose I have a poem whose author is unknown?

39
00:03:04,990 --> 00:03:07,420
I would like to know which author wrote this poem.

40
00:03:08,350 --> 00:03:13,780
Well, what do I get when I compute the probability of this poem under the different Aizen pies?

41
00:03:14,410 --> 00:03:20,530
The answer is I get pie of poem given author equals Robert Frost and Poem Given Author equals Edgar

42
00:03:20,530 --> 00:03:21,220
Allan Poe.

43
00:03:21,970 --> 00:03:27,700
The conditioning here arises from the fact that each model was trained on data by a specific author.

44
00:03:32,340 --> 00:03:37,680
Now, what I would really like to know is the distribution, which is kind of opposite to what we have.

45
00:03:38,220 --> 00:03:44,400
Specifically, we have of palm given author, as I recall what we would like to know this piece of author

46
00:03:44,400 --> 00:03:45,150
given Palm.

47
00:03:45,900 --> 00:03:51,210
If I have this distribution, then my prediction for which author wrote the poem would simply be the

48
00:03:51,210 --> 00:03:52,710
one with the highest probability.

49
00:03:53,520 --> 00:03:59,610
In general, I will choose the class case star, which is the AMAX of PE of class equals K given the

50
00:03:59,610 --> 00:04:02,130
input x over all classes K.

51
00:04:06,920 --> 00:04:12,500
So now that we understand what distribution we want, let's think about how we can find this distribution.

52
00:04:13,460 --> 00:04:15,380
In fact, this is just Bay's rule.

53
00:04:15,380 --> 00:04:23,570
Once again, specifically of author given poem is equal to a poem given author times b of Arthur divided

54
00:04:23,570 --> 00:04:24,440
by P of Palm.

55
00:04:25,340 --> 00:04:30,590
So this is the basic form of what we need, but there are several ways we can simplify this expression

56
00:04:30,590 --> 00:04:31,280
on the right.

57
00:04:35,950 --> 00:04:43,240
Firstly, note that all we want is the AMAX puV poem does not depend on any specific author, and therefore

58
00:04:43,240 --> 00:04:45,610
it is constant with respect to author.

59
00:04:46,540 --> 00:04:49,060
Therefore, it simply does not need to be considered.

60
00:04:49,600 --> 00:04:51,550
We only need to consider the numerator.

61
00:04:52,630 --> 00:04:56,830
When we take the AMAX, we will get the same answer, whether we include this term or not.

62
00:04:58,030 --> 00:05:03,250
Furthermore, as you recall, with Markov models, we often like to work with log probabilities.

63
00:05:03,790 --> 00:05:09,280
Remember that since the log is a monotonically increasing function, it does not change the answer as

64
00:05:09,280 --> 00:05:10,570
to which is the ARG Max.

65
00:05:11,080 --> 00:05:16,360
And since the numerator only involves multiplication, it's easy to take a log of all the terms.

66
00:05:17,170 --> 00:05:23,200
So what we end up with is that we would like the ARG mass of log of palm giving author plus log p of

67
00:05:23,200 --> 00:05:23,680
author.

68
00:05:24,400 --> 00:05:30,400
And again, recall that log p of palm given author is just the log probability of a markup model, which

69
00:05:30,400 --> 00:05:31,690
we know how to compute.

70
00:05:36,350 --> 00:05:42,980
One further simplification we can make occurs when the prior piece of author is uniform, for example,

71
00:05:42,980 --> 00:05:48,440
given no other information, perhaps there's no reason why Robert Frost or Edgar Allan Poe would be

72
00:05:48,440 --> 00:05:49,130
more likely.

73
00:05:49,940 --> 00:05:53,090
Thus, both would have a 50 percent chance of being chosen.

74
00:05:54,050 --> 00:06:00,200
If this is the case, then Pivot author is also constant, and it can also be removed from the AMAX.

75
00:06:00,770 --> 00:06:05,480
In this case, our decision comes down to choosing the AMAX of Poem Given author.

76
00:06:06,230 --> 00:06:11,480
Of course, this is just the likelihood of once again, and so we call this method maximum likelihood.

77
00:06:12,080 --> 00:06:17,690
But note that this is a special case that can only be used if we assume that the prior is uniform.

78
00:06:22,490 --> 00:06:25,400
OK, so let's recap what we've discussed in this lecture.

79
00:06:26,060 --> 00:06:31,520
We discovered that we can use markup models for text classification by creating a separate Markov model

80
00:06:31,520 --> 00:06:32,550
for each class.

81
00:06:33,230 --> 00:06:39,200
The Markov model can be used to compute the likelihood of X given class equals K for each Class K.

82
00:06:40,070 --> 00:06:43,550
In order to come up with a decision rule, we must apply Bayes rule.

83
00:06:44,300 --> 00:06:51,020
We would like the ARG max of P of class equals K given the input x over all classes k, this is our

84
00:06:51,020 --> 00:06:53,390
prediction, which we will call K star.

85
00:06:54,830 --> 00:07:00,320
The next step is to realize that this posterior probability can be simplified since we don't actually

86
00:07:00,320 --> 00:07:06,800
need to compute it and we only want the ARG Max specifically, we can reduce this decision rule to the

87
00:07:06,800 --> 00:07:09,560
log of the likelihood plus the log of the prior.

88
00:07:10,310 --> 00:07:14,170
By the way, we call this the maximum a posteriori method or map.

89
00:07:15,050 --> 00:07:20,360
Furthermore, we noted that if the prior is uniform, that is, all classes have an equal chance of

90
00:07:20,360 --> 00:07:24,230
being chosen, then our problem reduces to maximum likelihood.

