1
00:00:11,040 --> 00:00:16,920
So in this lecture, we'll continue discussing LDA and specifically we'll talk about the intuition behind

2
00:00:16,920 --> 00:00:19,280
how LDA works now.

3
00:00:19,290 --> 00:00:25,860
LDA is a topic which could cover a whole course by itself, and it has many prerequisites in my experience,

4
00:00:25,860 --> 00:00:28,830
more than typical deep learning and machine learning and so forth.

5
00:00:29,850 --> 00:00:34,320
So this lecture will discuss some high level ideas, and if you want to know the details, check out

6
00:00:34,320 --> 00:00:36,490
the paper in extra reading dot text.

7
00:00:37,830 --> 00:00:40,350
Now you'll notice I mark this lecture as advanced.

8
00:00:40,920 --> 00:00:43,410
Why is it advanced if it's just intuition?

9
00:00:44,220 --> 00:00:49,080
The answer is that I'll be mentioning some of the techniques we use in Bayesian machine learning, which

10
00:00:49,080 --> 00:00:54,120
should be intuitive if you have some experience, but we'll probably seem like magic if you do not.

11
00:00:54,750 --> 00:00:59,670
This is just to give you some idea of what kind of prerequisites would be needed if you wanted a full

12
00:00:59,670 --> 00:01:01,590
understanding of this method yourself.

13
00:01:06,270 --> 00:01:09,960
Now, our goal in this lecture is to essentially understand this picture.

14
00:01:10,620 --> 00:01:14,760
This is called a graphical model, which uses a convention called plate notation.

15
00:01:15,510 --> 00:01:20,310
I want to start with this because although it looks complicated now, hopefully I'll be able to fill

16
00:01:20,310 --> 00:01:22,080
in the blanks by the end of this lecture.

17
00:01:22,620 --> 00:01:24,870
So just keep this picture in mind as our goal.

18
00:01:29,610 --> 00:01:34,470
So to begin building our intuition, I'd like to start with clustering, but instead of just clustering,

19
00:01:34,890 --> 00:01:39,940
let's compare it with classification, which is the supervised analog, as you recall.

20
00:01:39,960 --> 00:01:45,540
These are both essentially doing the same thing, which is mapping an input X to some Category Z or

21
00:01:45,540 --> 00:01:46,020
Y.

22
00:01:46,920 --> 00:01:51,570
In the unsupervised case we call a Z, and it represents a cluster identity.

23
00:01:52,590 --> 00:01:56,550
In the supervised case, we call it Y, and it represents a class.

24
00:01:57,120 --> 00:02:01,800
But although they have different names, they both still represent discrete categories.

25
00:02:03,090 --> 00:02:07,740
So you'll notice that for our classifier, we are using the Bayes classifier, which, as you recall,

26
00:02:08,070 --> 00:02:09,630
is a generative approach.

27
00:02:10,259 --> 00:02:12,690
In fact, both of these are generative models.

28
00:02:13,590 --> 00:02:18,870
In the supervised case we have whi which points to X and in the unsupervised case we have Z, which

29
00:02:18,870 --> 00:02:23,580
points to X. Now there's a slight difference in how I've colored these circles.

30
00:02:24,240 --> 00:02:27,000
In the supervised case, both X and Y are dark.

31
00:02:27,450 --> 00:02:30,960
But in the unsupervised case, only X is dark while Z is light.

32
00:02:31,500 --> 00:02:37,500
This is not a mistake in graphical model notation, and dark means it's a variable we observe, while

33
00:02:37,500 --> 00:02:39,780
light means it's a variable we did not observe.

34
00:02:44,510 --> 00:02:46,550
But I want to point out another similarity.

35
00:02:47,150 --> 00:02:52,340
If you have experience with these models, then perhaps you know that they both use Bayes rule to find

36
00:02:52,340 --> 00:02:58,610
what is called a posterior distribution in the supervisor case, we want P of Y given X.

37
00:02:59,150 --> 00:03:07,010
We do this by applying Bayes rule in particular p of X given y times B of Y divided by Pivac's Pivac's

38
00:03:07,010 --> 00:03:14,150
given a Y describes the relationship or the arrow between X and Y, for example, given a Class Y X

39
00:03:14,150 --> 00:03:18,800
is a Gaussian in the unsupervised case, we want P of Z given X.

40
00:03:19,340 --> 00:03:25,580
Again, we do this by applying Bayes rule, which is p of X given z type of z divided by P of X.

41
00:03:26,540 --> 00:03:29,990
So it's the exact same equation, except the Z replaces Y.

42
00:03:30,890 --> 00:03:34,940
Now you realize that this poses a problem, as you recall.

43
00:03:34,970 --> 00:03:36,380
Z is not observed.

44
00:03:36,920 --> 00:03:39,280
Therefore, we don't actually know P of Z.

45
00:03:39,290 --> 00:03:42,890
And since we never see Z, how can we compute any of this?

46
00:03:43,610 --> 00:03:46,760
The answer is through a method called expectation maximization.

47
00:03:47,480 --> 00:03:51,470
Of course, we won't discuss how that works in this course, since that's essentially a whole course

48
00:03:51,470 --> 00:03:52,160
by itself.

49
00:03:52,580 --> 00:03:55,190
But again, I want to let you know the names of these techniques.

50
00:03:55,430 --> 00:04:00,790
In case you want to study them yourself, if you're interested in resources to learn these techniques,

51
00:04:01,010 --> 00:04:02,810
just contact me on my website.

52
00:04:07,460 --> 00:04:12,740
Now, using these graphical models, it allows us to tell a story about how our data was created.

53
00:04:13,310 --> 00:04:15,800
This is where our model assumptions are embedded.

54
00:04:16,700 --> 00:04:18,860
Since the models I've introduced are quite simple.

55
00:04:19,130 --> 00:04:23,780
It might seem trivial, but I hope that this makes more sense as we encounter further examples.

56
00:04:24,860 --> 00:04:30,050
Now being that both of these models have the same graphs, they both follow the same data generating

57
00:04:30,050 --> 00:04:30,800
process.

58
00:04:31,460 --> 00:04:33,140
Let's start with the super vice case.

59
00:04:34,460 --> 00:04:38,870
In the supervisor case, suppose that we have two classes spam and not spam.

60
00:04:39,650 --> 00:04:45,080
What this model tells us is that since there are no arrows going into the wide circle, these values

61
00:04:45,080 --> 00:04:47,460
are simply generated out of thin air from of.

62
00:04:47,510 --> 00:04:54,860
Why not given any other information, for example, of why might be 20 percent spam and 80 percent not

63
00:04:54,860 --> 00:04:55,370
spam?

64
00:04:56,060 --> 00:05:00,260
So using that distribution, we can sample a realization of why.

65
00:05:01,130 --> 00:05:04,640
Suppose that for this instance, why is spam from this?

66
00:05:04,640 --> 00:05:07,190
Why we can then follow the arrow to X?

67
00:05:07,910 --> 00:05:11,330
Suppose that x seven y is multi nominal as we've seen.

68
00:05:12,200 --> 00:05:18,050
Then we simply draw a sample from this multi normal distribution, which gives us, say, five instances

69
00:05:18,050 --> 00:05:22,970
of car, six instances of insurance, three instances of cash and so forth.

70
00:05:25,060 --> 00:05:30,460
Note that since this multi no is dependent on why there will be different distributions, depending

71
00:05:30,460 --> 00:05:32,440
on whether why was spam or not spam.

72
00:05:33,970 --> 00:05:39,040
So anyway, this will result in our X vector, which contains the various counts for the words in our

73
00:05:39,040 --> 00:05:39,700
corpus.

74
00:05:41,230 --> 00:05:44,470
Importantly, notice how this is an unrealistic process.

75
00:05:45,040 --> 00:05:47,080
Nobody generates an email this way.

76
00:05:47,770 --> 00:05:52,540
Instead, we generate emails by writing them so that they are coherent and makes sense.

77
00:05:52,960 --> 00:05:57,100
We don't actually create documents by picking how many times each word should appear.

78
00:05:57,790 --> 00:06:02,350
But remember that this is a model and models always come with simplifying assumptions.

79
00:06:03,220 --> 00:06:08,630
It may not be realistic, but it does the job we need it to do in these instances.

80
00:06:08,650 --> 00:06:12,760
I always like to remind students of the quote by famous statistician George Box.

81
00:06:13,270 --> 00:06:15,940
He says all models are wrong, but some are useful.

82
00:06:16,480 --> 00:06:18,160
So this obviously applies here.

83
00:06:19,810 --> 00:06:25,690
Now, let's consider the unsupervised case as mentioned since the graphical model is the same.

84
00:06:26,080 --> 00:06:29,260
There actually is no difference in the data generating process.

85
00:06:29,770 --> 00:06:34,900
We still sample a z, which again comes out of thin air and doesn't depend on anything else.

86
00:06:35,620 --> 00:06:42,100
Then, given Z, we follow the arrow to X, which tells us how to generate X given the different values

87
00:06:42,100 --> 00:06:42,730
of Z.

88
00:06:43,510 --> 00:06:48,730
The major difference in the supervised case and the unsupervised case is with the data we are given.

89
00:06:49,600 --> 00:06:52,330
The model is the same, but the data is different.

90
00:06:52,960 --> 00:06:59,530
We never get to observe the ZS, but we do observe the what the Z's are simply absent from our dataset,

91
00:06:59,830 --> 00:07:01,180
which means we have to guess.

92
00:07:01,870 --> 00:07:03,850
In addition to guessing what they mean.

93
00:07:04,090 --> 00:07:07,630
You'll recall that we also have to guess how many there are in the first place.

94
00:07:12,220 --> 00:07:17,140
Now, there's one small addition I want to make to our graphical model, which is to introduce the full

95
00:07:17,140 --> 00:07:18,220
plate notation.

96
00:07:19,030 --> 00:07:24,430
As you can see, I've had it rectangles two hour diagrams with the letter N. in the bottom right corner.

97
00:07:25,390 --> 00:07:30,220
Basically, you can think of this like a for loop in the letter N. That tells you how many times that

98
00:07:30,220 --> 00:07:31,090
loop runs.

99
00:07:31,840 --> 00:07:34,150
So you can imagine this process and pseudo code.

100
00:07:34,780 --> 00:07:36,400
We have a for loop with the index.

101
00:07:36,400 --> 00:07:39,730
I going from one up to n inside the loop.

102
00:07:39,740 --> 00:07:45,610
We first draw sample from P of Y to get a realization of Y, which we do know y survie.

103
00:07:46,450 --> 00:07:52,750
From this, we then sample from P of X given y Saibai to get x of--i, which is a document belonging

104
00:07:52,750 --> 00:07:53,980
to the Class Y Saibai.

105
00:07:54,610 --> 00:07:56,290
So that's pretty much all there is to it.

106
00:08:00,940 --> 00:08:06,220
So now that we understand clustering from a graphical model perspective, we can think of unsupervised

107
00:08:06,220 --> 00:08:08,560
learning in sort of a big picture sense.

108
00:08:09,280 --> 00:08:14,080
This slide shows three different unsupervised models in order of increasing complexity.

109
00:08:14,800 --> 00:08:19,450
So the left side is the simplest and the right side is the most complex, which is LDA.

110
00:08:20,350 --> 00:08:24,400
On the left, we have what is called a u anagram model, which will describe shortly.

111
00:08:25,060 --> 00:08:29,470
We can visualize a MoneyGram model in terms of a single, multivariate Gaussian.

112
00:08:29,860 --> 00:08:33,220
Although do note that we don't typically use Gaussians for word counts.

113
00:08:34,299 --> 00:08:37,570
The point of this diagram is to show that there are no clusters.

114
00:08:37,960 --> 00:08:42,940
Our model assumes that all data comes from the same cluster or, in other words, the same topic.

115
00:08:44,330 --> 00:08:48,830
In the middle, we have clustering once again, as you may have learned in the past.

116
00:08:49,250 --> 00:08:51,980
One way to view clustering is that it's a mixture model.

117
00:08:52,640 --> 00:08:54,260
Again, we can visualize this.

118
00:08:54,920 --> 00:08:59,630
So this is a Gaussian mixture model, but again, be aware that we typically do not use Gaussians for

119
00:08:59,630 --> 00:09:00,210
NLP.

120
00:09:01,670 --> 00:09:06,860
Finally, on the right side, we have LDA, which unfortunately is not as easy to visualize.

121
00:09:07,490 --> 00:09:11,870
However, graphical models in played notation give us the language we need to describe it.

122
00:09:16,580 --> 00:09:20,840
So the simplest model is the one where there are no hidden variables and no clusters.

123
00:09:21,260 --> 00:09:24,440
We simply have the word by itself in the paper.

124
00:09:24,440 --> 00:09:29,120
They call this the you anagram model, which makes sense since it's a bag of words and each word is

125
00:09:29,120 --> 00:09:30,830
independent of the other words.

126
00:09:31,730 --> 00:09:34,490
You'll notice that here we have two plates which are nested.

127
00:09:35,180 --> 00:09:38,940
In particular, we have the outside plate, which is a loop over and samples.

128
00:09:39,320 --> 00:09:42,320
And we have the inside plate, which is a loop over de words.

129
00:09:43,040 --> 00:09:50,580
Of course, this results in an end by documents, by terms matrix, as we've seen many times using pseudocode.

130
00:09:50,600 --> 00:09:54,980
We can again imagine the data generating process as a nested loop.

131
00:09:55,910 --> 00:09:59,510
So we have four i from one to N 4G from one to B.

132
00:09:59,930 --> 00:10:05,900
Then for each exigé draw a sample for the number of times where J appears in Document I.

133
00:10:10,620 --> 00:10:15,300
Let's now return to the clustering model, which in the paper is called a mixture of anagrams.

134
00:10:15,960 --> 00:10:21,780
In this picture, I've again used two nested plates with one over the samples and one over the words.

135
00:10:22,440 --> 00:10:26,880
As you can see, for each sample, we first generate a topic Ze'evi.

136
00:10:27,120 --> 00:10:30,890
So there's one topic per sample from this topic.

137
00:10:30,900 --> 00:10:32,790
We then loop through each of our D words.

138
00:10:33,180 --> 00:10:38,220
And for each of these words, we generate a count exigé given z sub by.

139
00:10:43,030 --> 00:10:46,330
So at this point, we're finally ready to understand LDA.

140
00:10:47,080 --> 00:10:52,180
The reason I've done things this way is because the model is kind of weird and how it generates topics,

141
00:10:52,870 --> 00:10:56,600
whereas previously we had one word for topic with LDA.

142
00:10:56,620 --> 00:11:03,490
We actually sample a new topic for every word that is, we sample a topic as many times as we sample

143
00:11:03,490 --> 00:11:04,330
the word count.

144
00:11:05,290 --> 00:11:11,410
You'll notice that the LDA model has many more variables in addition to Z, which is the topic, and

145
00:11:11,410 --> 00:11:12,670
X, which is the word.

146
00:11:13,090 --> 00:11:15,160
We also have theta, beta and alpha.

147
00:11:15,850 --> 00:11:19,630
These are called priors, which are an integral part of Bayesian models.

148
00:11:20,260 --> 00:11:25,210
We'll discuss that more later, but for now, we can use what we've learned to understand the data generating

149
00:11:25,210 --> 00:11:30,460
process as you can see Alpha and beta show up outside the plates.

150
00:11:30,880 --> 00:11:36,130
So these are fixed for all end documents and all the words inside the first plate.

151
00:11:36,130 --> 00:11:38,620
We sampled Theta for each of the documents.

152
00:11:39,310 --> 00:11:41,560
Theta is a prior for the topic z.

153
00:11:41,920 --> 00:11:45,480
Well, Alpha is the prior four theta from Theta.

154
00:11:45,490 --> 00:11:49,000
We follow the arrow to Z, but this is inside another plate.

155
00:11:49,720 --> 00:11:56,530
This means that for each of the documents will be sampling a new topic d times, but notice that these

156
00:11:56,530 --> 00:12:00,010
topics are not completely random because they depend on theta.

157
00:12:00,850 --> 00:12:06,460
So practically speaking, you can imagine that Theta might be biased to produce topics in tech like

158
00:12:06,460 --> 00:12:09,490
software, Microsoft, Spam, Google and so forth.

159
00:12:10,840 --> 00:12:14,560
Now you'll notice that both the Xanax are inside the nested plates.

160
00:12:15,160 --> 00:12:18,310
This means that they are both sampled the same number of times.

161
00:12:19,120 --> 00:12:23,740
In other words, we can think of both Z and X as matrices indexed by AI and J.

162
00:12:24,520 --> 00:12:29,350
Although X is a matrix of word counts and Z is a matrix of categories which are topics.

163
00:12:30,580 --> 00:12:35,830
Again, I want to mention that this is weird because we normally think of a document as belonging to

164
00:12:35,830 --> 00:12:37,990
one topic or a mixture of topics.

165
00:12:38,650 --> 00:12:42,730
Llda is more complex because it samples a different topic for every word.

166
00:12:44,070 --> 00:12:49,770
Finally, note that the word count X also depends on beta, which you can think of like a prior over

167
00:12:49,770 --> 00:12:50,700
the word counts.

168
00:12:55,200 --> 00:12:59,070
Now, this is essentially as far as I want to go in explaining this topic.

169
00:12:59,670 --> 00:13:04,770
The key part of this was to understand plate notation so that you can understand the data generating

170
00:13:04,770 --> 00:13:09,030
process, which means we understand the model assumptions of LDA.

171
00:13:09,690 --> 00:13:14,400
We won't get into the finer points of this model like what it means to be Bayesian or how the model

172
00:13:14,400 --> 00:13:15,630
parameters are estimated.

173
00:13:16,260 --> 00:13:21,780
If you're interested, basically the key principle in Bayesian learning is that everything is a random

174
00:13:21,780 --> 00:13:22,290
variable.

175
00:13:23,130 --> 00:13:27,810
Because of this, everything has a distribution and every distribution has parameters.

176
00:13:28,350 --> 00:13:32,550
When those parameters are priors to other parameters, we call them hyper parameters.

177
00:13:37,160 --> 00:13:42,470
So, for example, under normal machine learning circumstances, Theta would be a parameter or, in

178
00:13:42,470 --> 00:13:44,210
other words, a set of numbers or weights.

179
00:13:44,420 --> 00:13:45,380
And that would be it.

180
00:13:45,980 --> 00:13:51,680
But since this is a Bayesian model, this theta is not just the set of weights, but a distribution

181
00:13:51,680 --> 00:13:52,580
over weights.

182
00:13:53,090 --> 00:13:55,850
This distribution has a hyper parameter called alpha.

183
00:13:57,050 --> 00:14:02,000
This also brings us to why this model is called latent directly allocation in the first place.

184
00:14:02,720 --> 00:14:04,610
Now this goes far beyond this course.

185
00:14:04,610 --> 00:14:08,900
But if you studied Bayesian machine learning with me in the past, then this should make sense.

186
00:14:09,650 --> 00:14:14,840
As you recall, in Bayesian machine learning, we try to pick convenient distributions to work with

187
00:14:15,230 --> 00:14:16,910
to make our computations simpler.

188
00:14:17,630 --> 00:14:23,420
It turns out that for some distributions, there are special priors called conjugate priors, which

189
00:14:23,420 --> 00:14:26,930
makes it very easy to apply Bayes rule to find the posterior.

190
00:14:27,830 --> 00:14:32,840
Now, because their topics z are discrete, they come from a multi normal distribution.

191
00:14:33,650 --> 00:14:37,670
It turns out that the conjugate prior for the multi gnomeo is the deer Ashleigh.

192
00:14:38,450 --> 00:14:43,910
This is why Theta is sampled from a deer Ashleigh, which explains why this model has the name it has.

193
00:14:44,600 --> 00:14:47,450
And as a side note, you can check all these facts on Wikipedia.

194
00:14:47,540 --> 00:14:49,580
If you want to confirm that they are true.

