1
00:00:11,070 --> 00:00:16,230
So to understand where clustering and elder fit in the grand scheme of things, it's helpful to look

2
00:00:16,230 --> 00:00:17,250
at some pictures.

3
00:00:17,970 --> 00:00:20,010
Let's start with the simplest thing we could do.

4
00:00:20,730 --> 00:00:26,160
The simplest thing is if we had no clusters or, in other words, no topics, and every document comes

5
00:00:26,160 --> 00:00:27,480
from the same distribution.

6
00:00:28,380 --> 00:00:32,910
We can picture this like a single Gaussian, although do note that we don't typically use Gaussians

7
00:00:32,910 --> 00:00:33,930
to model where it counts.

8
00:00:34,350 --> 00:00:40,110
This is for intuition only, and believe it or not, this is in fact unsupervised learning.

9
00:00:40,740 --> 00:00:44,040
So when you're your grade school teacher, makes a bell curve for your exam grades.

10
00:00:44,370 --> 00:00:46,800
They are actually doing a form of unsupervised learning.

11
00:00:48,360 --> 00:00:52,020
Now, one step up from the simplest thing is to have multiple Gaussians.

12
00:00:52,470 --> 00:00:55,980
This is essentially soft clustering, also known as a mixture model.

13
00:00:56,880 --> 00:01:00,630
In this case, each Gaussian represents a separate cluster or a separate topic.

14
00:01:01,020 --> 00:01:05,220
And the job of our algorithm is to identify the locations of these clusters.

15
00:01:06,090 --> 00:01:09,240
This also tells us which documents belong to which cluster.

16
00:01:10,650 --> 00:01:15,750
Now, I'm skipping a few intermediate steps, but if we add even more complexity, we finally end up

17
00:01:15,750 --> 00:01:16,500
with LDA.

18
00:01:17,430 --> 00:01:21,300
Unfortunately, LDA is an easy to visualize with our current machinery.

19
00:01:21,720 --> 00:01:25,590
So you just have to wait until the next lecture to understand this relationship.

20
00:01:26,580 --> 00:01:31,200
What I can say is that for LDA, each document is a mixture of topics.

21
00:01:31,350 --> 00:01:37,080
And furthermore, each topic is a mixture of words that doesn't really differentiate it from a mixture

22
00:01:37,080 --> 00:01:37,510
model.

23
00:01:37,530 --> 00:01:40,230
But again, the precise details will be in the next lecture.

24
00:01:40,920 --> 00:01:46,140
But understanding that documents are mixtures of topics and topics are mixtures of words will help us

25
00:01:46,140 --> 00:01:47,430
with the rest of this lecture.

