1
00:00:11,360 --> 00:00:16,940
So in this lecture, we'll be discussing the intuition behind lane directly allocation, also known

2
00:00:16,940 --> 00:00:17,870
as LDA.

3
00:00:18,620 --> 00:00:21,830
We'll start this lecture with a simple outline of what will be discussed.

4
00:00:23,210 --> 00:00:26,540
The first thing I'll point you to is the paper that introduced this model.

5
00:00:26,540 --> 00:00:28,340
Since today, it's considered a classic.

6
00:00:28,640 --> 00:00:33,320
And a lot of the techniques should be familiar if you have some experience with Bayesian methods.

7
00:00:34,700 --> 00:00:39,530
The next step in this lecture will be to look at unsupervised learning at a very high level.

8
00:00:40,010 --> 00:00:44,870
This will help you understand how it contrasts with the supervised methods we looked at earlier in this

9
00:00:44,870 --> 00:00:45,470
course.

10
00:00:46,640 --> 00:00:51,260
We'll spend some time looking at different kinds of unsupervised methods, such as clustering and how

11
00:00:51,260 --> 00:00:52,550
that fits into this picture.

12
00:00:53,420 --> 00:00:58,400
In a sense, clustering methods could also be used for topic modeling, since each cluster would be

13
00:00:58,400 --> 00:00:59,390
a separate topic.

14
00:00:59,960 --> 00:01:04,879
For example, Document one might belong to Cluster two, while Document two belongs to Cluster three

15
00:01:04,879 --> 00:01:05,690
and so forth.

16
00:01:06,140 --> 00:01:09,770
But you'll see how LDA it gives us some more rich description of clusters.

17
00:01:11,350 --> 00:01:16,420
Once we finished discussing unsupervised methods at a very high level, we'll start looking at elders

18
00:01:16,420 --> 00:01:17,260
specifically.

19
00:01:17,860 --> 00:01:20,920
We'll try to understand it from what I call an API point of view.

20
00:01:21,760 --> 00:01:27,310
In particular, we'll ask questions like What are the inputs, what are the outputs and how can I interpret

21
00:01:27,310 --> 00:01:28,180
those outputs?

22
00:01:28,840 --> 00:01:31,900
We'll see how this translates to how we use LDA in Python.

23
00:01:32,740 --> 00:01:37,660
Since LDA is part of CI get learn, the inputs and outputs are really what we'll have to deal with.

24
00:01:38,710 --> 00:01:41,950
This is the first step towards understanding what LDA does.

25
00:01:42,970 --> 00:01:46,780
The final part of this lecture will look at the intuition behind LDA.

26
00:01:47,530 --> 00:01:52,450
LDA is a Bayesian model, which admittedly probably means nothing to you if you don't have experience

27
00:01:52,450 --> 00:01:52,810
with it.

28
00:01:53,230 --> 00:01:59,170
But I'll try to give you some appreciation for what this entails Bayesian models or graphical models,

29
00:01:59,170 --> 00:02:04,060
which allow us to impose some structure and assumptions on the data generating process.

30
00:02:08,750 --> 00:02:14,030
As a quick side note, please be aware that Llda is also an acronym for another machine learning method

31
00:02:14,390 --> 00:02:18,920
called linear discriminant analysis in older versions of and learn.

32
00:02:19,220 --> 00:02:25,590
There's a model called LDA, which refers to this and not the topic of this lecture in newer versions.

33
00:02:25,610 --> 00:02:30,140
This was moved to a module called discriminant analysis so as to avoid confusion.

34
00:02:34,800 --> 00:02:40,170
OK, so as promised, the first topic of this lecture is just a quick mention of the paper, which was

35
00:02:40,170 --> 00:02:45,540
authored by David Blye, Andrew Ng and Michael Jordan, all of whom are prominent researchers in machine

36
00:02:45,540 --> 00:02:45,930
learning.

37
00:02:46,680 --> 00:02:49,320
I'm sure you've heard of at least one, if not all three.

38
00:02:50,310 --> 00:02:53,100
Now the reason I want to mention this paper is twofold.

39
00:02:53,880 --> 00:02:59,700
Firstly, note that these lectures will only discuss some basic intuitions about how LDA works.

40
00:03:00,180 --> 00:03:06,780
Actually learning all the details behind LDA is a significant task, and it would be very advanced if

41
00:03:06,780 --> 00:03:09,720
you are an advanced student and you want to know these details.

42
00:03:09,990 --> 00:03:13,140
The paper would be a good next step after watching these lectures.

43
00:03:13,860 --> 00:03:19,230
It's included an extra reading in the course repo so you can find it in the usual place.

44
00:03:20,550 --> 00:03:24,930
The second reason I want to mention this paper is because it's a beautiful demonstration of just how

45
00:03:24,930 --> 00:03:29,670
far you can go on machine learning, and it uses a lot of interesting advanced techniques.

46
00:03:30,270 --> 00:03:36,270
These include probabilistic graphical models, expectation maximization and variational inference.

47
00:03:36,660 --> 00:03:39,390
All of these are important topics in Bayesian machine learning.

48
00:03:40,920 --> 00:03:45,750
Even if you do not understand the techniques being discussed, you should take some time to marvel at

49
00:03:45,750 --> 00:03:47,250
the complexity behind this method.

50
00:03:48,300 --> 00:03:53,580
In fact, one more interesting note to make about this paper is that it begins by building off from

51
00:03:53,580 --> 00:03:57,360
latent semantic indexing, which is a topic we covered in this course.

52
00:03:57,930 --> 00:04:01,800
So it forms a nice connection with other topics you learn about in this class.

53
00:04:06,620 --> 00:04:08,270
OK, so now let's answer the question.

54
00:04:08,420 --> 00:04:11,330
What is unsupervised learning and why is it useful?

55
00:04:12,200 --> 00:04:16,490
So the best way to understand unsupervised learning is the first thing about supervised learning.

56
00:04:17,480 --> 00:04:23,180
Let's consider a simple task where a job is to predict given a piece of text, whether it's about physics,

57
00:04:23,180 --> 00:04:27,920
biology or math, in order to build a model to perform this task.

58
00:04:28,130 --> 00:04:31,110
We need a labeled data set, as you recall.

59
00:04:31,130 --> 00:04:34,130
You can think of this like an Excel spreadsheet with two columns.

60
00:04:34,760 --> 00:04:40,520
In this case, one column is the input, namely, the text in the second column is the target, specifying

61
00:04:40,520 --> 00:04:42,350
either physics, biology or math.

62
00:04:43,220 --> 00:04:48,200
The difference between supervised and unsupervised learning is that with unsupervised learning, we

63
00:04:48,200 --> 00:04:50,630
are only given the text, but no labels.

64
00:04:52,140 --> 00:04:57,900
As always, you can visualize machine learning by plotting your input vectors on a grid in the supervise

65
00:04:57,900 --> 00:05:03,780
case, each data point is colored and those colors represent the classes in the unsupervised case,

66
00:05:03,780 --> 00:05:08,790
the data points all have the same color because we're not given any information other than is simply

67
00:05:08,790 --> 00:05:09,900
where those points live.

68
00:05:10,890 --> 00:05:15,750
Our job, then, is to figure out what we can do even when we're not given these colors.

69
00:05:16,500 --> 00:05:20,790
So what if, given this tax, we would still like to categorize these documents?

70
00:05:21,270 --> 00:05:24,390
Is this possible even if we are not given any labels?

71
00:05:24,960 --> 00:05:26,760
Surprisingly, the answer is yes.

72
00:05:27,540 --> 00:05:32,910
We call this clustering in the canonical example is K means, which you've probably seen in the past.

73
00:05:37,550 --> 00:05:42,560
Now, as you recall for machine learning, our first step is typically to convert our input data into

74
00:05:42,560 --> 00:05:45,620
vectors, which you can visualize like points in a grid.

75
00:05:46,280 --> 00:05:48,770
This will continue to be the case in these lectures.

76
00:05:49,760 --> 00:05:55,010
So let's suppose we've converted our text into vectors using some method like counting or TFI Taf,

77
00:05:55,340 --> 00:05:57,350
and we then cluster these vectors like so.

78
00:05:58,250 --> 00:06:03,440
One important fact about unsupervised learning is that these clusters have no meaning by themselves.

79
00:06:03,860 --> 00:06:08,270
We might no these clusters, for example, we might call them clusters one, two and three.

80
00:06:08,930 --> 00:06:12,590
But this is unlike supervised learning where we know what each category means.

81
00:06:12,770 --> 00:06:18,440
For example, physics, biology and math with unsupervised learning, there's no guarantee that our

82
00:06:18,440 --> 00:06:24,380
three clusters correspond to these three subjects, even if we trained on a physics, biology and math

83
00:06:24,380 --> 00:06:25,070
dataset.

84
00:06:25,670 --> 00:06:29,780
The only way they attain meaning is by what we assign to them based on what we know.

85
00:06:34,480 --> 00:06:39,100
Another important fact to remember is that we don't even know how many clusters there should be.

86
00:06:39,880 --> 00:06:45,070
Occasionally we can simply look at the data if it's two dimensional and very simple, or we can try

87
00:06:45,070 --> 00:06:47,290
to guess based on some out of sample metric.

88
00:06:47,830 --> 00:06:51,850
But generally, the number of clusters is unknown and it must be chosen by us.

89
00:06:52,480 --> 00:06:56,920
This is also unlike supervised learning where we know the categories because they come with our data

90
00:06:56,920 --> 00:07:04,090
set and because we're given these categories, we know exactly how many of them exist to foreshadow

91
00:07:04,090 --> 00:07:04,690
a little bit.

92
00:07:04,720 --> 00:07:09,160
This also means that for topic modeling, we don't know how many topics there should be.

93
00:07:09,550 --> 00:07:11,970
We simply pick a number and see how it goes.

94
00:07:16,580 --> 00:07:21,650
Now, at this point, it's useful to turn our attention to the practical side, which is just to look

95
00:07:21,650 --> 00:07:28,190
at LDA in terms of an API, that is what are the inputs to this model and what are the outputs?

96
00:07:28,820 --> 00:07:33,500
This will help you understand what we previously discussed, and it will help you understand what happens

97
00:07:33,500 --> 00:07:34,430
in the code example.

98
00:07:35,090 --> 00:07:38,570
If you don't like math, then this is essentially as far as you need to go.

99
00:07:39,200 --> 00:07:43,850
However, if you do want to know a little bit about how this model works, then that will come after.

100
00:07:45,170 --> 00:07:47,540
OK, so the inputs to this model are quite simple.

101
00:07:47,750 --> 00:07:50,630
Just the count vectors we learned about earlier in this course.

102
00:07:51,440 --> 00:07:55,760
Note that this is a bag of words model, and so we don't take the order of words into account.

103
00:07:56,750 --> 00:07:58,990
As you recall, this can be done using the count.

104
00:07:59,000 --> 00:08:00,110
Victories are inside it.

105
00:08:00,110 --> 00:08:02,690
Learn to review how this works.

106
00:08:02,900 --> 00:08:09,290
Each document is converted into a vector, and each element of the vector represents a word count since

107
00:08:09,290 --> 00:08:11,150
we have multiple documents in our corpus.

108
00:08:11,420 --> 00:08:16,670
Our dataset will then be a matrix, with number of rows equal to the number of documents and columns

109
00:08:16,670 --> 00:08:18,200
equal to the number of terms.

110
00:08:22,780 --> 00:08:28,840
So the outputs of LDA are as follows, we essentially get two matrices as output, although I'll qualify

111
00:08:28,840 --> 00:08:29,710
this statement later.

112
00:08:30,700 --> 00:08:33,640
The first matrix is a matrix of topics by words.

113
00:08:34,150 --> 00:08:37,270
The second matrix is a matrix of documents by topics.

114
00:08:38,380 --> 00:08:40,539
So how can we interpret what this means?

115
00:08:41,110 --> 00:08:43,090
Well, it helps to simply look at a picture.

116
00:08:43,809 --> 00:08:46,000
Let's start with the topics by words matrix.

117
00:08:46,690 --> 00:08:52,000
For this matrix, each row is a topic, and each topic is represented by a vector, which is that row.

118
00:08:52,810 --> 00:08:57,310
Obviously, the number of components in the vector is the number of words in our corpus.

119
00:08:58,710 --> 00:09:03,630
Now, as you recall, for Elder, each topic is a distribution over words.

120
00:09:04,050 --> 00:09:06,750
Or put another way, not a discrete category.

121
00:09:08,040 --> 00:09:14,310
So as you can see, one way to visualize the topics is to simply plot these distributions as a bar chart

122
00:09:14,640 --> 00:09:16,330
showing the top words for each topic.

123
00:09:17,160 --> 00:09:19,350
That is the words with the highest probability.

124
00:09:20,040 --> 00:09:21,600
In other words, these are sorted.

125
00:09:22,560 --> 00:09:28,440
So in the first chart, we can see that topic one has the highest values for words like software, Microsoft

126
00:09:28,440 --> 00:09:30,270
virus security and so forth.

127
00:09:30,840 --> 00:09:33,910
Clearly, this topic has something to do with desktop software.

128
00:09:34,680 --> 00:09:39,480
And notice that what I've just done is an example of how we assign a meaning to topics.

129
00:09:40,320 --> 00:09:46,920
Let's now look a topic to the top words are people government, US, UK human rights law.

130
00:09:47,490 --> 00:09:52,350
This topic appears to be related to politics, and we can do the same thing with the other topics as

131
00:09:52,350 --> 00:09:52,740
well.

132
00:09:57,440 --> 00:10:03,530
Now, let's look at the second output matrix, which is documents by topics again, we can visualize

133
00:10:03,530 --> 00:10:08,810
each document with a bar chart displaying the amounts each topic belongs to that document.

134
00:10:09,530 --> 00:10:14,720
Notice how this is like a soft clustering, since we don't assign a document to just one topic.

135
00:10:15,500 --> 00:10:21,020
And this actually makes a lot of sense, since documents can be about multiple different topics, depending

136
00:10:21,020 --> 00:10:22,910
on how those topics are differentiated.

137
00:10:24,050 --> 00:10:28,850
For example, we previously saw two topics related to software, but with different words.

138
00:10:29,300 --> 00:10:32,990
In one case, it was centered around Microsoft and viruses and security.

139
00:10:33,440 --> 00:10:36,530
In the other case, it was centered around Google and search and spam.

140
00:10:37,310 --> 00:10:41,120
Well, what if we had a document about how viruses were acquired through spam?

141
00:10:41,600 --> 00:10:46,820
Now it's a document that has components of two different topics, so that should give you an idea of

142
00:10:46,820 --> 00:10:49,540
why probabilistic topic assignments are useful.