1
00:00:11,090 --> 00:00:16,490
So in this lecture, we'll be discussing non-negative matrix factorization and how this can be applied

2
00:00:16,490 --> 00:00:20,330
to topic modeling will begin this lecture with a quick outline.

3
00:00:21,110 --> 00:00:23,540
This lecture essentially has two distinct parts.

4
00:00:24,170 --> 00:00:29,270
The first part is to discuss recommender systems, which is what matrix factorization is best known

5
00:00:29,270 --> 00:00:29,690
for.

6
00:00:30,500 --> 00:00:35,570
The second part is to connect this to topic modeling to understand why something built for recommender

7
00:00:35,570 --> 00:00:39,890
systems can also be applied to this seemingly unrelated task.

8
00:00:44,430 --> 00:00:50,130
So let's talk about recommenders, suppose you are Netflix, you have a whole bunch of movies and a

9
00:00:50,130 --> 00:00:55,620
whole bunch of users, and your job is to recommend movies to users, which they have not yet seen.

10
00:00:56,520 --> 00:01:01,950
If you can do this successfully, your users will be happy and entertained and more likely to use your

11
00:01:01,950 --> 00:01:02,970
service in the future.

12
00:01:04,050 --> 00:01:09,960
Now there's a lot of almost philosophical discussion that could be had about how to recommend, for

13
00:01:09,960 --> 00:01:15,270
example, YouTube, in my opinion, as a bad recommender system because it keeps showing me the same

14
00:01:15,270 --> 00:01:16,920
things, which is very boring.

15
00:01:17,850 --> 00:01:23,130
However, for the purpose of this lecture, we'll need to simplify what it means to make a good recommendation

16
00:01:24,120 --> 00:01:24,750
for us.

17
00:01:24,780 --> 00:01:28,530
A good recommendation is simply a movie that a user will rate highly.

18
00:01:29,430 --> 00:01:35,700
Once we have predictions for a user's movie ratings, we can simply show the user those movies in order

19
00:01:35,700 --> 00:01:37,650
with the highest predicted rating first.

20
00:01:38,100 --> 00:01:42,570
So if I think you're going to read something five stars, I'd probably show you that first.

21
00:01:43,680 --> 00:01:49,440
Therefore, our task has been simplified predict the ratings for movies that users have not yet seen.

22
00:01:54,060 --> 00:01:58,810
Now, this type of recommender system requires that we already have a set of ratings to learn from.

23
00:01:59,640 --> 00:02:04,740
So imagine that we have a set of users numbered from one up to em and a set of movies numbered from

24
00:02:04,740 --> 00:02:05,490
one up to end.

25
00:02:06,300 --> 00:02:11,130
If you have experience with databases, this should make complete sense since these numbers would just

26
00:02:11,130 --> 00:02:14,550
be used radios and movie IDs in your database tables.

27
00:02:15,510 --> 00:02:22,080
Now, imagine that we have a table of ratings with three columns User ID, Movie ID and rating every

28
00:02:22,080 --> 00:02:23,520
time a user writes a movie.

29
00:02:23,850 --> 00:02:29,070
We add a new row to this table with that rating, which can then be used in our machine learning system.

30
00:02:33,790 --> 00:02:39,040
Now, database tables are nice, but we need to think a bit more abstractly in order to solve this problem,

31
00:02:39,880 --> 00:02:42,940
in particular, how we would like to store our ratings.

32
00:02:43,000 --> 00:02:44,500
Is it a ratings matrix?

33
00:02:45,280 --> 00:02:49,780
In this matrix, users go along the rows and movies go along the columns.

34
00:02:50,230 --> 00:02:52,870
Therefore, this is a matrix of size m by N.

35
00:02:54,130 --> 00:02:58,330
However, note that this is no ordinary matrix in particular.

36
00:02:58,360 --> 00:03:01,510
This is a matrix where we do not know most of the values.

37
00:03:02,200 --> 00:03:05,140
That is to say, most of the values are simply missing.

38
00:03:06,010 --> 00:03:07,810
Of course, this makes complete sense.

39
00:03:08,500 --> 00:03:11,200
Think about why this would be if you haven't yet done so.

40
00:03:11,920 --> 00:03:15,190
Suppose Netflix has 100000 movies in their database.

41
00:03:15,820 --> 00:03:18,790
Now, consider how many movies you have personally seen.

42
00:03:19,510 --> 00:03:25,900
Now, consider how many movies you have personally rated probably a tiny percentage of those 100000

43
00:03:25,900 --> 00:03:26,530
movies.

44
00:03:28,330 --> 00:03:33,250
Now, assume that you are an average user and that pretty much all other users have the same proportion

45
00:03:33,250 --> 00:03:33,940
of ratings.

46
00:03:34,660 --> 00:03:39,430
It should be easy to see that this ratings matrix mostly contains missing values.

47
00:03:40,090 --> 00:03:41,770
We call this a sparse matrix.

48
00:03:42,670 --> 00:03:47,890
Now, note that this meaning of sparse is different from what we encounter in NLP, such as when we're

49
00:03:47,920 --> 00:03:51,430
using TFI Taf with a tier free of metrics.

50
00:03:51,460 --> 00:03:54,490
This is called sparse because many of the values are zero.

51
00:03:55,090 --> 00:04:00,280
This is different from the ratings matrix, which is called sparse because many of the values are simply

52
00:04:00,280 --> 00:04:01,000
not there.

53
00:04:02,230 --> 00:04:04,600
Now let's remind ourselves of what our goal is.

54
00:04:05,290 --> 00:04:10,900
As you recall, our goal is to make predictions from movie ratings that users have not yet made.

55
00:04:11,620 --> 00:04:16,089
In other words, our goal is to fill in the missing values of this ratings matrix.

56
00:04:16,690 --> 00:04:18,760
So now we have a concrete task.

57
00:04:23,420 --> 00:04:30,110
So what is matrix factorization, the basic concept behind matrix factorization is that we are going

58
00:04:30,110 --> 00:04:37,220
to imagine that a ratings matrix is really just the result of multiplying two smaller matrices, as

59
00:04:37,220 --> 00:04:40,910
you recall, the ratings matrix as Ambrose and and the columns.

60
00:04:41,600 --> 00:04:46,730
We're going to break this up into two factors which are also matrices, which by convention we call

61
00:04:46,730 --> 00:04:52,730
W and H W has the shape m by K and H has the shape K by N.

62
00:04:53,720 --> 00:04:58,790
As you know, from the rules of matrix multiplication, the results of multiplying these two matrices

63
00:04:58,790 --> 00:05:03,080
is a new matrix of size m by N, which matches our ratings matrix.

64
00:05:03,890 --> 00:05:09,080
Of course, the trick here is that when you multiply these two matrices together, there won't be any

65
00:05:09,080 --> 00:05:09,950
missing values.

66
00:05:10,040 --> 00:05:15,050
Thus, in makes predictions for values which didn't appear in the given ratings matrix.

67
00:05:19,590 --> 00:05:24,330
Now, for those of you with some familiarity with machine learning, if you want to think about how

68
00:05:24,330 --> 00:05:28,470
we actually find these unique matrices, let's discuss that briefly.

69
00:05:29,220 --> 00:05:34,680
As you recall, for many machine learning models, the way we train them is by defining some lost function

70
00:05:35,070 --> 00:05:39,990
and then finding the parameters in this case, W and H, which minimize that loss.

71
00:05:41,070 --> 00:05:46,350
So, for example, we might want to minimize the squared error between the true ratings and the predicted

72
00:05:46,350 --> 00:05:46,950
ratings.

73
00:05:47,730 --> 00:05:53,790
Another possible option is the cail divergence, which is the one we will use and corresponds to probabilistic

74
00:05:53,790 --> 00:05:55,320
lean semantic analysis.

75
00:05:55,950 --> 00:05:59,220
So there's a connection to other topics we've studied in this course.

76
00:06:00,550 --> 00:06:04,630
Note that the facts I've just presented are optional, but may be useful and intuitive.

77
00:06:04,870 --> 00:06:06,400
If you have some experience.

78
00:06:10,990 --> 00:06:15,040
Another optional topic I want to discuss in this lecture is why it works.

79
00:06:15,700 --> 00:06:21,070
Why does factor rising our matrix into two smaller matrices results in an accurate predictive model?

80
00:06:21,910 --> 00:06:24,860
I consider this topic to be intermediate to advanced.

81
00:06:24,880 --> 00:06:27,730
So if you'd like to skip this, please feel free to do so.

82
00:06:29,560 --> 00:06:34,840
So as you recall, the main idea behind machine learning is to create a model with a small number of

83
00:06:34,840 --> 00:06:38,980
parameters that essentially compresses the information in the training set.

84
00:06:39,730 --> 00:06:45,250
If we wanted perfect accuracy, all we would have to do is simply memorize the training set in the dictionary.

85
00:06:45,940 --> 00:06:49,180
Of course, that's not helpful, since it doesn't help us predict a new data.

86
00:06:50,110 --> 00:06:55,450
So, for example, to train a linear model, we might have an input matrix of size 1000 by 10.

87
00:06:55,780 --> 00:06:58,330
But the model itself only has 10 weights.

88
00:06:58,870 --> 00:07:03,370
So we compress all that information from the training set into just 10 numbers.

89
00:07:03,880 --> 00:07:07,570
This is much smaller than our input data, which had 10000 numbers.

90
00:07:08,410 --> 00:07:10,840
So the same idea applies to our ratings matrix.

91
00:07:11,410 --> 00:07:17,440
Our ratings matrix technically could have rank em or rank in whichever is smaller and perhaps even less.

92
00:07:18,130 --> 00:07:23,530
As you recall, the rank of a matrix is the number of linearly independent rows of columns, whichever

93
00:07:23,530 --> 00:07:24,070
is smaller.

94
00:07:24,910 --> 00:07:28,360
For all intents and purposes, just think of it as the smallest dimension.

95
00:07:29,420 --> 00:07:32,420
But if you look at our model, which consists of W and H.

96
00:07:32,780 --> 00:07:39,500
We choose to be much smaller than both men, which means that they both have rank typically values of

97
00:07:39,500 --> 00:07:41,360
care between ten and a few hundred.

98
00:07:42,170 --> 00:07:46,370
And in total, the number of parameters would be m times K plus K times N.

99
00:07:47,240 --> 00:07:53,150
So by approximating a very large matrix with two very small matrices which have much lower rank, we

100
00:07:53,150 --> 00:07:58,100
are compressing the information in the training set, which allows us to predict unknown data points

101
00:07:58,430 --> 00:08:00,830
just as we do with models like linear regression.

102
00:08:02,410 --> 00:08:07,990
Also important to note is that by doing this, we are making a modeling assumption that the true rank

103
00:08:07,990 --> 00:08:14,440
of the ratings matrix is actually just K, and this is because when you multiply W and H together,

104
00:08:14,830 --> 00:08:20,320
which both have ranked K, this gives us the predicted R matrix, which still has rank K.

105
00:08:20,950 --> 00:08:25,420
You can think of K is the true number of hidden factors or in relation to this lecture.

106
00:08:25,780 --> 00:08:27,430
The true number of hidden topics.

107
00:08:32,130 --> 00:08:35,159
Now we've just described regular matrix factorization.

108
00:08:35,730 --> 00:08:40,720
But what makes it non-negative matrix factorization for the purpose of this lecture?

109
00:08:40,740 --> 00:08:42,059
The idea is very simple.

110
00:08:42,600 --> 00:08:46,830
It just means that both W and H only contain at non-negative values.

111
00:08:47,430 --> 00:08:52,500
Of course, this should be the case, since we know that these values will end up being some representation

112
00:08:52,500 --> 00:08:53,310
of topics.

113
00:08:53,970 --> 00:08:59,430
Documents only contain non-negative amounts of topics, and topics only contain non-negative amounts

114
00:08:59,430 --> 00:09:00,180
of words.

115
00:09:00,750 --> 00:09:03,990
That is, the values we see are always greater than or equal to zero.

116
00:09:04,710 --> 00:09:09,540
We can ensure that these values remain not negative, by the way we do our updates, but this is not

117
00:09:09,540 --> 00:09:11,250
a relevant topic for this course.

118
00:09:15,870 --> 00:09:20,100
Now, I alluded to this on the previous slide, but let's address this question explicitly.

119
00:09:20,760 --> 00:09:24,690
We just discussed matrix factorization in terms of recommender systems.

120
00:09:25,170 --> 00:09:27,360
But why is it related to topic modeling?

121
00:09:28,320 --> 00:09:30,030
Well, it helps to recall LDA.

122
00:09:30,870 --> 00:09:37,530
The output of LDA, as you recall, is essentially two matrices, one with documents by topics and another

123
00:09:37,530 --> 00:09:38,880
with topics by words.

124
00:09:39,570 --> 00:09:41,820
But this is exactly what an AMF gives us.

125
00:09:42,540 --> 00:09:48,690
If the input to our matrix factorization model is a tier of RDF matrix, it will be documents by terms.

126
00:09:49,230 --> 00:09:50,640
The topics are hidden variables.

127
00:09:50,640 --> 00:09:54,960
So as usual, we choose this value in our recommenders analogy.

128
00:09:54,970 --> 00:10:01,950
This would be K when we perform matrix factorization, we get the two matrices W and H, which will

129
00:10:01,950 --> 00:10:04,650
be documents by topics and topics by words.

130
00:10:05,250 --> 00:10:08,520
Therefore, this is exactly the same as what we get with LDA.

131
00:10:09,630 --> 00:10:15,270
And therefore applying an AMF to topic modeling is simply a matter of plugging in our documents by terms

132
00:10:15,270 --> 00:10:16,020
matrix.