1
00:00:11,700 --> 00:00:16,440
In this section of the course we are going to look at how deep learning can be used for recommender

2
00:00:16,440 --> 00:00:17,920
systems.

3
00:00:17,940 --> 00:00:21,700
This lecture will discuss the theory behind how deep learning can be applied.

4
00:00:21,840 --> 00:00:25,330
And following that we will look at the code to start.

5
00:00:25,350 --> 00:00:29,070
I want to talk about recommender systems as a general concept.

6
00:00:29,250 --> 00:00:32,880
Where can we find recommender systems and what is their purpose.

7
00:00:32,880 --> 00:00:38,010
I think you'll be surprised to find out that recommender systems are one of the most applicable machine

8
00:00:38,010 --> 00:00:39,970
learning concepts ever.

9
00:00:39,990 --> 00:00:45,060
It's a type of machine learning that can be used by nearly any consumer facing business and you most

10
00:00:45,060 --> 00:00:53,440
certainly encounter recommender systems multiple times daily.

11
00:00:53,630 --> 00:00:55,100
Let's start with the first Web site.

12
00:00:55,100 --> 00:00:57,080
Everybody goes to when they go online.

13
00:00:57,110 --> 00:01:01,440
Google Google search is in fact a recommendation engine.

14
00:01:01,640 --> 00:01:07,070
Actually when you perform a search multiple recommendation engines are working to build the page that

15
00:01:07,070 --> 00:01:08,390
you see.

16
00:01:08,390 --> 00:01:13,940
Not only do you have the search results which are recommendations based on your query your geographical

17
00:01:13,940 --> 00:01:16,470
location and maybe even your search history.

18
00:01:16,670 --> 00:01:20,450
But don't forget that Google is also an advertising company.

19
00:01:20,450 --> 00:01:26,450
The online advertisements you see when you go to Google dot com are also recommendations based on your

20
00:01:26,450 --> 00:01:27,740
previous browsing history

21
00:01:32,900 --> 00:01:35,320
another great example is Amazon.

22
00:01:35,540 --> 00:01:40,790
When you search for a product on Amazon it's more likely that you'll see advertisements for similar

23
00:01:40,790 --> 00:01:42,590
products in the future.

24
00:01:42,590 --> 00:01:46,130
You might also see recommendations for related products.

25
00:01:46,130 --> 00:01:51,920
For example if you're buying an iPod you might also be in the market for new headphones a case and a

26
00:01:51,920 --> 00:01:52,730
screen protector

27
00:01:57,840 --> 00:02:04,560
another great but seriously unfortunate example is the news news sites employ recommendation systems

28
00:02:04,620 --> 00:02:09,720
and they even do a b testing on their article headlines to improve click rate.

29
00:02:09,720 --> 00:02:15,570
This is very unfortunate because this means that news has become more biased and more polarizing over

30
00:02:15,570 --> 00:02:16,590
time.

31
00:02:16,590 --> 00:02:18,520
People don't care about boring news.

32
00:02:18,570 --> 00:02:24,000
They want to read something exciting and news companies are more than happy to give that to you in place

33
00:02:24,000 --> 00:02:26,490
of real factual and unbiased news

34
00:02:31,570 --> 00:02:35,670
there are really too many examples of recommendation systems to mention.

35
00:02:35,670 --> 00:02:41,910
Here are a few more examples to help you internalize what exactly it means to recommend we have Netflix

36
00:02:41,910 --> 00:02:44,600
which recommends movies and TV shows.

37
00:02:44,640 --> 00:02:51,090
We have YouTube which recommends videos we have read it which recommends a mix of user generated content

38
00:02:51,330 --> 00:02:54,400
and third party content posted by users.

39
00:02:54,450 --> 00:03:00,330
We have Facebook which sorts your news feed based on what will make you stay on it the longest.

40
00:03:00,330 --> 00:03:05,490
Closely related to that we have Instagram which is owned by Facebook and it does largely the same thing

41
00:03:06,220 --> 00:03:10,820
you're shown a feed of Instagram posts that are recommended in order to keep you engaged.

42
00:03:15,800 --> 00:03:19,830
In this lecture we'll be focusing on a specific kind of recommender.

43
00:03:20,000 --> 00:03:25,580
This recommender works on data that comes in the form of triples the three items that form a sample

44
00:03:25,580 --> 00:03:29,680
or the user the item and the rating that the user gave that item.

45
00:03:34,920 --> 00:03:38,400
As an example suppose you're creating a movie recommender.

46
00:03:38,480 --> 00:03:41,610
Here are what a few rows of your data set might look like.

47
00:03:41,790 --> 00:03:42,440
Alice rates.

48
00:03:42,450 --> 00:03:44,800
Avatar 5 Bob rates.

49
00:03:44,820 --> 00:03:49,490
Star Wars of 4.5 and Carol rates the godfather of 4.

50
00:03:49,560 --> 00:03:51,920
So all of our samples will look like this.

51
00:03:52,080 --> 00:03:53,760
The first item is the user.

52
00:03:53,760 --> 00:03:55,560
The second item is the movie.

53
00:03:55,560 --> 00:03:57,120
And the third item is the rating

54
00:04:02,260 --> 00:04:06,970
one important feature of this type of dataset is that it must be incomplete.

55
00:04:06,970 --> 00:04:08,200
Why.

56
00:04:08,200 --> 00:04:08,980
Well think about it.

57
00:04:09,400 --> 00:04:15,220
If all users have already watched and rated all existing movies then what would there be to recommend

58
00:04:15,910 --> 00:04:18,820
if I've already watched and rated every movie in existence.

59
00:04:18,820 --> 00:04:21,280
There is no need to recommend anything to me.

60
00:04:21,280 --> 00:04:24,280
You can't recommend a movie to me because I've already seen it.

61
00:04:25,330 --> 00:04:27,070
Luckily for any real data set.

62
00:04:27,100 --> 00:04:28,870
This won't be the case.

63
00:04:28,870 --> 00:04:34,660
Imagine you have 1 million users and one hundred thousand movies which is not unrealistic.

64
00:04:34,660 --> 00:04:39,220
You can probably count the number of movies you've seen in your life and it's probably a lot less than

65
00:04:39,220 --> 00:04:40,770
one hundred thousand.

66
00:04:40,930 --> 00:04:44,800
The number of movies you've actually bother to rate is probably even smaller.

67
00:04:49,940 --> 00:04:56,050
The question now is once we have such a dataset how does it help us make recommendations.

68
00:04:56,060 --> 00:05:03,350
Let's take a very simple example consider Bob Bob R rated every movie in the Star Wars series of 5 Bob

69
00:05:03,380 --> 00:05:06,270
also rated a few of the Star Trek movies of five.

70
00:05:06,500 --> 00:05:09,510
And Bob also rated Avatar five.

71
00:05:09,530 --> 00:05:13,310
Common sense tells us that Bob is probably a fan of sci fi.

72
00:05:13,490 --> 00:05:17,720
Bob likes movies that have space travel aliens and so forth.

73
00:05:17,780 --> 00:05:22,330
It is probable then that Bob would like movies that have the same features.

74
00:05:22,550 --> 00:05:28,190
A pretty obvious recommendation for Bob would be a star trek movie or a TV show that he hasn't seen

75
00:05:28,190 --> 00:05:28,430
yet.

76
00:05:33,520 --> 00:05:35,990
Let's try to generalize this concept.

77
00:05:36,400 --> 00:05:42,130
As you know we're given a dataset that consists of triples containing user's items and the user's ratings

78
00:05:42,130 --> 00:05:43,760
for those items.

79
00:05:43,780 --> 00:05:45,960
Suppose we can fit a model to this data.

80
00:05:46,060 --> 00:05:49,510
Let's just call it F F takes in two arguments.

81
00:05:49,510 --> 00:05:54,540
A user you and a movie M and it outputs a predicted rating.

82
00:05:54,580 --> 00:05:57,730
There are two things that we would like this function to do.

83
00:05:57,730 --> 00:06:03,520
Number one obviously if the user movie and rating appear in our dataset then we would like the predicted

84
00:06:03,520 --> 00:06:07,510
rating to be close to the real rating and number two.

85
00:06:07,510 --> 00:06:13,150
We would like this function to be able to output a predicted rating even if the rating did not appear

86
00:06:13,150 --> 00:06:19,850
in our training center luckily a neuron that work is a perfect model for this use case.

87
00:06:19,850 --> 00:06:25,160
As you know a neuron that work is a function approximate how it fits to the data set so that when it

88
00:06:25,160 --> 00:06:31,130
makes predictions those predictions are close to the training targets but not only that it can also

89
00:06:31,190 --> 00:06:39,870
make new predictions for data which did not appear in the training set.

90
00:06:39,910 --> 00:06:43,090
So how might we use such a model to make recommendations.

91
00:06:43,690 --> 00:06:45,510
Well here's one strategy.

92
00:06:45,730 --> 00:06:52,000
Once we have a model that can predict ratings this becomes easy for a given user predicts a rating for

93
00:06:52,000 --> 00:06:58,930
every movie that they have not yet seen then sort these movies by predicted rating the movies you want

94
00:06:58,930 --> 00:07:03,910
to recommend are simply the movies with the highest predicted rating for this user.

95
00:07:03,970 --> 00:07:04,630
Pretty simple

96
00:07:09,710 --> 00:07:13,760
The next question is how can we go about building such a model.

97
00:07:13,850 --> 00:07:16,070
We know that we need a neuron network of some sort.

98
00:07:16,790 --> 00:07:23,690
But here's a problem both users and movies are categorical objects neural networks are basically a series

99
00:07:23,690 --> 00:07:25,660
of matrix multiplications.

100
00:07:25,910 --> 00:07:30,200
As you know we can't multiply a categorical object by a number.

101
00:07:30,200 --> 00:07:31,850
What is the Star Wars times five.

102
00:07:31,850 --> 00:07:33,110
This does not make sense

103
00:07:38,230 --> 00:07:41,160
luckily we have encountered this problem before.

104
00:07:41,350 --> 00:07:47,830
Strangely natural language processing is the field we draw inspiration from in order to build deep recommended

105
00:07:47,830 --> 00:07:49,390
systems.

106
00:07:49,390 --> 00:07:53,310
As you know in MLP the main input is words.

107
00:07:53,410 --> 00:07:54,550
Words are categorical.

108
00:07:54,550 --> 00:07:56,560
They cannot be multiplied by numbers.

109
00:07:57,040 --> 00:07:58,690
So what do we do.

110
00:07:58,750 --> 00:08:05,350
We use an embedding Recall that an embedding is a mapping from each category to a feature vector which

111
00:08:05,350 --> 00:08:08,290
is essentially a list of numbers that can be multiplied

112
00:08:13,450 --> 00:08:17,090
here's how we might use a neuron that work for recommender systems.

113
00:08:17,140 --> 00:08:23,620
Let's say we have an input user Bob and an input movie star was the first thing we do is map both the

114
00:08:23,620 --> 00:08:27,920
user and the movie to their respective embedding vectors.

115
00:08:28,000 --> 00:08:32,110
So I have a vector for Bob and I have a vector for Star Wars.

116
00:08:32,230 --> 00:08:37,680
Now usually these embedding vectors are the same size but this need not be the case.

117
00:08:37,760 --> 00:08:42,160
So now both Bob and Star Wars are represented by feature vectors.

118
00:08:42,200 --> 00:08:48,090
The next thing we do is concatenate these two feature vectors into a single feature vector.

119
00:08:48,140 --> 00:08:51,340
Once we have a feature vector the rest is obvious.

120
00:08:51,440 --> 00:08:57,720
We just do the same thing we've always done pass this through a neural network since this is a feature

121
00:08:57,720 --> 00:09:01,320
vector and not some special object like an image or a sequence.

122
00:09:01,320 --> 00:09:06,900
We can use simple antennas rather than CNN or Arnold's for recommender systems.

123
00:09:06,900 --> 00:09:12,900
The final output layer will be a linear regression with no activation function since predicting ratings

124
00:09:12,900 --> 00:09:15,900
is a regression task and not a classification task

125
00:09:21,110 --> 00:09:22,580
as a point of interest.

126
00:09:22,580 --> 00:09:28,260
I said earlier that we can use an techniques to inspire models for recommender system.

127
00:09:28,340 --> 00:09:34,020
In fact the opposite is true as well recommender techniques have been used in an LP.

128
00:09:34,070 --> 00:09:40,040
My favorite example of this is a technique called Matrix Factorization Matrix Factorization which we

129
00:09:40,040 --> 00:09:44,960
haven't talked about is an algorithm that has been known for quite a while in the field of recommender

130
00:09:44,960 --> 00:09:49,370
systems around the time word embedding were becoming popular.

131
00:09:49,370 --> 00:09:54,980
Specifically due to the algorithm known as works of EQ another similar algorithm came out called a glove

132
00:09:55,760 --> 00:09:58,360
Glove essentially solves the same task as where to vac.

133
00:09:58,430 --> 00:10:03,710
It's an algorithm for finding word meetings using these words I'm betting you can do all sorts of fun

134
00:10:03,710 --> 00:10:06,950
analogies such as King minus man equals Queen minus woman.

135
00:10:07,790 --> 00:10:12,080
Well it turns out that glove is nothing more than Matrix Factorization.

136
00:10:12,170 --> 00:10:17,540
Just an old technique from recommender systems so it's always good to study techniques from different

137
00:10:17,540 --> 00:10:21,740
fields because you never know when or what you learn that can be applied to your own work.