1
00:00:11,060 --> 00:00:16,790
So in this lecture, we will be looking at the code to implement a recommended system using TFI Taf,

2
00:00:17,480 --> 00:00:22,220
as mentioned in the previous exercise prompt, you are strongly encouraged to call this yourself first.

3
00:00:22,610 --> 00:00:27,080
So if you haven't yet done so, this is your last chance to attempt the exercise.

4
00:00:27,500 --> 00:00:28,820
Otherwise, let's move on.

5
00:00:31,720 --> 00:00:35,500
So we'll start by downloading our data set, which is a database of movies.

6
00:00:43,100 --> 00:00:45,380
The next block of code contains our imports.

7
00:00:46,010 --> 00:00:51,320
Note that in this script, we'll need the JSON library since some of the data is stored in JSON format.

8
00:00:51,950 --> 00:00:57,620
Furthermore, note that we've imported cosine similarity and Euclidean distance despite the fact that

9
00:00:57,620 --> 00:00:59,630
these are easy to implement ourselves.

10
00:01:00,140 --> 00:01:05,420
These functions are efficient at doing many computations at once, so it's usually beneficial to use

11
00:01:05,420 --> 00:01:06,680
the built in methods.

12
00:01:12,240 --> 00:01:15,780
The next step is to read on our data using PD that reads GSV.

13
00:01:19,800 --> 00:01:23,400
The next step is to call the overhead to see what's inside our data frame.

14
00:01:27,000 --> 00:01:33,180
OK, so as you can see, the data frame contains quite a few columns, including budget, genres, homepage

15
00:01:33,180 --> 00:01:34,590
keywords and so forth.

16
00:01:35,400 --> 00:01:38,400
Note that some of these columns are stored as JSON strings.

17
00:01:38,850 --> 00:01:43,040
You're encouraged to look for this yourself to see what data you might want to use.

18
00:01:47,920 --> 00:01:50,860
So the next step is to explore our data set format.

19
00:01:51,460 --> 00:01:56,650
We'll begin by retrieving the first row of data which can be done using AI Luke passing in the index

20
00:01:56,650 --> 00:01:57,250
zero.

21
00:02:03,590 --> 00:02:09,139
As you can see, this prints out a panda series with each of the column names along with the corresponding

22
00:02:09,139 --> 00:02:09,919
values.

23
00:02:13,840 --> 00:02:15,910
The next step is to print out the genres.

24
00:02:16,270 --> 00:02:19,240
This is one of the columns will be using to build our text.

25
00:02:23,260 --> 00:02:30,970
OK, so as you can see, this is a JSON formatted string, in particular, it's a list inside the list.

26
00:02:30,970 --> 00:02:37,150
We have individual JSON documents and each of these items has two keys which are ID and name.

27
00:02:37,990 --> 00:02:44,140
So clearly we are interested in the name attribute, which contains the actual names of the genres or,

28
00:02:44,140 --> 00:02:47,200
in other words, what we would simply refer to as the genres.

29
00:02:47,860 --> 00:02:53,530
Note that some genres, like science fiction, contain a two words in order to treat these as a single

30
00:02:53,530 --> 00:02:54,010
token.

31
00:02:54,370 --> 00:02:58,090
We may want to remove the space and treat it as if it were one word.

32
00:03:01,580 --> 00:03:03,620
The next step is to look at the keywords.

33
00:03:04,010 --> 00:03:07,520
This is another one of those columns will be using to build our text.

34
00:03:12,140 --> 00:03:18,230
OK, so as you can see, this is a much more expansive list, but note that it has the same format as

35
00:03:18,230 --> 00:03:19,010
the genres.

36
00:03:19,340 --> 00:03:24,530
It's a list of JSON documents and inside each document we're interested in the name attribute.

37
00:03:28,280 --> 00:03:34,550
OK, so the next step is to convert the JSON string into a format we can actually use in Python, namely

38
00:03:34,550 --> 00:03:36,980
a Python list of Python dictionaries.

39
00:03:37,490 --> 00:03:39,950
We'll do this by calling JSON downloads.

40
00:03:47,150 --> 00:03:53,150
The next step is to write some code to demonstrate how we will convert one of these genes into a single

41
00:03:53,150 --> 00:03:54,050
line of text.

42
00:03:54,800 --> 00:03:59,180
As you recall, this is what is required by the TF IDF vector razor.

43
00:03:59,930 --> 00:04:01,280
So what's going on here?

44
00:04:02,090 --> 00:04:06,320
Well, let's start with the fact that Jay is a list we have a for loop over.

45
00:04:06,740 --> 00:04:08,750
So we are looping through this list.

46
00:04:09,470 --> 00:04:13,220
Each item in the list is represented by the variable double J.

47
00:04:14,060 --> 00:04:16,100
OK, so what are we doing with Double J?

48
00:04:17,089 --> 00:04:22,340
Well, as you recall, this is a tiny dictionary and we want the value stored in the name attribute.

49
00:04:23,000 --> 00:04:28,880
So normally all we would have to do is call double J Square bracket name, but we can only do that.

50
00:04:29,510 --> 00:04:34,220
As you recall, some genres such as science fiction, contain multiple tokens.

51
00:04:34,610 --> 00:04:36,740
We would like to treat them like a single word.

52
00:04:37,490 --> 00:04:43,070
One easy way to do this is to split the string on whitespace and then join it back together using an

53
00:04:43,070 --> 00:04:43,820
empty string.

54
00:04:44,780 --> 00:04:50,840
What this will effectively do is just concatenate each individual token into a single string, removing

55
00:04:50,840 --> 00:04:51,740
any whitespace.

56
00:04:52,610 --> 00:04:56,360
Finally, note that we do this for every genre in our list of genres.

57
00:04:57,410 --> 00:05:01,700
The last step is to join other genres together with a single space in between.

58
00:05:06,180 --> 00:05:12,090
OK, so as you can see, the result is, as expected, all the genres now appear in a single string

59
00:05:12,330 --> 00:05:14,070
separated by a single space.

60
00:05:14,730 --> 00:05:19,980
Note that genres like science fiction, which contained multiple tokens, are now joined into a single

61
00:05:19,980 --> 00:05:20,520
token.

62
00:05:24,280 --> 00:05:27,700
OK, so the next step is to put what we just did into a function.

63
00:05:28,360 --> 00:05:32,710
We're going to do this for both the genres column and the keywords column, and then we're going to

64
00:05:32,710 --> 00:05:35,260
contaminate the results into one string.

65
00:05:36,160 --> 00:05:40,180
So we'll start by defining a function called the genres and keywords to string.

66
00:05:40,900 --> 00:05:45,160
This takes in a single row of our data frame inside this function.

67
00:05:45,190 --> 00:05:49,600
We're going to call JSON downloads for the genres column, as we did above.

68
00:05:50,440 --> 00:05:55,510
The next step is to join all the genres into a single string using the same code as above.

69
00:05:56,590 --> 00:05:59,860
The next step is to do all the same steps to the keywords column.

70
00:06:00,550 --> 00:06:05,260
As you recall, both of these columns have the same format, so the same code will work.

71
00:06:06,790 --> 00:06:11,290
The final step is to join the genres and key words together into a single string.

72
00:06:17,100 --> 00:06:22,440
The next step is to call DFA to apply, which will run the function we just wrote on every row of our

73
00:06:22,440 --> 00:06:23,130
data frame.

74
00:06:23,310 --> 00:06:24,240
One at a time.

75
00:06:25,200 --> 00:06:27,660
Note that will assign this to a new column called String.

76
00:06:32,960 --> 00:06:39,440
The next step is to create an instance of the TF IDF Vector isAre class note that have set the MAX features

77
00:06:39,440 --> 00:06:43,640
to two thousand, which will limit the number of columns in the final matrix.

78
00:06:44,150 --> 00:06:48,560
You're encouraged to read the documentation if you want to learn more, but basically it keeps the most

79
00:06:48,560 --> 00:06:50,470
frequent terms in the corpus.

80
00:06:54,930 --> 00:07:00,750
The next step is to call fit transform on our data set, as you recall, our text is stored in a column

81
00:07:00,750 --> 00:07:01,470
called String.

82
00:07:02,580 --> 00:07:07,590
Note that for this example, we won't have a train and test set since that doesn't reflect how this

83
00:07:07,590 --> 00:07:09,330
would be used in the real world.

84
00:07:09,990 --> 00:07:15,540
In practice, we would have a single database of movies, and our free idea vectors would be trained

85
00:07:15,540 --> 00:07:16,980
based on whatever we had.

86
00:07:21,320 --> 00:07:24,770
The next step is to print out acts just to see what we get.

87
00:07:25,580 --> 00:07:30,440
Note that we did this for count victories there as well, so it would be interesting to see if the result

88
00:07:30,440 --> 00:07:31,340
is still the same.

89
00:07:35,360 --> 00:07:41,120
OK, so as you can see, it is in fact the same as X is still stored as a sparse matrix.

90
00:07:41,840 --> 00:07:46,310
Note that the dataset has about 4800 rows and two thousand columns.

91
00:07:46,790 --> 00:07:51,860
And the reason it has two thousand columns is because we set max features to this value.

92
00:07:52,550 --> 00:07:57,680
In other words, we could have had more columns, but we've decided to throw out less frequent terms.

93
00:07:58,220 --> 00:08:02,720
You're encouraged to experiment with this value and see what impact it has on the results.

94
00:08:06,190 --> 00:08:12,100
Now, the next step might seem a bit strange, but recall that we are now working with a matrix of numbers.

95
00:08:12,610 --> 00:08:16,540
It's not obvious which row of The Matrix corresponds to which movie.

96
00:08:17,170 --> 00:08:22,810
Of course, our data has been processed in order, so the movies in our data frame will correspond to

97
00:08:22,810 --> 00:08:25,090
the vectors in our RDF matrix.

98
00:08:26,290 --> 00:08:31,210
Now, for reasons which may not be clear yet, it would be useful to have a mapping that tells us for

99
00:08:31,210 --> 00:08:33,760
a given movie, which index does it have?

100
00:08:34,480 --> 00:08:37,000
Luckily, the way to do this is relatively simple.

101
00:08:37,690 --> 00:08:43,360
Note that in our data frame, the index of this data frame is already a list of integers starting from

102
00:08:43,360 --> 00:08:45,220
zero and counting up by one.

103
00:08:45,940 --> 00:08:52,120
This, coincidentally, is also the same as how we count the indices in an array, such as our TFI D

104
00:08:52,120 --> 00:08:52,990
of Matrix.

105
00:08:53,530 --> 00:08:56,890
Therefore, all we have to do now is create a panda series.

106
00:08:57,460 --> 00:09:03,340
The values in this series will be D after index that is zero, one or two and so forth.

107
00:09:04,510 --> 00:09:10,300
The index for the series will be the title of the movies, so we're kind of reversing what our data

108
00:09:10,300 --> 00:09:11,590
frame already does.

109
00:09:12,040 --> 00:09:16,210
Our original data frame has integer indices, and this points to the movie.

110
00:09:16,690 --> 00:09:22,630
This new series has the movie title as the index, and it points to the original index as the value.

111
00:09:27,250 --> 00:09:32,470
OK, so as you can see, the titles are stored in the index column and the values in the series are

112
00:09:32,470 --> 00:09:34,240
just zero one two and so forth.

113
00:09:38,380 --> 00:09:45,070
The next step is to see how our movie to ADX mapping will be used and also to use this to make a recommendation.

114
00:09:45,700 --> 00:09:49,840
I've chosen the movie Scream three, but you're welcome to choose any movie you wish.

115
00:09:53,700 --> 00:09:58,140
OK, so notice that this gives us back the index where this movie was stored.

116
00:10:01,220 --> 00:10:03,590
OK, so why do we even want this index?

117
00:10:04,220 --> 00:10:06,650
Luckily, that's the next thing we're going to look at.

118
00:10:07,310 --> 00:10:14,030
As you can see, this index is used to grab the correct arrow inside our ETF IDF matrix, which we've

119
00:10:14,030 --> 00:10:14,840
called X.

120
00:10:15,440 --> 00:10:19,400
This is the TF IDF vector that corresponds to the movie screen.

121
00:10:19,400 --> 00:10:19,880
Three.

122
00:10:23,980 --> 00:10:29,410
OK, so notice that when we print out this query vector, we get back a sparse matrix of size one by

123
00:10:29,410 --> 00:10:31,450
2000, which makes sense.

124
00:10:31,900 --> 00:10:35,890
We've essentially just grabbed a single row of our TFT of Matrix.

125
00:10:39,210 --> 00:10:43,080
Now, suppose that you'd like to see the values that are held within this matrix.

126
00:10:43,590 --> 00:10:46,620
One simple way to do this is to call the two array function.

127
00:10:50,640 --> 00:10:56,670
So as you can see, most of the values are just zero, as you may have expected since this is a sparse

128
00:10:56,670 --> 00:10:57,390
matrix.

129
00:11:01,340 --> 00:11:07,280
The next step is to compute the cosine similarity between our query vector and every vector in X.

130
00:11:07,910 --> 00:11:12,660
Note that this includes the query vector itself because the query vector came from X.

131
00:11:13,160 --> 00:11:14,600
This is not always the case.

132
00:11:14,630 --> 00:11:18,740
For example, if you have a search engine where the user can type in their own data.

133
00:11:19,460 --> 00:11:21,380
However, for the purpose of this lecture?

134
00:11:21,620 --> 00:11:27,020
Well, assume that our system is like Netflix, where we base our recommendations on movies which are

135
00:11:27,020 --> 00:11:28,520
already in our database.

136
00:11:30,460 --> 00:11:36,430
OK, so this gives us back a variable, which we will call scores now just as an exercise, let's think

137
00:11:36,430 --> 00:11:38,560
about how many scores there will be.

138
00:11:39,460 --> 00:11:42,370
Remember that this function does a pairwise similarity.

139
00:11:42,910 --> 00:11:48,880
So if I had a 10 query vectors and 20 vectors to check against, then my result would be 10 by 20.

140
00:11:49,630 --> 00:11:54,280
In this case, I have one query vector and 48 vectors in my database.

141
00:11:54,760 --> 00:11:58,960
Therefore, the result will be approximately one by 4800.

142
00:12:03,120 --> 00:12:06,360
OK, so note that most of the values appear to be zero.

143
00:12:06,990 --> 00:12:12,810
This makes sense since if two movies don't share any common terms, their product will be zero, which

144
00:12:12,810 --> 00:12:15,540
means the cosine similarity will also be zero.

145
00:12:18,890 --> 00:12:24,300
Now, because our array is of shape, one by end, we would like to flatten it to be a one degree.

146
00:12:29,320 --> 00:12:32,680
The next step is to plot out scores just to see what they look like.

147
00:12:36,940 --> 00:12:40,870
OK, so as you can see, it appears very noisy with one big spike.

148
00:12:41,440 --> 00:12:46,570
Of course, this big spike is just the scary movie, since the cosine similarity between the two of

149
00:12:46,570 --> 00:12:52,870
the same vector is just one of the other similarities are much smaller, with a max around zero point

150
00:12:52,870 --> 00:12:53,350
four.

151
00:12:54,370 --> 00:12:59,380
Note that because the movies are in a random order, it's difficult to tell where we have zeros.

152
00:13:03,740 --> 00:13:09,140
Now we know that we would like to do something like sort the scores, but this is not exactly what we

153
00:13:09,140 --> 00:13:09,650
want.

154
00:13:10,550 --> 00:13:16,160
Firstly, we know that sorting by default usually results in the items going in ascending order.

155
00:13:16,910 --> 00:13:21,830
In our case, we want them to go in descending order with the most similar item at the front.

156
00:13:22,520 --> 00:13:28,070
This makes sense since if we want the top five matches, we'll just need to take the first five values.

157
00:13:29,030 --> 00:13:34,020
The second issue is that we don't really want salt because we don't care about the score itself.

158
00:13:34,790 --> 00:13:40,340
We instead want our expert, which tells us which order the movies go in if we sorted by the score.

159
00:13:41,120 --> 00:13:45,740
Again, we don't care about the score values only how they rank amongst one another.

160
00:13:51,680 --> 00:13:54,410
The next step is to plot the scores after sorting them.

161
00:13:55,310 --> 00:13:59,510
Note that we can simply index the scores array by the previous results.

162
00:14:00,140 --> 00:14:06,170
This is because with Nampai arrays, you can index them with other arrays, provided that the index

163
00:14:06,170 --> 00:14:08,390
array contains the appropriate values.

164
00:14:13,860 --> 00:14:16,440
OK, so this result makes much more sense.

165
00:14:17,090 --> 00:14:22,590
Of course, the top scores one, which is just the query, we then have a few hundred movies which are

166
00:14:22,590 --> 00:14:26,400
partial matches with a score less than one but bigger than zero.

167
00:14:27,540 --> 00:14:32,910
Finally, we see that most of the movies are the score of zero because they are completely unrelated

168
00:14:32,910 --> 00:14:33,630
to the query.

169
00:14:38,190 --> 00:14:41,280
OK, so the next step is to actually retrieve our matches.

170
00:14:41,790 --> 00:14:43,650
In fact, we've kind of already done that.

171
00:14:44,250 --> 00:14:50,550
All we really need to do is take the sorted indices and index those from positional one up to position

172
00:14:50,550 --> 00:14:51,190
six.

173
00:14:51,840 --> 00:14:56,430
Recall that we don't want to start at zero because the first movie is just the query itself.

174
00:14:57,060 --> 00:14:59,490
We'll call the result recommended ADX.

175
00:15:06,320 --> 00:15:09,380
Now, currently, these recommendations are just integers.

176
00:15:09,890 --> 00:15:13,340
What would be more meaningful to us is to see the actual titles.

177
00:15:13,970 --> 00:15:18,230
Of course, as you recall, these indices map back to our original data frame.

178
00:15:18,830 --> 00:15:25,190
Thus, we can simply call the iLoad function passing in recommended RDX, since we only care to see

179
00:15:25,190 --> 00:15:25,880
the title.

180
00:15:25,910 --> 00:15:28,430
We can also grab only the column called Title.

181
00:15:34,150 --> 00:15:38,800
OK, so as you can see, these results seem promising for screen three.

182
00:15:38,830 --> 00:15:44,710
We get to Friday the 13th movies Graduation Day, the calling and the glimmer man.

183
00:15:45,250 --> 00:15:47,500
These are all thrillers, which makes sense.

184
00:15:48,070 --> 00:15:51,730
You're welcome to look these up for yourself to confirm that this is the case.

185
00:15:56,950 --> 00:16:02,260
OK, so because we don't want to have to do all that work again, to make new recommendations for other

186
00:16:02,260 --> 00:16:07,360
movies, we're going to write a function to encapsulate all the previous work we just did.

187
00:16:08,140 --> 00:16:13,360
So let's write a function called recommend that we'll take in a single input, which is the title of

188
00:16:13,360 --> 00:16:13,900
a movie.

189
00:16:14,650 --> 00:16:17,920
Well, assume that this title always exists in our database.

190
00:16:19,860 --> 00:16:24,750
The first step in this function is as before to grab the index for the movie title.

191
00:16:26,100 --> 00:16:31,890
Note that one thing we didn't discuss previously is that the Pandas API is a bit inconsistent.

192
00:16:32,550 --> 00:16:38,340
Specifically, if your movie title is the same for multiple rows, the result will not just be a single

193
00:16:38,340 --> 00:16:39,120
index.

194
00:16:39,720 --> 00:16:44,430
In that case, you'll get a panda series, which is a completely different type of object.

195
00:16:45,150 --> 00:16:47,820
So the next step is to check the type of RDX.

196
00:16:48,360 --> 00:16:52,530
If it's a panda series, then we know that there were multiple of the same title.

197
00:16:53,370 --> 00:16:56,370
So inside the CIF statement, we simply grab the first item.

198
00:16:57,270 --> 00:16:58,620
Note that this is arbitrary.

199
00:16:59,160 --> 00:17:04,230
Another option would be to ask the user which one they want to choose or simply combine the results

200
00:17:04,230 --> 00:17:06,060
of all the movies with that title.

201
00:17:07,050 --> 00:17:12,359
Of course, in a real system such as Netflix, you wouldn't necessarily be searching by title, but

202
00:17:12,359 --> 00:17:15,690
instead you would have the idea of movies the user actually watched.

203
00:17:16,260 --> 00:17:18,540
So in that case, there wouldn't be any ambiguity.

204
00:17:22,619 --> 00:17:27,300
Note that the remaining steps in this function are the same as what we've already seen, so I'll just

205
00:17:27,300 --> 00:17:28,349
review them quickly.

206
00:17:29,220 --> 00:17:34,740
The next step is to use RDX to grab the corresponding TF IDF vector from X..

207
00:17:35,220 --> 00:17:36,360
We'll call this our query.

208
00:17:37,560 --> 00:17:43,680
The next step is to apply the cosine similarity function between the query vector and all of X. We'll

209
00:17:43,680 --> 00:17:45,090
call the results scores.

210
00:17:46,500 --> 00:17:50,340
The next step is to flatten the scores so that they become a 1D array.

211
00:17:51,300 --> 00:17:56,040
The next step is to sort the scores in descending order and to grab the top five matches.

212
00:17:56,220 --> 00:18:04,230
Excluding the element at index zero, the final step is to index our original data frame with the recommendations

213
00:18:04,500 --> 00:18:07,430
and to return the titles for those recommendations.

214
00:18:14,060 --> 00:18:19,670
OK, so the next step will be to test our function using Screen three once again to see if the results

215
00:18:19,670 --> 00:18:20,540
will be the same.

216
00:18:26,330 --> 00:18:30,410
OK, so luckily, the results are still the same, which means our function works.

217
00:18:34,290 --> 00:18:37,260
Next, let's check the recommendations for Mortal Kombat.

218
00:18:41,720 --> 00:18:44,300
OK, so these results also look pretty good.

219
00:18:45,050 --> 00:18:48,470
First, we see another Mortal Kombat movie, which makes sense.

220
00:18:49,100 --> 00:18:52,010
We also see dead or alive in the name of the king.

221
00:18:52,010 --> 00:18:52,530
Three.

222
00:18:52,880 --> 00:18:54,920
Street Fighter and alone in the Dark.

223
00:18:55,610 --> 00:19:00,140
Now you probably don't recognize these if you're not into video games, but rest assured.

224
00:19:00,230 --> 00:19:02,690
These are all movies based on video games.

225
00:19:03,230 --> 00:19:08,570
Even more dead or alive and Street Fighter, just like Mortal Kombat, are video games from the fighting

226
00:19:08,570 --> 00:19:09,260
game genre.

227
00:19:09,830 --> 00:19:15,290
In fact, there was a big rivalry between Street Fighter and Mortal Kombat back when arcades were popular

228
00:19:15,290 --> 00:19:16,190
in the 90s.

229
00:19:16,640 --> 00:19:18,640
So these results make a lot of sense.

230
00:19:23,220 --> 00:19:26,400
Finally, let's check the recommendations for Runaway Bride.

231
00:19:31,560 --> 00:19:34,710
OK, so again, these results also look pretty good.

232
00:19:35,160 --> 00:19:38,250
There is House of D, My Big Fat Greek Wedding two.

233
00:19:38,490 --> 00:19:44,390
It happened one night and education in our family wedding again.

234
00:19:44,400 --> 00:19:48,060
You've probably never heard of these movies unless you're into that sort of thing.

235
00:19:48,390 --> 00:19:53,190
But if you look them up, you'll see that these are all movies that fall into the romance, comedy or

236
00:19:53,190 --> 00:19:54,270
drama category.

237
00:19:54,750 --> 00:19:58,120
And of course, there's a lot of overlap between these different categories.

238
00:19:58,140 --> 00:20:02,940
For example, a romance is often also a drama, or you'll have a romance comedy.

239
00:20:03,900 --> 00:20:07,890
At least three of these movies are about weddings, at least from the titles.

240
00:20:08,430 --> 00:20:13,770
The query movie has bride in the title, while my big fat Greek wedding and our family wedding have

241
00:20:13,770 --> 00:20:15,750
the actual word wedding in the title.

242
00:20:16,440 --> 00:20:19,680
So again, these recommendations appear to make sense.