1
00:00:11,120 --> 00:00:15,290
So in this lecture, you'll be given a few exercises to practice using as we do.

2
00:00:16,309 --> 00:00:20,990
Previously, you weren't that served is helpful for reducing redundant dimensions.

3
00:00:21,500 --> 00:00:25,910
So for instance, run and sprint might be combined into a single dimension.

4
00:00:26,510 --> 00:00:32,570
And therefore, if you used a search engine, a user querying for the word run might also get back documents

5
00:00:32,570 --> 00:00:34,010
that contain the word sprint.

6
00:00:35,210 --> 00:00:41,390
So your first exercise is this We built a recommender system using TFI Taf earlier in the course.

7
00:00:41,930 --> 00:00:48,650
It works by doing a nearest neighbor search on a TFI to effectors directly, but we know that TFI effectors

8
00:00:48,650 --> 00:00:53,180
can have large dimensionality and further might contain redundant terms.

9
00:00:53,840 --> 00:00:58,910
In our case, when we were looking at movie reviews, one example might be action and adventure.

10
00:00:59,180 --> 00:01:01,910
Typically, these both occur together for the same movie.

11
00:01:02,840 --> 00:01:06,170
In other words, they are redundant for this exercise.

12
00:01:06,170 --> 00:01:11,900
You should modify that script to use latent semantic analysis on the TF RDF vectors.

13
00:01:12,500 --> 00:01:17,630
So now your nearest neighbor search should be done on the Z vectors or the latent semantic analysis

14
00:01:17,630 --> 00:01:21,440
vectors instead of the X vectors or the two Jovita effectors.

15
00:01:26,050 --> 00:01:31,720
The next exercise will involve revisiting tax classification, which could be anything from classifying

16
00:01:31,720 --> 00:01:35,410
the news to spam detection to sentiment analysis.

17
00:01:35,950 --> 00:01:38,650
Note that we did these examples earlier in the course.

18
00:01:39,340 --> 00:01:45,460
So your second task will be to take one of these notebooks and modify them to use lean semantic analysis

19
00:01:45,790 --> 00:01:47,290
as a pre processing step.

20
00:01:47,950 --> 00:01:53,350
That is, instead of training your model on X train and Y train and then testing on X tested and Y test,

21
00:01:53,860 --> 00:01:56,470
you were going to transform the X matrix first.

22
00:01:57,190 --> 00:02:02,950
As such, your classifier will be trained on Z Train and Y Train and tested on Z Test and Y Test.

23
00:02:04,470 --> 00:02:10,020
It would be useful to do a comparison between the performance of the model with and without this step

24
00:02:10,050 --> 00:02:11,910
using different numbers of features.

25
00:02:13,380 --> 00:02:19,620
As you recall, the Count Vector Riser and TFI Taf can be told to keep only the most frequent words.

26
00:02:20,160 --> 00:02:26,700
So for a fair comparison, if you're Z Matrix has Dimension K, then also keep K words in your TFT f

27
00:02:26,700 --> 00:02:30,960
matrix and then plot your model's performance for different values of K.

28
00:02:31,590 --> 00:02:38,070
So you'll have K on the horizontal axis and model performance, such as accuracy on the vertical axis.

29
00:02:42,800 --> 00:02:50,000
Your next exercise is to revisit topic modeling, as you recall, topic modeling gives us back to matrices,

30
00:02:50,390 --> 00:02:54,350
one with documents by topics and another with topics by words.

31
00:02:54,980 --> 00:03:02,150
For this exercise, you're going to use lean semantic analysis instead of LDA or an AMF now to convince

32
00:03:02,150 --> 00:03:05,060
you that speed gives us back the same matrices.

33
00:03:05,390 --> 00:03:08,510
I have to discuss a little bit about how SVT works.

34
00:03:09,260 --> 00:03:16,790
Essentially, speed splits up your input matrix into three matrices you multiplied by s multiplied by

35
00:03:16,790 --> 00:03:17,870
v transpose.

36
00:03:18,500 --> 00:03:23,090
Now, you don't really have to know what these matrices mean or why they're relevant, but you do have

37
00:03:23,090 --> 00:03:24,080
to know their shapes.

38
00:03:24,740 --> 00:03:28,730
Suppose that are x matrix has the shape and by D then are you?

39
00:03:28,730 --> 00:03:36,500
Matrix will have the shape and by K s matrix will be a diagonal matrix with the shape K by K and R v

40
00:03:36,500 --> 00:03:37,580
transpose matrix.

41
00:03:37,790 --> 00:03:39,470
We'll have the shape K by D.

42
00:03:40,160 --> 00:03:46,820
Because of this, v not transposed has the shape d by K, but we typically work with the K by D version

43
00:03:46,820 --> 00:03:47,240
anyway.

44
00:03:48,050 --> 00:03:49,850
But this should remind you a lot of enemies.

45
00:03:49,890 --> 00:03:57,650
F enemy is simply another way to do a matrix decomposition where we say X is equal to W times h.

46
00:03:58,220 --> 00:04:02,900
In this case, W has the shape and by K and H has the shape K by D.

47
00:04:03,710 --> 00:04:09,380
So the V transpose matrix in CVD is analogous to the H matrix in an MF.

48
00:04:09,650 --> 00:04:11,630
They are both topics by words.

49
00:04:12,200 --> 00:04:18,680
In fact, inside you learn the V transpose matrix can be accessed by calling ASCVD components with an

50
00:04:18,680 --> 00:04:21,110
underscore, just like an MF and LDA.

51
00:04:21,980 --> 00:04:27,680
Furthermore, when you call Model Di Transform, you get back your Z matrix, which is of shape and

52
00:04:27,680 --> 00:04:30,620
by K or, in other words, documents by topics.

53
00:04:31,160 --> 00:04:37,040
So exactly the same as an MF and LDA as a side note for SVD.

54
00:04:37,580 --> 00:04:42,770
This is equivalent to U times s, so in fact, no code changes are even required.

55
00:04:43,100 --> 00:04:47,510
You simply need to plug in CVD wherever you see an MF for LDA.

56
00:04:48,590 --> 00:04:53,840
Now there's one slight difference between us and the other methods which you will see for yourself when

57
00:04:53,840 --> 00:04:55,160
you do this exercise.

58
00:04:55,820 --> 00:05:01,250
Unlike LDA in an MF, there are no constraints on the values that we get back from CVD.

59
00:05:01,490 --> 00:05:07,160
They can be either positive or negative, so you might find documents that contain negative amounts

60
00:05:07,160 --> 00:05:07,970
of topics.

61
00:05:08,570 --> 00:05:14,990
So that's why LDA and an MF would be conceptually better for a topic modeling compared to SVT.

62
00:05:19,830 --> 00:05:26,490
So the final exercise is perhaps the most difficult, as you saw earlier in this course, latent semantic

63
00:05:26,490 --> 00:05:31,770
analysis can also be applied to text summarization exactly how it does.

64
00:05:31,770 --> 00:05:35,130
This is a bit non-trivial for this exercise.

65
00:05:35,130 --> 00:05:40,470
I've attached two papers to extra reading Detox Tea, which explains how these methods work.

66
00:05:41,070 --> 00:05:42,690
The first paper is the simplest.

67
00:05:43,260 --> 00:05:47,760
The second paper references the first and tries to show that their method is better.

68
00:05:48,450 --> 00:05:52,440
So the exercise is to read these two papers and implement the LSA.

69
00:05:52,440 --> 00:05:55,470
Summarize you yourself using both of these methods.

