1
00:00:11,060 --> 00:00:16,880
So in the previous lecture, we discussed CVD in a generic sense, but we don't yet know how it applies

2
00:00:16,880 --> 00:00:19,380
to an LP in this lecture.

3
00:00:19,400 --> 00:00:26,030
We'll be discussing latent semantic analysis, which is precisely what we get when we apply CVD to a

4
00:00:26,030 --> 00:00:29,360
term document matrix or a document term matrix.

5
00:00:30,110 --> 00:00:36,410
But before we get to an LP specifically, let's discuss one more thing as we can do, which is reduce

6
00:00:36,410 --> 00:00:37,790
redundant dimensions.

7
00:00:38,420 --> 00:00:43,550
As you recall, I previously gave an example where we were looking at the height and weight of people.

8
00:00:44,180 --> 00:00:50,060
We noted that these two dimensions are redundant because they can both be described by a single variable,

9
00:00:50,060 --> 00:00:51,080
which is size.

10
00:00:51,800 --> 00:00:56,510
To connect this back to the previous lecture, note how this is achieved by rotation.

11
00:00:57,140 --> 00:01:03,830
We simply rotate the data or equivalently rotate the axes such that the white part goes along one axis

12
00:01:04,220 --> 00:01:06,440
and the skinny part goes along the other axis.

13
00:01:07,430 --> 00:01:10,820
Clearly, the white part is the part that represents our size variable.

14
00:01:11,180 --> 00:01:14,090
And so this is the dimension we would decide to keep.

15
00:01:14,720 --> 00:01:18,560
The other dimension looks like pure noise so it can be removed.

16
00:01:19,220 --> 00:01:22,430
So that's how rotating allows us to remove redundancy.

17
00:01:22,970 --> 00:01:29,240
Height and weight were redundant, and we combine these into a single variable called size using rotation

18
00:01:29,540 --> 00:01:30,770
or in other words, as we'd.

19
00:01:35,900 --> 00:01:40,550
Now, at the beginning of this section, I mentioned the concepts of synonyms and polygamy.

20
00:01:41,240 --> 00:01:45,850
Personally, I don't think these are great ways to introduce SVOD or LSA.

21
00:01:46,520 --> 00:01:52,670
And the reason is that most resources tend to introduce these concepts, but never explain exactly how

22
00:01:52,680 --> 00:01:54,620
SVOD fixes these issues.

23
00:01:55,160 --> 00:01:57,110
They usually just end up saying something like.

24
00:01:57,390 --> 00:01:59,900
And therefore, SVOD appears to help.

25
00:02:00,530 --> 00:02:03,470
This is, in my opinion, not satisfactory.

26
00:02:04,220 --> 00:02:09,050
And this lecture will propose a simpler view based on what SVOD actually does.

27
00:02:09,830 --> 00:02:16,250
As you've just seen, as we can reduce redundant dimensions, let's consider how this might arise in

28
00:02:16,250 --> 00:02:16,850
NLP.

29
00:02:18,230 --> 00:02:24,410
Suppose that we have two words which always co-occur, for instance, cat and feline, so every time

30
00:02:24,410 --> 00:02:28,700
cat has a high value in the delivery of Matrix, feline does as well.

31
00:02:29,510 --> 00:02:33,050
Another example is mitochondria ribosome in cytoplasm.

32
00:02:33,740 --> 00:02:35,750
These are all parts of a biological cell.

33
00:02:36,110 --> 00:02:37,970
But note that they are not synonymous.

34
00:02:38,390 --> 00:02:40,640
They simply appear when the other words appear.

35
00:02:41,330 --> 00:02:44,450
However, you can imagine how these two are redundant.

36
00:02:45,710 --> 00:02:51,590
Any document about cells is likely to include words like mitochondria and ribosome and cytoplasm.

37
00:02:52,190 --> 00:02:56,390
Every time the count for one of these is high, the count for the others are high as well.

38
00:02:57,080 --> 00:02:59,810
And thus we have the same situation as height and weight.

39
00:03:00,320 --> 00:03:02,960
These can all be reduced to single variables.

40
00:03:03,440 --> 00:03:10,790
Mitochondria, ribosome and cytoplasm or words having to do with the cell cat and feline both just mean

41
00:03:10,790 --> 00:03:11,300
cat.

42
00:03:11,810 --> 00:03:14,630
And therefore, their dimensionality can be reduced.

43
00:03:15,290 --> 00:03:17,750
And again, these are not necessarily synonyms.

44
00:03:18,020 --> 00:03:24,590
They simply concur and note how this is the same situation as our height and weight example.

45
00:03:25,100 --> 00:03:28,580
When the value of one is big, the value of the other is big as well.

46
00:03:29,180 --> 00:03:33,530
In other words, it's the same kind of redundancy that as we can help get rid of.

47
00:03:38,220 --> 00:03:42,900
So what do we get when we apply SVOD to a documentary matrix?

48
00:03:43,560 --> 00:03:45,750
Well, it's very similar to topic modeling.

49
00:03:46,350 --> 00:03:52,590
We didn't put a documentary matrix called X and we transform it into a document topic matrix called

50
00:03:52,590 --> 00:03:53,040
Z.

51
00:03:53,760 --> 00:03:59,760
As mentioned, our special SVB rotation will make it so that all the true data occurs in the left most

52
00:03:59,760 --> 00:04:00,690
columns of Z.

53
00:04:01,050 --> 00:04:03,230
Well, all the noise occurs to the right.

54
00:04:03,930 --> 00:04:07,350
The code we use will actually cut off all the noise columns automatically.

55
00:04:07,710 --> 00:04:10,200
So what we get back is just the relevant data.

56
00:04:11,250 --> 00:04:14,610
Again, you can imagine these like hidden factors or topics.

57
00:04:15,150 --> 00:04:17,220
So one topic might be regarding cats.

58
00:04:17,670 --> 00:04:20,880
Another topic might be regarding cell biology and so forth.

59
00:04:22,220 --> 00:04:27,800
One important concepts we haven't had time to discuss is that these hidden factors are actually ordered

60
00:04:27,800 --> 00:04:29,240
in terms of importance.

61
00:04:29,810 --> 00:04:32,300
So the first column is the most important topic.

62
00:04:32,660 --> 00:04:35,990
The second column is the second most important topic and so forth.

63
00:04:36,650 --> 00:04:42,980
So when you go to visualize your data by keeping only the first two columns, you are guaranteed that

64
00:04:42,980 --> 00:04:45,170
you have the two most important columns.

65
00:04:45,650 --> 00:04:50,120
This is unlike LDA and matrix factorization, where there is no such concept.

66
00:04:54,800 --> 00:05:00,470
OK, so now that you know what SVT does and how it works in how to apply it to an LP, let's look at

67
00:05:00,470 --> 00:05:01,790
how we'll do this in Python.

68
00:05:02,540 --> 00:05:08,840
As usual, we can use Typekit learn, but note that SVT is actually a fundamental operation in linear

69
00:05:08,840 --> 00:05:11,450
algebra, and it's not strictly machine learning.

70
00:05:12,170 --> 00:05:17,210
And as such, you'll find that speed is actually a function in NumPy and CI Pi as well.

71
00:05:17,750 --> 00:05:24,440
In fact, Saikia Learn uses this function behind the scenes when you learn about the math behind SVT.

72
00:05:24,860 --> 00:05:30,710
You may find that this function is actually more useful in terms of using speed for other applications.

73
00:05:31,430 --> 00:05:35,450
In any case, we begin as usual by creating an object of type truncated.

74
00:05:36,920 --> 00:05:41,660
It says truncated because, as mentioned, we will be cutting off all the noise columns.

75
00:05:42,260 --> 00:05:46,970
Note that this takes in an argument called end components just like LDA and an AMF.

76
00:05:47,840 --> 00:05:53,330
In fact, just like with topic modeling, we as the user need to choose how many components we want

77
00:05:53,330 --> 00:05:53,960
to keep.

78
00:05:54,590 --> 00:05:56,810
Sometimes this will be very easy to choose.

79
00:05:57,260 --> 00:06:01,320
For instance, if we want to visualize our data, then we should choose to.

80
00:06:01,340 --> 00:06:03,350
Since our plot will be two dimensional.

81
00:06:03,830 --> 00:06:09,260
But if we want to use speed as a pre processing step before spam detection, then we might want to choose

82
00:06:09,260 --> 00:06:11,150
this value based on cross validation.

83
00:06:11,900 --> 00:06:15,920
So how to choose the number of components will be context dependent.

84
00:06:16,640 --> 00:06:20,540
There are more advanced concepts you might want to consider when choosing the number of components,

85
00:06:20,930 --> 00:06:22,880
but those are outside the scope of this lecture.

86
00:06:25,450 --> 00:06:27,910
The next step, as usual, is to call fit.

87
00:06:28,690 --> 00:06:35,020
Now what is interesting about SVD is that it works pretty much exactly the same if your data is documents

88
00:06:35,020 --> 00:06:37,120
by terms or terms by documents.

89
00:06:37,870 --> 00:06:43,650
Normally in this course, we use documents by terms because we treat documents as samples in terms as

90
00:06:43,660 --> 00:06:44,320
features.

91
00:06:44,860 --> 00:06:48,340
But for the upcoming code example, we'll be using terms by documents.

92
00:06:49,030 --> 00:06:54,340
This is because we'll be treating each term as a sample and we'll be reducing the dimensionality of

93
00:06:54,340 --> 00:06:59,260
term vectors and then plotting those term vectors to see what patterns we can find.

94
00:07:03,920 --> 00:07:09,800
As usual, we can then call the transform method to get back our transform data z, or we can do it

95
00:07:09,800 --> 00:07:12,110
all in one step by calling Fit Transform.

96
00:07:13,070 --> 00:07:20,270
Now again, as with LDA and OMF, the Z matrix has the shape end by K, where N is the number of samples,

97
00:07:20,570 --> 00:07:24,860
and K is the number of hidden factors, which corresponds to end components above.

98
00:07:25,520 --> 00:07:28,970
So in the case where we want to visualize our data, K would be two.

99
00:07:29,960 --> 00:07:35,120
And by the way, note that this is another machine learning technique in which we have no targets and

100
00:07:35,120 --> 00:07:37,490
we call transform instead of predict.

101
00:07:38,060 --> 00:07:41,690
In other words, this is another instance of unsupervised learning.