1
00:00:11,150 --> 00:00:17,870
So in this lecture, we will be introducing the intuition behind singular value decomposition, also

2
00:00:17,870 --> 00:00:18,740
known as SVT.

3
00:00:19,820 --> 00:00:22,940
There are three basic concepts you want to keep in mind for this lecture.

4
00:00:23,870 --> 00:00:28,130
Number one seed is useful for visualizing your data.

5
00:00:28,820 --> 00:00:33,350
Number two as SVT is useful for reducing a dimensionality.

6
00:00:33,980 --> 00:00:39,680
And Number three as FEED works by finding the best rotation of your data points.

7
00:00:40,400 --> 00:00:45,530
This lecture will look at the relationships between these three concepts, and we'll look at a hopefully

8
00:00:45,530 --> 00:00:48,440
intuitive picture that explains how this works.

9
00:00:53,060 --> 00:00:56,870
OK, so let's first talk about why doing these things are even useful.

10
00:00:57,620 --> 00:00:59,870
Let's begin with reducing dimensionality.

11
00:01:00,620 --> 00:01:04,580
We know that in NLP, our data is pretty much always high dimensional.

12
00:01:05,180 --> 00:01:11,150
This is because we work with document term matrices like TFI Taf, where the number of terms can be

13
00:01:11,150 --> 00:01:17,960
in the tens of thousands or even millions for many datasets will have a number of features, much greater

14
00:01:17,960 --> 00:01:22,790
than the number of samples, which, as you know, is generally considered to be undesirable.

15
00:01:23,510 --> 00:01:30,380
We would much rather have a data matrix of size and by 100 than a data matrix of size and by 100000.

16
00:01:31,730 --> 00:01:37,910
In addition, more dimensions simply means more data to process, which takes more time, and that's

17
00:01:37,910 --> 00:01:38,870
also not good.

18
00:01:39,530 --> 00:01:45,740
So reducing dimensionality is helpful because it means anything further in our pipeline will be more

19
00:01:45,740 --> 00:01:47,180
efficient and fast.

20
00:01:51,890 --> 00:01:53,990
OK, so what about visualization?

21
00:01:54,920 --> 00:01:58,370
Well, visualization is always useful in data science.

22
00:01:58,850 --> 00:02:00,290
We would like to see our data.

23
00:02:01,010 --> 00:02:06,720
Notice how we do this all the time when we imagine how regression works or how classification works.

24
00:02:07,070 --> 00:02:09,259
We think of them in two dimensional spaces.

25
00:02:09,680 --> 00:02:11,660
It helps us to see how they work.

26
00:02:12,380 --> 00:02:17,810
Of course, the world around us is only three dimensional, so it's not immediately obvious how we can

27
00:02:17,810 --> 00:02:20,480
visualize a 10000 dimensional dataset.

28
00:02:25,120 --> 00:02:30,910
The next thing to discuss is why reducing dimensionality and visualizing your data are actually almost

29
00:02:30,910 --> 00:02:31,660
the same thing.

30
00:02:32,800 --> 00:02:35,980
Well, suppose that our data has dimension at 10000.

31
00:02:36,580 --> 00:02:37,960
This cannot be visualized.

32
00:02:38,680 --> 00:02:44,890
But suppose that we could reduce dimensionality to transform it in such a way that our transformation

33
00:02:44,890 --> 00:02:48,970
had less dimensions while still retaining all the important information.

34
00:02:49,780 --> 00:02:52,720
This might sound like magic, but it is in fact possible.

35
00:02:53,830 --> 00:02:59,350
Now, supposing that we could do this, then let's say we simply reduced dimensionality all the way

36
00:02:59,350 --> 00:03:02,110
down from 10000 to two or three.

37
00:03:02,980 --> 00:03:07,180
Then we could plot this data and look for any useful patterns and so forth.

38
00:03:07,960 --> 00:03:11,950
Thus, reducing dimensionality is what allows us to visualize.

39
00:03:12,520 --> 00:03:19,030
We can only visualize two or three dimensions and to get data in two or three dimensions, we must reduce

40
00:03:19,030 --> 00:03:19,840
dimensionality.

41
00:03:20,410 --> 00:03:22,330
So that's how these two ideas are connected.

42
00:03:27,140 --> 00:03:33,020
Now, an obvious question to ask is, how do we actually know that data can be reduced in this way?

43
00:03:33,860 --> 00:03:37,490
Consider what we do with Typekit modeling or even simpler clustering.

44
00:03:38,330 --> 00:03:41,900
In the case of clustering, we assign each of our data points to a cluster.

45
00:03:42,680 --> 00:03:46,160
A cluster is a set of data points where all the points are very similar.

46
00:03:46,940 --> 00:03:53,510
Suppose that the data we used for clustering had 100 dimensions, but we chose to only have three clusters.

47
00:03:54,110 --> 00:03:58,400
In this case, we would have reduced the dimensionality of our data set to three.

48
00:03:59,660 --> 00:04:05,210
Now, obviously, this picture is just two dimensional because I want to visualize clustering, but

49
00:04:05,210 --> 00:04:08,420
in actuality, this would be happening in 100 dimensions.

50
00:04:09,020 --> 00:04:14,150
As we've discussed, you can't see a 100 dimensional data set unless it is first reduced.

51
00:04:18,790 --> 00:04:21,130
Another example is with heights and weights.

52
00:04:21,790 --> 00:04:27,160
We know that generally speaking, if you are taller, you are bigger and if you are bigger, then you

53
00:04:27,160 --> 00:04:27,790
weigh more.

54
00:04:28,480 --> 00:04:31,360
There seems to be a common factor here, which is size.

55
00:04:32,110 --> 00:04:37,930
Imagine a line going through these points and that what we would like to do is project all of our height

56
00:04:37,930 --> 00:04:40,450
and weight data points onto this line.

57
00:04:41,320 --> 00:04:46,720
If we can do that, then we've reduced our dimensionality from two down to one.

58
00:04:47,560 --> 00:04:53,230
Specifically, any point at the bottom left corner would have been small size in any point.

59
00:04:53,230 --> 00:04:55,630
At the top right corner would mean large size.

60
00:04:56,230 --> 00:05:00,910
But clearly, we don't need two numbers, which are height and weight to tell us about a person.

61
00:05:01,600 --> 00:05:05,860
Instead, we can describe them with only one a number, which is size.

62
00:05:10,470 --> 00:05:15,420
Now, when we started this lecture, I mentioned that there are three concepts we want to understand.

63
00:05:15,990 --> 00:05:19,620
Visualization a dimension reduction and rotation.

64
00:05:20,310 --> 00:05:22,620
We've looked at the first two, but not the third.

65
00:05:23,580 --> 00:05:30,300
What I hope to convince you of next is that all of this is possible by rotating our data to understand

66
00:05:30,300 --> 00:05:31,110
how this works.

67
00:05:31,470 --> 00:05:33,150
We have to kind of work backwards.

68
00:05:33,840 --> 00:05:37,320
We'll start with a story about how we think our data came to be.

69
00:05:38,280 --> 00:05:43,740
You see, the whole point of dimensionality reduction is that we believe our data has a small number

70
00:05:43,740 --> 00:05:48,480
of dimensions, but it is embedded into a larger dimensional space.

71
00:05:49,290 --> 00:05:52,020
That is why it can be reduced in the first place.

72
00:05:52,740 --> 00:05:58,350
If your data is truly of the same dimensionality of the space it lives in, then it cannot be reduced.

73
00:05:59,010 --> 00:06:06,330
It can only be reduced if we believe that the data comes from a smaller dimensional source and is embedded

74
00:06:06,450 --> 00:06:08,460
into a larger dimensional space.

75
00:06:13,070 --> 00:06:18,500
To understand this, let's start with a simple, happy face, this happy face is clearly just a set

76
00:06:18,500 --> 00:06:20,390
of data points in two dimensions.

77
00:06:21,110 --> 00:06:26,180
And as a side note, if you're curious about how I generated this plot, I encourage you to try it as

78
00:06:26,180 --> 00:06:27,080
an exercise.

79
00:06:31,740 --> 00:06:38,370
OK, so now suppose that I took our happy face and embedded this two-dimensional object into a three

80
00:06:38,370 --> 00:06:39,420
dimensional space.

81
00:06:40,200 --> 00:06:46,140
This is exactly what we imagine happens with any other data set where we attempt to use dimensionality

82
00:06:46,140 --> 00:06:46,830
reduction.

83
00:06:47,550 --> 00:06:52,710
It's a smaller, dimensional set of data points embedded in a larger dimensional space.

84
00:06:57,280 --> 00:06:58,420
Now, here's the issue.

85
00:06:59,050 --> 00:07:05,080
When our two dimensional object gets embedded into this three dimensional space, it doesn't necessarily

86
00:07:05,080 --> 00:07:06,460
align with the axes.

87
00:07:07,150 --> 00:07:12,370
Therefore, it's not easy to see the happy face unless you are at a nice orientation.

88
00:07:13,150 --> 00:07:17,710
And obviously, the ideal orientation would be to see the happy face head on.

89
00:07:18,760 --> 00:07:23,410
But notice how some orientations can be very bad while others can be better.

90
00:07:24,250 --> 00:07:29,770
For instance, if you picture the happy face facing upwards or to the side, you can't see the face

91
00:07:29,770 --> 00:07:30,280
at all.

92
00:07:30,880 --> 00:07:33,190
It just looks like a flat cloud of data points.

93
00:07:34,120 --> 00:07:39,700
Of course, this is what happens when you embed a two dimensional object in a three dimensional space.

94
00:07:40,420 --> 00:07:43,000
The two dimensional object is like a sheet of paper.

95
00:07:43,360 --> 00:07:47,170
So when you look at the sheet from the side, you can't see the sheet.

96
00:07:48,010 --> 00:07:53,440
For us, we can see a bit of thickness because they also added some noise in the third dimension.

97
00:07:54,190 --> 00:08:00,790
Noise is an important aspect of CVD because we imagine that this process also ignores any dimensions

98
00:08:01,120 --> 00:08:03,070
which are just a little bit of noise.

99
00:08:03,850 --> 00:08:05,410
But consider what I'm doing here.

100
00:08:06,010 --> 00:08:13,030
I'm rotating the graph, using my mouse, and under these various rotations, you can see the varying

101
00:08:13,030 --> 00:08:15,730
degrees of usefulness of each perspective.

102
00:08:16,390 --> 00:08:22,570
Some perspectives are useful since they allow us to see the face, and some perspectives are not useful

103
00:08:22,840 --> 00:08:24,850
because all we can see is the noise.

104
00:08:25,810 --> 00:08:27,040
So what is our goal?

105
00:08:27,850 --> 00:08:35,140
Our goal clearly is to rotate our perspective such that we can see the face in the correct orientation.

106
00:08:35,950 --> 00:08:40,330
That is, we would like to see the face head on facing us out of the screen.

107
00:08:40,990 --> 00:08:43,150
And this is exactly what SVT does.

108
00:08:44,350 --> 00:08:49,540
You can imagine that if we could find the exact orientation, we need to see the face.

109
00:08:49,990 --> 00:08:54,370
The third dimension, which is just noise, would be coming out of the page.

110
00:08:55,120 --> 00:09:00,700
At that point, we could simply discard this third dimension and keep only the two that are useful.

111
00:09:01,390 --> 00:09:06,190
By doing this, we would have reduced the dimensionality from three down to two.

112
00:09:10,880 --> 00:09:16,970
OK, so let's suppose we applied as we do to our happy face, which is currently embedded in 3D space.

113
00:09:17,690 --> 00:09:23,030
Actually did this and this is the results you'll notice that the happy faces upside down.

114
00:09:23,600 --> 00:09:25,010
This is perfectly OK.

115
00:09:25,760 --> 00:09:30,650
Remember that the job of us video isn't to know that the eyes are at the top in the smile is underneath.

116
00:09:31,220 --> 00:09:37,040
Obviously, there's no way for SVT to know the structure of a face, but what it does know is that the

117
00:09:37,040 --> 00:09:43,700
third dimension was completely useless and that the true information in the data consisted of the two

118
00:09:43,730 --> 00:09:45,050
dimensional happy face.

119
00:09:49,740 --> 00:09:54,990
Now, because of the ambiguity of language, there may be some confusion about this demonstration,

120
00:09:55,770 --> 00:09:59,820
in particular, I said that I added a third dimension, which is just noise.

121
00:10:00,510 --> 00:10:04,200
So you might be wondering if I want the two dimensional happy face.

122
00:10:04,590 --> 00:10:09,060
Why can't I just remove that third dimension, which I said is just noise?

123
00:10:09,720 --> 00:10:14,850
And the reason you cannot do this is because of how we imagine our data is created.

124
00:10:15,660 --> 00:10:20,040
Recall that our two dimensional happy face is not aligned with the axes.

125
00:10:20,310 --> 00:10:21,990
It's at a random orientation.

126
00:10:22,830 --> 00:10:28,890
But if we were to choose, say, Dimension one and Dimension two without first rotating the data correctly,

127
00:10:29,280 --> 00:10:31,470
you would get a squished version of the face.

128
00:10:32,100 --> 00:10:36,840
This is because the happy face was embedded randomly in the three dimensional space.

129
00:10:37,530 --> 00:10:42,990
So it's equivalent to just picking a random perspective, which of course, is unlikely to be correct.

130
00:10:44,130 --> 00:10:48,300
Mathematically, you can imagine our data points being an end by three matrix.

131
00:10:48,870 --> 00:10:54,330
And as the number of data points and three is obviously the number of dimensions representing the three

132
00:10:54,330 --> 00:10:55,020
axes.

133
00:10:56,070 --> 00:11:02,670
So in the UN transformed or original space, you can imagine that each dimension contains both a bit

134
00:11:02,670 --> 00:11:04,710
of data and a bit of noise.

135
00:11:05,190 --> 00:11:06,450
Everything is mixed up.

136
00:11:07,410 --> 00:11:12,690
If we choose only two of these dimensions to look at, we're still looking at a mixture of both a bit

137
00:11:12,690 --> 00:11:14,400
of data and a bit of noise.

138
00:11:15,900 --> 00:11:22,050
The goal of transforming this data or, in other words, finding the right rotation is to get back another

139
00:11:22,050 --> 00:11:29,040
end by three matrix, which is organized more nicely, in particular in our new transformed and by three

140
00:11:29,040 --> 00:11:29,640
matrix.

141
00:11:30,060 --> 00:11:32,370
The first two dimensions are purely data.

142
00:11:32,700 --> 00:11:35,100
While the third dimension is purely noise.

143
00:11:35,700 --> 00:11:38,520
So that's what we're doing when we rotate our happy face.

144
00:11:39,600 --> 00:11:45,540
Because the happy face is oriented or randomly, you have a bit of the happy face in all three dimensions.

145
00:11:46,080 --> 00:11:51,810
But after rotating the happy face correctly, you move the happy face to the first to two dimensions

146
00:11:52,200 --> 00:11:54,360
and you move the noise to the third dimension.

147
00:11:55,470 --> 00:12:00,720
Now, in practice, there's going to be many more data dimensions and many more noise dimensions.

148
00:12:01,110 --> 00:12:05,940
But hopefully this lecture helped to give you some intuition about what SVT does.