1
00:00:11,060 --> 00:00:15,920
So in this lecture, we'll be looking at the notebook to implement latent semantic analysis.

2
00:00:16,520 --> 00:00:20,840
We'll begin this notebook by downloading our data set, which is a list of book titles.

3
00:00:29,300 --> 00:00:32,330
The next step is to import everything we need for this script.

4
00:00:32,870 --> 00:00:39,020
The only thing new here is truncated served notice how this comes from the same decomposition module

5
00:00:39,230 --> 00:00:45,230
as an EMF and LDA, which should give you some idea of where this fits in the universe of machine learning

6
00:00:45,230 --> 00:00:45,860
methods.

7
00:00:53,990 --> 00:00:57,200
The next step is to download the data we need for NCTC.

8
00:01:02,800 --> 00:01:05,890
The next step is to create a word net limits or object.

9
00:01:06,460 --> 00:01:11,530
Note that in this notebook, we'll be writing our own custom tokenize here, since the data set contains

10
00:01:11,530 --> 00:01:13,540
specific terms will want to remove.

11
00:01:18,890 --> 00:01:24,410
The next step is to read in our book titles, note that our file is structured so that each line consists

12
00:01:24,410 --> 00:01:25,220
of one title.

13
00:01:30,550 --> 00:01:33,280
The next step is to get an initial list of stop words.

14
00:01:33,880 --> 00:01:38,410
Note that of course, this to a set since will want to add more stop words in the next step.

15
00:01:44,090 --> 00:01:47,240
The next step is to add more software to our set of stop words.

16
00:01:47,810 --> 00:01:52,010
As noted in the comment, this is an excellent example of where we need to do something.

17
00:01:52,130 --> 00:01:58,220
Domain specific in this instance, because we're working with book titles, we don't care about words

18
00:01:58,220 --> 00:02:02,720
like Introduction Edition, series, application approach and so forth.

19
00:02:03,320 --> 00:02:07,460
Normally, these wouldn't be considered stop words, but because we're working with book titles, they

20
00:02:07,490 --> 00:02:07,970
would be.

21
00:02:08,690 --> 00:02:10,759
These words are too generic to be meaningful.

22
00:02:11,000 --> 00:02:13,100
But they appear in many book titles.

23
00:02:19,830 --> 00:02:26,040
The next step is to create our tokenize, their function as input we take in a string s, which represents

24
00:02:26,040 --> 00:02:26,700
a book title.

25
00:02:27,630 --> 00:02:29,820
The first step is to down case the text.

26
00:02:33,640 --> 00:02:36,550
The next step is to tokenize the title into tokens.

27
00:02:39,910 --> 00:02:45,490
The next step is to remove any tokens that are too short for examples of tokens that are too short.

28
00:02:45,760 --> 00:02:48,580
I recommend turning this off and seeing what happens.

29
00:02:51,980 --> 00:02:54,200
The next step is to clematis, the tokens.

30
00:02:57,550 --> 00:02:59,530
The next step is to remove stalwarts.

31
00:03:03,040 --> 00:03:06,310
The final step is to remove any tokens that contain digits.

32
00:03:07,090 --> 00:03:11,980
These are similar to Stoppard's, but unlike, say, upwards, any number can appear anywhere.

33
00:03:12,250 --> 00:03:15,850
So it's easier to simply remove any token that contains digits.

34
00:03:18,560 --> 00:03:21,500
At this point, we can return the remaining list of tokens.

35
00:03:27,430 --> 00:03:33,220
The next step is to create a count victories or object note that have said binary to true, which means

36
00:03:33,220 --> 00:03:37,450
that we won't count the tokens, but instead just to know whether or not they appear.

37
00:03:37,870 --> 00:03:43,630
So the output will only contain values of zero and one for the tokenize area will pass in the function

38
00:03:43,630 --> 00:03:44,530
we just created.

39
00:03:49,210 --> 00:03:53,290
The next step is to convert our list of titles into account Matrix.

40
00:04:01,820 --> 00:04:06,530
The next step is to create an index toward mapping, which we'll need later in the script.

41
00:04:07,100 --> 00:04:10,580
I've left a comment here showing you conceptually what we want to do.

42
00:04:11,420 --> 00:04:17,209
Essentially, we want to create a list since a list has integers as the index ends any object as the

43
00:04:17,209 --> 00:04:17,690
value.

44
00:04:18,350 --> 00:04:23,570
We then loop through this vocabulary, which is a word to index mapping and then simply reverse this

45
00:04:23,570 --> 00:04:24,020
mapping.

46
00:04:25,130 --> 00:04:29,330
So the vocabulary has the word as the key and the index as the value.

47
00:04:29,900 --> 00:04:34,160
But our index, the word mapping has the index as the key and the word is the value.

48
00:04:37,980 --> 00:04:42,600
Luckily, we don't actually have to do all this work since there's a function called Get Feature Names

49
00:04:42,600 --> 00:04:44,400
Out, which does what we want.

50
00:04:50,180 --> 00:04:56,030
The next step is to transpose our account matrix, as you recall, by default, they count victimizer,

51
00:04:56,060 --> 00:04:57,800
gives us documents by terms.

52
00:04:58,400 --> 00:05:03,710
However, we want to find a word vectors, which means we want to reduce the dimensionality of the term

53
00:05:03,740 --> 00:05:04,940
document matrix.

54
00:05:10,710 --> 00:05:15,780
The next step is to perform as be on our term document matrix, which gives us back Z.

55
00:05:21,350 --> 00:05:24,860
Now, the next step is to essentially create a scatterplot for Z.

56
00:05:25,610 --> 00:05:29,240
Unfortunately, CoLab doesn't have interactive plots by default.

57
00:05:29,870 --> 00:05:34,790
If you ran this on your local computer, you would be able to zoom in, move around the plot and so

58
00:05:34,790 --> 00:05:35,270
forth.

59
00:05:35,630 --> 00:05:39,410
So if you want to try that, please feel free to do so for us.

60
00:05:39,420 --> 00:05:43,670
We're going to do everything in this notebook, so we're going to install a package called plot.

61
00:05:51,590 --> 00:05:57,080
The next step is to import a plot leaked express, which has an API similar to a map plot lid.

62
00:06:04,050 --> 00:06:05,880
The next step is to create our scatterplot.

63
00:06:06,570 --> 00:06:12,270
Note that this is where we make use of our index toward mapping, as you recall, is just a list of

64
00:06:12,270 --> 00:06:13,050
data points.

65
00:06:13,410 --> 00:06:18,360
But in order to know what these data points actually mean, we have to actually draw the words on the

66
00:06:18,360 --> 00:06:18,870
plot.

67
00:06:27,340 --> 00:06:30,700
OK, so we see right away that our plot is very interesting.

68
00:06:31,450 --> 00:06:34,300
There appears to be two directions where the data spreads.

69
00:06:35,020 --> 00:06:40,900
On one side, which almost goes horizontal, we can see the words computer and science on the other

70
00:06:40,900 --> 00:06:42,790
side, which almost goes vertical.

71
00:06:43,150 --> 00:06:45,100
We can see the words history and art.

72
00:06:45,820 --> 00:06:51,460
So this is very interesting because it reflects essentially the two sides of most college offerings.

73
00:06:52,030 --> 00:06:57,220
Typically, the biggest differentiator of college programs is whether they are based on science and

74
00:06:57,220 --> 00:06:59,290
technology or liberal arts.

75
00:06:59,740 --> 00:07:02,200
And these two axes reflect this observation.

76
00:07:03,190 --> 00:07:05,740
Now, the center is where most of the words reside.

77
00:07:06,400 --> 00:07:11,320
Note that you can zoom into this plot by clicking and holding while dragging your mouse to highlight

78
00:07:11,320 --> 00:07:12,640
the portion you want to see.

79
00:07:13,240 --> 00:07:19,150
So please try this yourself and try to find different areas of interest because there are many possible

80
00:07:19,150 --> 00:07:20,140
areas to look at.

81
00:07:20,500 --> 00:07:23,200
I've chosen a select few to discuss for this lecture.

82
00:07:28,940 --> 00:07:34,280
So if we zoom in just a bit, we can see more of the same pattern to the right, we have words like

83
00:07:34,280 --> 00:07:40,640
business, statistic, biology, engineering, probability, Earth, political and so forth.

84
00:07:41,060 --> 00:07:47,900
Again, reflecting science and tech at the top, we have words like American world, global modern,

85
00:07:48,260 --> 00:07:50,090
again reflecting liberal arts.

86
00:07:56,270 --> 00:07:58,550
So if we zoom in again, we see more of the same.

87
00:07:59,240 --> 00:08:05,450
So the rate we see words like psychology, algorithm, mathematics, machine physiology, actuarial

88
00:08:05,840 --> 00:08:09,890
java, environmental and more words related to science and tech.

89
00:08:10,610 --> 00:08:16,790
At the top, we see words like culture, Buddhism, narrative, Judaism, Greek literature, religion,

90
00:08:17,150 --> 00:08:20,990
feminism, diaspora, in other words, related to liberal arts.

91
00:08:26,760 --> 00:08:33,270
If we zoom in even more, we see the same pattern yet again to the right, we see words like thermodynamics,

92
00:08:33,270 --> 00:08:38,909
geology, finance, computational quantum security, communication and physics.

93
00:08:39,360 --> 00:08:41,730
Again, all words related to science and tech.

94
00:08:42,600 --> 00:08:49,590
Vertically, we see words like state, Islam, sex, sexuality, Christianity, Asian East Anthology

95
00:08:49,590 --> 00:08:52,680
philosophy, in other words, related to liberal arts.

96
00:08:54,870 --> 00:08:58,980
Now, I don't want to go through the whole list of words, but please look through this on your own

97
00:08:59,370 --> 00:09:04,770
to see whether this pattern continues as you zoom in and more and more check whether or not the results

98
00:09:04,770 --> 00:09:05,370
make sense.

99
00:09:05,730 --> 00:09:09,120
And think about why this happened and how it's related to speed.

