1
00:00:11,040 --> 00:00:16,920
In this lecture, we are going to continue our discussion about how to convert text into vectors using

2
00:00:16,920 --> 00:00:17,520
counting.

3
00:00:18,210 --> 00:00:24,000
At this point, we understand the basics of tokenizing a string into a list of words and then counting

4
00:00:24,000 --> 00:00:26,910
each word and putting those counts into a vector.

5
00:00:27,480 --> 00:00:29,850
But there are still a few details we need to explore.

6
00:00:30,630 --> 00:00:34,290
In this lecture, we are going to consider the concept of stop words.

7
00:00:39,080 --> 00:00:42,800
So suppose that we did our word counting procedure as described.

8
00:00:43,250 --> 00:00:44,720
What do we think might happen?

9
00:00:45,440 --> 00:00:50,540
Well, let's consider some very common words, words that appear everywhere, but do not necessarily

10
00:00:50,570 --> 00:00:52,100
carry a lot of information.

11
00:00:52,670 --> 00:00:56,240
Words like and the it is and so forth.

12
00:00:56,690 --> 00:00:59,300
What might happen if we counted these words?

13
00:01:00,170 --> 00:01:03,230
One thing to consider is whether or not these words are useful.

14
00:01:03,920 --> 00:01:09,590
Pretty much any text can contain these words, no matter what topic they are about whether or not they

15
00:01:09,590 --> 00:01:12,500
are spam, whether or not they are positive or negative.

16
00:01:13,100 --> 00:01:18,680
So it could be said that these words don't carry useful information for whatever task we are doing.

17
00:01:23,450 --> 00:01:25,070
Here's another thing to consider.

18
00:01:25,670 --> 00:01:29,330
As you recall, increasing dimensionality is generally bad.

19
00:01:29,870 --> 00:01:32,090
This will be a common theme throughout this course.

20
00:01:32,090 --> 00:01:36,170
So don't worry if you don't understand everything there is to know about this right away.

21
00:01:36,710 --> 00:01:41,060
But just keep in mind that we prefer not to have very high dimensional vectors.

22
00:01:41,660 --> 00:01:46,310
One simple reason is that the larger they are, the more computation we have to do.

23
00:01:46,880 --> 00:01:49,910
More computation takes more time and more space.

24
00:01:54,520 --> 00:01:58,660
Here's another thing to consider putting high dimensionality aside.

25
00:01:59,020 --> 00:02:04,300
What about the fact that these words do not help us differentiate between different kinds of documents?

26
00:02:04,960 --> 00:02:10,300
Recall that one method we use to gain an understanding about vectors is their distance to each other.

27
00:02:10,990 --> 00:02:16,960
But if we make a vector that contains lots of ends, lots of verbs and lots of it, all the documents

28
00:02:16,960 --> 00:02:18,730
will look like similar vectors.

29
00:02:19,210 --> 00:02:24,010
The counts for these words will overshadow the counts for the other, possibly more important words

30
00:02:24,250 --> 00:02:26,200
like mitochondria or voltage.

31
00:02:26,710 --> 00:02:30,070
So it may be useful to simply ignore these words altogether.

32
00:02:30,730 --> 00:02:33,070
We call this list of words that we want to ignore.

33
00:02:33,100 --> 00:02:34,060
Stop words.

34
00:02:38,660 --> 00:02:44,570
Luckily people have already thought about this issue and thus libraries like Psyche Learn allow you

35
00:02:44,570 --> 00:02:47,780
to remove stop words from the counting process automatically.

36
00:02:48,440 --> 00:02:53,180
So when you create your account event to raise your object, there is an argument called stop words

37
00:02:53,450 --> 00:02:57,210
that allows you to specify the stop words by default.

38
00:02:57,230 --> 00:02:58,430
This is set to none.

39
00:02:58,430 --> 00:03:02,570
And so if you don't specify any stop words, then they will be capped.

40
00:03:03,680 --> 00:03:05,690
There are two other possible options.

41
00:03:06,290 --> 00:03:08,480
The first is to pass in the string English.

42
00:03:09,470 --> 00:03:14,390
If you do this, there is a built in list of stop words in English that will be ignored.

43
00:03:14,900 --> 00:03:20,030
This is probably the easiest thing to do, but unfortunately it only covers English.

44
00:03:20,810 --> 00:03:24,800
The second option is to pass in a list of user defined stop words.

45
00:03:25,250 --> 00:03:30,260
So perhaps you're working in a different language, or you're working in a specific domain where the

46
00:03:30,260 --> 00:03:34,190
stop words are not the same as they are for generic English documents.

47
00:03:34,580 --> 00:03:40,370
For example, you might be working in a business or an industry that uses very niche terms, but some

48
00:03:40,370 --> 00:03:43,220
of them are not important since they are used to commonly.

49
00:03:43,910 --> 00:03:47,120
In this case, you can specify your own stop word list.

50
00:03:51,730 --> 00:03:56,950
Now, if you want to do an LP in another language, one possibility is to use stop words from.

51
00:03:57,340 --> 00:03:57,780
Okay.

52
00:03:58,450 --> 00:04:03,580
So here is some sample code to show you how to get a list of stop words in other languages.

53
00:04:04,090 --> 00:04:10,480
The first step is to import and I'll take the next step, if you haven't already done so, is to call

54
00:04:10,480 --> 00:04:10,810
notes.

55
00:04:11,290 --> 00:04:14,140
Download Passing in the argument stop words.

56
00:04:14,530 --> 00:04:17,170
This will download the stop words for each language.

57
00:04:18,850 --> 00:04:24,400
The next step is to import the Stop Words module, and at this point you can call the words function

58
00:04:25,030 --> 00:04:26,860
passing in your language of choice.

59
00:04:27,310 --> 00:04:32,050
So for example, you can pass in English, German, Spanish and even Arabic.

60
00:04:32,590 --> 00:04:35,140
But note that not all languages are supported.

61
00:04:35,800 --> 00:04:39,400
To get a full list of languages, check the directory shown on this slide.

62
00:04:40,150 --> 00:04:45,520
If the language you want is not there, you may still be able to find a list of stop words simply by

63
00:04:45,520 --> 00:04:46,630
doing a search.

