1
00:00:11,060 --> 00:00:17,720
So a few students have asked me over the years, how can I apply what I've learned in NLP to other languages?

2
00:00:18,530 --> 00:00:23,930
Now you may have noticed this whole courses in English, but does that mean what you learn in this course

3
00:00:23,930 --> 00:00:26,330
can only be applied to English text?

4
00:00:27,110 --> 00:00:28,610
Luckily, the answer is no.

5
00:00:28,970 --> 00:00:33,290
You can, in fact, apply everything you learned in this course to any language.

6
00:00:34,220 --> 00:00:39,650
In fact, this is an especially relevant topic because of this course has students from all around the

7
00:00:39,650 --> 00:00:40,190
world.

8
00:00:41,210 --> 00:00:47,600
In these situations, I like to remind students that many NLP techniques can even be applied to bioinformatics

9
00:00:47,600 --> 00:00:48,410
and genomics.

10
00:00:48,890 --> 00:00:53,030
In other words, strings of DNA, RNA, amino acids and so forth.

11
00:00:53,660 --> 00:00:59,300
If these techniques can be applied even to languages created by nature, then surely they are capable

12
00:00:59,300 --> 00:01:01,640
of handling languages created by humans.

13
00:01:02,420 --> 00:01:07,700
So in this lecture, I hope to show you that it would be very easy to apply what you learn in this course

14
00:01:08,060 --> 00:01:10,880
to other languages should you find the need to do so.

15
00:01:15,680 --> 00:01:21,650
So let's begin by recalling the high level steps of an A.P. analysis, like the ones we do in this course.

16
00:01:22,490 --> 00:01:25,910
The first step is to simply recognize what our data would look like.

17
00:01:26,540 --> 00:01:27,890
Our data will be text.

18
00:01:28,490 --> 00:01:30,710
This will be the case no matter the language.

19
00:01:31,760 --> 00:01:33,920
The next step is to tokenize the text.

20
00:01:34,610 --> 00:01:40,670
Perhaps we might remove stop words and do some stemming in limitation in order to simplify the tokens.

21
00:01:41,720 --> 00:01:45,260
Once we have these tokens, we need to map each token to an integer.

22
00:01:46,130 --> 00:01:47,960
Let's review why this has to be done.

23
00:01:49,700 --> 00:01:54,290
As you recall, machine learning works with numbers in tabular machine learning.

24
00:01:54,290 --> 00:02:00,140
Our data is represented as a table, with rows corresponding to documents in columns corresponding to

25
00:02:00,140 --> 00:02:04,370
tokens, since each column corresponds to a different token.

26
00:02:04,760 --> 00:02:09,590
And we need to know which token we must have a mapping from token to column index.

27
00:02:10,130 --> 00:02:12,530
Therefore, we must map tokens to integers.

28
00:02:14,130 --> 00:02:19,490
Once we have this mapping, we can then convert our data set into account vectors or 240fps.

29
00:02:20,310 --> 00:02:25,380
Once we have this data matrix of document vectors, we can then apply any machine learning technique

30
00:02:25,380 --> 00:02:26,100
we've studied.

31
00:02:26,700 --> 00:02:30,840
So, for instance, we can create a recommender system or perform spam detection.

32
00:02:31,350 --> 00:02:36,120
We can summarize a document or we can build a topic model to find latent topics.

33
00:02:37,170 --> 00:02:42,630
Alternatively, you can use tokens directly and build a Markov model like we do in this course.

34
00:02:44,280 --> 00:02:46,920
Now there's a reason we're reviewing all these steps.

35
00:02:47,580 --> 00:02:52,650
What I want you to notice is that these are the steps you need to do no matter what language your text

36
00:02:52,650 --> 00:02:53,040
is in.

37
00:02:53,790 --> 00:02:57,000
The only thing that changes is how you do your tokenization.

38
00:02:57,420 --> 00:03:02,850
And I hope to convince you in this lecture that this is not a task of machine learning, but a task

39
00:03:02,850 --> 00:03:03,990
of learning a language.

40
00:03:04,590 --> 00:03:10,350
So, for example, if you want to work with Japanese, it's not a matter of learning different NLP techniques

41
00:03:10,350 --> 00:03:11,520
for Japanese.

42
00:03:12,030 --> 00:03:15,900
Instead, it's actually a task of learning the language of Japanese.

43
00:03:16,410 --> 00:03:19,980
The NLP parts are still the same as what you learned in this course.

44
00:03:24,520 --> 00:03:27,640
To illustrate this, let's consider some Japanese text.

45
00:03:28,300 --> 00:03:31,150
Here are some immediate questions that should come to mind.

46
00:03:31,900 --> 00:03:38,410
Firstly, what is a word in Japanese is a word considered to be one of these symbols or multiple contiguous

47
00:03:38,410 --> 00:03:39,070
symbols.

48
00:03:39,910 --> 00:03:42,340
Secondly, what is a sentence in Japanese?

49
00:03:42,850 --> 00:03:48,070
Does Japanese use the same punctuation as English so that we can find the boundaries between different

50
00:03:48,070 --> 00:03:48,820
sentences?

51
00:03:50,050 --> 00:03:53,380
This also reminds us to think about the boundaries between words.

52
00:03:54,010 --> 00:03:59,770
If we want to tokenized Japanese text into words, we need to think not only of the words themselves,

53
00:03:59,770 --> 00:04:01,660
but also how to split the words.

54
00:04:03,140 --> 00:04:07,100
Another question to consider is what are the stop words in Japanese?

55
00:04:07,670 --> 00:04:12,890
Just like in English, these would be words that are so common that they won't be useful in our analysis.

56
00:04:13,790 --> 00:04:16,640
Now think about the common theme behind these questions.

57
00:04:17,209 --> 00:04:22,790
Are these questions a matter of learning more NLP, or are they a matter of understanding the Japanese

58
00:04:22,790 --> 00:04:23,780
language itself?

59
00:04:24,650 --> 00:04:29,960
I hope you can see that it's not a matter of learning NLP, it's a matter of learning Japanese.

60
00:04:30,440 --> 00:04:35,720
And of course, if you want to learn Japanese, this would require you to take a course on Japanese.

61
00:04:35,720 --> 00:04:37,070
Not, of course, on an LP.

62
00:04:41,830 --> 00:04:46,900
Now, I want to point out another fact, which is that it's not a matter of this course being English

63
00:04:46,900 --> 00:04:51,490
centric, but rather the whole machine learning ecosystem being English centric.

64
00:04:52,180 --> 00:04:57,430
For instance, if you use the count victories or insight, you learn this will not tokenized Japanese

65
00:04:57,430 --> 00:04:57,940
correctly.

66
00:04:58,720 --> 00:05:05,140
So although technically the count victimizer can do tokenization for you, this is limited only to languages

67
00:05:05,140 --> 00:05:10,540
that can be passed by its token isare, which probably assumes English and other languages with similar

68
00:05:10,540 --> 00:05:11,290
syntax.

69
00:05:12,220 --> 00:05:18,070
But this brings us back to the practical world in particular, if psyche learns count, victimizer won't

70
00:05:18,070 --> 00:05:20,890
tokenize my Japanese text, what should I do?

71
00:05:21,730 --> 00:05:27,430
Well, the answer is to simply build a tokenize of yourself using your knowledge of the Japanese language

72
00:05:27,880 --> 00:05:31,150
or to find an open source tokenize are made by someone else.

73
00:05:32,230 --> 00:05:38,740
In fact, I was able to find one within seconds after simply doing a search for how to tokenized Japanese

74
00:05:38,740 --> 00:05:39,310
text.

75
00:05:39,910 --> 00:05:42,190
In other words, this is not very difficult at all.

76
00:05:46,940 --> 00:05:49,820
Now, both of these approaches have pros and cons.

77
00:05:50,540 --> 00:05:56,120
Firstly, if you build your own Japanese tokenize, this requires you to actually know Japanese.

78
00:05:56,810 --> 00:05:59,750
Unfortunately, learning a new language is pretty difficult.

79
00:05:59,780 --> 00:06:02,840
So if you're starting from scratch, this may not be for you.

80
00:06:03,890 --> 00:06:08,600
However, if you're working with Japanese because you live in Japan and you already know Japanese,

81
00:06:08,990 --> 00:06:10,160
then this is a non-issue.

82
00:06:11,480 --> 00:06:15,320
The second downside to this is that building things takes time and effort.

83
00:06:16,190 --> 00:06:19,160
So what about finding a tokenize are built by someone else?

84
00:06:19,820 --> 00:06:22,040
This would be ideal if it works.

85
00:06:22,700 --> 00:06:28,100
It means that not only do you not have to spend time and effort on a pretty low level task, you also

86
00:06:28,100 --> 00:06:30,140
don't have to worry about learning a new language.

87
00:06:31,070 --> 00:06:32,450
But there are downsides as well.

88
00:06:33,350 --> 00:06:38,480
Firstly, if no one has bothered to make its organiser in your language of choice, then this isn't

89
00:06:38,480 --> 00:06:39,860
an option and you're stuck.

90
00:06:40,760 --> 00:06:42,980
Secondly, you have to trust that it works.

91
00:06:43,610 --> 00:06:49,220
So using this method, it would still be advantageous to know the language such that you can properly

92
00:06:49,220 --> 00:06:52,910
judge whether or not the tokenize her is performing as expected.

93
00:06:57,540 --> 00:07:02,550
Ultimately, if you're going to be working with a particular language, you should probably know that

94
00:07:02,550 --> 00:07:04,980
language so that you can analyze the results.

95
00:07:05,550 --> 00:07:10,920
This applies not just to low level tasks like tokenization, but also higher level tasks like sentiment

96
00:07:10,920 --> 00:07:11,730
analysis.

97
00:07:12,600 --> 00:07:17,370
For instance, if you don't know the language, then you won't be able to analyze why your model is

98
00:07:17,370 --> 00:07:18,390
making mistakes.

99
00:07:19,050 --> 00:07:24,300
As an example, you might have texting this movie is good, but marked as negative sentiment.

100
00:07:25,020 --> 00:07:30,090
Suppose your model predicts positive, which is technically incorrect because it doesn't match the label.

101
00:07:30,930 --> 00:07:37,110
In this case, you can diagnose the issue as being an incorrect target, but you can only do this if

102
00:07:37,110 --> 00:07:38,610
you actually know the language.

103
00:07:39,300 --> 00:07:44,400
So the lesson is, if you're going to be working with a different language, then it's probably a good

104
00:07:44,400 --> 00:07:46,170
idea to learn that language.

105
00:07:46,770 --> 00:07:50,910
And again, this is in the realm of a language course, not an NLP course.

106
00:07:55,550 --> 00:08:01,670
So to summarize, this lecture is really just a long winded way to say that most student concerns about

107
00:08:01,670 --> 00:08:06,620
working with other languages are misplaced in terms of the NLP side.

108
00:08:06,650 --> 00:08:08,420
The solution is actually very simple.

109
00:08:09,200 --> 00:08:11,720
Most of what you learned in this course remains the same.

110
00:08:12,350 --> 00:08:16,100
The only part which is different is in how you tokenized the text.

111
00:08:16,790 --> 00:08:20,750
Tokenizing text in other languages really has two solutions.

112
00:08:21,230 --> 00:08:27,200
Number one, learn the language and build one yourself or a number to do a search and find one that

113
00:08:27,200 --> 00:08:28,430
has already been built.

114
00:08:29,030 --> 00:08:31,790
I tried this myself and found some within seconds.

115
00:08:32,390 --> 00:08:34,970
Now, whether or not they actually work is a different story.

116
00:08:35,630 --> 00:08:40,490
But as you can see, this isn't really the kind of thing you expect to learn inside, and I'll be course.

117
00:08:40,909 --> 00:08:43,100
And it's certainly not a machine learning task.

118
00:08:43,610 --> 00:08:49,430
It's simply a matter of doing a search and finding an existing tool or in the case that it doesn't exist.

119
00:08:49,760 --> 00:08:51,030
Building one yourself.

120
00:08:51,740 --> 00:08:56,510
Either way, it's strongly recommended that you actually learn the language you're working with so that

121
00:08:56,510 --> 00:08:59,840
you can understand the results and see whether they're good or bad.

122
00:09:04,590 --> 00:09:09,840
So another question I get sometimes is about word embeddings, which are word vectors we use in neural

123
00:09:09,840 --> 00:09:10,590
networks.

124
00:09:11,580 --> 00:09:15,180
Occasionally, students inquire about word embeddings in other languages.

125
00:09:15,900 --> 00:09:18,400
Again, it really comes down to the same strategy.

126
00:09:18,400 --> 00:09:24,120
You do a search and find out if what you're looking for has been uploaded by someone else.

127
00:09:24,840 --> 00:09:29,610
Unfortunately, there's not much I can do to help you since I would just take the same approach.

128
00:09:29,820 --> 00:09:32,310
So anything I would find you would find as well.

129
00:09:33,990 --> 00:09:37,470
Alternatively, if the word embeddings you need it do not exist.

130
00:09:37,980 --> 00:09:42,030
Well, again, the same strategy applies, simply build them yourself.

131
00:09:42,780 --> 00:09:48,450
Again, this involves tokenizing the text, which we already discussed and then running some algorithm

132
00:09:48,450 --> 00:09:49,800
like words of ECHR glove.

133
00:09:50,310 --> 00:09:54,720
Note that both of these are open source, so you don't actually have to write any code yourself.

134
00:09:59,460 --> 00:10:05,550
Another question I received in the recent past was concerning multilingual models, that is models that

135
00:10:05,550 --> 00:10:07,860
understand the multiple languages at once.

136
00:10:08,700 --> 00:10:13,950
Note that this is an active area of research, so you're unlikely to find any well-established models

137
00:10:14,280 --> 00:10:16,560
that can handle any language of your choice.

138
00:10:17,430 --> 00:10:22,050
One of the main ideas is to make the word embeddings consistent across languages.

139
00:10:22,500 --> 00:10:27,570
So, for example, the word cat in English should give you the same vector as cat in French.

140
00:10:28,380 --> 00:10:32,160
An example of a model that is multilingual is GPT three.

141
00:10:32,550 --> 00:10:36,660
Although do note that it's skill in each language has been said to be inconsistent.