1
00:00:11,120 --> 00:00:16,280
So in this lecture, we will be introducing the next section of this course, which is on vector models

2
00:00:16,280 --> 00:00:17,150
in an LP.

3
00:00:18,380 --> 00:00:20,840
So let's think about why this section exists.

4
00:00:22,040 --> 00:00:26,720
Well, this class is all about the study of language and how machine learning can be applied to it.

5
00:00:27,290 --> 00:00:30,020
We know that machine learning is essentially applied math.

6
00:00:30,590 --> 00:00:33,200
We also know that math typically works on numbers.

7
00:00:34,100 --> 00:00:35,510
So this is the main issue.

8
00:00:35,540 --> 00:00:38,000
Language is not represented by numbers.

9
00:00:38,510 --> 00:00:43,520
Language is represented by words and characters, which are discrete, categorical objects.

10
00:00:44,120 --> 00:00:50,270
So this section essentially answers the question How can we convert language into a numerical representation

11
00:00:50,630 --> 00:00:55,580
so that we can apply machine learning to language as a side effect of this?

12
00:00:55,790 --> 00:01:00,680
This section will also cover textbook processing, which is what must be done in the code when you're

13
00:01:00,680 --> 00:01:04,690
working with text to actually get the text into a numerical format.

14
00:01:09,440 --> 00:01:14,150
OK, so now that we know the fundamental purpose of this section, let's go through an outline of what

15
00:01:14,150 --> 00:01:15,080
we will discuss.

16
00:01:15,710 --> 00:01:18,620
We'll start with just some basic definitions in NLP.

17
00:01:19,250 --> 00:01:23,420
Specifically, you'll learn about terms like token, character and vocabulary.

18
00:01:23,960 --> 00:01:27,120
Many people already know what these words mean, but some do not.

19
00:01:27,170 --> 00:01:31,130
So just for completion sake, we will discuss what all of these words mean.

20
00:01:32,030 --> 00:01:37,700
Once we understand our basic definitions, the next step will be to answer the question What is a vector?

21
00:01:38,750 --> 00:01:43,460
Now, I'm sure you all learned about vectors in high school mathematics, so this should just be a review.

22
00:01:44,180 --> 00:01:49,190
We'll also discuss why the concept of vectors is useful in NLP, and I'll give you a little preview

23
00:01:49,190 --> 00:01:51,200
of how they will be used throughout the course.

24
00:01:52,520 --> 00:01:57,680
Once we understand the concept of vectors, the next step will be to discuss various techniques that

25
00:01:57,680 --> 00:02:01,640
will be used during the process of converting text into vectors.

26
00:02:02,150 --> 00:02:06,920
So, for example, tokenization stop words stemming and limitation.

27
00:02:08,240 --> 00:02:13,190
So after we understand some of the basic text processing techniques, well, then look at the simplest

28
00:02:13,190 --> 00:02:17,000
way to convert text into vectors, which is by simple counting.

29
00:02:18,110 --> 00:02:21,080
We'll also look at how this can be applied to do machine learning.

30
00:02:22,400 --> 00:02:25,730
At this point, we won't study any machine learning techniques in depth.

31
00:02:26,000 --> 00:02:30,380
But this is just to give you a quick preview of how effective simple counting can be.

32
00:02:31,640 --> 00:02:36,410
Once we learn how to turn text into victories by counting, well, then look at some of the limitations

33
00:02:36,410 --> 00:02:37,280
of this method.

34
00:02:37,850 --> 00:02:44,060
We'll learn that one way to overcome these limitations is by using a technique called TFI Taf, which

35
00:02:44,060 --> 00:02:47,060
stands for a term frequency inverse document frequency.

36
00:02:48,200 --> 00:02:53,600
We'll also look at how TF IDF can be applied in practice by building a simple recommendation system

37
00:02:54,710 --> 00:02:56,150
as an advanced exercise.

38
00:02:56,150 --> 00:02:59,660
We'll also look at how to implement TFI IDF from scratch.

39
00:03:00,170 --> 00:03:05,120
For those advanced students who really want to understand how things work, this exercise is strongly

40
00:03:05,120 --> 00:03:05,930
recommended.

41
00:03:06,470 --> 00:03:08,060
Otherwise, it's OK to skip.

42
00:03:09,490 --> 00:03:14,140
In the final portion of this section, we will then look at some of the more recent ways that researchers

43
00:03:14,140 --> 00:03:18,040
have been using to create word vectors, such as word and glove.

44
00:03:18,670 --> 00:03:24,340
Now for this part, we will only be looking at basic intuition since the mathematics are quite involved.

45
00:03:24,730 --> 00:03:29,920
In fact, it would take an entire course to cover these subjects in depth, which I've already done.

46
00:03:30,160 --> 00:03:34,480
So there's no need to repeat that, and you could just think of this more like a survey of those more

47
00:03:34,480 --> 00:03:35,650
advanced techniques.

