1
00:00:11,080 --> 00:00:16,120
So in this lecture, we will be summarizing this section of the course, which was on vector models

2
00:00:16,120 --> 00:00:16,850
in NLP.

3
00:00:17,650 --> 00:00:22,990
This section was essentially about how to make the connection between text, which is represented as

4
00:00:22,990 --> 00:00:27,650
a string and numbers, which is what we required to do any kind of analytics.

5
00:00:28,480 --> 00:00:32,650
Thus, this section was about how to convert text into vectors.

6
00:00:34,140 --> 00:00:39,870
Part of this section was about text preprocessing, since this is a necessary step before conversion

7
00:00:39,870 --> 00:00:41,130
into vectors can happen.

8
00:00:41,790 --> 00:00:47,910
In particular, we learned about tokenization, which is the process of converting text into tokens.

9
00:00:48,540 --> 00:00:52,230
We learned that tokens can be words, characters or sub words.

10
00:00:52,860 --> 00:00:58,650
We also learned about the bag of words concepts where we ignore the ordering of the tokens in the document.

11
00:00:59,970 --> 00:01:02,190
We also learn that not all words are equal.

12
00:01:02,580 --> 00:01:09,480
Some words like and it and is can appear in any kind of document and thus are likely not informative.

13
00:01:10,230 --> 00:01:11,700
We call these stop words.

14
00:01:12,420 --> 00:01:17,670
We then learned about stemming in limited zation, which involves converting a word into its root.

15
00:01:18,360 --> 00:01:23,730
This is useful because we don't want to have to independently learn about each variation of a word.

16
00:01:24,300 --> 00:01:29,580
We already know that words like run runs, running and ran all mean the same thing.

17
00:01:30,960 --> 00:01:34,800
Furthermore, this helps us to reduce the dimensionality of our vectors.

18
00:01:35,430 --> 00:01:40,770
One simple reason we want to do this is because the more dimensions we have, the more time it takes

19
00:01:40,770 --> 00:01:41,910
to do computation.

20
00:01:46,570 --> 00:01:50,260
We then learned about several techniques for converting text into vectors.

21
00:01:50,620 --> 00:01:55,540
The simplest being counts of tokens where each token is a separate vector component.

22
00:01:56,320 --> 00:02:01,030
Using this method, we were able to build a text classifier with pretty good performance.

23
00:02:01,720 --> 00:02:06,850
We then learned about TFI Taf, which handles one of the problems with count vectors, which is that

24
00:02:07,150 --> 00:02:12,910
it doesn't account for non informative words that may have high counts but show up in many documents.

25
00:02:13,750 --> 00:02:18,760
We also looked at the concept of Vector Similarity, which we applied to build a movie recommendation

26
00:02:18,760 --> 00:02:19,300
system.

27
00:02:20,380 --> 00:02:25,450
As an advanced exercise, we then looked at how to implement TFI Taf from scratch.

28
00:02:25,990 --> 00:02:30,160
This helped to give us deeper insight on how TFI Taf actually works.

29
00:02:31,480 --> 00:02:36,640
Furthermore, it introduced us to an important concept, which was the word to index mapping.

30
00:02:37,360 --> 00:02:43,000
This is especially useful in deep learning where we work with the indices and not the words themselves.

31
00:02:44,750 --> 00:02:50,120
The final topic we looked at was neural word embeddings, which are word vectors used in deep learning

32
00:02:50,120 --> 00:02:51,260
and neural networks.

33
00:02:51,800 --> 00:02:57,140
Specifically, we did a little preview of words of and glove, which, unlike the other methods we've

34
00:02:57,140 --> 00:03:01,100
discussed, convert words into vectors instead of hold documents.

35
00:03:01,910 --> 00:03:06,320
We then saw how these vectors can do interesting things like word analogies.

36
00:03:07,010 --> 00:03:12,440
They showed us that the simple process of converting words into vectors does something very meaningful.

37
00:03:13,640 --> 00:03:16,880
Words really represent concepts on multiple dimensions.

38
00:03:17,330 --> 00:03:19,070
For example, take the word king.

39
00:03:19,700 --> 00:03:22,880
King represents the ruler of a society in one dimension.

40
00:03:23,390 --> 00:03:25,730
King represents mail in another dimension.

41
00:03:26,270 --> 00:03:28,310
King is a noun in another dimension.

42
00:03:28,910 --> 00:03:32,600
In fact, these dimensions are not abstract, but rather numeric.

43
00:03:33,260 --> 00:03:38,870
This is actually a very interesting idea, since if we can represent concepts with numbers, then it

44
00:03:38,870 --> 00:03:45,080
may also be the case that thinking and reasoning are simply mathematical operations on those numbers.