1
00:00:11,080 --> 00:00:15,820
So in this lecture, we will be discussing some basic definitions we'll be using in this course.

2
00:00:16,270 --> 00:00:21,580
There will be more along the way as needed, but this is just to get us started with the basics as a

3
00:00:21,580 --> 00:00:22,150
side note.

4
00:00:22,180 --> 00:00:27,310
This course will be focused on the English language, so it would be best if you have familiarity with

5
00:00:27,310 --> 00:00:31,960
it, which I'm sure is probably the case since the lectures are also in English.

6
00:00:33,160 --> 00:00:37,030
So that being said, let's recall that we communicate in sentences.

7
00:00:37,900 --> 00:00:44,470
A sentence is a sequence of words which typically begins with a capitalized word and ends with punctuation

8
00:00:44,470 --> 00:00:46,330
like a period or a question mark.

9
00:00:46,990 --> 00:00:53,350
OK, so hopefully you are familiar with all the terms I just used, specifically sentences, sequences,

10
00:00:53,350 --> 00:00:56,560
words, capitalization and punctuation.

11
00:00:57,100 --> 00:01:00,640
If you don't know what any of these words mean, let me know on the Q&A.

12
00:01:05,480 --> 00:01:07,430
OK, so let's go through some more terms.

13
00:01:07,880 --> 00:01:10,760
We've stated that a sentence is a sequence of words.

14
00:01:11,690 --> 00:01:14,420
Sometimes we refer to words as tokens.

15
00:01:14,840 --> 00:01:20,540
The term token is a more general term, which can refer to words, but also can refer to punctuation

16
00:01:20,570 --> 00:01:22,340
or even sub units of words.

17
00:01:22,880 --> 00:01:27,770
The details of this are best left to another lecture, but for now, just note that it is common to

18
00:01:27,770 --> 00:01:30,230
use the term a token in place of word.

19
00:01:34,850 --> 00:01:37,130
OK, so here's another source of ambiguity.

20
00:01:37,730 --> 00:01:43,070
We know that words are made up of letters, as I'm sure you're aware the English language is made up

21
00:01:43,070 --> 00:01:44,540
of the letters A to Z.

22
00:01:44,690 --> 00:01:46,820
And there are 26 such letters.

23
00:01:47,330 --> 00:01:52,250
However, like the term token, there is a more general term that we can use, which is the character.

24
00:01:52,910 --> 00:01:58,310
The concept of characters is more general because they can represent not only letters, but punctuation

25
00:01:58,310 --> 00:01:59,540
and spaces as well.

26
00:02:00,140 --> 00:02:04,250
So, for example, a space is considered a character, as is a new line.

27
00:02:05,360 --> 00:02:11,360
If you've ever coded in C++ or Java, then you should be familiar with characters as the character is

28
00:02:11,360 --> 00:02:13,250
a data type in these languages.

29
00:02:13,940 --> 00:02:19,640
Sometimes in NLP, we build models in terms of words or tokens, but other times we build models in

30
00:02:19,640 --> 00:02:20,810
terms of characters.

31
00:02:21,290 --> 00:02:25,130
Both have their pros and cons, so it's good to be aware of these possibilities.

32
00:02:29,710 --> 00:02:32,020
OK, so let's go through some more definitions.

33
00:02:32,650 --> 00:02:35,140
We know that different words make up a language.

34
00:02:35,590 --> 00:02:37,120
What is the set of all words?

35
00:02:37,450 --> 00:02:39,130
We call this the vocabulary.

36
00:02:39,940 --> 00:02:45,400
Note that in most cases, our vocabulary will not consist of every possible word in the English language,

37
00:02:45,400 --> 00:02:47,770
but only a reasonable subset of those words.

38
00:02:48,490 --> 00:02:53,650
Sometimes this might number in the thousands, tens of thousands, hundreds of thousands, or perhaps

39
00:02:53,650 --> 00:02:54,580
even millions.

40
00:02:55,090 --> 00:02:57,730
The choice of how many words to use is really up to you.

41
00:02:58,130 --> 00:03:00,580
You should choose this based on the results you get.

42
00:03:01,810 --> 00:03:05,740
On the other hand, if you're using a pre-trained model, then you have no choice.

43
00:03:05,770 --> 00:03:08,120
Simply use what is used by the authors.

44
00:03:12,630 --> 00:03:15,960
The next term I want to introduce you to is Corpus Christi.

45
00:03:16,320 --> 00:03:20,250
So here are two definitions I found by searching this term on DuckDuckGo.

46
00:03:20,970 --> 00:03:27,360
The first definition is a large collection of writings of a specific kind or on a specific subject.

47
00:03:27,960 --> 00:03:34,150
And the second definition is a collection of writings or recorded remarks used for linguistic analysis.

48
00:03:34,620 --> 00:03:36,180
So either of these is fine.

49
00:03:36,870 --> 00:03:42,210
Basically, when I say corpus in this class, it refers to the dataset, which we will be using to train

50
00:03:42,210 --> 00:03:43,260
a machine learning model.

51
00:03:48,030 --> 00:03:51,660
OK, so the next time I want to introduce you to is the Engram.

52
00:03:52,320 --> 00:03:57,930
Basically, this just means a sequence of end consecutive items or tokens, whether they be words,

53
00:03:57,930 --> 00:03:59,550
characters or sub words.

54
00:04:00,270 --> 00:04:05,330
This is not a complex topic by any means, and in fact, we don't even really need to give it a name,

55
00:04:05,340 --> 00:04:06,330
in my opinion.

56
00:04:07,110 --> 00:04:11,280
However, this may help you if you're reading the literature and you come across this term.

57
00:04:12,450 --> 00:04:15,030
Note that we have special names for small values of that.

58
00:04:15,780 --> 00:04:20,740
For example, if we're looking at just single tokens, we call those you anagrams.

59
00:04:21,149 --> 00:04:25,380
If we're looking at sets of two consecutive tokens, we call those by grams.

60
00:04:25,740 --> 00:04:28,410
If we're looking at three, we call them tri grams.

61
00:04:29,130 --> 00:04:30,510
OK, so you get the idea.

62
00:04:31,080 --> 00:04:32,850
Now, where does this come into play?

63
00:04:33,480 --> 00:04:39,120
Well, one instance you'll see this idea pop up is of Divac, which is basically just a fancier way

64
00:04:39,120 --> 00:04:41,460
of building a neural network for by grams.

65
00:04:42,600 --> 00:04:48,090
Another point of which you'll see this concept is with Markov models, which look at bigram probabilities.

66
00:04:48,780 --> 00:04:52,740
Now again, it's my opinion that the terminology itself doesn't matter too much.

67
00:04:53,070 --> 00:04:56,160
But if you see this term, then at least you know what it means.