1
00:00:11,100 --> 00:00:16,350
So in this lecture, we are going to discuss one important topic which will be necessary if you want

2
00:00:16,350 --> 00:00:18,840
to implement DFAT from scratch.

3
00:00:19,620 --> 00:00:22,980
Doing so would be considered an advanced exercise.

4
00:00:23,340 --> 00:00:25,350
So if you'd like to try this exercise?

5
00:00:25,560 --> 00:00:30,060
Note that this is a step that must be completed if you only want to use.

6
00:00:30,060 --> 00:00:35,340
I could learn that you don't necessarily need to know how this works at the same time.

7
00:00:35,370 --> 00:00:40,710
Note that if you want to do anything more advanced in NLP, you still have to know how to do this anyway.

8
00:00:41,430 --> 00:00:47,190
For example, if you're going to do NLP and TensorFlow PyTorch or Ajax, you'll need to know how this

9
00:00:47,190 --> 00:00:47,790
works.

10
00:00:48,300 --> 00:00:54,030
So methods like words of Vec aren't ends and transformers will all require some knowledge about this

11
00:00:54,030 --> 00:00:54,660
topic.

12
00:00:56,150 --> 00:01:01,550
OK, so what is this topic, the topic of this lecture is the word to index mapping.

13
00:01:02,000 --> 00:01:07,340
We'll discuss what that is, why we need it and how to build one from a given text corpus.

14
00:01:12,120 --> 00:01:18,090
OK, so let's start by answering the question, why do we need the word to index mapping as you recall

15
00:01:18,090 --> 00:01:23,640
what we do when we convert text into vectors as we take a whole set of documents and convert that into

16
00:01:23,640 --> 00:01:24,450
a matrix?

17
00:01:25,020 --> 00:01:30,390
Essentially, each row of the matrix is a vector, or each vector corresponds to one of the original

18
00:01:30,390 --> 00:01:31,170
documents.

19
00:01:31,650 --> 00:01:36,720
Each column of the matrix and hence each component of the vector corresponds to a word.

20
00:01:37,920 --> 00:01:42,690
Thus, the total size of the matrix is a number of documents by a number of words.

21
00:01:43,410 --> 00:01:48,390
The big question is this How do we know which column corresponds to which word?

22
00:01:48,900 --> 00:01:50,070
This is not obvious.

23
00:01:50,670 --> 00:01:53,430
Perhaps you may believe they will be in alphabetical order.

24
00:01:53,880 --> 00:01:58,140
Perhaps you may believe they will be in the order the words appeared in in the text corpus.

25
00:01:58,710 --> 00:02:03,210
Perhaps you believe they may be ordered by the frequency of those words in the text corpus.

26
00:02:04,590 --> 00:02:09,030
Of course, none of these is necessarily false, but none of these is necessarily true.

27
00:02:09,690 --> 00:02:13,920
The truth is, it depends on how you decide to build the word to index mapping.

28
00:02:14,610 --> 00:02:18,390
That is, you get to decide which column corresponds to which word.

29
00:02:19,050 --> 00:02:22,980
This lecture is mostly about how you can actually write the code to do this.

30
00:02:27,630 --> 00:02:32,970
Now, I have to warn you that this is one of those tasks which really requires you to know how to program

31
00:02:33,540 --> 00:02:38,250
if you can code and you can logically think through an algorithm, you should find this to be quite

32
00:02:38,250 --> 00:02:38,760
easy.

33
00:02:39,300 --> 00:02:44,400
If you cannot code, you may find this to be difficult for many Udemy students.

34
00:02:44,430 --> 00:02:46,590
I've seen them struggle with this in the past.

35
00:02:47,070 --> 00:02:51,480
My best advice for you is to simply practice coding and improving your skills.

36
00:02:51,960 --> 00:02:54,240
This is one of those things where it's just pure coding.

37
00:02:54,450 --> 00:02:57,840
There are no tricks and there is no specialized knowledge you need.

38
00:02:58,350 --> 00:03:03,780
It's just basic coding, and it only depends on your ability to think through what you have to do.

39
00:03:04,890 --> 00:03:10,200
In other words, your ability to get this done is dependent only on your thinking ability.

40
00:03:14,880 --> 00:03:20,370
OK, so as a simple example, we're going to start by looking at a text corpus that we can process by

41
00:03:20,370 --> 00:03:20,970
hand.

42
00:03:21,690 --> 00:03:24,990
Let's suppose our text corpus contains only three documents.

43
00:03:25,440 --> 00:03:28,050
Document number one is I like cats.

44
00:03:28,500 --> 00:03:30,960
Document number two is I love cats.

45
00:03:31,320 --> 00:03:33,840
Document number three is I love dogs.

46
00:03:34,920 --> 00:03:39,780
OK, so our first order of business is to determine what is our vocabulary size.

47
00:03:40,200 --> 00:03:43,530
That is how many unique words exist in our corpus.

48
00:03:44,100 --> 00:03:46,050
In this case, we have five words.

49
00:03:46,170 --> 00:03:48,450
I like cats, love and dogs.

50
00:03:48,990 --> 00:03:54,960
Thus, each of these words would occupy columns zero, one, two, three and four inside a matrix with

51
00:03:54,960 --> 00:03:56,100
five columns.

52
00:03:56,670 --> 00:03:59,910
So, for example, we might say zero corresponds with I.

53
00:04:00,090 --> 00:04:04,110
One corresponds with like two corresponds with cats and so forth.

54
00:04:04,800 --> 00:04:09,180
In this case, the integers have been assigned in the order that each of the words appears.

55
00:04:09,870 --> 00:04:13,770
You'll see that this is quite natural when we actually look at some code that will do this.

56
00:04:14,520 --> 00:04:19,529
Now, this example is really easy because it's small enough that you can do everything in your head.

57
00:04:20,250 --> 00:04:25,260
The question is, how can you do this when you have millions of documents and thousands of possible

58
00:04:25,260 --> 00:04:25,980
words?

59
00:04:30,600 --> 00:04:36,210
The answer is you're going to have to write some kind of computer program in this lecture, I'm actually

60
00:04:36,210 --> 00:04:39,510
going to give you some pseudocode that will accomplish this task.

61
00:04:39,960 --> 00:04:44,910
You don't have to follow this particular strategy, but this is just one possible way to make a word

62
00:04:44,910 --> 00:04:45,930
to index map.

63
00:04:46,500 --> 00:04:51,540
If you have your own ideas about how to implement this, you are strongly encouraged to try that instead.

64
00:04:53,240 --> 00:04:55,130
OK, so let's look at our pseudocode.

65
00:04:56,030 --> 00:05:01,310
Note that while I call this pseudocode, it's really actual Python code that will work given that your

66
00:05:01,310 --> 00:05:02,930
inputs are formatted correctly.

67
00:05:03,470 --> 00:05:07,130
So although this is pseudocode, it is also real working code.

68
00:05:08,090 --> 00:05:09,560
OK, so how does this work?

69
00:05:10,130 --> 00:05:14,090
Well, we start by initializing a variable called current index to zero.

70
00:05:14,750 --> 00:05:17,960
We also create an empty dictionary called word to index.

71
00:05:18,620 --> 00:05:25,220
Now, recall that when we say mapping in Python, this usually translates to dictionary mapping is where

72
00:05:25,220 --> 00:05:29,840
we take one value as input and produce one corresponding value as output.

73
00:05:30,380 --> 00:05:33,620
That is, the same input should always map to the same output.

74
00:05:34,100 --> 00:05:36,830
And this is exactly what a Python dictionary does.

75
00:05:37,850 --> 00:05:40,040
The next step is to live through our documents.

76
00:05:40,700 --> 00:05:46,370
This is assuming that our documents are stored in a list or some editable in each individual component

77
00:05:46,400 --> 00:05:47,570
is a single string.

78
00:05:49,640 --> 00:05:55,310
The next step is to tokenize the document, as you recall, we've seen several ways of doing this,

79
00:05:55,550 --> 00:06:00,770
such as a simple string split or more complex methods like notecards where tokenize.

80
00:06:01,340 --> 00:06:05,540
This will give us back a list of tokens, which we will assume are words.

81
00:06:07,070 --> 00:06:11,990
Note that this may contain punctuation as well, but for the purpose of this lecture, they should all

82
00:06:11,990 --> 00:06:12,980
be treated the same.

83
00:06:14,650 --> 00:06:18,220
The next step is to move through each token inside the loop.

84
00:06:18,250 --> 00:06:22,480
We check whether or not the token exists in our word to ADX Dictionary.

85
00:06:23,140 --> 00:06:25,390
If it does, then there is nothing to do.

86
00:06:25,690 --> 00:06:28,990
That means our token has already been mapped to an index.

87
00:06:29,800 --> 00:06:35,890
The interesting case is when the token does not yet exist, in this case, we must create a new entry.

88
00:06:36,490 --> 00:06:42,220
We can do so by setting the key of our dictionary to be the current token and the corresponding value

89
00:06:42,220 --> 00:06:43,630
to be the current index.

90
00:06:44,260 --> 00:06:49,810
We must also remember to increment the current index so that the same variable can be used in the same

91
00:06:49,810 --> 00:06:50,290
way.

92
00:06:50,530 --> 00:06:56,470
The next time we encounter a new token in this way, we will increment current index one by one.

93
00:06:56,800 --> 00:06:59,860
In the first token we encounter will be assigned index zero.

94
00:07:00,100 --> 00:07:03,820
The second token we encounter will be assign index one and so forth.

95
00:07:04,600 --> 00:07:09,910
Once this loop is complete, we will have encountered all the unique tokens in our corpus and our word

96
00:07:09,910 --> 00:07:12,400
to index mapping will be populated correctly.

97
00:07:17,070 --> 00:07:22,440
Now, at this point, once you know which word maps to which index, you'll be able to do things like

98
00:07:22,440 --> 00:07:28,560
implement your own count vectors or your own at TFI Taf in the next lecture, we will be implementing

99
00:07:28,560 --> 00:07:32,460
TFI Taf from scratch, which involves all three of these steps.

100
00:07:33,180 --> 00:07:38,430
As you'll see, creating your own town vectors requires you to first create a word to index mapping

101
00:07:38,910 --> 00:07:40,140
and creating your own TFI.

102
00:07:40,140 --> 00:07:43,230
Taf requires you to first create your own count vectors.

103
00:07:43,800 --> 00:07:48,750
Thus, every step we've learned about in the section must be completed in TFI.

104
00:07:48,750 --> 00:07:51,720
TAF is like the culmination of everything we've seen.

105
00:07:56,420 --> 00:08:01,910
So for those of you who might be thinking a few steps ahead, you may be wondering how do we treat words

106
00:08:01,910 --> 00:08:05,180
which appear in the test set but do not appear in the train set?

107
00:08:06,080 --> 00:08:08,870
The answer is that there are several ways to deal with this.

108
00:08:09,560 --> 00:08:13,880
Method number one is to simply ignore words which were not seen in the train set.

109
00:08:14,570 --> 00:08:15,620
This makes sense.

110
00:08:16,130 --> 00:08:20,330
For example, if you're building a machine learning model, your model will be trained on the train

111
00:08:20,330 --> 00:08:21,890
set by definition.

112
00:08:22,490 --> 00:08:27,470
Therefore, it won't know what to do with inputs it has never seen because it hasn't been trained to

113
00:08:27,470 --> 00:08:28,070
do so.

114
00:08:28,790 --> 00:08:32,179
There are some situations, however, where ignoring words is not desired.

115
00:08:34,169 --> 00:08:39,900
Method number two is to create a special index for unknown words or rare words that did not appear in

116
00:08:39,900 --> 00:08:40,770
the train set.

117
00:08:41,400 --> 00:08:46,950
In this case, even some words in the train set could be considered rare words such that they get mapped

118
00:08:46,950 --> 00:08:48,150
to the special token.

119
00:08:48,960 --> 00:08:54,540
We could, for example, assign any words whose frequency falls below a certain threshold to be an unknown

120
00:08:54,540 --> 00:08:55,080
word.

121
00:08:55,770 --> 00:09:00,810
In this way, our model will learn what to do with unknown words since they appear in the train set

122
00:09:01,170 --> 00:09:05,310
and then during test time, it will handle new unknown words in the same way.

123
00:09:06,300 --> 00:09:10,230
The downside to this is that all unknown words will be treated the same.

124
00:09:14,970 --> 00:09:20,820
The next topic I want to discuss in this lecture is the reverse mapping that is the mapping from index

125
00:09:20,820 --> 00:09:21,570
back to work.

126
00:09:22,350 --> 00:09:24,060
So why would we ever want this?

127
00:09:24,750 --> 00:09:26,790
One example is with neural networks.

128
00:09:27,360 --> 00:09:29,910
As you recall, machine learning works with numbers.

129
00:09:30,450 --> 00:09:35,880
One way to use neural networks is to take a list of words in a sentence and to try and predict the next

130
00:09:35,880 --> 00:09:36,480
word.

131
00:09:37,110 --> 00:09:41,400
The next word, however, is represented by a number in the neural network output.

132
00:09:42,300 --> 00:09:47,190
So the neural network might say the most likely next word is word number 500.

133
00:09:47,790 --> 00:09:53,550
Of course, in order to figure out what this word actually is in human readable terms, we must know

134
00:09:53,550 --> 00:09:56,720
which word corresponds to word index 500.

135
00:09:57,420 --> 00:10:00,810
Thus, we need a mapping from index back to word.

136
00:10:05,480 --> 00:10:09,170
Another example is with interpretation of machine learning models.

137
00:10:09,710 --> 00:10:13,790
Note that this could be deep neural networks, but does not necessarily have to be.

138
00:10:15,320 --> 00:10:20,810
So in machine learning, we often like to know which of the input features is most important.

139
00:10:21,350 --> 00:10:23,930
This helps us to interpret what our model has learned.

140
00:10:24,500 --> 00:10:30,620
So our algorithm might say input number three, four or five is the most important input, but how do

141
00:10:30,620 --> 00:10:34,080
we know which word input number three, four or five corresponds to?

142
00:10:34,850 --> 00:10:36,920
The answer is we must keep track of this.

143
00:10:37,160 --> 00:10:43,410
And the way to do that is with a mapping from index back to work as an exercise.

144
00:10:43,430 --> 00:10:48,890
You should think of how one might convert a word to index mapping into an index to word mapping and

145
00:10:48,890 --> 00:10:49,670
vice versa.