1
00:00:11,100 --> 00:00:15,310
So in this lecture, we are going to discuss how to do text preprocessing.

2
00:00:16,260 --> 00:00:18,720
Now, you might notice this lecture has a funny name.

3
00:00:18,730 --> 00:00:22,410
I've called it beginner blues, pi torture NLP version.

4
00:00:23,010 --> 00:00:24,570
So what do I mean by this?

5
00:00:25,230 --> 00:00:30,420
Well, let's begin with the fact that this is a brand new updated lecture, which did not exist when

6
00:00:30,420 --> 00:00:31,870
I first released this course.

7
00:00:32,370 --> 00:00:36,440
In fact, there have been a few free updates to this course over the years.

8
00:00:37,050 --> 00:00:38,940
So why does this lecture exist?

9
00:00:39,690 --> 00:00:44,670
Well, you may have noticed if you went through the Legacy and Alpay lectures, they use the library

10
00:00:44,670 --> 00:00:45,810
called torture text.

11
00:00:46,110 --> 00:00:51,600
And at the current time, the torture text library has been updated such that the previous code does

12
00:00:51,600 --> 00:00:53,400
not work with the latest versions.

13
00:00:53,910 --> 00:00:56,670
Many beginners got stuck because of this change.

14
00:00:57,690 --> 00:01:03,150
Now, the reason I wanted to make this lecture is because over the years I've seen many beginner mistakes

15
00:01:03,150 --> 00:01:06,820
being made and many misconceptions that need to be addressed.

16
00:01:07,260 --> 00:01:11,670
So if you are one of the students that got caught wondering how can I do an LP?

17
00:01:11,790 --> 00:01:15,690
If torture text change their API, then this lecture is for you.

18
00:01:20,490 --> 00:01:25,800
So let's start this lecture by saying that this is not something anyone who is a student of mine should

19
00:01:25,800 --> 00:01:31,440
have gotten stuck with, basically I taught the skills to deal with this sort of thing approximately

20
00:01:31,440 --> 00:01:32,580
five years ago.

21
00:01:33,060 --> 00:01:37,500
So this is not new technology by any means, merely a change in the API.

22
00:01:38,100 --> 00:01:41,640
So that's one of the major downsides of being dependent on libraries.

23
00:01:42,090 --> 00:01:47,340
If you haven't studied my in-depth series of courses and you depend on libraries to get things done,

24
00:01:47,640 --> 00:01:50,790
then you will pretty much get stuck whenever anything changes.

25
00:01:51,660 --> 00:01:57,600
On the other hand, students who have taken my in-depth series of courses tend not to get stuck because

26
00:01:57,600 --> 00:02:01,980
they've been taught the skills such that a change like this is not even an issue.

27
00:02:02,730 --> 00:02:08,160
So in this lecture and the coming lectures, we'll be borrowing a bit of content from my in-depth series,

28
00:02:08,370 --> 00:02:14,670
specifically deep in AP, Arnon's and HMD, where I first taught this technique many years ago.

29
00:02:19,240 --> 00:02:24,340
Now, you might think I'm just being too hard on beginners, but here's the reality, what I'm about

30
00:02:24,340 --> 00:02:29,480
to teach you in this lecture is a very practical skill used very often in the real world.

31
00:02:29,950 --> 00:02:31,450
In fact, it's so useful.

32
00:02:31,450 --> 00:02:35,520
I use this as an interview question for data science employees.

33
00:02:36,250 --> 00:02:42,010
Basically, the question is to perform the textbook processing steps in Python, which in this course

34
00:02:42,190 --> 00:02:44,590
we had previously done using torture text.

35
00:02:45,160 --> 00:02:50,410
The hard truth is I would never hire someone who could not write the code to solve this problem.

36
00:02:50,950 --> 00:02:53,680
In fact, my colleagues loved this question as well.

37
00:02:53,860 --> 00:02:57,060
And so they are also using this question in their interviews.

38
00:02:57,490 --> 00:03:02,250
So don't be surprised if you go to your data science interview and they ask you this question.

39
00:03:02,830 --> 00:03:04,650
We use code like this every day.

40
00:03:04,660 --> 00:03:08,690
And so if you can't write this kind of code, we will not give you the job.

41
00:03:09,280 --> 00:03:14,500
Thus, if you were one of the beginners who saw this and just gave up or got frustrated because the

42
00:03:14,500 --> 00:03:19,870
library changed on, you realize that this is not a good attitude for a data scientist to have.

43
00:03:20,290 --> 00:03:23,140
If this is your attitude, you are not getting the job.

44
00:03:23,860 --> 00:03:29,230
The main theme of this lecture is basically how you can be a builder and a leader instead of a follower

45
00:03:29,230 --> 00:03:32,060
and someone who only commits to rote memorization.

46
00:03:32,830 --> 00:03:36,310
Now, I do want to mention that most of you did not get stuck.

47
00:03:36,760 --> 00:03:42,520
For the most part, I find that the students of my VIP courses tend to be more mature and well-rounded,

48
00:03:42,790 --> 00:03:44,720
not always, but to a greater degree.

49
00:03:45,220 --> 00:03:46,560
So this is a good thing.

50
00:03:47,110 --> 00:03:51,130
If you were one of the students that did not get stuck, then congratulations.

51
00:03:51,310 --> 00:03:52,690
You are on the right track.

52
00:03:57,550 --> 00:04:03,010
Another falsehood I want to address before getting to the main content is that beginners often think

53
00:04:03,010 --> 00:04:05,440
they should always be using the latest version.

54
00:04:06,130 --> 00:04:12,070
In fact, I have a YouTube video about this called Why Bad Programmers Always Need the latest version.

55
00:04:12,730 --> 00:04:16,630
The reality is this is not the way things work in the real world.

56
00:04:17,110 --> 00:04:20,230
You see, as a business, you really only have two choices.

57
00:04:20,620 --> 00:04:26,200
Number one, do the work that your clients have asked for to produce real meaningful output and get

58
00:04:26,200 --> 00:04:32,680
paid real money or no to do behind the scenes work, such as updating libraries, which does not produce

59
00:04:32,680 --> 00:04:36,610
any change in output for your clients in which you do not get paid for.

60
00:04:37,540 --> 00:04:41,260
So as a business, we tend to choose number one rather than number two.

61
00:04:41,710 --> 00:04:46,480
Beginners often feel like everything has to use the latest library because they haven't yet had the

62
00:04:46,480 --> 00:04:49,350
opportunity to work on real world projects.

63
00:04:49,870 --> 00:04:51,970
Many times we even skip versions.

64
00:04:52,240 --> 00:04:56,620
So today we might be using version five and we might not update until version 10.

65
00:04:57,550 --> 00:05:02,500
Beginners who think they have to update their code every month whenever a new version comes out are

66
00:05:02,500 --> 00:05:04,040
not very practically minded.

67
00:05:04,450 --> 00:05:07,060
This is not how we operate in a real business.

68
00:05:11,720 --> 00:05:16,550
So for the rest of this section, I'm actually going to leave the legacy lectures up since you'll need

69
00:05:16,550 --> 00:05:22,880
to watch them in order to complete the following, exercise the exercises to write your own code, to

70
00:05:22,880 --> 00:05:29,030
replicate the text preprocessing steps, that is, write the code to replace the target text functionality

71
00:05:29,030 --> 00:05:29,930
that we need.

72
00:05:31,040 --> 00:05:36,770
Now, you might wonder, how can I do this if I haven't yet taken lazy programmers deep in AP course

73
00:05:37,200 --> 00:05:37,970
is true.

74
00:05:38,000 --> 00:05:42,650
This exercise will be a little easier if you've taken my in-depth series of courses.

75
00:05:43,070 --> 00:05:44,930
But it's not true that you can't complete it.

76
00:05:45,050 --> 00:05:51,590
If you have not taken my in-depth series of courses to complete this exercise, you only need basic

77
00:05:51,590 --> 00:05:52,760
python knowledge.

78
00:05:53,180 --> 00:05:57,020
And actually this is why it surprised me so much that so many people got stuck.

79
00:06:02,220 --> 00:06:05,780
So to summarize the exercise, the steps are roughly as follows.

80
00:06:06,330 --> 00:06:11,910
Number one, tokenized each document, you can use any tokenized you like, but the simplest method

81
00:06:11,910 --> 00:06:18,300
is to simply call string that split number to map each token, a two a word index.

82
00:06:18,750 --> 00:06:22,710
As you recall, these will be used as indices to the embedding matrix.

83
00:06:23,280 --> 00:06:28,370
Note also that you'll need special indices for padding since each batch needs to have the same length.

84
00:06:29,520 --> 00:06:35,610
Number three, once you have your word to index mapping, convert all your documents into integer format.

85
00:06:36,570 --> 00:06:41,770
Number four, use your integer formatted documents as input into your neural network.

86
00:06:42,300 --> 00:06:47,070
Of course, the last step you already know how to do if you've completed the previous sections of this

87
00:06:47,070 --> 00:06:54,120
course as a bonus, you should also create your own data generator that produces batches of neural network

88
00:06:54,120 --> 00:06:54,820
inputs.

89
00:06:55,230 --> 00:06:58,140
Remember to also return these as Tahj Tensas.

90
00:06:59,430 --> 00:07:04,830
As a further bonus, the sequence length of each batch should only be as long as the longest sequence

91
00:07:04,830 --> 00:07:10,560
in the batch, otherwise the sequence length can simply be the maximum document length in the whole

92
00:07:10,560 --> 00:07:11,130
data set.

93
00:07:12,330 --> 00:07:18,690
OK, so as you'll see, this is nothing but a review of what is already shown in the legacy text preprocessing

94
00:07:18,690 --> 00:07:19,260
lecture.

95
00:07:19,950 --> 00:07:25,230
So watch that lecture to see an example of how your code should behave and then write your own code

96
00:07:25,230 --> 00:07:26,820
to reproduce that behavior.

97
00:07:31,510 --> 00:07:36,550
Now, after all this, you might be wondering why not simply use the latest version of torture text

98
00:07:37,030 --> 00:07:42,670
and to these students, I say you must watch this whole lecture again because the purpose has evaded

99
00:07:42,670 --> 00:07:43,040
you.

100
00:07:43,630 --> 00:07:48,950
You see, your approach of using libraries is the whole reason why you are stuck here in the first place.

101
00:07:49,450 --> 00:07:54,040
Of course, if you learn today's library, you're just going to be stuck again three months from now.

102
00:07:54,280 --> 00:07:58,270
And then six months later, when it changes again, you're going to be stuck again.

103
00:07:58,850 --> 00:08:01,630
So hopefully you see why this approach is a bad idea.

104
00:08:03,070 --> 00:08:09,490
Furthermore, the torture text API is just not that nice, which is, in my estimation, why they decided

105
00:08:09,490 --> 00:08:11,090
to change it in the first place.

106
00:08:11,740 --> 00:08:15,910
Unfortunately, the latest version is also not that nice, in my opinion.

107
00:08:15,910 --> 00:08:17,080
Not really worth learning.

108
00:08:18,440 --> 00:08:24,260
In fact, I've seen several courses that simply use the CARUS text module instead because it has a much

109
00:08:24,260 --> 00:08:31,660
nicer API instead, a better and more powerful approach is to actually learn how these things work.

110
00:08:33,110 --> 00:08:38,450
Getting back to the real world, often when we're hiring people, we don't care which APIs you know

111
00:08:38,450 --> 00:08:43,070
how to use, we care that you can write code when we're hiring people.

112
00:08:43,070 --> 00:08:46,760
We don't want people who are going to get stuck every time a library changes.

113
00:08:47,120 --> 00:08:50,500
Those employees tend to need too much handholding to get things done.

114
00:08:51,050 --> 00:08:55,230
And as such, my recommendation is to not be that kind of employee.

115
00:08:55,970 --> 00:09:01,070
The main reason why beginners were getting stuck is because they were just trying to memorize syntax

116
00:09:01,250 --> 00:09:03,480
without any understanding of what it was doing.

117
00:09:04,100 --> 00:09:07,460
So the approach of rote memorization is very bad in programming.

118
00:09:08,210 --> 00:09:14,000
So not only is the approach in this exercise nicer, it's also conducive to improving your understanding

119
00:09:14,000 --> 00:09:15,070
of how things work.

120
00:09:19,720 --> 00:09:25,420
Another question you may have is, isn't it always better to use a tried and true solution instead of

121
00:09:25,420 --> 00:09:26,230
rolling your own?

122
00:09:26,980 --> 00:09:32,110
And for this type of question, it's obvious you have to apply some critical thinking and context.

123
00:09:32,620 --> 00:09:36,400
Obviously, you're not going to build your own database or a Web server framework.

124
00:09:36,610 --> 00:09:37,740
That would be absurd.

125
00:09:38,230 --> 00:09:43,660
You won't build your own SVM or random forest unless, of course, you're taking a course on SVM or

126
00:09:43,660 --> 00:09:46,300
random forest, in which case that would make sense.

127
00:09:46,840 --> 00:09:51,510
If you're doing a Kagle contest, then of course, you're not building your own SVM for the contest.

128
00:09:51,880 --> 00:09:53,710
So it's all about context.

129
00:09:54,820 --> 00:10:00,630
Normally, big complex code bases like a database or a web server would not be built from scratch.

130
00:10:01,090 --> 00:10:06,630
Simple code like looping through a log file to pass out the important fields would be built from scratch.

131
00:10:06,970 --> 00:10:11,130
So there's always a spectrum in the case of text preprocessing.

132
00:10:11,140 --> 00:10:15,880
What you'll see is that this actually falls on the simple side instead of the complex side.

133
00:10:16,480 --> 00:10:22,780
In fact, another reason, aside from all the good reasons already given, is performance in the section

134
00:10:22,780 --> 00:10:24,100
on recommender systems.

135
00:10:24,310 --> 00:10:29,740
We'll see that building your own data loader is actually more performing than using the built in utilities

136
00:10:29,740 --> 00:10:30,810
and pie torch.