1
00:00:11,060 --> 00:00:14,900
So in this lecture, we are going to look at the count victories are in code.

2
00:00:15,590 --> 00:00:20,660
Now this lecture is a bit of a preview of machine learning since technically we haven't yet discussed

3
00:00:20,660 --> 00:00:23,000
those concepts in this particular section.

4
00:00:23,600 --> 00:00:28,970
However, you need not be intimidated by this, since we are only going to make use of psychic learning.

5
00:00:29,750 --> 00:00:32,960
As you may know, this involves only three lines of code.

6
00:00:33,440 --> 00:00:35,960
In particular, we create the model objects.

7
00:00:36,290 --> 00:00:38,300
We fit the model to the training data.

8
00:00:38,690 --> 00:00:41,300
Then we check the models train and test performance.

9
00:00:41,900 --> 00:00:47,750
So if you are not familiar with these steps, please let me know on the Q&A so we can get that fixed

10
00:00:47,750 --> 00:00:48,530
for you promptly.

11
00:00:49,040 --> 00:00:53,930
I have free resources to help you refresh on this process, so please take advantage of them.

12
00:00:54,470 --> 00:00:57,800
Otherwise, I'm going to assume that you are comfortable with these steps.

13
00:01:01,390 --> 00:01:04,150
OK, so let's start by importing everything we need.

14
00:01:04,750 --> 00:01:10,990
We'll start with basic numerical computing libraries such as numpy and pandas from site to learn we'll

15
00:01:10,990 --> 00:01:16,390
need the count, voucherize the class, the multi normal and B class and the train to split function.

16
00:01:17,950 --> 00:01:21,550
Multi a.m. Be is our classifier of choice for this notebook.

17
00:01:21,880 --> 00:01:24,760
But of course, you can feel free to use whatever you like.

18
00:01:25,450 --> 00:01:30,040
Note that we also have several imports from NCTC, which should look familiar.

19
00:01:37,540 --> 00:01:40,830
The next step is to download the data we need for analytic.

20
00:01:47,980 --> 00:01:52,840
The next step is to download our data set, which consists of samples of BBC News.

21
00:01:53,500 --> 00:01:56,680
The goal of this data set is to classify the documents.

22
00:01:57,190 --> 00:02:02,230
As you recall, if you read the news articles are usually categorized into sections.

23
00:02:02,560 --> 00:02:07,690
For example, there's the business section, entertainment, politics, sports and so forth.

24
00:02:08,320 --> 00:02:12,460
Note that this will be a supervised task since we will be given the labels.

25
00:02:19,850 --> 00:02:23,960
The next step is to read in our Series V. using PD that reads GSV.

26
00:02:28,020 --> 00:02:31,560
The next step is to call the FDA ahead to see what our data looks like.

27
00:02:35,970 --> 00:02:40,240
As you can see, this data has two columns, as you may recall.

28
00:02:40,260 --> 00:02:43,080
This is what I described in the previous lectures.

29
00:02:43,560 --> 00:02:49,140
The text column contains their documents, and the labels column contains the label for each document.

30
00:02:49,590 --> 00:02:51,780
Each row is a separate document.

31
00:02:52,410 --> 00:02:57,240
Please feel free to actually print out the full text yourself to get a better sense of the data we're

32
00:02:57,240 --> 00:02:57,840
using.

33
00:03:01,200 --> 00:03:06,990
The next step is to assign the two columns to separate variables, which I've called inputs and labels.

34
00:03:11,600 --> 00:03:14,420
The next step is to plot a histogram of our labels.

35
00:03:15,050 --> 00:03:18,950
The point of this is to see whether or not we have imbalanced classes.

36
00:03:19,400 --> 00:03:22,970
That is if any class is over or underrepresented.

37
00:03:23,540 --> 00:03:28,730
This can be an issue when we check our model's performance since, for example, if ninety nine percent

38
00:03:28,730 --> 00:03:34,430
of our data belongs to one class, we can obtain ninety nine percent accuracy by only predicting that

39
00:03:34,430 --> 00:03:35,030
class.

40
00:03:35,570 --> 00:03:40,700
In that case, we would want to use other metrics in order to get a better sense of how our model is

41
00:03:40,700 --> 00:03:41,240
doing.

42
00:03:46,640 --> 00:03:51,980
So notice that our labels seem to be pretty evenly spread out because of this, we can say that our

43
00:03:51,980 --> 00:03:53,600
data is not imbalanced.

44
00:03:53,840 --> 00:03:57,470
And so there isn't a great need to check alternative metrics.

45
00:03:58,130 --> 00:04:04,490
Also, this gives us a chance to see what labels were working with notice that we have five labels business,

46
00:04:04,490 --> 00:04:07,370
entertainment, politics, sport and tech.

47
00:04:10,860 --> 00:04:16,440
The next step is to do a train test, split note that we do this before using the Koun vector riser.

48
00:04:20,950 --> 00:04:24,490
The next step is to instantiate our counterfactual your object.

49
00:04:25,210 --> 00:04:31,210
Note that for our first experiment, we will be using all the default values and thus, for this instance,

50
00:04:31,450 --> 00:04:33,250
we're not going to have any arguments.

51
00:04:37,650 --> 00:04:41,220
The next step is to call fit transform on the training inputs.

52
00:04:41,730 --> 00:04:45,150
We follow that by calling Transform on the test inputs.

53
00:04:45,960 --> 00:04:51,840
Sometimes, as students inquire about why we call fit transform in one case, but only transform in

54
00:04:51,840 --> 00:04:52,740
the other case.

55
00:04:53,340 --> 00:04:58,110
Recall that the training data is supposed to represent what we have when we build our model.

56
00:04:58,530 --> 00:05:03,570
The test data is supposed to represent what we have when we apply our model to data we haven't seen

57
00:05:03,570 --> 00:05:04,170
before.

58
00:05:04,800 --> 00:05:10,170
As such, we would not want to fit on the test data because that's not how test data will be used.

59
00:05:15,280 --> 00:05:20,950
OK, so although the count of exerciser has given us X training next test, you may be wondering what

60
00:05:20,950 --> 00:05:22,630
do these variables represent?

61
00:05:23,620 --> 00:05:25,270
Recall why we are doing this?

62
00:05:25,780 --> 00:05:30,070
Our input data is text, but machine learning only works on numbers.

63
00:05:30,580 --> 00:05:33,790
We have just converted our text into vectors of numbers.

64
00:05:34,360 --> 00:05:40,510
Specifically, X train and X tests are matrices with number of rows equal to the number of data samples

65
00:05:40,810 --> 00:05:43,690
and number of columns equal to the vocabulary size.

66
00:05:44,650 --> 00:05:46,540
So let's just print out X train.

67
00:05:50,490 --> 00:05:56,010
As you can see, ex train is not what you may have expected, or perhaps it is if you remember what

68
00:05:56,010 --> 00:05:56,970
I told you earlier.

69
00:05:57,870 --> 00:06:03,420
You may have expected to see an array of numbers since that's what ex train it is, but instead we see

70
00:06:03,420 --> 00:06:04,230
something else.

71
00:06:04,740 --> 00:06:07,290
We see that X train is a sparse matrix.

72
00:06:07,890 --> 00:06:13,740
As you recall, the reason these special types of matrices are used is because most of the values in

73
00:06:13,740 --> 00:06:15,060
the Matrix are zero.

74
00:06:15,600 --> 00:06:18,990
That is, most documents do not make use of most words.

75
00:06:19,440 --> 00:06:22,890
Using a sparse matrix representation is more efficient.

76
00:06:24,270 --> 00:06:31,140
Also note the size of this matrix we have about six hundred rows and about 26000 columns.

77
00:06:31,770 --> 00:06:35,250
Note that this is typically very undesirable in machine learning.

78
00:06:35,790 --> 00:06:40,500
We normally like to have many more rows compared to columns, as you'll see.

79
00:06:40,530 --> 00:06:44,070
This is not necessarily a detriment in this particular case.

80
00:06:47,240 --> 00:06:49,880
Now to see just how sparse this matrix is.

81
00:06:50,120 --> 00:06:54,770
We're going to compute the percentage of values in extreme, which are non-zero.

82
00:06:55,460 --> 00:06:58,810
So let's break down this expression on the numerator.

83
00:06:58,820 --> 00:07:01,340
We have the count of non-zero values.

84
00:07:01,850 --> 00:07:03,710
So why does this give us that count?

85
00:07:04,580 --> 00:07:06,830
Basically, you have to work from the inside out.

86
00:07:07,460 --> 00:07:09,950
First, we have extranet, not equal zero.

87
00:07:10,370 --> 00:07:11,810
So what does this give us?

88
00:07:12,350 --> 00:07:16,970
This gives us a Boolean matrix with true, where the value is non-zero and false.

89
00:07:16,970 --> 00:07:20,070
Otherwise, as you recall in Python.

90
00:07:20,090 --> 00:07:23,390
True is treated as one and false is treated as zero.

91
00:07:24,110 --> 00:07:29,210
Therefore, if we simply take the sum of all the elements in this matrix, we will get the sum of all

92
00:07:29,210 --> 00:07:30,650
the non-zero values.

93
00:07:36,600 --> 00:07:39,960
On the denominator, we have the product of extreme nut shape.

94
00:07:40,620 --> 00:07:42,060
So why does that make sense?

95
00:07:42,630 --> 00:07:48,060
Well, extreme nut shape will give us a tuple containing the number of rows and number of columns.

96
00:07:48,540 --> 00:07:50,700
How many elements are inside its train?

97
00:07:51,300 --> 00:07:54,420
Well, it's the number of rows times the number of columns.

98
00:07:54,780 --> 00:07:56,190
In other words, the products.

99
00:07:56,940 --> 00:07:59,850
OK, so I hope you're convinced that this expression is correct.

100
00:08:00,240 --> 00:08:04,410
It's the number of non-zero elements divided by the total number of elements.

101
00:08:08,740 --> 00:08:15,670
So as you can see, less than one percent of the Matrix contains non-zero values, thus we are justified

102
00:08:15,670 --> 00:08:17,530
in using a sparse representation.

103
00:08:21,110 --> 00:08:25,070
So in the next block of code, we'd like to do our usual machine learning steps.

104
00:08:25,760 --> 00:08:32,210
Again, we start by creating an instance of our model, which is multi a.m. NIVEIS The second step is

105
00:08:32,210 --> 00:08:34,309
to fit the model on the train sets.

106
00:08:35,030 --> 00:08:39,559
The third step is to check the performance of the model on both the train and test sets.

107
00:08:40,190 --> 00:08:43,250
As you recall, the score function returns the accuracy.

108
00:08:47,510 --> 00:08:52,640
OK, so we get about ninety nine percent on the train, said about ninety seven percent on the test

109
00:08:52,640 --> 00:08:52,970
set.

110
00:08:53,540 --> 00:08:58,850
This is already pretty good, but let's see if any of the variations we discussed will be able to beat

111
00:08:58,850 --> 00:08:59,600
this score.

112
00:09:04,800 --> 00:09:11,010
So the next experiment in this script will be to use stop words, as you recall, by default, stop

113
00:09:11,010 --> 00:09:12,240
words are not removed.

114
00:09:12,990 --> 00:09:16,590
Note that this time we're going to do everything in a single block of code.

115
00:09:16,890 --> 00:09:22,380
Since you've already seen this code before, so please double check each line for yourself to make sure

116
00:09:22,380 --> 00:09:23,580
that this is the case.

117
00:09:28,100 --> 00:09:32,330
OK, so as you can see, our performance is nearly the same as before.

118
00:09:33,080 --> 00:09:38,030
Note that although we did a bit better this time, this was not always the case when I ran this code.

119
00:09:38,510 --> 00:09:42,590
So please do your own experiments and check how consistent this result is.

120
00:09:46,800 --> 00:09:52,290
The next experiment in this notebook will be to use the limited zation now because we're using the count

121
00:09:52,290 --> 00:09:54,310
of exercise your class in Saikat Learn.

122
00:09:54,600 --> 00:10:00,510
This is mainly an exercise in figuring out how to work with the API to use an external tokenize error.

123
00:10:01,350 --> 00:10:06,070
We'll begin by defining a function that will map parts of speech tags in Moltke.

124
00:10:06,660 --> 00:10:09,780
As you recall, this was described in a previous lecture.

125
00:10:16,610 --> 00:10:20,690
The next block of code is used to define a class called dilemma tokenize error.

126
00:10:21,410 --> 00:10:27,350
Basically, this is going to do all the work of tokenizing and limiting each document, although the

127
00:10:27,350 --> 00:10:29,000
syntax is a bit complex.

128
00:10:29,390 --> 00:10:31,280
Pay attention to the high level idea.

129
00:10:32,660 --> 00:10:38,210
What do we want to do is create an object and then we want to be able to call that object as if it were

130
00:10:38,210 --> 00:10:38,900
a function.

131
00:10:39,500 --> 00:10:44,840
We can accomplish that by defining a special call function with two underscores on each side.

132
00:10:46,670 --> 00:10:51,890
OK, so in the initial teaser, we start by instantiating a word net, lemme Taser object.

133
00:10:52,610 --> 00:10:54,530
The next step is to look at the call function.

134
00:10:55,160 --> 00:11:00,830
This essentially takes in one argument, which is the document to tokenize inside the function.

135
00:11:00,860 --> 00:11:04,550
We start by calling the word tokenize function from Nulty K.

136
00:11:05,390 --> 00:11:07,780
This will convert our documents into tokens.

137
00:11:08,330 --> 00:11:12,200
As you recall, you can think of this like a fancier version of String Split.

138
00:11:13,040 --> 00:11:16,970
The next step is to obtain the parts of speech tags, as you recall.

139
00:11:17,210 --> 00:11:20,870
This can be done by calling an object post tag.

140
00:11:21,440 --> 00:11:27,710
This returns a list containing tuples, and each tuple contains each word in the document, along with

141
00:11:27,710 --> 00:11:28,970
its corresponding tag.

142
00:11:29,810 --> 00:11:34,970
The final step is to look through each word and tag pair and to call the limitées function.

143
00:11:35,660 --> 00:11:40,760
The output of this is a list containing each lemma ties the word in the input document.

144
00:11:47,650 --> 00:11:51,430
The next step is to try a limited organizer with the count vector riser.

145
00:11:52,090 --> 00:11:53,230
So how do we use it?

146
00:11:54,100 --> 00:11:59,560
As you can see, we pass in an object of type lemma tokenize here for the tokenize their argument.

147
00:12:00,160 --> 00:12:05,200
Note that this takes in any callable, so an object with a call function is acceptable.

148
00:12:05,590 --> 00:12:07,720
A regular function would be fine as well.

149
00:12:08,590 --> 00:12:13,300
In fact, you may want to rewrite this code to use a regular function if you're more comfortable with

150
00:12:13,300 --> 00:12:14,170
that approach.

151
00:12:14,860 --> 00:12:16,900
And again, the remaining steps are the same.

152
00:12:25,900 --> 00:12:31,360
OK, so interestingly, while the train score is still about 99 percent, the test score has dropped

153
00:12:31,360 --> 00:12:35,050
to about 96 something percent, which is less than before.

154
00:12:35,800 --> 00:12:41,860
Note that you're encouraged to try running this yourself to observe that limitation is the slowest process

155
00:12:42,130 --> 00:12:44,080
out of everything we tried in this notebook.

156
00:12:44,680 --> 00:12:49,540
This video has been edited to make it look like it didn't take that much time, but please run this

157
00:12:49,540 --> 00:12:50,440
yourself to see.

158
00:12:54,330 --> 00:13:00,420
So the next experiment in the script is to try stemming, as you recall, stemming is a more crude version

159
00:13:00,420 --> 00:13:01,500
of limitation.

160
00:13:02,370 --> 00:13:08,500
Again, we're going to accomplish this by defining a class with a call function in the initial teaser.

161
00:13:08,520 --> 00:13:12,700
We're going to instantiate a porter steamer inside the call function.

162
00:13:12,730 --> 00:13:18,360
We're going to call, we're tokenize to tokenize are documents and then we call Porter Dot STEM for

163
00:13:18,360 --> 00:13:19,050
each token.

164
00:13:19,830 --> 00:13:22,440
Note that there is no need for parts of speech tags.

165
00:13:23,100 --> 00:13:26,130
Again, the output will be a list of stemmed tokens.

166
00:13:32,280 --> 00:13:36,120
The next step is to try to stem tokenize air with the count victimizer.

167
00:13:36,750 --> 00:13:39,060
Note that the syntax is the same as before.

168
00:13:46,890 --> 00:13:50,070
OK, so this time our train performance is slightly worse.

169
00:13:50,430 --> 00:13:55,590
While our test performance is slightly better, but note that this is still worse than our first two

170
00:13:55,590 --> 00:13:58,710
experiments, which did not use any special tokenize error.

171
00:14:02,460 --> 00:14:08,790
The next step is to define the simplest possible token isare, which is just a string split note that

172
00:14:08,790 --> 00:14:14,070
we need to put this into a function since, as you recall the tokenize, their argument must be callable

173
00:14:14,280 --> 00:14:16,650
and it must accept a document as input.

174
00:14:17,250 --> 00:14:21,240
In other words, this function has the same interface as the limited and str..

175
00:14:21,390 --> 00:14:22,530
We defined above.

176
00:14:27,770 --> 00:14:32,150
OK, so the next step is to try a simple tokenize here with the count of exerciser.

177
00:14:37,480 --> 00:14:43,480
So this time our train score has increased, but our test score is only on par with what we got before.

178
00:14:44,140 --> 00:14:49,240
At the same time, this is a sign that a simple string split is a reasonable choice.

179
00:14:54,070 --> 00:14:56,380
OK, so what have we learned in this lecture?

180
00:14:57,070 --> 00:15:01,870
We've learned that it's not at all clear which method will perform best before you even try.

181
00:15:02,650 --> 00:15:06,880
Oftentimes, people assume that the most complex method will perform best.

182
00:15:07,300 --> 00:15:09,320
In this case, that was limitation.

183
00:15:10,060 --> 00:15:11,280
But what we saw was there.

184
00:15:11,320 --> 00:15:15,700
This, in fact, led to the worst performance out of all the experiments we tried.

185
00:15:16,450 --> 00:15:20,920
However, also note that this was just for a single instance of the experiment.

186
00:15:23,030 --> 00:15:29,210
As a final exercise for this lecture, you should print out the size of X train in each case, what

187
00:15:29,210 --> 00:15:32,780
you should find is that there is quite a large difference between the cases.

188
00:15:33,290 --> 00:15:37,760
Consider why these differences exist using what you've learned about each method.