1
00:00:11,130 --> 00:00:16,890
So in this lecture, I'm going to walk you through the exercise for classifying the SMS spam data set

2
00:00:17,370 --> 00:00:20,690
and a few extra things that I think will help you understand the data.

3
00:00:21,330 --> 00:00:24,990
So hopefully you had a chance to try this out before watching this lecture.

4
00:00:26,430 --> 00:00:29,700
So the first thing we're going to look at is our list of imports.

5
00:00:30,270 --> 00:00:32,250
Note that most of this should be familiar.

6
00:00:32,910 --> 00:00:39,150
Some notable additions are the Rowsey, a U.S. score, F1 score and the confusion matrix.

7
00:00:39,810 --> 00:00:44,850
In addition, we'll also be using the word cloud library, which will let us visualize the most common

8
00:00:44,850 --> 00:00:46,260
words from each class.

9
00:00:46,890 --> 00:00:50,700
This will be helpful in understanding why our model behaves the way it does.

10
00:00:51,090 --> 00:00:54,510
In other words, why it thinks an email is spam or not spam.

11
00:01:01,270 --> 00:01:03,100
The next step is to download the data.

12
00:01:09,450 --> 00:01:14,100
Next, I'm going to load in the data as usual, since we have a CSB.

13
00:01:14,130 --> 00:01:16,920
We're going to use the pandas reads as we function.

14
00:01:17,730 --> 00:01:23,490
Also notice I'm passing in a special encoding here called the ISO eight eight five nine Dash one.

15
00:01:24,240 --> 00:01:29,730
Knowing what this is is an important, but the reason why we need it is because the CSB contains some

16
00:01:29,730 --> 00:01:30,960
invalid characters.

17
00:01:31,530 --> 00:01:36,510
So if you tried to use the default encoding, you're going to get an error that there's a character

18
00:01:36,510 --> 00:01:38,260
that's not valid in UTF eight.

19
00:01:39,030 --> 00:01:43,740
Generally, when you're working with text, you're going to come across invalid characters, especially

20
00:01:43,740 --> 00:01:47,880
these days with the existence of emojis and other non-standard symbols.

21
00:01:53,310 --> 00:01:57,510
OK, so the next step is to do what you have done, head to see what our data frame looks like.

22
00:02:02,640 --> 00:02:07,620
So as you may have expected, we have two columns, one for the labels and one for the text.

23
00:02:08,970 --> 00:02:14,220
Now, for some reason after you load, in this case, we are going to see some extra columns that are

24
00:02:14,220 --> 00:02:14,940
just empty.

25
00:02:15,540 --> 00:02:20,430
So what we're going to do is clean up this data frame a little bit by dropping these columns.

26
00:02:23,700 --> 00:02:27,060
So here we have some code to drop the unwanted columns.

27
00:02:31,300 --> 00:02:35,080
The next step is to do what you have done had again to see what the result is.

28
00:02:37,940 --> 00:02:41,270
OK, so now you can see that our unwanted columns are gone.

29
00:02:45,840 --> 00:02:51,540
The next step is to rename the columns we kept, which are currently just V1 and v2, which are not

30
00:02:51,540 --> 00:02:55,800
really descriptive names, so we're going to rename them to labels and data.

31
00:03:00,150 --> 00:03:05,760
The next step is to do it had again to ensure that our data frame is in the format we want it to be

32
00:03:05,760 --> 00:03:06,090
in.

33
00:03:08,920 --> 00:03:12,400
OK, so now our data frame looks a lot better than it was before.

34
00:03:16,910 --> 00:03:19,640
The next step is to draw a histogram of our labels.

35
00:03:20,090 --> 00:03:23,570
This will help us determine whether or not we have imbalanced classes.

36
00:03:28,080 --> 00:03:33,690
OK, so as you can see, in this case, we do have imbalanced classes in particular.

37
00:03:33,720 --> 00:03:40,470
Ham is much more common than spam, so it will make sense to look at other metrics such as the F1 score

38
00:03:40,470 --> 00:03:41,580
in the AUC.

39
00:03:45,650 --> 00:03:51,920
So the next step is to create a new column called B Labels, which assigns a value of zero to ham in

40
00:03:51,920 --> 00:03:53,480
a value of one to spam.

41
00:03:54,230 --> 00:03:57,360
We're also going to extract this column as an umpire, Ray.

42
00:03:58,490 --> 00:04:01,820
Note that you probably don't need to do this since last I remember.

43
00:04:02,120 --> 00:04:06,860
So I could learn accepts data frames and series as input arguments, and it allows your labels to be

44
00:04:06,860 --> 00:04:07,520
strings.

45
00:04:08,030 --> 00:04:12,650
But if you were to write your own model, you'd want the labels to be represented numerically.

46
00:04:12,950 --> 00:04:15,470
So in general, it's good practice to do this.

47
00:04:20,480 --> 00:04:25,360
The next step is to call train, to split, to split up our data into train and test.

48
00:04:26,930 --> 00:04:32,420
Note that you could even do cross validation so you can calculate on average how well the model does

49
00:04:32,420 --> 00:04:33,650
on the validation set.

50
00:04:34,130 --> 00:04:37,490
That's probably a more accurate measure than just one train to split.

51
00:04:44,010 --> 00:04:49,500
OK, so the next step is to create our X matrix, which contains the input features for every sample

52
00:04:50,220 --> 00:04:51,180
for this experiment.

53
00:04:51,210 --> 00:04:56,160
I'm going to use the count vector riser from Saikia Learn, which just gives you the raw counts.

54
00:04:56,700 --> 00:04:59,520
I have code here for TF IDF that's commented out.

55
00:04:59,880 --> 00:05:06,360
In case you want to try that to notice how I'm passing in the decode error argument sets you ignore.

56
00:05:06,900 --> 00:05:12,750
This is because, as I mentioned earlier, if any invalid UTF eight characters are found, we want to

57
00:05:12,750 --> 00:05:13,650
just ignore them.

58
00:05:20,630 --> 00:05:24,020
The next step is to just print out X train to see what we got back.

59
00:05:27,170 --> 00:05:32,630
OK, so we get back a sparse matrix instead of the usual Nampai array, where you can see all the numbers,

60
00:05:33,230 --> 00:05:33,960
as you recall.

61
00:05:33,980 --> 00:05:36,590
This is because the array contains a lot of zeros.

62
00:05:36,950 --> 00:05:40,520
And so this is the most efficient format to store the data in.

63
00:05:45,160 --> 00:05:50,020
OK, so now that we have our data training, the model in evaluating the model is super simple.

64
00:05:50,350 --> 00:05:55,300
We just call the objects constructor, then we call the fit function and then we call the score function

65
00:05:55,300 --> 00:05:56,680
for both training and test.

66
00:06:00,780 --> 00:06:04,980
OK, so we do pretty well, about 98 percent accuracy on the test set.

67
00:06:05,940 --> 00:06:11,940
One interesting thing you can try later is to use TFI Taf or use a different classifier and see if the

68
00:06:11,940 --> 00:06:12,900
results improve.

69
00:06:17,080 --> 00:06:23,740
OK, so recall that our classes are imbalanced and thus accuracy may not be the best measure of performance.

70
00:06:24,370 --> 00:06:27,730
It could be the case that our model is just predicting ham all the time.

71
00:06:28,030 --> 00:06:33,670
Since that is the over-represented class, in order to make sure that this is not the case, we should

72
00:06:33,670 --> 00:06:34,960
check other metrics.

73
00:06:35,590 --> 00:06:39,490
The first alternative metric we're going to try is the F1 score.

74
00:06:40,270 --> 00:06:45,010
To do this, we're first going to get our model's predictions for both train and test, which we'll

75
00:06:45,010 --> 00:06:46,930
call P train and P test.

76
00:06:47,590 --> 00:06:52,450
Note that we've assigned these two variables since they will be used again later in this notebook.

77
00:06:53,740 --> 00:06:59,170
The next step is to call the F1 score function, passing in the targets and the corresponding predictions

78
00:06:59,410 --> 00:07:00,790
for both train and test.

79
00:07:05,430 --> 00:07:11,220
OK, so as you can see, the F1 scores in the 90s for both train and test, which is a good sign that

80
00:07:11,220 --> 00:07:17,550
our model is performing well for both classes, it's another way of saying that both precision and recall

81
00:07:17,550 --> 00:07:23,070
are good since, as you recall, the F1 score is the harmonic mean of these two.

82
00:07:27,420 --> 00:07:33,360
In the next block, we're also going to check the AUC as yet another alternative measure of performance,

83
00:07:34,050 --> 00:07:34,870
as you recall.

84
00:07:34,890 --> 00:07:40,860
This function requires our model's posterior probabilities, so we need to call the predict proper function.

85
00:07:41,610 --> 00:07:44,130
In addition, we want the columns at index one.

86
00:07:44,400 --> 00:07:50,760
Since these represent the probabilities for class one will assign these to the variable names prob train

87
00:07:50,760 --> 00:07:51,810
and probe test.

88
00:07:53,700 --> 00:07:59,610
The next step is to call the RLC a U.S. score function, passing in the targets along with the corresponding

89
00:07:59,610 --> 00:08:02,370
mortal probabilities for both train and test.

90
00:08:06,770 --> 00:08:12,020
So as you can see, our model does very well on this measure, obtaining close to one for both training

91
00:08:12,020 --> 00:08:12,590
test.

92
00:08:16,240 --> 00:08:21,370
The next step is to see if we can get a more fine grained view of a model's performance by looking at

93
00:08:21,370 --> 00:08:22,840
the confusion matrix.

94
00:08:23,560 --> 00:08:27,610
Note that we don't have to compute these ourselves since there's a function inside it.

95
00:08:27,610 --> 00:08:29,530
Learn that can do this for us.

96
00:08:30,070 --> 00:08:33,970
However, if you did want to compute this yourself, it would be quite simple.

97
00:08:34,090 --> 00:08:37,240
So please try that as an exercise if you would like.

98
00:08:41,210 --> 00:08:46,640
OK, so as you can see, this only returns an array, which is not that useful to look at.

99
00:08:47,120 --> 00:08:52,340
You can probably guess based on these numbers what each entry means, but that would be an ideal.

100
00:08:56,400 --> 00:09:00,000
Instead, a better method is to plot the confusion matrix.

101
00:09:00,600 --> 00:09:04,830
Now, normally you can do this in cycle learn, and that is the case right now.

102
00:09:05,430 --> 00:09:11,640
However, you'll notice this comment I made here, which explains the situation basically at the time

103
00:09:11,640 --> 00:09:14,640
I've made this lecture circuit learn version one is out.

104
00:09:15,030 --> 00:09:20,910
But as is typical with these data science Python libraries, these updates come with breaking changes.

105
00:09:21,630 --> 00:09:27,870
Unfortunately, the confusion matrix is one of those breaking changes as the function to plot the confusion

106
00:09:27,870 --> 00:09:29,580
matrix has been moved.

107
00:09:30,390 --> 00:09:36,390
Unfortunately, this new version of cycle learning is not currently easy to install in CoLab, so we

108
00:09:36,390 --> 00:09:39,390
won't be looking at the new way to do this in this lecture.

109
00:09:40,290 --> 00:09:46,080
Now, breaking changes tend to always confuse a lot of beginners, so it's also not good to use the

110
00:09:46,080 --> 00:09:47,220
current version either.

111
00:09:48,090 --> 00:09:52,980
The solution for this notebook is to simply implement a confusion matrix plot ourselves.

112
00:09:53,310 --> 00:09:59,190
With the help of seabourne and pandas as an exercise, I would recommend looking up the new way to do

113
00:09:59,190 --> 00:10:04,140
this, and I can learn as a version one and to try that out on your own machine.

114
00:10:05,190 --> 00:10:10,260
In any case, you can see that we've defined a function called plot scheme, which takes in a confusion

115
00:10:10,260 --> 00:10:11,610
matrix as input.

116
00:10:12,360 --> 00:10:16,920
Inside the function, we define a list of classes which are just ham and spam.

117
00:10:18,490 --> 00:10:23,950
The next step is to convert the confusion matrix into a pan as data frame, which is useful because

118
00:10:24,250 --> 00:10:26,800
both the columns and the rows can have names.

119
00:10:27,400 --> 00:10:32,410
In particular, we are going to name them by their class names, which will show up on our plot.

120
00:10:34,490 --> 00:10:40,490
The next step is to call the seaborne heat map function with several arguments to make it show annotations

121
00:10:40,730 --> 00:10:42,500
and to format the numbers correctly.

122
00:10:43,670 --> 00:10:49,490
The final step in this function is to label the axes, so we know which one corresponds to the prediction

123
00:10:49,760 --> 00:10:51,800
and which one corresponds to the target.

124
00:10:52,580 --> 00:10:58,100
After defining the plot, the function, we're going to call the plot system function on our train confusion

125
00:10:58,100 --> 00:10:58,820
matrix.

126
00:11:06,470 --> 00:11:12,620
OK, so as you can see, it looks about right on the diagonal, which represents correct predictions.

127
00:11:12,860 --> 00:11:16,240
We have the largest numbers on the off diagonals.

128
00:11:16,250 --> 00:11:17,990
We have relatively small numbers.

129
00:11:18,590 --> 00:11:24,740
Note that this goes for both classes, so we rarely predict hammer spam and we rarely predict spam as

130
00:11:24,740 --> 00:11:25,220
ham.

131
00:11:30,060 --> 00:11:33,660
The next step is to plot the confusion matrix for the test set as well.

132
00:11:39,500 --> 00:11:44,750
So again, we see the same pattern where most of the predictions are correct for both classes.

133
00:11:45,290 --> 00:11:49,280
This makes sense in light of our high F1 scores and AEW sees.

134
00:11:54,670 --> 00:11:59,650
So in this example, because we have access to the raw data, there's some interesting stuff we can

135
00:11:59,650 --> 00:12:00,100
do.

136
00:12:00,820 --> 00:12:03,160
So I've created this function called Visualize.

137
00:12:03,880 --> 00:12:06,280
What this is going to do is create a word cloud.

138
00:12:06,880 --> 00:12:11,170
Basically, that's a picture where more frequent words appear larger and less frequent.

139
00:12:11,170 --> 00:12:12,280
Words appear smaller.

140
00:12:12,850 --> 00:12:18,430
So for each of the classes, we'll be able to see what are the most common words in a spam message?

141
00:12:18,700 --> 00:12:20,860
What are the most common words and a ham message?

142
00:12:21,520 --> 00:12:24,970
So this function takes in a label which can be either hammered spam.

143
00:12:25,570 --> 00:12:30,190
We then grab only the rows that contain that label, and we loop through the data column.

144
00:12:31,300 --> 00:12:35,830
We keep a running string of each message we encounter and just concatenate them together.

145
00:12:36,400 --> 00:12:40,450
This is because that's what the word cloud library expects to get its input.

146
00:12:41,890 --> 00:12:45,250
So once we create the word cloud, we then use map plot to show it.

147
00:12:46,750 --> 00:12:47,950
So let's run this block.

148
00:12:53,100 --> 00:12:55,320
OK, so let's first call this on the spam data.

149
00:13:00,840 --> 00:13:09,150
OK, so we see that some of the common spam words are text call now free mobile call, text and so forth.

150
00:13:09,360 --> 00:13:10,830
So that makes a lot of sense.

151
00:13:14,840 --> 00:13:16,970
So now let's call our function on the ham data.

152
00:13:23,810 --> 00:13:28,670
So here we see that the common words for him, Hammer, love will.

153
00:13:28,700 --> 00:13:31,720
OK, now go and so forth.

154
00:13:31,730 --> 00:13:34,340
So it's a lot different than the spam messages.

155
00:13:41,130 --> 00:13:46,530
OK, so in this next block of code, we're going to do a simple analysis to figure out what our model

156
00:13:46,530 --> 00:13:47,430
is getting wrong.

157
00:13:48,180 --> 00:13:52,730
There shouldn't be too many examples since we're getting about 98 99 percent accuracy.

158
00:13:53,400 --> 00:13:58,800
So we create a new column called Predictions, and we set it to the predictions generated by our trained

159
00:13:58,800 --> 00:13:59,400
model.

160
00:14:10,000 --> 00:14:12,970
The next step is to create a variable called sneaky spam.

161
00:14:13,420 --> 00:14:17,080
It's sneaky spam because it's able to bypass our spam filter.

162
00:14:17,920 --> 00:14:23,740
So to do this, I need to filter the data frame by selecting any rows where the prediction is zero,

163
00:14:23,750 --> 00:14:25,150
but the true label is one.

164
00:14:26,170 --> 00:14:28,660
So we use that element wise and operation.

165
00:14:29,110 --> 00:14:34,090
So if you don't know how to do this now, you know, then we loop through the data field and print each

166
00:14:34,090 --> 00:14:34,720
message.

167
00:14:35,500 --> 00:14:37,900
So here are some sneaky spam messages.

168
00:14:40,730 --> 00:14:44,030
OK, so we have free message, Hey there, darling.

169
00:14:44,570 --> 00:14:46,790
Did you hear about the new divorce Barbie?

170
00:14:47,810 --> 00:14:49,700
Do you realize that in about 40 years?

171
00:14:53,920 --> 00:14:59,800
OK, so you can see from this that some of these are obviously spam and some of them are not quite obviously

172
00:14:59,800 --> 00:15:00,250
spam.

173
00:15:00,970 --> 00:15:02,770
So overall, it's a mixture of the two.

174
00:15:09,980 --> 00:15:14,840
OK, so in the next block of code, we're going to create another variable called not actually spam.

175
00:15:15,170 --> 00:15:20,870
Since these are messages that are spam classifier to Texas spam, but are actually legitimate messages.

176
00:15:21,320 --> 00:15:24,770
So it's the same process as before, except I've switched the zero in the one.

177
00:15:26,270 --> 00:15:28,800
So here it's not actually spam messages.

178
00:15:34,490 --> 00:15:41,960
OK, so what's very interesting about this is that apparently these are not spam, but if you read these

179
00:15:41,960 --> 00:15:46,190
messages like, Hey, great deal farm tour, blah blah blah.

180
00:15:47,780 --> 00:15:52,070
OK, so for a lot of these, you can actually understand why these might be considered spam.

181
00:15:52,640 --> 00:15:55,520
It actually makes a lot of sense why these would be misclassified.

182
00:15:55,820 --> 00:15:59,270
So, for example, unlimited texts, limited minutes.

183
00:15:59,780 --> 00:16:04,700
So that could be something that was actually spam or something that your friend is sending you to give

184
00:16:04,700 --> 00:16:05,660
you some information.

185
00:16:06,800 --> 00:16:11,780
You also have stuff like this at the bottom where you know, we have send aid for customer service.

186
00:16:12,530 --> 00:16:15,260
So that actually looks like a spam message.

187
00:16:15,920 --> 00:16:17,990
So perhaps some of the roads are mislabeled.