1
00:00:11,110 --> 00:00:15,580
So in this lecture, we will be looking at how to implement sentiment analysis.

2
00:00:18,240 --> 00:00:21,390
We'll begin by downloading our data set, as we normally do.

3
00:00:22,350 --> 00:00:25,470
Note that this data set is a data set of airline tweets.

4
00:00:25,980 --> 00:00:29,790
So these are tweets that Twitter users are making to an airline.

5
00:00:30,570 --> 00:00:35,850
Based on this, you might be able to guess whether most of these tweets are positive, negative or neutral.

6
00:00:44,090 --> 00:00:47,540
The next step is to import everything we need for this notebook.

7
00:00:48,140 --> 00:00:51,650
Most of this stuff you should have seen before, so there's not much to say.

8
00:00:58,030 --> 00:01:04,180
The next step is to load in our data using PD that reads ESV Note that I've given this the variable

9
00:01:04,180 --> 00:01:08,830
name of ADF underscore since we'll be filtering out this data frame shortly.

10
00:01:14,080 --> 00:01:17,890
The next step is to call the head command to see what our data frame looks like.

11
00:01:22,540 --> 00:01:26,350
OK, so notice that we have many columns, most of which are not needed.

12
00:01:26,950 --> 00:01:32,530
The columns that we will make use of are the sentiment column and the column called text, which contains

13
00:01:32,530 --> 00:01:33,460
the tweet itself.

14
00:01:34,600 --> 00:01:39,010
Please feel free to look through this yourself to see if there's anything else you'd like to use.

15
00:01:39,910 --> 00:01:44,680
One interesting column that would have been interesting to use is the sentiment confidence.

16
00:01:45,280 --> 00:01:50,250
This one is interesting since some models allow you to apply weights to each data sample.

17
00:01:51,010 --> 00:01:53,470
By default, these weights are usually uniform.

18
00:01:54,190 --> 00:01:59,200
But in this case, since the confidence score differs from sample to sample, it might make sense to

19
00:01:59,200 --> 00:02:01,270
weight the samples by this confidence.

20
00:02:01,870 --> 00:02:06,220
In that way, the model will pay less attention to samples with low confidence.

21
00:02:06,850 --> 00:02:11,650
In any case, we are not doing that in this notebook, but it might be a nice exercise for you to try

22
00:02:11,650 --> 00:02:12,370
by yourself.

23
00:02:16,480 --> 00:02:22,360
The next step is to filter out the columns we want, which are airline sentiment and text, as mentioned.

24
00:02:23,350 --> 00:02:25,450
Note that we will call the result death.

25
00:02:30,760 --> 00:02:34,570
The next step is to call the head function to see what our new data frame looks like.

26
00:02:39,740 --> 00:02:45,140
So as you can see, we now have the usual two column data dataframe that we expect for this type of

27
00:02:45,140 --> 00:02:45,740
problem.

28
00:02:49,300 --> 00:02:54,820
The next step is to plot a histogram of the labels to check whether or not we have any class imbalance.

29
00:02:59,490 --> 00:03:06,480
OK, so it does appear that we have class imbalance, in particular, the negative class is overrepresented.

30
00:03:07,350 --> 00:03:13,530
This makes sense since users often use Twitter to complain to companies about their bad experiences.

31
00:03:14,130 --> 00:03:19,830
So because of this class imbalance, we should apply techniques like planning the confusion matrix as

32
00:03:19,830 --> 00:03:21,420
well as computing there you see.

33
00:03:26,130 --> 00:03:31,410
The next step is to convert the class labels into the conventional format, which are integers from

34
00:03:31,410 --> 00:03:32,910
zero up to minus one.

35
00:03:33,750 --> 00:03:39,570
As you can see, this simply means I'm going to create a dictionary mapping from string label to integer.

36
00:03:40,500 --> 00:03:45,810
Once I've done that, I can call the map function on the data frame column, which I want to apply this

37
00:03:45,810 --> 00:03:46,620
mapping to.

38
00:03:51,460 --> 00:03:56,320
The next step is to call the after head again to check that the targets were created correctly.

39
00:04:00,040 --> 00:04:03,130
OK, so this confirms that our targets look as expected.

40
00:04:07,210 --> 00:04:10,270
The next step is to split our data into train and test.

41
00:04:15,120 --> 00:04:19,649
The next step is to call the head function on our train data set to make sure that it looks like we

42
00:04:19,649 --> 00:04:20,310
expect.

43
00:04:24,130 --> 00:04:27,880
As you can see, the rows have been shuffled, which is what we expect.

44
00:04:32,440 --> 00:04:35,920
The next step is to create a TF IDF vector as your object.

45
00:04:36,490 --> 00:04:41,860
Note that I've said max features to two thousand, which will limit the vocabulary size of our model.

46
00:04:46,710 --> 00:04:52,190
The next step is to fit our veterans are to the training data and to transform it into X train.

47
00:04:57,820 --> 00:05:03,940
The next step is to print X train again to confirm that we get back a matrix of TF IDF values.

48
00:05:08,410 --> 00:05:12,490
So as expected, we get a sparse matrix with two thousand columns.

49
00:05:15,890 --> 00:05:19,340
The next step is to transform the test data into a test.

50
00:05:24,180 --> 00:05:29,040
The next step is to assign why weight training, why test, which come from the target column of their

51
00:05:29,040 --> 00:05:30,460
respective data frames.

52
00:05:35,210 --> 00:05:40,700
The next step is to build a logistic regression instance, train it and check our models performance

53
00:05:42,170 --> 00:05:44,720
note that have set max error to five 500 since.

54
00:05:44,720 --> 00:05:50,810
As I recall, this led to a warning that the training process had not converged with the default values.

55
00:05:52,070 --> 00:05:57,410
As you know, the next step is to call fit and then to go score on both the training test sets.

56
00:06:02,870 --> 00:06:08,840
OK, so as you can see, we get about 85 percent on the train set in about 80 percent on the test set.

57
00:06:09,890 --> 00:06:13,160
But remember, these numbers should not be looked at in isolation.

58
00:06:13,520 --> 00:06:15,380
Since we have imbalanced classes.

59
00:06:19,620 --> 00:06:21,540
The next step is to check the AUC.

60
00:06:22,320 --> 00:06:28,740
Now, as you recall, the AUC is defined as the area under the ROTC, which is defined in terms of binary

61
00:06:28,740 --> 00:06:29,460
detection.

62
00:06:30,270 --> 00:06:36,240
However, recall that in this problem, we have three classes, so let's proceed as we normally would

63
00:06:36,240 --> 00:06:38,310
by calling model predicts proper.

64
00:06:39,240 --> 00:06:44,820
Now, normally for two classes, the return value would be an end by two array and we would choose the

65
00:06:44,820 --> 00:06:46,110
column at index one.

66
00:06:46,950 --> 00:06:48,900
So you can see how I've commented that out.

67
00:06:49,710 --> 00:06:52,920
In this case, we're actually going to keep all the columns.

68
00:06:53,490 --> 00:06:58,710
This is necessary for the next step, which is to call the function RC AUC score.

69
00:06:59,610 --> 00:07:04,890
You'll notice that I've passed in an additional argument called the multiclass and I've said it to over

70
00:07:04,890 --> 00:07:05,370
you.

71
00:07:06,210 --> 00:07:08,820
So Oyo stands for one versus one.

72
00:07:09,030 --> 00:07:11,640
This is opposed to one versus rest.

73
00:07:12,270 --> 00:07:17,850
Essentially, these are two different strategies to build a multi class model from a binary model.

74
00:07:18,840 --> 00:07:23,280
Now, it's not my goal in this lecture to go off on a tangent about how that's done at this point.

75
00:07:23,670 --> 00:07:27,240
But if you want a deeper understanding, please feel free to look it up.

76
00:07:27,720 --> 00:07:29,970
It is, in my opinion, kind of interesting.

77
00:07:30,720 --> 00:07:36,330
The main idea you should keep in mind is that there are ways to convert binary metrics into multiclass

78
00:07:36,330 --> 00:07:37,050
metrics.

79
00:07:42,760 --> 00:07:47,140
OK, so interestingly, the AUC values are much higher than the accuracy's.

80
00:07:51,050 --> 00:07:54,590
OK, so the next step is to work on plotting the confusion matrix.

81
00:07:55,310 --> 00:08:01,040
So we'll start by getting our model predictions this time as class labels instead of probabilities.

82
00:08:06,150 --> 00:08:08,820
The next step is to call the confusion matrix function.

83
00:08:09,660 --> 00:08:14,640
Note that have said normalized the true, which will make it so that each row of the matrix sums to

84
00:08:14,640 --> 00:08:15,120
one.

85
00:08:15,900 --> 00:08:21,270
This is helpful to interpret the output, since the number of samples in each class vary by a lot.

86
00:08:25,730 --> 00:08:29,150
OK, so as you can see, we get back a three by three matrix.

87
00:08:29,390 --> 00:08:30,380
As expected.

88
00:08:34,700 --> 00:08:38,630
The next step is to plot the confusion matrix, as we had previously done.

89
00:08:39,530 --> 00:08:44,690
Recall that the reason why we are not using psyche to learn even though it has this functionality,

90
00:08:44,990 --> 00:08:50,420
is because they are in the middle of transitioning to a new version and this API is going to change.

91
00:08:51,140 --> 00:08:53,300
Luckily, this method is just as good.

92
00:08:59,400 --> 00:09:05,400
OK, so as you can see, our model does very well on the negative class, getting about 96 percent of

93
00:09:05,400 --> 00:09:08,380
samples correct for the positive class.

94
00:09:08,400 --> 00:09:10,710
We only get about 70 percent correct.

95
00:09:11,400 --> 00:09:16,740
And interestingly, when we do not predict the positive class correctly, we actually end up predicting

96
00:09:16,740 --> 00:09:18,450
negative rather than neutral.

97
00:09:19,080 --> 00:09:23,940
So it doesn't seem to be a matter of simply confusing positive tweets with neutral tweets.

98
00:09:24,900 --> 00:09:30,690
Now, if we look at the neutral class, the performance here is the worst at about sixty five percent

99
00:09:31,440 --> 00:09:32,100
in this case.

100
00:09:32,100 --> 00:09:37,950
Again, when we are incorrect, we most often predict a negative, which happens about 30 percent of

101
00:09:37,950 --> 00:09:38,640
the time.

102
00:09:39,480 --> 00:09:43,350
Thus, our model appears to have a bias towards the negative class.

103
00:09:43,950 --> 00:09:47,190
This makes sense since it is the overrepresented class.

104
00:09:51,810 --> 00:09:57,120
So the next step is to plot the confusion matrix for the test set against setting normalized the true.

105
00:10:02,550 --> 00:10:08,010
OK, so in this case, the results reflect what we see for the train set, except that the performance

106
00:10:08,010 --> 00:10:09,030
is a bit worse.

107
00:10:09,720 --> 00:10:16,410
In particular, we now only get about 64 percent of the positive tweets correct and about 53 percent

108
00:10:16,620 --> 00:10:17,820
of the neutral tweets.

109
00:10:18,360 --> 00:10:21,660
And again, there is a large bias towards the negative class.

110
00:10:23,130 --> 00:10:28,480
Now, since this lecture has been quite long so far, we're going to take a break in the next lecture.

111
00:10:28,500 --> 00:10:33,150
We'll continue with looking at the binary case, along with interpreting the model weights.