1
00:00:11,270 --> 00:00:15,890
So in this lecture, we will continue looking at our notebook for sentiment analysis.

2
00:00:17,360 --> 00:00:22,280
Let us begin by grabbing only the samples, which have been classified as positive and negative.

3
00:00:23,030 --> 00:00:28,880
To do this, we're going to create a list of desired targets which contain the codes for both the positive

4
00:00:28,880 --> 00:00:31,440
and negative class, as you recall.

5
00:00:31,460 --> 00:00:34,190
These are stored in the dictionary called Target Map.

6
00:00:34,820 --> 00:00:37,130
We'll call the result binary target list.

7
00:00:39,000 --> 00:00:44,880
The next step is to filter the f train and D of test where we only want the rows, where the target

8
00:00:44,880 --> 00:00:51,570
is in this list of desired targets that we can accomplish this by using the function is in.

9
00:00:52,890 --> 00:00:58,050
Note that another way to do this would have been to use the not equals operator since we simply don't

10
00:00:58,050 --> 00:01:01,170
want any samples where the target is equal to neutral.

11
00:01:06,700 --> 00:01:11,860
The next step will be to call the head function on one of our new data frames to check that they only

12
00:01:11,860 --> 00:01:14,100
contain positive and negative tweets.

13
00:01:19,260 --> 00:01:21,360
OK, so this appears to be correct.

14
00:01:25,320 --> 00:01:32,040
The next step is to convert our new data set into TF IDF vectors, since we already have a vector riser

15
00:01:32,040 --> 00:01:32,640
object.

16
00:01:32,940 --> 00:01:39,510
We can simply reuse the same one from above as before we call the fit transform function on the train

17
00:01:39,510 --> 00:01:42,420
set and then we call transform on the test set.

18
00:01:47,540 --> 00:01:51,650
The next step is to obtain a wide train and wide test from the target column.

19
00:01:56,280 --> 00:02:01,500
The next step is to train a new model on a binary data set using the same code as before.

20
00:02:02,250 --> 00:02:04,470
Again, we'll print the accuracy to start.

21
00:02:08,990 --> 00:02:15,080
OK, so this time we've done much better, we get about 93 percent on the train set in about ninety

22
00:02:15,080 --> 00:02:16,700
one percent on the test set.

23
00:02:21,090 --> 00:02:23,190
The next step is to compute the AUC.

24
00:02:24,330 --> 00:02:30,930
Note that because we now have binary classes, we only need the complement index one from the probabilities

25
00:02:30,930 --> 00:02:33,120
returned by the predict proper method.

26
00:02:37,010 --> 00:02:40,760
OK, so notice that our EU seas are now much better than before.

27
00:02:45,090 --> 00:02:49,380
The next step in this notebook is to work on interpreting what our model has learned.

28
00:02:50,460 --> 00:02:54,180
We'll begin by demonstrating how to obtain the weights of a trained model.

29
00:02:58,830 --> 00:03:04,380
OK, so notice that we get back a two dimensional array where the first dimension has size one.

30
00:03:05,310 --> 00:03:10,020
Thus, if our model has the inputs, this would be an array of size one by the.

31
00:03:13,280 --> 00:03:18,950
The next step is to plot a histogram of the weights in order to get an idea of the range of values.

32
00:03:23,140 --> 00:03:26,200
So notice that most of the weights are centered around zero.

33
00:03:26,590 --> 00:03:30,400
While there are some outliers around two, three or even four.

34
00:03:35,140 --> 00:03:40,780
Now, as you recall, if we look at our model, only we won't know which words correspond to which inputs.

35
00:03:41,320 --> 00:03:46,400
Thus, it is necessary to obtain a word to index mapping, as you recall.

36
00:03:46,420 --> 00:03:49,840
This is stored in our IDF vector riser objects.

37
00:03:50,500 --> 00:03:54,100
In particular, it's stored inside an attribute called the vocabulary.

38
00:03:58,300 --> 00:04:04,750
OK, so as expected, this returns a dictionary where the key is the word and the value is the index.

39
00:04:09,950 --> 00:04:15,710
So in order to interpret our model, what we're going to do is find the most extreme words for each

40
00:04:15,710 --> 00:04:22,290
class that is, find the words with the largest magnitude of weights, as you recall.

41
00:04:22,310 --> 00:04:27,440
These words are the words that will have the largest effect on the output, provided that the input

42
00:04:27,440 --> 00:04:32,210
is the same since there aren't that many words with the weight larger than two.

43
00:04:32,480 --> 00:04:34,100
We'll set a threshold at two.

44
00:04:35,390 --> 00:04:39,340
Alternatively, you could use a more statistical method like percentiles.

45
00:04:40,580 --> 00:04:44,090
The next step is to live through each word in our word to index mapping.

46
00:04:45,350 --> 00:04:50,090
Note that on each iteration, we get the word and its index at the same time.

47
00:04:51,170 --> 00:04:56,180
Inside this loop will index our model weight vector using the index called index.

48
00:04:56,420 --> 00:04:58,280
And this will give us a single weight.

49
00:04:59,360 --> 00:05:03,350
The next step is to check whether this weight is larger than a threshold.

50
00:05:03,950 --> 00:05:08,840
If this is the case, then we will print the word along with its corresponding weight.

51
00:05:16,110 --> 00:05:18,990
OK, so these results seem to make a lot of sense.

52
00:05:19,520 --> 00:05:21,250
We can see words like, Thanks.

53
00:05:21,270 --> 00:05:21,720
Great.

54
00:05:21,870 --> 00:05:23,010
Best love.

55
00:05:23,040 --> 00:05:23,840
Appreciate.

56
00:05:23,910 --> 00:05:24,540
Awesome.

57
00:05:24,570 --> 00:05:25,200
Kudos.

58
00:05:25,230 --> 00:05:25,950
Amazing.

59
00:05:25,950 --> 00:05:26,820
And so forth.

60
00:05:27,600 --> 00:05:32,220
Interestingly, we also see some terms which correspond to the name of the airline itself.

61
00:05:33,030 --> 00:05:38,370
Perhaps it's simply the case that these are the airlines which are typically associated with positive

62
00:05:38,370 --> 00:05:43,140
sentiment, or that users tend to use these names more in positive tweets.

63
00:05:48,080 --> 00:05:50,720
So the next step is to look at the negative words.

64
00:05:51,500 --> 00:05:56,600
Note that this loop is the same as the previous loop, except that this time we check whether the wait

65
00:05:56,600 --> 00:05:59,180
is less than the negative of a threshold.

66
00:05:59,750 --> 00:06:03,950
That is, we're going to print the words for any weight, which is less than minus two.

67
00:06:10,420 --> 00:06:13,180
OK, so these words also make a lot of sense.

68
00:06:13,750 --> 00:06:16,750
We see words like hours delayed, cancelled.

69
00:06:16,780 --> 00:06:18,140
Nothing worst.

70
00:06:18,160 --> 00:06:18,780
Why?

71
00:06:18,790 --> 00:06:20,770
And luggage, of course.

72
00:06:20,800 --> 00:06:26,620
Terms like delayed or cancelled in luggage are associated with common problems that people have when

73
00:06:26,620 --> 00:06:27,850
traveling by plane.

74
00:06:28,450 --> 00:06:32,650
So it's pretty clear that these words would be associated with negative sentiment.

75
00:06:38,280 --> 00:06:44,070
So now that you understand how to do sentiment analysis, the following exercises are designed to help

76
00:06:44,070 --> 00:06:45,270
you practice further.

77
00:06:46,110 --> 00:06:50,050
The first exercise is to check which tweets are model is getting wrong.

78
00:06:51,150 --> 00:06:55,350
In particular, you're going to print the most wrong tweets for both classes.

79
00:06:56,010 --> 00:07:01,710
In other words, this means find a negative review where the output probability is closest to one.

80
00:07:02,490 --> 00:07:07,560
This means, out of all the negative reviews our model got wrong, this one was the one it was most

81
00:07:07,560 --> 00:07:08,640
confident about.

82
00:07:09,330 --> 00:07:10,860
Print the probability as well.

83
00:07:11,820 --> 00:07:15,540
Similarly, you want to do the same process for the positive reviews.

84
00:07:15,930 --> 00:07:19,110
In other words, find the most wrong a positive review.

85
00:07:21,680 --> 00:07:23,780
The second exercise is very simple.

86
00:07:24,590 --> 00:07:30,470
One easy way to help your model handle imbalanced classes is to simply weight the loss function.

87
00:07:31,370 --> 00:07:36,800
Now, since we've only discussed the intuition behind logistic regression, you may not know exactly

88
00:07:36,800 --> 00:07:37,790
what this means.

89
00:07:38,420 --> 00:07:42,830
But if you decide to study this topic further, this will make more sense.

90
00:07:43,730 --> 00:07:46,670
Luckily, the code for this variation is very simple.

91
00:07:47,600 --> 00:07:53,330
When you instantiate your model, you're going to send an attribute called class way to the string balanced.

92
00:07:54,110 --> 00:07:57,710
This will weight the laws function based on the frequency of each class.

93
00:07:58,310 --> 00:08:01,940
So try this and observe how the confusion matrix changes.

94
00:08:02,690 --> 00:08:08,090
You should find that the model does better on the classes which were underrepresented, meaning the

95
00:08:08,090 --> 00:08:10,250
positive class in the neutral class.

96
00:08:11,000 --> 00:08:14,360
Of course, this improvement doesn't necessarily come for free.

97
00:08:15,080 --> 00:08:20,690
You may find that doing this also leads to the model performing worse on the majority class, which

98
00:08:20,690 --> 00:08:22,550
corresponds to negative sentiment.