1
00:00:11,110 --> 00:00:16,660
So in this lecture, I'm going to give you an official exercise prompt for this section, which is to

2
00:00:16,660 --> 00:00:23,050
implement sentiment analysis, as with the other exercises in this course, please feel free to look

3
00:00:23,050 --> 00:00:26,030
at the official notebook in order to get the data set.

4
00:00:26,050 --> 00:00:29,020
But please do not cheap by looking at the whole solution.

5
00:00:29,890 --> 00:00:31,570
So the exercise can be described.

6
00:00:31,570 --> 00:00:38,080
Quite simply, what you are going to get is a data set of tweets which are labeled as positive, negative

7
00:00:38,080 --> 00:00:38,740
or neutral.

8
00:00:39,610 --> 00:00:45,400
Your job, of course, is to build a classifier and assess its accuracy on both the train and test sets.

9
00:00:46,180 --> 00:00:51,940
Note that because the data set is just a single file, you will need to split the data into train and

10
00:00:51,940 --> 00:00:52,870
test yourself.

11
00:00:53,920 --> 00:00:57,580
Also note that this dataset comes with columns which you do not need.

12
00:00:58,210 --> 00:01:02,800
Therefore, you'll need to look at the data carefully to choose the right columns for the text and the

13
00:01:02,800 --> 00:01:03,520
labels.

14
00:01:08,000 --> 00:01:12,530
So let's go through some additional details that may help you complete the exercise.

15
00:01:13,400 --> 00:01:20,400
Firstly, note that it will be your choice which victimization strategy you want to use, as you recall.

16
00:01:20,420 --> 00:01:22,610
This will include tokenization as well.

17
00:01:23,450 --> 00:01:28,640
You might choose the count victimizer with default settings or even TFI Taf with stop words and limits,

18
00:01:28,640 --> 00:01:31,310
ization and normalization and so forth.

19
00:01:31,820 --> 00:01:38,550
So that is up to you as your classifier, you should use logistic regression, which is available inside.

20
00:01:38,550 --> 00:01:39,230
Get Learn.

21
00:01:39,860 --> 00:01:42,530
You may be interested in trying other classifiers as well.

22
00:01:43,760 --> 00:01:46,670
Finally, you want to check the performance of your model.

23
00:01:47,690 --> 00:01:53,120
Note that by default, when you call the score function inside, you learn this returns the accuracy.

24
00:01:54,020 --> 00:01:59,000
However, recall that this is not an ideal scoring function when the classes are imbalanced.

25
00:02:00,200 --> 00:02:05,780
Thus, you should check whether the classes are imbalanced in order to determine if other scoring functions

26
00:02:05,780 --> 00:02:12,350
are necessary to use some examples of scoring functions that take into account class imbalance or the

27
00:02:12,350 --> 00:02:14,420
AUC and F1 score.

28
00:02:15,110 --> 00:02:17,990
In addition, you'll want to plot the confusion matrix.

29
00:02:19,040 --> 00:02:25,430
Now, one thing to pay attention to is that the U.S. and F1 are not naturally designed for multiclass

30
00:02:25,430 --> 00:02:26,150
problems.

31
00:02:26,630 --> 00:02:32,000
However, circuit learning does offer several alternative options for the multi class case.

32
00:02:32,600 --> 00:02:36,770
So I encourage you to play around with these options and check out the results.

33
00:02:41,410 --> 00:02:47,920
Now, although the original data set has three classes, your next task will be to build another classifier

34
00:02:48,160 --> 00:02:51,550
which learns from among only two classes positive and negative.

35
00:02:52,450 --> 00:02:55,510
That is, you're going to filter out the neutral class.

36
00:02:56,320 --> 00:03:01,510
Note that the code for this will essentially be the same since the interface for Saikat Learn works

37
00:03:01,510 --> 00:03:04,150
for both binary and multi class data sets.

38
00:03:05,780 --> 00:03:09,830
Firstly, what you should notice is that this greatly improves the results.

39
00:03:10,400 --> 00:03:16,130
This makes sense since the neutral class is in between positive and negative in terms of polarity.

40
00:03:16,910 --> 00:03:23,480
That is, we're likely to confuse neutral with the neighbouring classes and vice versa since we expect

41
00:03:23,480 --> 00:03:26,690
positive and negative tweets to be very different from each other.

42
00:03:27,080 --> 00:03:29,360
They should be easier to discriminate between.

43
00:03:30,650 --> 00:03:36,200
Finally, after you've trained your binary model and checked its performance, you should also interpret

44
00:03:36,200 --> 00:03:38,390
the weights in particular.

45
00:03:38,420 --> 00:03:43,700
Try printing out the words corresponding to the most positive and the most negative weights.

46
00:03:44,360 --> 00:03:49,460
I think what you'll find is that these are, for the most part, unsurprising, which is a good thing

47
00:03:49,460 --> 00:03:52,070
since it means our model is working as expected.

48
00:03:52,640 --> 00:03:55,310
So good luck and I'll see you in the next lecture.

