1
00:00:11,050 --> 00:00:17,050
So in this lecture, I will be giving you an official exercise prompt in preparation for the next lecture.

2
00:00:17,770 --> 00:00:22,660
As with the other exercises in this course, please feel free to look at the official notebook in order

3
00:00:22,660 --> 00:00:23,650
to get the data set.

4
00:00:23,950 --> 00:00:26,620
But please do not cheat by looking at the whole solution.

5
00:00:27,820 --> 00:00:34,330
So the exercise can be described as quite simply, what you were going to get is a data set of SMS messages

6
00:00:34,600 --> 00:00:36,610
which are labeled as spam or not spam.

7
00:00:37,330 --> 00:00:42,580
Your job, of course, is to build a classifier and assess its accuracy on both the training and test

8
00:00:42,580 --> 00:00:43,090
sets.

9
00:00:43,930 --> 00:00:49,270
Note that because the data set is just a single file, you will need to split the data into train and

10
00:00:49,270 --> 00:00:50,170
test yourself.

11
00:00:54,960 --> 00:00:59,280
So let's go through some additional details that may help you complete the exercise.

12
00:01:00,180 --> 00:01:06,460
Firstly, note that it will be your choice, which factorization strategy you want to use, as you recall.

13
00:01:06,480 --> 00:01:08,820
This will include tokenisation as well.

14
00:01:09,660 --> 00:01:15,900
You may choose the count riser with default settings or even TF IDF with stop words and limitation and

15
00:01:15,900 --> 00:01:17,460
normalization and so forth.

16
00:01:17,910 --> 00:01:24,420
So that is up to you as your classifier, you should choose an appropriate form of niveis, either writing

17
00:01:24,420 --> 00:01:25,980
it yourself or using Saikia.

18
00:01:25,980 --> 00:01:28,770
Learn depending on how advanced do you want to go.

19
00:01:29,790 --> 00:01:33,330
Furthermore, you should feel free to try other classifiers as well.

20
00:01:34,170 --> 00:01:36,930
Finally, you'll want to check the performance of your model.

21
00:01:37,650 --> 00:01:42,810
Note that by default, when you call the score function inside can learn this returns the accuracy.

22
00:01:43,560 --> 00:01:48,180
However, recall that this is not an ideal scoring function when the classes are imbalanced.

23
00:01:48,840 --> 00:01:54,510
Thus, you should check whether the classes are imbalanced in order to determine if other scoring functions

24
00:01:54,510 --> 00:02:00,900
are necessary to use some examples of scoring functions that take into account class imbalance or the

25
00:02:00,900 --> 00:02:02,640
F1 score in the AUC.