1
00:00:11,040 --> 00:00:16,650
OK, so in this lecture, I'm going to give you an official exercise prompt in preparation for the next

2
00:00:16,650 --> 00:00:17,490
coding lecture.

3
00:00:18,300 --> 00:00:23,520
By the end of this lecture, you should know exactly what you need to do in order to complete the exercise

4
00:00:23,520 --> 00:00:26,220
yourself without any further assistance.

5
00:00:26,760 --> 00:00:31,950
So after watching this lecture, try to make sure that you don't watch the coding lecture without trying

6
00:00:31,950 --> 00:00:33,180
the exercise first.

7
00:00:33,750 --> 00:00:36,480
This lecture will give you a few extra pointers and tips.

8
00:00:36,870 --> 00:00:40,710
Although technically, you could do the exercise already at this point.

9
00:00:45,380 --> 00:00:48,860
OK, so just at a very high level, what is the exercise?

10
00:00:49,400 --> 00:00:55,520
The exercise is to take two sets of poems by two different authors, Edgar Allan Poe and Robert Frost.

11
00:00:56,090 --> 00:00:59,780
The Euros for these poems will be given to you in the coming notebooks.

12
00:01:00,230 --> 00:01:06,380
So don't try to find them yourself or to type in your rows by hand as some students sometimes try to

13
00:01:06,380 --> 00:01:06,770
do.

14
00:01:07,490 --> 00:01:12,170
Instead, it's perfectly OK to grab the URLs from the notebooks directly.

15
00:01:12,590 --> 00:01:14,210
Please don't feel bad about doing that.

16
00:01:15,140 --> 00:01:18,740
Obviously, that's as far as you should go in terms of looking at the notebook.

17
00:01:19,100 --> 00:01:25,670
If you want to do the exercise without cheating, the next step is to build a text classifier to distinguish

18
00:01:25,670 --> 00:01:29,000
between the two authors, as described in the previous lecture.

19
00:01:29,630 --> 00:01:33,650
This will involve building two separate Markov models, one for each of the poets.

20
00:01:34,520 --> 00:01:39,860
Now, after building the model, you'll want to check how well the model performs by computing the train

21
00:01:39,860 --> 00:01:41,060
and test the accuracy.

22
00:01:41,960 --> 00:01:47,420
Furthermore, since the classes may be imbalanced, you might also want to use a metric that takes this

23
00:01:47,420 --> 00:01:50,000
into account, such as the F1 score.

24
00:01:50,840 --> 00:01:54,350
OK, so that's pretty much the whole exercise from a bird's-eye point of view.

25
00:01:59,080 --> 00:02:04,540
Now, let's go through some details concerning the exercise, specifically, what I would like to do

26
00:02:04,570 --> 00:02:07,180
is give you a step by step outline of the code.

27
00:02:07,870 --> 00:02:10,830
Note that this is specific to my implementation.

28
00:02:10,840 --> 00:02:16,270
So if you had other ideas about how you wanted to implement the high level steps, then these details

29
00:02:16,270 --> 00:02:17,620
may not apply to you.

30
00:02:19,260 --> 00:02:24,930
OK, so the first step will be to live through each data file and save each line of each file into a

31
00:02:24,930 --> 00:02:25,470
list.

32
00:02:26,010 --> 00:02:31,020
So each line of each poem will be considered one sample into our machine learning model.

33
00:02:32,040 --> 00:02:37,500
Furthermore, while we do this, we're also going to keep track of the labels that is, which line belongs

34
00:02:37,500 --> 00:02:38,160
to which poet.

35
00:02:38,730 --> 00:02:40,390
So that will be another list.

36
00:02:42,060 --> 00:02:47,670
The next step will be to do a train test split where we randomly split the data into train and test.

37
00:02:48,270 --> 00:02:52,080
At this point, our input data is still in the form of lines of text.

38
00:02:52,680 --> 00:02:56,520
Of course, this is not appropriate input for a machine learning model.

39
00:02:57,240 --> 00:03:02,550
As you recall, we need to implement a Markov transition matrix where each state represents a word in

40
00:03:02,550 --> 00:03:03,510
the vocabulary.

41
00:03:04,140 --> 00:03:08,790
Therefore, we need to know which indices in the matrix correspond to which words.

42
00:03:09,660 --> 00:03:14,640
Therefore, what we need is a mapping from unique word to unique integer index.

43
00:03:16,050 --> 00:03:20,190
So the next step will be to loop through our data, collecting all the unique words.

44
00:03:20,850 --> 00:03:25,920
Note that this will involve tokenizing each line of text, each of which currently contains multiple

45
00:03:25,920 --> 00:03:26,610
words.

46
00:03:27,120 --> 00:03:32,550
It's up to you how you want to do tokenization, but for this exercise, a simple string split should

47
00:03:32,550 --> 00:03:33,270
suffice.

48
00:03:34,800 --> 00:03:40,440
From these unique words will be able to assign each of them a unique integer index, which will then

49
00:03:40,440 --> 00:03:43,260
be used as indices into a markup matrix.

50
00:03:44,130 --> 00:03:49,170
So you should know that when we use the word mapping, typically that refers to a Python dictionary.

51
00:03:50,100 --> 00:03:53,490
Also note that we'll need a special index for unknown words.

52
00:03:54,000 --> 00:03:58,500
This is because there may be words in the test set that do not appear in the train set.

53
00:04:03,410 --> 00:04:08,840
OK, so once you have your word to integer index mapping, you want to then convert all the lines of

54
00:04:08,840 --> 00:04:11,480
text into integer representation.

55
00:04:12,140 --> 00:04:17,420
This will be useful when we want to index the state transition matrices since we can use the integers

56
00:04:17,420 --> 00:04:18,019
directly.

57
00:04:19,490 --> 00:04:25,840
OK, so the next step is to Trina Markov model for each class, as you recall, we have two classes

58
00:04:25,850 --> 00:04:28,010
Robert Frost and Edgar Allan Poe.

59
00:04:28,760 --> 00:04:33,500
Each Markov model should be trained on lines of poems only for their respective poet.

60
00:04:34,250 --> 00:04:39,410
So one Markov model will be trained on Edgar Allan Poe Lines, while the other Markov model will be

61
00:04:39,410 --> 00:04:41,060
trained on Robert Frost lines.

62
00:04:42,140 --> 00:04:45,530
Of course, the implementation details are your exercise.

63
00:04:46,370 --> 00:04:51,110
Note that you want to use some form of smoothing, as mentioned in the previous lectures.

64
00:04:51,890 --> 00:04:56,900
Furthermore, you should consider whether you need the A's in the Pies directly, or whether you only

65
00:04:56,900 --> 00:04:59,210
need the log of the A's and the log of the pot.

66
00:05:00,230 --> 00:05:05,540
In addition, in order to apply Bayes rule, you'll also need the priors, so compute these from the

67
00:05:05,540 --> 00:05:06,350
data as well.

68
00:05:07,580 --> 00:05:11,240
At this point, you should have everything you need to make a prediction.

69
00:05:11,720 --> 00:05:17,300
Your data is stored as integers, your Markov models are trained and you have the priors for each class.

70
00:05:21,980 --> 00:05:25,850
So the next step will be to write a function that can actually make predictions.

71
00:05:26,390 --> 00:05:31,580
Specifically, it should take in an input sequence and it should compute the log posterior for each

72
00:05:31,580 --> 00:05:32,270
class.

73
00:05:33,290 --> 00:05:37,580
Then it should take the ARG Max to get the prediction for that input sequence.

74
00:05:38,480 --> 00:05:43,820
And remember that there is no need to compute the full posterior since the denominator is constant for

75
00:05:43,820 --> 00:05:44,660
each class.

76
00:05:45,470 --> 00:05:50,480
Now, of course, once you have a function to make predictions, you can then compute the predictions

77
00:05:50,480 --> 00:05:51,860
for both the train and test set.

78
00:05:52,520 --> 00:05:58,850
After doing so, you should compute the accuracy for both the train and test set as a bonus, check

79
00:05:58,850 --> 00:06:00,920
whether or not the classes are imbalanced.

80
00:06:01,340 --> 00:06:03,740
Note that this information is in the priors.

81
00:06:04,640 --> 00:06:11,030
If the classes are imbalanced, then compute the confusion matrix to see which class has incorrect predictions

82
00:06:11,030 --> 00:06:11,750
most often.

83
00:06:12,590 --> 00:06:18,080
Furthermore, you may want to compute a metric like the F1 score, which takes into account any imbalance

84
00:06:18,080 --> 00:06:19,040
in the classes.

85
00:06:20,450 --> 00:06:24,770
OK, so after having completed all these steps, the exercise will be done.

86
00:06:25,040 --> 00:06:28,220
So please try this yourself and I'll see you in the next lecture.