1
00:00:11,070 --> 00:00:16,560
In this lecture, we will be looking at the code to implement Markov models for building a text classifier.

2
00:00:17,340 --> 00:00:22,890
As always, I strongly encourage you to treat this as an exercise and to try to code this yourself from

3
00:00:22,890 --> 00:00:23,490
scratch.

4
00:00:23,940 --> 00:00:26,820
You already know what to do and what the results should be.

5
00:00:27,330 --> 00:00:32,640
So if you'd like to treat this as an exercise, please look only at the first block of code where we

6
00:00:32,640 --> 00:00:33,900
download the data sets.

7
00:00:34,470 --> 00:00:40,500
The only thing left to do is to translate our ideas into Python, which does not require any new concepts

8
00:00:40,620 --> 00:00:41,790
or any new knowledge.

9
00:00:42,240 --> 00:00:44,880
So please try to complete this by yourself first.

10
00:00:45,450 --> 00:00:50,640
As output, you should display the accuracy of your text classifier on both the train and test set.

11
00:00:53,070 --> 00:00:59,610
OK, so assuming you've already attempted the exercise, let's continue as mentioned, we'll begin by

12
00:00:59,610 --> 00:01:04,950
downloading the data center, which consists of poems by Edgar Allan Poe and Robert Frost.

13
00:01:05,640 --> 00:01:10,830
Our goal will be to build a classifier that can distinguish between the two authors using one line of

14
00:01:10,830 --> 00:01:13,710
a poem, so each line of a poem will be a sample.

15
00:01:14,760 --> 00:01:19,830
Now, just as a reminder for beginners, please do not try to type these URLs by hand.

16
00:01:20,160 --> 00:01:23,310
Simply copy and paste the links in the notebook itself.

17
00:01:31,450 --> 00:01:35,170
The next step is to import everything we need for the rest of the notebook.

18
00:01:40,830 --> 00:01:45,360
The next step is due to find a list that contains the file names for our text files.

19
00:01:50,350 --> 00:01:54,220
The next step is to run the head command to see what's inside our tax files.

20
00:01:54,730 --> 00:01:56,260
We'll start with the Edgar Allan Poe.

21
00:02:01,230 --> 00:02:05,220
OK, so as you can see, this file contains poems by Edgar Allan Poe.

22
00:02:05,730 --> 00:02:07,380
Note that some lines are empty.

23
00:02:08,160 --> 00:02:11,790
Also note that there is capitalization and punctuation.

24
00:02:15,450 --> 00:02:18,450
The next step is to run the head command for Robert Frost.

25
00:02:22,950 --> 00:02:28,200
Again, we see that some lines are empty and that there is capitalization and punctuation.

26
00:02:32,650 --> 00:02:36,920
The next step is to collect our data into lists for this experiment.

27
00:02:36,940 --> 00:02:39,520
We'll be treating each line as an input sample.

28
00:02:39,970 --> 00:02:46,450
This is opposed to an entire verse or an entire poem, so we'll begin by creating two empty lists,

29
00:02:46,720 --> 00:02:50,140
one to hold on to the text and one to hold on to the labels.

30
00:02:53,660 --> 00:02:59,630
The next step is to loop through each input file notice that we use enumerate, which gives us the index

31
00:02:59,630 --> 00:03:00,710
at the same time.

32
00:03:01,580 --> 00:03:06,800
So the first of all, we'll have the index zero and the second file we'll have the next one will be

33
00:03:06,800 --> 00:03:11,070
using these as labels for our data set inside the loop.

34
00:03:11,090 --> 00:03:14,480
We're going to print out the file name along with the corresponding label.

35
00:03:14,990 --> 00:03:18,080
This is so that we'll know which label corresponds to which author.

36
00:03:20,730 --> 00:03:25,140
The next step is to live through each line of the file inside the centre loop.

37
00:03:25,170 --> 00:03:30,720
We begin by calling our strip, as you recall, when you look through a file line by line in Python.

38
00:03:30,960 --> 00:03:33,230
It includes the new line character at the end.

39
00:03:33,670 --> 00:03:37,230
The airstrip call will remove this new line since it's not needed.

40
00:03:38,220 --> 00:03:42,420
The next step is to call the lower method, which converts all the text to lowercase.

41
00:03:43,200 --> 00:03:45,090
The next step is to call if line.

42
00:03:45,960 --> 00:03:51,960
As you recall, some lines are empty, so the following code will only run if the line is not empty.

43
00:03:53,910 --> 00:03:57,840
OK, so inside the if statement, we begin by removing punctuation.

44
00:03:58,800 --> 00:04:02,220
Note that understanding this code is not crucial at this point.

45
00:04:02,550 --> 00:04:05,490
You can easily just copy and paste this code from Stack Overflow.

46
00:04:06,990 --> 00:04:11,760
So after this line, our text will be lowercase and all punctuation will be removed.

47
00:04:14,010 --> 00:04:19,890
The final step is to append this text to our list of input texts and to append the label to our list

48
00:04:19,890 --> 00:04:20,670
of labels.

49
00:04:25,440 --> 00:04:30,930
OK, so as you can see, Edgar Allan Poe has been assigned label zero and Robert Frost has been assigned

50
00:04:30,930 --> 00:04:31,650
Label one.

51
00:04:35,210 --> 00:04:37,310
The next step is to do a train test split.

52
00:04:38,060 --> 00:04:42,470
Part of the challenge in this notebook is figuring out when to do your train to split.

53
00:04:43,190 --> 00:04:47,990
It might seem natural to do this later after we've converted the text into integers and so forth.

54
00:04:48,260 --> 00:04:49,660
But this way is more sensible.

55
00:04:50,450 --> 00:04:56,180
As you recall, the test set should emulate texts that we have never seen before, so it may contain

56
00:04:56,180 --> 00:04:59,390
unknown words that don't match up to any integer index.

57
00:05:03,780 --> 00:05:08,610
The next step is to check the length of Y train and Y test to see how many samples we have.

58
00:05:14,350 --> 00:05:17,500
The next step is to print out a few rows from the train text.

59
00:05:22,410 --> 00:05:27,830
As you can see, these are random lines of palms, and everything is lowercase without punctuation.

60
00:05:31,930 --> 00:05:34,780
The next step is to print out a few rows from Y Train.

61
00:05:39,250 --> 00:05:44,800
As you can see, we get a mixture of zeros and ones since the train to split function also shuffles

62
00:05:44,800 --> 00:05:45,670
the data set.

63
00:05:48,980 --> 00:05:53,310
The next step will be to convert our text into integers, as you recall.

64
00:05:53,330 --> 00:05:57,320
These will be indexes into the markup matrices we are about to define.

65
00:05:58,250 --> 00:06:04,220
We'll begin by setting ADX to one which will act as our current index as we loop through the text.

66
00:06:04,940 --> 00:06:10,610
We'll also initialize a word to index dictionary with one entry, which maps the unknown token to zero.

67
00:06:11,510 --> 00:06:15,480
Basically, this will never be used for the train set, but it may be used for the test.

68
00:06:16,040 --> 00:06:19,550
If the test set contains words that do not appear in the train set.

69
00:06:24,850 --> 00:06:31,150
The next step is to loop through our train text in order to populate where 280X inside the loop will

70
00:06:31,150 --> 00:06:35,020
begin by splitting the text into tokens, as you recall.

71
00:06:35,050 --> 00:06:40,510
There are a few options for tokenization, but calling the split function is the simplest thing to do.

72
00:06:41,950 --> 00:06:43,160
The next step is to leave it there.

73
00:06:43,180 --> 00:06:49,510
Each token we find inside the loop, we start by checking whether or not the token is already in our

74
00:06:49,510 --> 00:06:51,190
word to ADX dictionary.

75
00:06:51,730 --> 00:06:55,030
If it is, then there is nothing to do if it is not.

76
00:06:55,060 --> 00:06:57,580
Then we need to create a new entry for this token.

77
00:06:58,540 --> 00:07:04,030
So inside the if statement, we start by assigning the current value of index for the current token,

78
00:07:04,900 --> 00:07:10,360
the next step is to increment IDEX, so it will have the right value for the next token we encounter.

79
00:07:16,660 --> 00:07:19,600
The next step is to simply print out word to RDX.

80
00:07:20,110 --> 00:07:23,440
Feel free to check this yourself, to see what words it contains.

81
00:07:34,410 --> 00:07:37,500
The next step is to check the length of where to ADX.

82
00:07:38,010 --> 00:07:42,750
Note that this will determine the size of our markup matrix and the initial state distribution.

83
00:07:49,010 --> 00:07:52,130
The next step is to convert this data into integer format.

84
00:07:52,910 --> 00:07:56,180
Recall that this is what we need to index our in our pie.

85
00:07:56,990 --> 00:08:01,760
So the first step will be to create two empty lists, which we will use to store the results.

86
00:08:05,110 --> 00:08:11,020
The next step will be to loop through the train tracks inside the loop will begin by again splitting

87
00:08:11,020 --> 00:08:13,090
the text to do tokenisation.

88
00:08:13,840 --> 00:08:19,270
The next step will be to use a list comprehension to map each token to its corresponding text.

89
00:08:20,470 --> 00:08:24,820
The result of this will be a list of integers representing one line of a poem.

90
00:08:25,960 --> 00:08:29,920
The next step is to append this list of integers to our list to find above.

91
00:08:33,870 --> 00:08:36,809
The next step is to do the same procedure for the test set.

92
00:08:37,380 --> 00:08:43,210
This is mostly the same as the above loop, but there is one major difference, as you recall.

93
00:08:43,230 --> 00:08:48,660
It's possible that not every word in the test set appears in the trains there, so we can't naively

94
00:08:48,660 --> 00:08:51,330
try to index the word to dictionary.

95
00:08:51,960 --> 00:08:57,510
Instead, we'll use the get function and will ensure that we always return a default value of zero.

96
00:08:58,230 --> 00:09:01,200
As you recall, this corresponds to the unknown token.

97
00:09:06,380 --> 00:09:10,160
The next step is to print out a few rows from the list, which is populated.

98
00:09:14,350 --> 00:09:19,300
So as you can see, the result is a list of lists of integers as expected.

99
00:09:24,030 --> 00:09:29,610
OK, so at this point, we're ready to build our A and pie matrices which represent the mark model.

100
00:09:31,080 --> 00:09:35,220
Now one important thing to remember is that we don't just have one Markov model.

101
00:09:35,370 --> 00:09:37,530
We'll have as many as there are classes.

102
00:09:37,980 --> 00:09:41,460
Since we have two classes, we'll have two A's and two pies.

103
00:09:42,780 --> 00:09:47,070
The first step is to assign the length of word to IPX to a variable called V.

104
00:09:47,760 --> 00:09:52,410
By convention, this is what we normally use to represent the vocabulary size.

105
00:09:54,090 --> 00:09:57,690
In the theory lectures, we often use the letter M A for generic states.

106
00:09:57,900 --> 00:10:01,320
But since we know we're working with language, it's OK to use V.

107
00:10:02,550 --> 00:10:06,960
The next step is to initialize A0, Pi zero A1 and PI one.

108
00:10:07,740 --> 00:10:11,520
You'll notice that I've initialized each of these arrays to all ones.

109
00:10:12,150 --> 00:10:15,420
The reason for this is we're going to be using add one smoothing.

110
00:10:16,050 --> 00:10:21,000
So these ones are the initial fake counts for each initial word and each transition.

111
00:10:22,710 --> 00:10:26,550
The next step will be to add the rest of the counts from the actual text.

