1
00:00:11,710 --> 00:00:17,200
In this lecture we are going to look at extending our MLP example so that we can use a trained model

2
00:00:17,200 --> 00:00:19,000
to make predictions.

3
00:00:19,000 --> 00:00:23,650
A lot of people ask this question and at a high level you just need to follow the same steps that we

4
00:00:23,650 --> 00:00:28,090
did earlier to pre process the text and call the model predict function.

5
00:00:28,090 --> 00:00:33,040
However some students get overwhelmed with the amount of code and so hopefully this lecture will clear

6
00:00:33,040 --> 00:00:35,220
up any confusion you may have had.

7
00:00:35,590 --> 00:00:41,470
Before we start first recognize that we are in a separate VIP notebook which you can find in the VIP

8
00:00:41,470 --> 00:00:42,790
notebooks document.

9
00:00:43,330 --> 00:00:48,070
So to start we're going to do a few things differently in this notebook to kind of extend what we discussed

10
00:00:48,070 --> 00:01:08,590
earlier.

11
00:01:08,740 --> 00:01:13,630
So if we scroll down to after the model is trained you'll see that I also do a check for how many of

12
00:01:13,630 --> 00:01:16,060
our observations are actually spam.

13
00:01:16,270 --> 00:01:17,940
As you can see it's not that many.

14
00:01:17,950 --> 00:01:19,990
Only about 13 percent.

15
00:01:20,110 --> 00:01:23,840
This is an example of a data set with imbalanced classes.

16
00:01:23,890 --> 00:01:28,900
So when we see accuracy in the high 90s we might be concerned that our model is just predicting the

17
00:01:28,900 --> 00:01:31,990
majority class all the time.

18
00:01:32,000 --> 00:01:36,680
Next we're going to look at the confusion matrix to see if our model is doing well on one class but

19
00:01:36,680 --> 00:01:37,490
not on the other.

20
00:01:38,150 --> 00:01:45,890
So I've pasted the same confusion matrix code from earlier in the course in this block if we scroll

21
00:01:45,890 --> 00:01:50,040
down to the results we can see that we actually do well on both classes.

22
00:01:50,120 --> 00:01:51,470
So this goes for the tests.

23
00:01:53,350 --> 00:01:54,410
In with the train said

24
00:01:57,990 --> 00:02:02,970
so it seems that even though the classes were imbalanced this did not pose a problem for the model

25
00:02:09,190 --> 00:02:09,730
next.

26
00:02:09,760 --> 00:02:13,610
Let's go onto you making a prediction on a new test sentence.

27
00:02:13,660 --> 00:02:16,040
There are two ways and one to show you how to do this.

28
00:02:16,150 --> 00:02:21,640
The first way is closer to what we did earlier in the script which I think is easier to understand.

29
00:02:21,640 --> 00:02:25,060
That is we want to start with a C S V file containing our data.

30
00:02:25,780 --> 00:02:30,880
So first I have a snippet of code here that shows you how to sample from the data frame but only for

31
00:02:30,880 --> 00:02:32,440
the rows with a label that spam

32
00:02:38,190 --> 00:02:39,090
in the next block.

33
00:02:39,090 --> 00:02:44,670
I assign this to a variable called small sample and then I write this to a C as V file called sample

34
00:02:44,670 --> 00:02:46,270
test starts yes.

35
00:02:46,650 --> 00:02:52,510
Note that the only thing I'm writing to this file is the sentences themselves not any of the other columns.

36
00:02:52,560 --> 00:02:58,170
So in the first line I say open a sample test DRC is V in write mode in the next line.

37
00:02:58,170 --> 00:03:01,680
I write the hetero which is just the string data.

38
00:03:01,680 --> 00:03:08,250
Then I do a loop through the small sample data frame using it rose function inside the loop I select

39
00:03:08,310 --> 00:03:12,810
the data column from the row and then write back to the file along with the new line

40
00:03:18,590 --> 00:03:23,290
next we use the cat command to confirm that our CSB has been written properly.

41
00:03:23,360 --> 00:03:33,000
As you can see it contains only the data column with the corresponding spam sentences.

42
00:03:33,240 --> 00:03:37,220
The next step is to create a tabular data set from our CSP.

43
00:03:37,230 --> 00:03:39,520
This is almost the same as the code from earlier.

44
00:03:39,630 --> 00:03:44,250
But notice that in the fields argument I only have one field which is data.

45
00:03:44,340 --> 00:03:47,940
I also pass in the text field object that we created earlier.

46
00:03:47,940 --> 00:03:54,060
Since this contains all of the information about tokenization and the vocabulary and so forth from the

47
00:03:54,060 --> 00:03:54,690
train set

48
00:04:01,170 --> 00:04:06,840
next I create an iterator object from the data set object which is similar to what we did earlier when

49
00:04:06,840 --> 00:04:09,160
we called iterator dot splits.

50
00:04:09,270 --> 00:04:16,290
As you can see it takes in almost the same arguments a dataset a batch size a sort key and a device

51
00:04:17,010 --> 00:04:22,950
but you'll notice that before the splits function it took in multiple datasets whereas this takes in

52
00:04:22,950 --> 00:04:25,030
only a single dataset.

53
00:04:25,050 --> 00:04:29,270
Now you might wonder how did I figure out such an object existed.

54
00:04:29,280 --> 00:04:33,150
Well guys you have to read the documentation all the information is there

55
00:04:39,840 --> 00:04:40,670
in the next block.

56
00:04:40,680 --> 00:04:46,110
We loop through our sample iterator so this is the moment of truth where we get to find out whether

57
00:04:46,110 --> 00:04:47,010
this will work or not.

58
00:04:47,850 --> 00:04:52,730
As you can see it's slightly different from what we had earlier in the previous loops.

59
00:04:52,740 --> 00:04:59,460
We got back the data directly as tenses but it seems that in this loop the data is returned as a batch

60
00:04:59,520 --> 00:05:00,600
object.

61
00:05:00,780 --> 00:05:03,510
We can access the data by calling inputs that data.

62
00:05:04,320 --> 00:05:10,560
Unfortunately the API for the iterator object versus iterator that splits is inconsistent.

63
00:05:10,560 --> 00:05:14,620
Well you just have to observe what's going on and make the necessary changes.

64
00:05:14,640 --> 00:05:19,850
Nobody said you wouldn't have to put in some effort as you can see our predictions come out just fine.

65
00:05:26,620 --> 00:05:31,570
Now your next question might be what if I have just a single sentence and I don't want to write it to

66
00:05:31,570 --> 00:05:32,750
ICAC.

67
00:05:33,070 --> 00:05:39,190
In this case we can still make use of the torture tax documentation to figure out what to do so we'll

68
00:05:39,190 --> 00:05:45,430
start with a sentence from the spam set and I'll assign this to a variable called single sentence the

69
00:05:45,430 --> 00:05:51,970
next step is to use our text field object and call the pre process function on this single sentence.

70
00:05:52,090 --> 00:05:56,050
Here I'm just printing out the result to confirm that it does the right thing.

71
00:05:56,230 --> 00:06:01,690
As you can see this only does the tokenization part but it does not map the tokens to integers

72
00:06:07,840 --> 00:06:13,420
the next step is to call the numerical I's function which converts the above token I sentence into a

73
00:06:13,420 --> 00:06:15,220
sequence of integers.

74
00:06:15,220 --> 00:06:21,400
Unfortunately the API is a bit inconsistent here again since this accepts a list of token sentences

75
00:06:21,640 --> 00:06:24,750
rather than just a single token a sentence.

76
00:06:24,790 --> 00:06:33,880
Finally the last step is just to put these all together and then use our model to get a prediction.

77
00:06:34,650 --> 00:06:40,790
So in the last block of code we call text pre process to convert our sentence into tokens.

78
00:06:40,800 --> 00:06:46,530
Next we call numeric allies to convert our tokens into a sequence of word indexes.

79
00:06:46,530 --> 00:06:50,480
Lastly we call our model and we pass in the sequence of word indexes.

80
00:06:50,520 --> 00:06:52,860
After moving the input tensor to the Jeep you

81
00:06:55,750 --> 00:07:01,270
as a final sign for this lecture you may want to check out another notebook I created which uses the

82
00:07:01,270 --> 00:07:04,540
Cara's API to pre process the text.

83
00:07:04,540 --> 00:07:09,190
It might surprise you that this is possible but in fact since the carries API just gives you back a

84
00:07:09,190 --> 00:07:10,560
num pi array.

85
00:07:10,570 --> 00:07:13,980
There is no mixture of Kerry's code with PI torch code.

86
00:07:13,990 --> 00:07:19,960
I found that this was advantageous in a few ways namely that it achieves better accuracy perhaps due

87
00:07:19,960 --> 00:07:27,460
to smarter default tokenization and the API is easier to work with and more consistent than torch text.

88
00:07:27,460 --> 00:07:32,020
You can find this notebook along with the rest of the VIP notebooks in the extra section.
