1
00:00:11,690 --> 00:00:17,180
So in the previous lecture we talked about all the steps that are required to pre process your text

2
00:00:17,210 --> 00:00:24,670
in order to convert your documents into lists of integers so that they can be ingested by a neural network.

3
00:00:24,680 --> 00:00:28,120
So now the big question how do you do this in PI talk.

4
00:00:28,580 --> 00:00:35,570
Well it's not done in PI torture per say but it's part of the PI talk ecosystem just like how a specialized

5
00:00:35,570 --> 00:00:41,390
computer vision operations are done by the torture vision library specialized in IP operations are done

6
00:00:41,390 --> 00:00:43,450
by the torture tax library.

7
00:00:43,520 --> 00:00:48,500
This is also automatically installed in Google collapse so there isn't anything special you need to

8
00:00:48,500 --> 00:00:49,490
install on your own.

9
00:00:49,610 --> 00:00:54,460
If you use google code that unfortunately the syntax is quite strange.

10
00:00:54,470 --> 00:01:02,560
So bear with me as I try to make it digestible.

11
00:01:02,620 --> 00:01:08,620
First we're going to assume that our task is tax classification which it is and it's going to be a many

12
00:01:08,620 --> 00:01:10,390
to one task.

13
00:01:10,390 --> 00:01:16,210
That means for each document which is a sequence of words our prediction is going to be a single label

14
00:01:16,210 --> 00:01:20,690
for that entire document that might sound a little bit abstract.

15
00:01:20,690 --> 00:01:24,650
So to give you a concrete example think of spam detection.

16
00:01:24,740 --> 00:01:30,770
Your input is an email which is a sequence of words and you would like to output a single label spam

17
00:01:30,770 --> 00:01:37,790
or not spam for torch text will want this data to be formatted as a C as V with one column for the label

18
00:01:38,110 --> 00:01:40,600
and one column for the input text.

19
00:01:40,820 --> 00:01:45,860
I assume that at this stage you know how to write a program in Python so that if your data does not

20
00:01:45,860 --> 00:01:55,550
come in this format you know what to do to convert it into the required format.

21
00:01:55,560 --> 00:02:03,090
The next step is to create field objects field objects are part of the torch text data module both the

22
00:02:03,090 --> 00:02:08,910
inputs column and the targets column will be field objects but with different arguments for the input

23
00:02:08,910 --> 00:02:09,360
data.

24
00:02:09,360 --> 00:02:15,690
I want to pass on the arguments sequential equals true batch first equals true lower equals true and

25
00:02:15,690 --> 00:02:17,630
pattern first equals true.

26
00:02:17,790 --> 00:02:22,750
Let's run through these quickly although they should be pretty intuitive already.

27
00:02:23,100 --> 00:02:29,550
We said sequentially go to true since the data is sequential each value is a sequence of words.

28
00:02:29,550 --> 00:02:33,660
We set batch first go to true because we want the end dimension to come first.

29
00:02:33,780 --> 00:02:40,600
And the T dimension to come last we set lower equal to true because we want to lower case the words.

30
00:02:40,650 --> 00:02:45,630
This isn't always necessary but our data set is pretty small in which case it's usually a pretty good

31
00:02:45,630 --> 00:02:46,070
idea.

32
00:02:48,180 --> 00:02:53,940
And lastly we set pattern first equal to true so that we use pre padding instead of post padding.

33
00:02:53,940 --> 00:02:57,660
We already discuss why that might be desirable although you should try both.

34
00:02:57,690 --> 00:03:04,620
Just to be sure for the label we're going to create another field objects but with an entirely different

35
00:03:04,620 --> 00:03:11,190
set of arguments specifically we're going to set sequentially go to false use vocab equal to False and

36
00:03:11,190 --> 00:03:18,040
is target equally true since the target is not sequential we said sequentially go to false.

37
00:03:18,060 --> 00:03:19,800
This should be clear.

38
00:03:19,800 --> 00:03:24,360
We set use vocab equal to False which might seem strange but let's think about it.

39
00:03:24,390 --> 00:03:28,220
Remember that there are a whole bunch of text pre processing steps that we need to do.

40
00:03:28,380 --> 00:03:33,290
One of which involves assigning a unique integer to each word in the dataset.

41
00:03:33,300 --> 00:03:36,990
This is the process of building the vocabulary of words in the dataset.

42
00:03:38,750 --> 00:03:43,730
Of course such a process wouldn't make sense to do on the targets because the targets are just numerical

43
00:03:43,730 --> 00:03:44,690
labels.

44
00:03:44,780 --> 00:03:48,020
So we set use vocab equal to False.

45
00:03:48,050 --> 00:03:50,690
Finally we set is target equal to true.

46
00:03:50,690 --> 00:03:54,110
Since this field is for the target column after all.

47
00:03:54,110 --> 00:03:58,680
In my opinion this could have a better API but it is what it is.

48
00:03:58,730 --> 00:04:03,350
You might also want to consider how these arguments might change if you were doing something like neural

49
00:04:03,350 --> 00:04:09,390
machine translation.

50
00:04:09,620 --> 00:04:15,350
The next step is to instantiate a tabular data set object which is also part of the tortured text data

51
00:04:15,350 --> 00:04:15,830
module.

52
00:04:16,670 --> 00:04:21,650
So now you understand why we want to create a CSB to store our data.

53
00:04:21,650 --> 00:04:27,920
It's because a CSB is a table and torture text only knows how to read tabular data sets.

54
00:04:27,920 --> 00:04:33,320
Now you might think it might seem kind of limited to only have tabular data sets.

55
00:04:33,320 --> 00:04:39,030
What if we need a data set that is not tabular but if you think about it pretty much every text dataset

56
00:04:39,170 --> 00:04:42,160
you can think of will probably fit into this framework.

57
00:04:42,230 --> 00:04:49,310
Whether that sentiment analysis spam detection parts of speech tagging machine translation and so forth.

58
00:04:49,310 --> 00:04:54,500
Even if it's unsupervised learning and all you have is just one document per line that's still technically

59
00:04:54,500 --> 00:04:55,830
a CSB.

60
00:04:55,940 --> 00:05:00,240
So all these different kinds of data sets can fit into a CSP format.

61
00:05:00,680 --> 00:05:05,090
And as long as you're a competent programmer and you know how to make your own C S V they should not

62
00:05:05,090 --> 00:05:05,720
be an issue.

63
00:05:06,650 --> 00:05:11,520
So let's go through the arguments one by one so you know what they mean.

64
00:05:12,100 --> 00:05:17,940
The first argument which is the path should be clear is the path to the data file where you're CSB data

65
00:05:17,940 --> 00:05:21,140
set is stored.

66
00:05:21,420 --> 00:05:25,880
The second argument is the format which I have specified as CSP.

67
00:05:25,980 --> 00:05:31,350
Other options are at TSB and Jason but these are not so significant in my opinion.

68
00:05:31,500 --> 00:05:37,290
A T S V is just like a CSP but with tabs instead of commas it's trivial to switch between them so there's

69
00:05:37,350 --> 00:05:41,260
no advantage to using one over the other.

70
00:05:41,280 --> 00:05:46,620
Jason is nice but it just adds extra characters to your data file without any real benefit.

71
00:05:46,800 --> 00:05:51,570
You can use it if you want but again there's no distinct advantage unless for some reason your data

72
00:05:51,600 --> 00:05:53,080
already came in that format.

73
00:05:56,350 --> 00:06:04,460
Next we set skip header equal to true since in our dataset there is a header row that will want to ignore.

74
00:06:04,850 --> 00:06:10,850
Finally we set the fields argument which is a list of tuples each tuple Contains two items.

75
00:06:10,880 --> 00:06:14,420
The name of the field and the corresponding field object.

76
00:06:14,550 --> 00:06:21,910
Called the first one data and the second one label.

77
00:06:22,140 --> 00:06:27,990
The next step is to split the data into train and test sets which is nice and easy with torch tax just

78
00:06:27,990 --> 00:06:34,550
called data set that split and this returns two items your train set and your test set as input.

79
00:06:34,560 --> 00:06:40,020
You can pass in an argument called Split ratio to say what percentage of the dataset you want to be

80
00:06:40,020 --> 00:06:41,630
part of the train set.

81
00:06:41,880 --> 00:06:50,230
By default this is 70 percent so I haven't bothered to change it.

82
00:06:50,340 --> 00:06:56,100
The next step is to call the build vocab function on our text fields from earlier this assigns a unique

83
00:06:56,130 --> 00:07:02,910
integer to each unique token in the dataset so you can imagine that internally talks text is doing the

84
00:07:02,910 --> 00:07:07,740
tokenization and all the other text pre processing steps that we discussed earlier.

85
00:07:08,160 --> 00:07:14,130
This function by itself doesn't return anything but after doing this we can access the vocab attribute

86
00:07:14,400 --> 00:07:18,320
in the text field which returns a vocab object.

87
00:07:18,600 --> 00:07:20,220
So what is a vocab object

88
00:07:25,370 --> 00:07:31,120
Well the vocab object contains that words index mapping that we talked about in the previous lecture.

89
00:07:31,280 --> 00:07:38,120
You can access it by calling vocab dot as to AI so as to I should remind you of C style naming if you've

90
00:07:38,120 --> 00:07:41,110
ever coded in the C programming language before.

91
00:07:41,630 --> 00:07:47,600
This returns a dictionary where the key is a token and the value is a unique integer that corresponds

92
00:07:47,600 --> 00:07:48,350
to the token.

93
00:07:49,340 --> 00:07:54,870
Additionally you can call the vocab at DOT I2 s function which returns the opposite.

94
00:07:55,040 --> 00:08:00,300
In particular it returns a list of all the unique tokens in the dataset.

95
00:08:00,680 --> 00:08:04,700
Now you might be wondering how can the opposite of a dictionary be a list

96
00:08:09,820 --> 00:08:10,070
well.

97
00:08:10,080 --> 00:08:14,770
Remember that a dictionary is nothing but a set of indices and corresponding values.

98
00:08:14,820 --> 00:08:16,800
So too is a list.

99
00:08:16,800 --> 00:08:19,790
A list has the indices 0 up to K minus 1.

100
00:08:19,890 --> 00:08:25,710
If the list has length k the values of a list are just the elements of that list.

101
00:08:25,770 --> 00:08:33,270
So if I call vocab dot as to why of dog and I get back to eleven then if I call vocab dot I too s of

102
00:08:33,270 --> 00:08:40,330
two eleven I will get back the string dog so they are inverse mappings relative to each other.

103
00:08:40,370 --> 00:08:44,660
Now you might be wondering who cares and why do we even need these mappings.

104
00:08:44,670 --> 00:08:50,640
Well if all we want to do is text classification then we don't necessarily but there may be instances

105
00:08:50,640 --> 00:08:56,340
where let's say you're doing text generation or machine translation where the output of the neuron that

106
00:08:56,340 --> 00:08:58,650
work is a sequence of integers.

107
00:08:58,860 --> 00:09:04,170
Of course in order to turn that into text you have to convert that sequence of integers back into a

108
00:09:04,170 --> 00:09:10,730
sentence and in order to do that you must know which integer corresponds to which word in your vocabulary.

109
00:09:15,740 --> 00:09:16,090
Okay.

110
00:09:16,110 --> 00:09:19,790
So at this point we have a trained data set and a test dataset.

111
00:09:19,810 --> 00:09:20,260
Now what.

112
00:09:21,040 --> 00:09:27,610
Well imagine your dataset is huge so huge that just like our image datasets we rather do batch gradient

113
00:09:27,610 --> 00:09:30,800
descent rather than full gradient descent.

114
00:09:30,940 --> 00:09:36,370
So just like in torch vision where we had an iterator over our image dataset we would like to use an

115
00:09:36,370 --> 00:09:39,430
iterator from torture text for our text dataset.

116
00:09:40,510 --> 00:09:48,340
Here's an example of how we would create an iterator for our text datasets the first argument is a tuple

117
00:09:48,460 --> 00:09:51,910
containing the datasets that we previously created.

118
00:09:51,910 --> 00:09:54,400
The second argument is a sword key.

119
00:09:54,520 --> 00:10:00,160
The idea behind this is that because we'll need to pad each batch of our dataset we would like each

120
00:10:00,160 --> 00:10:04,630
of the sentences in each batch to be of similar size.

121
00:10:04,630 --> 00:10:10,780
So by specifying the sort key to be a function that returns the length of the text field we can tell

122
00:10:10,780 --> 00:10:17,360
talks tax to organize the batches so that each sentence will be of roughly the same length.

123
00:10:17,380 --> 00:10:19,620
Next we pass in back sizes.

124
00:10:19,780 --> 00:10:24,500
This is a tuple containing the same number of elements as we have datasets.

125
00:10:24,520 --> 00:10:29,590
Basically the first number represents the batch size of the train data set and the second number represents

126
00:10:29,590 --> 00:10:31,240
the batch size of the test data set.

127
00:10:32,760 --> 00:10:36,490
Usually we can specify the batch size of the test that are set to be larger.

128
00:10:36,690 --> 00:10:43,710
Since it doesn't need to do any heavy processing other than just making predictions finally we set the

129
00:10:43,710 --> 00:10:48,900
device argument to the device objects that we've been creating in all of our notebooks which refers

130
00:10:48,900 --> 00:10:50,270
to the GP you.

131
00:10:50,430 --> 00:10:55,500
So this means that our dataset will be automatically placed on the GP you without us having to do so

132
00:10:55,500 --> 00:10:56,070
manually

133
00:11:01,160 --> 00:11:01,420
all right.

134
00:11:01,450 --> 00:11:04,630
So at this point you should be asking what's next.

135
00:11:04,630 --> 00:11:10,450
Since we now have iterator is that can loop through our train set and our test set and return batches

136
00:11:10,480 --> 00:11:14,990
that are just end by t arrays of integers corresponding to each sentence.

137
00:11:15,010 --> 00:11:16,630
There's nothing more to do.

138
00:11:16,780 --> 00:11:22,840
We can pass this data directly into our aren't n for both training and testing so you can imagine that

139
00:11:22,840 --> 00:11:30,350
our training loop will look like it usually does as well our loop to calculate the accuracy one minor

140
00:11:30,350 --> 00:11:35,840
caveat is that the torch text iterator gives us back a target that is only one dimensional.

141
00:11:35,840 --> 00:11:41,360
We know that in PI talk we like to have two dimensional targets so we can call the view function to

142
00:11:41,360 --> 00:11:48,020
convert each batch of targets from an N length one dimensional array to an end by 1 2 dimensional array.