1
00:00:11,680 --> 00:00:17,620
In this lecture we are going to work on a notebook to just do tax free processing so we can explore

2
00:00:17,620 --> 00:00:19,630
a little bit how it works.

3
00:00:19,630 --> 00:00:24,460
This lecture will be useful before we move on to the next script where we will assume that we already

4
00:00:24,460 --> 00:00:26,080
know how it works.

5
00:00:26,110 --> 00:00:31,060
As usual you can look at the title of the notebook to determine what notebook we are currently looking

6
00:00:31,060 --> 00:00:35,180
at.

7
00:00:35,360 --> 00:00:41,210
OK so we're going to start by making some fake data since we eventually want this to be saved as a CSI

8
00:00:41,230 --> 00:00:41,990
V.

9
00:00:42,090 --> 00:00:48,000
It's convenient to create a data frame but in order to do that a convenient way to specify the data

10
00:00:48,150 --> 00:00:49,110
is in a dictionary.

11
00:00:49,740 --> 00:00:56,310
So in this dictionary the key is refer to our eventual columns and the values are lists that correspond

12
00:00:56,310 --> 00:00:57,870
to each row in the data.

13
00:00:58,620 --> 00:01:01,420
So we're going to have three samples in our dataset.

14
00:01:01,500 --> 00:01:03,920
The labels are 0 1 and 1.

15
00:01:04,230 --> 00:01:13,320
The sentences are I like eggs and ham eggs I like in ham and eggs or just ham I've added some punctuation

16
00:01:13,320 --> 00:01:22,180
to this dataset so we can see what happens to it in our token either.

17
00:01:22,540 --> 00:01:28,720
Next we pass this data into the data frame constructor to instantiate a data frame.

18
00:01:28,720 --> 00:01:32,440
Next we do a DFT head so we can look at the data frame.

19
00:01:32,620 --> 00:01:34,540
Hopefully this is what you expected to see

20
00:01:38,270 --> 00:01:38,780
next.

21
00:01:38,870 --> 00:01:46,340
We save the data frame to see as we file by calling the two C as we function we can also use the command

22
00:01:46,340 --> 00:01:55,260
line version of head to see the contents of the file.

23
00:01:55,270 --> 00:02:01,710
Next we're going to create our 2 field objects one for the text and one for the label for the text.

24
00:02:01,710 --> 00:02:07,620
I've set sequential equally to true since our data is a sequence of words I've set batch first decoder

25
00:02:07,630 --> 00:02:13,890
true since I want the data to be n by something rather than something else I've set lower equal to true

26
00:02:14,040 --> 00:02:19,020
so that the data is lowercase I've set token I's equal to spacey.

27
00:02:19,020 --> 00:02:24,570
This might seem odd to you but specie is another an IP library in Python that is somewhat tangentially

28
00:02:24,570 --> 00:02:26,380
related to PI storage.

29
00:02:26,430 --> 00:02:32,290
Basically if you leave this argument unsaid it's just going to use a string that split by default.

30
00:02:32,400 --> 00:02:36,050
I would recommend you try both to see what the differences are.

31
00:02:36,060 --> 00:02:40,360
I wanted to use space here to show you how it deals with punctuation.

32
00:02:40,590 --> 00:02:45,810
Finally we set pad first decode a true so that the padding goes at the beginning of each sentence rather

33
00:02:45,810 --> 00:02:48,260
than at the end.

34
00:02:48,270 --> 00:02:50,400
Next we set the label field.

35
00:02:50,400 --> 00:02:52,490
Here we set sequential equal to False.

36
00:02:52,530 --> 00:02:57,770
Since this column does not contain sequential data we say use vocab to false.

37
00:02:57,870 --> 00:03:04,880
Since this column contains no vocabulary and it's just integers and finally we set is target equal to

38
00:03:04,880 --> 00:03:05,630
true.

39
00:03:05,630 --> 00:03:07,160
Since this is the target column

40
00:03:12,950 --> 00:03:17,570
next we create a tabular data set object where we pass in the C as V file.

41
00:03:17,570 --> 00:03:24,170
We created earlier we said format equal to see as V since it's a C S V and we set skip header equal

42
00:03:24,170 --> 00:03:28,190
to true since Panisse saves the header row by default.

43
00:03:28,190 --> 00:03:35,070
We set the fields argument to a list containing the fields we just created one important note is that

44
00:03:35,070 --> 00:03:40,000
these fields must be specified in the same order as they appear in your file.

45
00:03:40,320 --> 00:03:45,150
So if your file contains the data column to the left of the label column you would put the data fields

46
00:03:45,150 --> 00:03:45,750
first

47
00:03:49,210 --> 00:03:54,640
so if you check the attributes of the data set object you'll see it has one called examples which is

48
00:03:54,640 --> 00:03:57,130
a list of example objects.

49
00:03:57,130 --> 00:04:01,200
What I've done here is I've grabbed the first one so we can see what's in it.

50
00:04:01,360 --> 00:04:03,940
As you can see there is an attribute called data.

51
00:04:04,240 --> 00:04:09,640
And if we look at this attribute we can see that it contains our sentences but not in their regular

52
00:04:09,640 --> 00:04:10,680
form.

53
00:04:10,690 --> 00:04:13,710
In fact our sentence is now tokenization.

54
00:04:13,840 --> 00:04:18,880
You can see that the period has been separated from the word ham which would not be the case if you

55
00:04:18,880 --> 00:04:20,620
just did a plain string that split

56
00:04:26,410 --> 00:04:30,010
if we look at the label it just returns zero as we would expect

57
00:04:34,540 --> 00:04:34,860
next.

58
00:04:34,870 --> 00:04:37,150
I'm going to split the dataset.

59
00:04:37,150 --> 00:04:40,810
This is kind of weird in our case since the data only contains three samples.

60
00:04:40,900 --> 00:04:45,730
But let's just pretend this is OK.

61
00:04:45,750 --> 00:04:50,760
Next we're going to call the bill the vocab function on our text field and pass in the train dataset.

62
00:04:51,720 --> 00:04:56,180
Oddly this is not done automatically when you create the dataset object.

63
00:04:56,220 --> 00:04:57,840
Again it's not a mature library

64
00:05:01,740 --> 00:05:05,150
next we call the vocab attribute and check its type.

65
00:05:05,190 --> 00:05:07,530
We can see that it is a vocab object

66
00:05:12,100 --> 00:05:18,520
next we call that attribute as to AI which returns a dictionary since our dataset is so small which

67
00:05:18,520 --> 00:05:25,450
I did on purpose we can see the entire dictionary as you can see each token in the dataset is assigned

68
00:05:25,450 --> 00:05:27,240
a unique integer.

69
00:05:27,310 --> 00:05:34,620
Also punctuation is considered a token since we use the space tokenization we can see that the unknown

70
00:05:34,620 --> 00:05:39,920
token gets assigned the index 0 and the Pats okay and it gets assigned the index 1.

71
00:05:40,020 --> 00:05:42,330
We can also see that all the words are lowercase

72
00:05:49,980 --> 00:05:50,540
next.

73
00:05:50,550 --> 00:05:52,840
Let's look at the eye to s attribute.

74
00:05:53,130 --> 00:05:56,020
We can see that it contains a list of our tokens.

75
00:05:56,340 --> 00:06:02,070
You should confirm to yourself that the indices of each element in this list corresponds to the values

76
00:06:02,100 --> 00:06:08,240
in the dictionary above.

77
00:06:08,270 --> 00:06:11,440
Next we create a device object as we normally do.

78
00:06:16,010 --> 00:06:19,310
Next we create an iterator from our datasets.

79
00:06:19,370 --> 00:06:21,300
We set the bad size equal to 2.

80
00:06:21,440 --> 00:06:24,650
Since both the train and test set only have a maximum size of 2

81
00:06:30,630 --> 00:06:36,010
next we loop through the train iterator and print the inputs and targets along with their shape.

82
00:06:36,090 --> 00:06:44,440
As you can see the shape of the inputs is 2 by 7.

83
00:06:44,540 --> 00:06:50,700
Next we do the same thing for the test iterator and the shape of the inputs is 1 by 6.

84
00:06:50,750 --> 00:06:52,140
So what do we learn from this.

85
00:06:52,730 --> 00:06:58,250
Well we learned that pi talks already intelligently makes your backsides have the smallest sequence

86
00:06:58,250 --> 00:06:59,720
length possible.

87
00:06:59,720 --> 00:07:03,720
There is no unnecessary padding in the train batch or the test batch.

88
00:07:03,770 --> 00:07:10,100
This is good since for example if our data has a maximum sequence length of 10 but the sentences in

89
00:07:10,100 --> 00:07:14,360
our batch only have length 2 or 3 then that would be a huge waste of computation

90
00:07:18,790 --> 00:07:25,510
finally if we look at these sequences of integers carefully we can probably deduce which sequence belongs

91
00:07:25,510 --> 00:07:27,590
to which original sentence.

92
00:07:27,730 --> 00:07:32,260
The one with all the padding is the easiest because that corresponds to the shortest sentence in our

93
00:07:32,260 --> 00:07:33,830
small dataset.

94
00:07:34,060 --> 00:07:40,420
As an exercise here's what you want to do instead of just deducing which sentence goes with which sequence

95
00:07:40,720 --> 00:07:46,540
make use of the index to token mapping from earlier to programmatically convert these sequences back

96
00:07:46,540 --> 00:07:48,340
into their original sentence form.
