1
00:00:00,050 --> 00:00:03,770
In this section, we'll dive into preparing our data.

2
00:00:03,770 --> 00:00:11,960
So we have this, um, malaria data set which is made available on the TensorFlow platform.

3
00:00:11,960 --> 00:00:14,000
So here we let's scroll down.

4
00:00:14,000 --> 00:00:15,680
Here we have a description.

5
00:00:15,680 --> 00:00:24,920
The malaria data set contains 27,558 cell images with equal instances of parasitized and uninfected

6
00:00:24,920 --> 00:00:30,410
cells from the teen blood smear slide images of segmented cells.

7
00:00:30,830 --> 00:00:34,220
Um, we have this additional information right here.

8
00:00:34,220 --> 00:00:36,800
And then we can see the size.

9
00:00:36,800 --> 00:00:43,340
And then we have this um, figure which shows us some sample images.

10
00:00:43,340 --> 00:00:47,000
Now we are going to click on explore in Know Your Data.

11
00:00:47,000 --> 00:00:48,740
So let's open this.

12
00:00:48,830 --> 00:00:56,750
And then we could um check out um, different um data samples we have here and overall information about

13
00:00:56,750 --> 00:00:57,500
our data.

14
00:00:57,500 --> 00:01:00,860
Selecting um the cell at random from the right.

15
00:01:00,860 --> 00:01:04,160
You see here we have this uninfected cell.

16
00:01:04,460 --> 00:01:07,850
Um, we have all the different features and the values has faces.

17
00:01:07,850 --> 00:01:08,570
False.

18
00:01:08,570 --> 00:01:09,170
It's portrait.

19
00:01:09,170 --> 00:01:10,790
False labels.

20
00:01:10,970 --> 00:01:12,320
Um, number of faces.

21
00:01:12,320 --> 00:01:16,790
Well, this is, um, a general know your data platform.

22
00:01:16,790 --> 00:01:20,750
So you would have, um, these kinds of features actually.

23
00:01:20,750 --> 00:01:27,740
So although we're not, um, working with feature with faces in this case, um, we would definitely

24
00:01:27,740 --> 00:01:29,690
have this because this is something general.

25
00:01:29,690 --> 00:01:33,890
So we have um, the aspect ratio.

26
00:01:33,890 --> 00:01:41,690
We have the format as PNG, we have the resolution, we have the mode RGB, we have the height and then

27
00:01:41,690 --> 00:01:42,740
we have the width.

28
00:01:42,740 --> 00:01:50,120
So you see we have this um image right here exposure quality sharpness score and so on and so forth.

29
00:01:50,120 --> 00:01:51,590
So that's it.

30
00:01:51,590 --> 00:01:54,590
We see we have the label and then we have our image.

31
00:01:54,590 --> 00:01:58,310
So you could pick out some other one and then check that out.

32
00:01:58,310 --> 00:02:03,170
Anyways um here you notice that we only have this uninfected um cells.

33
00:02:03,170 --> 00:02:05,750
And that's simply because we start with the uninfected.

34
00:02:05,750 --> 00:02:07,880
That's out of this 27,000 items.

35
00:02:07,880 --> 00:02:16,580
The first, um, let's say 13,000 or 13,500 or whatever number is, um, uninfected cells, and then

36
00:02:16,580 --> 00:02:18,620
the rest are parasitized cells.

37
00:02:18,620 --> 00:02:25,370
So if we change the order, um, you see now we have the parasitized cells and clearly you could see,

38
00:02:25,460 --> 00:02:30,890
um, visually the difference between the parasitized and then the, uh, uninfected.

39
00:02:30,920 --> 00:02:36,320
See, notice how you have this little, um, purple patches here and there in the different images.

40
00:02:36,320 --> 00:02:41,480
And then when you click here, you, you find that the image is much clearer or you, you don't have

41
00:02:41,480 --> 00:02:47,240
those kinds of patches we had, um, here, like, you see these patches we have with a parasitized

42
00:02:47,240 --> 00:02:51,530
cells, we no longer have them with the uninfected cells.

43
00:02:52,010 --> 00:02:55,670
You could also decide to group by, um, these different features.

44
00:02:55,670 --> 00:02:59,330
So let's say for example, uh, let's group by aspect ratio.

45
00:02:59,330 --> 00:03:03,410
See we have the different aspect ratios and so on and so forth.

46
00:03:03,410 --> 00:03:10,130
So you could decide, um, on how you want to group, um, the images and visualize them that way.

47
00:03:10,130 --> 00:03:12,500
So that's it for this part.

48
00:03:12,680 --> 00:03:16,190
We'll now dive into loading our data set.

49
00:03:16,190 --> 00:03:20,900
And to load our data set we'll make use of TensorFlow data set um load method.

50
00:03:20,900 --> 00:03:23,330
So we'll head over to the documentation.

51
00:03:23,330 --> 00:03:29,120
That's TensorFlow data sets API docs um Python to be specific TensorFlow data sets.

52
00:03:29,120 --> 00:03:31,760
And then we have this load method right here.

53
00:03:31,760 --> 00:03:38,360
So with this load method we'll be able to simply specify like let's get back here.

54
00:03:38,360 --> 00:03:47,390
We'll be able to simply specify this name and then load um up this data such that we'll have as output.

55
00:03:47,810 --> 00:03:53,180
Um here our TensorFlow data set with obviously it's information.

56
00:03:53,300 --> 00:03:56,060
With that said we'll start with the import.

57
00:03:56,060 --> 00:04:01,820
So we get back to our notebook and then we simply import TensorFlow as tf.

58
00:04:01,820 --> 00:04:04,010
Then we import numpy.

59
00:04:04,040 --> 00:04:15,650
NumPy as np we import our matplotlib for visualizations matplotlib um pyplot as plt.

60
00:04:15,650 --> 00:04:18,500
And then we import TensorFlow.

61
00:04:19,160 --> 00:04:25,370
TensorFlow data sets data sets as tfds.

62
00:04:25,400 --> 00:04:31,070
We'll run that and make sure you have the runtime set to GPU.

63
00:04:31,100 --> 00:04:33,230
Let's now go ahead and load the data.

64
00:04:33,230 --> 00:04:37,880
So we have data set and then the data set um info.

65
00:04:37,910 --> 00:04:41,210
Then we have TensorFlow data set TensorFlow.

66
00:04:41,840 --> 00:04:44,690
That's tfds um load.

67
00:04:44,690 --> 00:04:48,500
And then now all you need to specify is malaria.

68
00:04:48,500 --> 00:04:53,900
So what we have seen already, that's the name which we saw in the documentation is the exact same name

69
00:04:53,900 --> 00:04:54,740
we use here.

70
00:04:54,740 --> 00:04:56,600
So we have that malaria.

71
00:04:56,600 --> 00:04:59,780
And then we have with info with.

72
00:04:59,900 --> 00:05:02,030
Info set to true.

73
00:05:02,060 --> 00:05:04,640
Let's run that and then see what we obtain.

74
00:05:04,640 --> 00:05:07,430
So let's have data set data set.

75
00:05:07,430 --> 00:05:12,920
And then we have data set um info with this downloaded.

76
00:05:12,920 --> 00:05:17,450
You see here we have the data set and we have the data set info.

77
00:05:17,450 --> 00:05:21,260
So we have this data set which is uh Prefetched data set.

78
00:05:21,260 --> 00:05:23,630
You can see the the shape right here.

79
00:05:23,660 --> 00:05:25,190
None by none by three.

80
00:05:25,190 --> 00:05:31,580
So we have an RGB image and the data type is unsigned um int eight.

81
00:05:31,880 --> 00:05:35,000
Uh we have the label which is an integer.

82
00:05:35,000 --> 00:05:42,440
So definitely we will have um zero for let's say parasitic parasitic.

83
00:05:42,440 --> 00:05:47,390
And then we will have one for um uninfected.

84
00:05:47,390 --> 00:05:49,070
Uninfected.

85
00:05:49,070 --> 00:05:49,940
So that's it.

86
00:05:49,940 --> 00:05:57,920
So let's take this off and say we, we select only train and there we go.

87
00:05:57,920 --> 00:05:59,930
You see we have just that data set.

88
00:05:59,930 --> 00:06:08,090
So let's say for um I in train and let's take a single element.

89
00:06:08,090 --> 00:06:11,810
So let's take just that single element and let's print out that I.

90
00:06:13,390 --> 00:06:14,410
Oh that's it.

91
00:06:14,410 --> 00:06:15,790
You see, we have the image.

92
00:06:15,790 --> 00:06:18,940
That's it, 103 by 103 by three.

93
00:06:18,940 --> 00:06:24,490
And then we scroll down and we have the corresponding label which is one.

94
00:06:24,490 --> 00:06:30,460
And then we have data set info or the name, the full name, the description.

95
00:06:30,460 --> 00:06:40,720
And um right up to the citation, we could verify that the length of our data set is 27,558.

96
00:06:40,720 --> 00:06:42,850
By printing this out.

97
00:06:42,850 --> 00:06:46,420
And then we could now dive into splitting up our data set.

98
00:06:46,420 --> 00:06:48,400
So here we have the split.

99
00:06:48,400 --> 00:06:51,280
You see that we specify train train and train.

100
00:06:51,280 --> 00:06:56,890
Because obviously um our data set um initially is made of just the train data.

101
00:06:57,040 --> 00:07:03,700
And so now what we are saying that we want to have all elements between all the first 80%.

102
00:07:03,700 --> 00:07:13,660
So essentially what we're saying is we're going to have um, element zero right up to element um, 0.8

103
00:07:13,660 --> 00:07:17,620
times 27,558.

104
00:07:17,620 --> 00:07:20,230
So let's take this off.

105
00:07:20,500 --> 00:07:28,900
We run that and you see, uh, what this tells us is we're going from zero right up to zero right up

106
00:07:28,900 --> 00:07:32,380
to 20, 2000, um, 46.

107
00:07:32,380 --> 00:07:37,900
So this first 22,000 elements will be used for training.

108
00:07:37,930 --> 00:07:44,440
Then the next 10%, um, of our full data set will be used for validation.

109
00:07:44,440 --> 00:07:48,220
And then the final 10% will be used for testing.

110
00:07:48,220 --> 00:07:51,280
So with that said, let's run this.

111
00:07:51,280 --> 00:07:57,100
And now let's say this set, you will find that we will have this list made of three different, um,

112
00:07:57,100 --> 00:07:57,910
data sets.

113
00:07:57,910 --> 00:08:06,130
So let's say we have data set data set, equal data set um data set zero.

114
00:08:06,130 --> 00:08:10,330
And then we have the same we repeat the same for validation and testing.

115
00:08:10,330 --> 00:08:12,040
So here we have two.

116
00:08:12,070 --> 00:08:14,230
We have um one one.

117
00:08:14,620 --> 00:08:15,400
There we go.

118
00:08:15,400 --> 00:08:16,600
We have one.

119
00:08:16,600 --> 00:08:18,460
And then we could take this off.

120
00:08:18,460 --> 00:08:21,220
We have validation and then we have testing.

121
00:08:21,220 --> 00:08:23,170
So let's run that.

122
00:08:23,170 --> 00:08:26,950
And now we could do the length or we could get the length.

123
00:08:27,580 --> 00:08:33,640
We could get the length of our train data set train data set.

124
00:08:33,640 --> 00:08:39,880
And we could do the same for the validation and the testing test.

125
00:08:39,880 --> 00:08:44,200
And Val run that and see what we get.

126
00:08:44,200 --> 00:08:45,250
So that's it.

127
00:08:45,250 --> 00:08:47,320
That's how we split up our data set.

128
00:08:47,320 --> 00:08:53,110
We can again print out those different lengths for the train validation and testing.

129
00:08:53,110 --> 00:09:00,460
So you see we have 20,046 27,056 for both the validation and the testing.

130
00:09:00,460 --> 00:09:03,550
From here we'll go ahead and shuffle our data set.

131
00:09:03,550 --> 00:09:07,330
If you check out, um, documentation, you see we have shuffle files.

132
00:09:07,330 --> 00:09:09,640
Let's just take this this way.

133
00:09:09,640 --> 00:09:16,300
And we have here shuffle files and we'll set that to true.

134
00:09:16,300 --> 00:09:20,350
And the reason why we need all it's important for us to shuffle is simple.

135
00:09:20,680 --> 00:09:24,610
Um, suppose that we you have some data, you have some data.

136
00:09:24,610 --> 00:09:28,180
And this first part is, um, uninfected.

137
00:09:28,180 --> 00:09:31,000
And then this next part, this next part.

138
00:09:31,000 --> 00:09:32,710
Let's take this off.

139
00:09:32,710 --> 00:09:36,580
This next part is, um, parasitized.

140
00:09:37,390 --> 00:09:38,620
Let's take this from here.

141
00:09:38,650 --> 00:09:45,490
Now, if you take out the first 80, um, percent, then, um, you end up with something like this

142
00:09:45,490 --> 00:09:46,210
being your train.

143
00:09:46,210 --> 00:09:48,070
So this will be your train data.

144
00:09:48,400 --> 00:09:49,690
Let's take that off.

145
00:09:49,690 --> 00:09:52,150
So let's say this this is going to be our train data.

146
00:09:52,330 --> 00:09:57,730
And then this is going to be our validation and testing.

147
00:09:57,730 --> 00:10:00,520
But the problem now is that, uh.

148
00:10:01,300 --> 00:10:08,980
Validation and testing is made of only parasitized cells, whereas if we had shuffled our data, then

149
00:10:08,980 --> 00:10:14,020
we would have a mixture of the parasitized and um, uninfected cells.

150
00:10:14,020 --> 00:10:16,480
So you would have something like this instead.

151
00:10:16,600 --> 00:10:21,280
Let's just, um, take this, uh, randomly.

152
00:10:21,280 --> 00:10:23,380
Well, randomly have that.

153
00:10:24,700 --> 00:10:25,660
And there we go.

154
00:10:25,660 --> 00:10:32,380
So in this case, you see that we're going to have or we're going to go from let's say let's add this

155
00:10:32,380 --> 00:10:32,980
up.

156
00:10:34,480 --> 00:10:36,040
We have that.

157
00:10:36,870 --> 00:10:42,870
We're going to have this first 80% and then we'll break this up again.

158
00:10:42,870 --> 00:10:50,220
And you see now that our validation and the testing data sets have both the uninfected and the parasitized

159
00:10:50,220 --> 00:10:50,970
cells.

160
00:10:50,970 --> 00:10:56,220
So that's why it's very important for us to, um, shuffle our files, run that again.

161
00:10:56,640 --> 00:10:58,050
And there we go.

162
00:10:58,050 --> 00:11:07,080
So we could we, we could, um, run this again and then feel free to check out, um, let's say Val.

163
00:11:07,650 --> 00:11:11,070
Val and that's it.

164
00:11:12,150 --> 00:11:14,250
Well, this should be let's get back.

165
00:11:14,250 --> 00:11:16,170
This should be val data set instead.

166
00:11:16,170 --> 00:11:18,270
So here we have val data set.

167
00:11:19,110 --> 00:11:20,910
And we could take that off.

168
00:11:20,910 --> 00:11:23,160
Run that and see what we get.

169
00:11:23,160 --> 00:11:27,030
You see we have our image and we have its corresponding label.

170
00:11:27,150 --> 00:11:30,750
Another way in which we could carry out the splitting is actually manually.

171
00:11:30,750 --> 00:11:36,900
So instead of relying on this um, split parameter right here, we could actually manually split up

172
00:11:36,900 --> 00:11:37,620
our data set.

173
00:11:37,620 --> 00:11:38,760
And this is important.

174
00:11:38,760 --> 00:11:44,760
And this is useful in cases where we're not dealing with a data set provided by TensorFlow.

175
00:11:44,760 --> 00:11:47,850
So let's say we do not have the split.

176
00:11:48,000 --> 00:11:49,230
Let's take this off.

177
00:11:49,700 --> 00:11:51,770
Let's suppose we do not have the split.

178
00:11:51,770 --> 00:11:53,270
We still have the shuffling.

179
00:11:53,690 --> 00:11:57,050
Um, now we're going to define our split method.

180
00:11:57,050 --> 00:12:01,670
So here we have the split method which will take in our data set.

181
00:12:01,670 --> 00:12:05,180
And then we'll take in our split um ratios.

182
00:12:05,390 --> 00:12:09,080
So we'll define uh train ratio.

183
00:12:09,080 --> 00:12:14,420
Train train ratio um val ratio.

184
00:12:14,420 --> 00:12:17,390
And then we have test ratio.

185
00:12:17,690 --> 00:12:24,170
So right here we will start by defining let's return return none.

186
00:12:24,830 --> 00:12:29,000
So what we'll do is we're going to start by defining this ratios.

187
00:12:29,000 --> 00:12:38,120
So let's say we have train ratio train ratio which will set to 0.80.8.

188
00:12:38,120 --> 00:12:42,380
And then we have val and test ratios.

189
00:12:42,380 --> 00:12:46,790
So test ratio and then val ratio.

190
00:12:46,790 --> 00:12:49,040
And here is 0.1.

191
00:12:49,040 --> 00:12:51,560
And and 0.1.

192
00:12:51,560 --> 00:12:52,340
That's it.

193
00:12:52,340 --> 00:12:57,770
And then now what we'll do is we're going to create a simple data set to illustrate how um our manual

194
00:12:57,770 --> 00:12:58,880
splitting will work.

195
00:12:58,880 --> 00:13:05,660
We have the data set um or let's let's call this let's just say RDS our data set.

196
00:13:05,660 --> 00:13:11,540
And then we have TensorFlow data data data set um range.

197
00:13:11,540 --> 00:13:12,710
And then we put ten.

198
00:13:12,710 --> 00:13:14,570
So we get this ten values.

199
00:13:14,570 --> 00:13:16,190
So let's just print out.

200
00:13:16,190 --> 00:13:20,150
Well let's just let's just um call this um our splits method.

201
00:13:20,150 --> 00:13:26,090
So here we'll have train um RDS we'll have Val RDS.

202
00:13:26,090 --> 00:13:29,540
And then we will have um, test RDS.

203
00:13:29,540 --> 00:13:33,020
So that's what we expect to be returned right here.

204
00:13:33,020 --> 00:13:35,690
Train um data set.

205
00:13:35,690 --> 00:13:42,350
We have um val data set and then we have test data set.

206
00:13:42,980 --> 00:13:43,850
That's it.

207
00:13:43,850 --> 00:13:49,970
And then now we'll call on our method split which will take in the data set or RDS.

208
00:13:49,970 --> 00:13:53,690
And then um we give it we'll pass in it.

209
00:13:53,690 --> 00:13:57,560
Um, the train ratio we pass in the val ratio.

210
00:13:57,560 --> 00:14:00,350
And then we also pass in the test ratio.

211
00:14:00,500 --> 00:14:01,520
So that's it.

212
00:14:01,520 --> 00:14:03,680
Let's get back test ratio.

213
00:14:04,190 --> 00:14:07,100
Let's now comment this out as we've just done.

214
00:14:07,100 --> 00:14:14,390
And then we print out the um our data set as numpy iterator.

215
00:14:14,810 --> 00:14:17,600
And then we have list.

216
00:14:19,360 --> 00:14:21,700
And run that and see what we get.

217
00:14:23,560 --> 00:14:24,340
There we go.

218
00:14:24,340 --> 00:14:29,680
As you could see, we have values ranging from 0 to 9, which is what we expect because that's what

219
00:14:29,680 --> 00:14:31,480
we specified right here.

220
00:14:31,630 --> 00:14:40,210
Now, if we want this first, um, 80%, that's this first eight elements, then it suffices to do or

221
00:14:40,210 --> 00:14:43,690
to, to get right here and say tick or not.

222
00:14:43,690 --> 00:14:44,710
Not this.

223
00:14:44,710 --> 00:14:52,030
Let's say we have or let's say, let's say we just define this values right here and we'll say, um,

224
00:14:52,030 --> 00:15:02,530
dz, take the first eight elements or take the train ratio, the train ratio times, um, the total

225
00:15:02,680 --> 00:15:09,580
or let's say total, um, total total number of elements.

226
00:15:09,580 --> 00:15:11,110
We set that to ten.

227
00:15:11,800 --> 00:15:13,360
We set it to ten.

228
00:15:13,390 --> 00:15:14,500
There we go.

229
00:15:14,500 --> 00:15:16,570
And we replace this with total.

230
00:15:17,260 --> 00:15:18,460
Um that's it.

231
00:15:18,460 --> 00:15:19,510
We have total.

232
00:15:19,510 --> 00:15:23,770
So it's going to be eight like well this should be an int.

233
00:15:23,770 --> 00:15:27,220
So we'll just uh make sure we have an int right there.

234
00:15:27,220 --> 00:15:30,340
Run that again and then we'll see what we get.

235
00:15:30,400 --> 00:15:32,680
See we have our data set.

236
00:15:32,680 --> 00:15:36,640
And now we're making sure we want to get the first eight elements.

237
00:15:37,150 --> 00:15:42,700
Um this should be well let's print let's the initial data set.

238
00:15:42,700 --> 00:15:45,940
Let's now get the val data set.

239
00:15:46,420 --> 00:15:51,280
So because this is a complete data set, now we want to have um the val data set.

240
00:15:51,760 --> 00:15:54,310
You'd see that we get the first eight elements.

241
00:15:54,310 --> 00:15:56,290
You see we have this first eight elements.

242
00:15:56,290 --> 00:16:02,770
Now if we want to get the next eight elements, all you need to do is skip the first eight.

243
00:16:02,770 --> 00:16:06,700
And then once you skip the first eight, you take the next.

244
00:16:06,700 --> 00:16:08,320
So this is 0.1.

245
00:16:08,320 --> 00:16:16,060
So what we'll do is we're going to have our Val data set or well this this is actually train.

246
00:16:16,060 --> 00:16:18,190
So this is actually train data set.

247
00:16:18,190 --> 00:16:19,750
So our train data set.

248
00:16:19,750 --> 00:16:24,670
And then now for our Val data set for our Val data set.

249
00:16:24,670 --> 00:16:29,470
Let's take this off for our while data set we'll specify the val ratio.

250
00:16:29,500 --> 00:16:32,260
So here we have our val ratio.

251
00:16:32,290 --> 00:16:34,090
Um but also this is skip.

252
00:16:34,090 --> 00:16:39,970
So we want to skip we want to skip um the first eight elements.

253
00:16:39,970 --> 00:16:42,040
So let's say we have train.

254
00:16:43,060 --> 00:16:43,900
See that.

255
00:16:43,900 --> 00:16:47,290
So this is the first eight elements that we're going to skip.

256
00:16:47,650 --> 00:16:49,720
And we will take all this to the end.

257
00:16:49,720 --> 00:16:51,730
So here is skip.

258
00:16:51,730 --> 00:16:52,720
And then here is take.

259
00:16:52,750 --> 00:16:58,540
Take simply means you have a data set and you take uh the first um n elements.

260
00:16:58,540 --> 00:17:02,800
So if n is set to eight then we'll take the first eight elements.

261
00:17:02,830 --> 00:17:10,870
Now for train you we or rather for skip you simply um moving or you're, you're going to start from

262
00:17:10,870 --> 00:17:13,180
the n plus one element.

263
00:17:13,180 --> 00:17:17,350
So let's say val data set and we skip this first eight elements.

264
00:17:17,740 --> 00:17:20,350
Um here we have train.

265
00:17:20,350 --> 00:17:22,240
So yeah we should have train.

266
00:17:23,080 --> 00:17:24,160
Train.

267
00:17:24,370 --> 00:17:25,630
Let's copy this.

268
00:17:26,540 --> 00:17:27,650
There we go.

269
00:17:27,650 --> 00:17:29,210
We paste that out.

270
00:17:29,480 --> 00:17:31,700
So now we should have vowel.

271
00:17:31,700 --> 00:17:33,620
So let's take this off and run that.

272
00:17:33,620 --> 00:17:36,770
So we see how um we get the vowel data set.

273
00:17:36,770 --> 00:17:40,340
Well this for now is a combination of the validation and the test data set.

274
00:17:40,340 --> 00:17:42,710
Because we just kept the first eight elements.

275
00:17:42,710 --> 00:17:43,070
You see.

276
00:17:43,070 --> 00:17:44,330
Now we have eight nine.

277
00:17:44,330 --> 00:17:52,220
So this last elements now we'll just change this slightly and call this vowel test vowel test.

278
00:17:52,220 --> 00:17:55,700
Because the combination of the validation and the test data set.

279
00:17:55,700 --> 00:17:56,840
So that's it.

280
00:17:56,840 --> 00:17:58,010
We have our test.

281
00:17:58,010 --> 00:18:06,140
And now we'll define our vowel data set to be simply our vowel test vowel test data set.

282
00:18:06,140 --> 00:18:09,710
And then now we take the top elements.

283
00:18:09,710 --> 00:18:11,840
Or rather we take this first element.

284
00:18:11,840 --> 00:18:14,510
So we have um int.

285
00:18:14,990 --> 00:18:16,160
Let's cancel that.

286
00:18:16,160 --> 00:18:24,620
We have your int and then we have our vowel ratio times um the total.

287
00:18:24,620 --> 00:18:26,360
So this is going to give us one.

288
00:18:26,360 --> 00:18:27,980
So we'll just take the first element.

289
00:18:27,980 --> 00:18:31,010
And then we have um for the test.

290
00:18:31,860 --> 00:18:32,880
Let's take this off.

291
00:18:32,910 --> 00:18:33,750
Piece it out.

292
00:18:33,750 --> 00:18:37,590
We have test and then we have vowel test data set.

293
00:18:37,590 --> 00:18:39,060
But instead now we skip.

294
00:18:39,330 --> 00:18:40,560
So we skip that.

295
00:18:40,560 --> 00:18:44,370
And then we have um 0.1 times ten which is one.

296
00:18:44,370 --> 00:18:46,170
And we get this last element.

297
00:18:46,170 --> 00:18:54,150
So now we could print out let's print out um our vowel and test data sets.

298
00:18:54,150 --> 00:18:55,350
So take this off.

299
00:18:55,350 --> 00:18:58,470
We have vowel and then take this off we have test.

300
00:18:58,740 --> 00:19:02,010
Run that again and see what we get.

301
00:19:02,910 --> 00:19:05,550
You can see that we have now eight for the vowel.

302
00:19:05,550 --> 00:19:07,260
And then we have nine for the test.

303
00:19:07,260 --> 00:19:12,270
Now let's say we had changed this to two, two and six.

304
00:19:12,270 --> 00:19:19,350
Run that again and then see um, that we have for the train the first six elements, and then for the

305
00:19:19,350 --> 00:19:22,230
next we, we're going to have the next two and the last two.

306
00:19:22,290 --> 00:19:24,750
See here we have our train data set.

307
00:19:24,750 --> 00:19:26,760
Here we have our vowel data set.

308
00:19:26,760 --> 00:19:29,130
And then here we have our test data set.

309
00:19:29,130 --> 00:19:31,140
So our method works perfectly.

310
00:19:31,140 --> 00:19:31,860
That's it.

311
00:19:31,860 --> 00:19:37,980
Now let's dive into our split method and then um implement what we just described right here.

312
00:19:38,460 --> 00:19:40,470
So we'll start by getting this total.

313
00:19:40,470 --> 00:19:43,890
To get total you just need to have the data set size.

314
00:19:43,890 --> 00:19:51,360
So let's say data set size which is simply the length of our data set.

315
00:19:51,540 --> 00:19:57,060
And once you have the length of the data set you now do train DDS.

316
00:19:57,090 --> 00:19:59,340
That's exactly what we had seen already here.

317
00:19:59,340 --> 00:20:02,760
And we specify the train ratio times total.

318
00:20:02,760 --> 00:20:09,210
So let's just copy all this or let's cut that out and then paste out here shift this.

319
00:20:09,990 --> 00:20:10,920
There we go.

320
00:20:10,920 --> 00:20:15,660
So we have our train data set val data set and test data set.

321
00:20:15,660 --> 00:20:20,550
Now this is instead data set size data set size.

322
00:20:20,550 --> 00:20:25,650
And here we have data set um data set size.

323
00:20:26,520 --> 00:20:27,870
There we go.

324
00:20:28,320 --> 00:20:32,640
We have um we got we can now run all this.

325
00:20:32,640 --> 00:20:34,800
Then let's get back here.

326
00:20:35,070 --> 00:20:36,600
Let's take all this off.

327
00:20:37,290 --> 00:20:41,580
And now we have our train data set Val data set and test data set.

328
00:20:41,580 --> 00:20:46,500
So let's get back to 0.1 and then 0.8.

329
00:20:46,950 --> 00:20:49,200
Uh we don't need this total anymore.

330
00:20:49,770 --> 00:20:51,930
We don't need this to our data set.

331
00:20:51,930 --> 00:20:53,430
Is this, um data set.

332
00:20:53,430 --> 00:20:55,350
So let's let's take this off.

333
00:20:55,350 --> 00:21:00,900
So we have data set, we run that and then see what we get while that's running.

334
00:21:00,900 --> 00:21:07,110
We could simply update this here to get the length of train data set.

335
00:21:07,440 --> 00:21:10,380
We'll get a length of val data set.

336
00:21:11,550 --> 00:21:18,030
And then we get the length of um test data set on that two.

337
00:21:18,030 --> 00:21:27,270
And make sure that we have exact same lengths for both the first method and our manual splitting method.

338
00:21:27,270 --> 00:21:29,520
So we get in this error.

339
00:21:29,730 --> 00:21:32,160
Um this has no attribute take.

340
00:21:32,160 --> 00:21:35,730
And that's because here we we need to specify that this is train.

341
00:21:35,730 --> 00:21:37,500
So we run that again.

342
00:21:38,370 --> 00:21:39,720
That should be fine.

343
00:21:39,720 --> 00:21:40,890
Now see.

344
00:21:40,890 --> 00:21:43,110
And then we also run this tool.

345
00:21:43,110 --> 00:21:44,880
So we now compare the lengths.

346
00:21:44,880 --> 00:21:47,610
And you could see that they are almost the same.

347
00:21:47,610 --> 00:21:57,150
The only difference we have comes from the fact that um this INT rounds down whereas um with this method

348
00:21:57,180 --> 00:22:04,470
or making use of the load method, um, the load method split, we instead have a round up.

349
00:22:04,470 --> 00:22:09,240
So that's why we go from 275 5 to 2 seven, five six.

350
00:22:09,360 --> 00:22:15,960
Whereas with the manual split we go from 27550. 8 to 2 755.

351
00:22:15,960 --> 00:22:19,350
So that's why we have this little difference.

352
00:22:19,860 --> 00:22:21,780
Um, that said that's it for the section.

353
00:22:21,780 --> 00:22:27,090
We're going to move on to the next section where we'll dive into visualizing our data.