1
00:00:00,180 --> 00:00:06,060
Hello, everyone, and welcome to this new session in which we'll trade dataset versioning with weights

2
00:00:06,060 --> 00:00:07,050
and biases.

3
00:00:07,530 --> 00:00:14,430
If you've already used Git for code versioning, then you should note that this is kind of similar to

4
00:00:14,430 --> 00:00:19,520
what git does as here we have this original editor.

5
00:00:19,530 --> 00:00:20,460
Let's take this off.

6
00:00:20,460 --> 00:00:29,580
We have this original data which is then pre processed into this data year and then this other pre processed

7
00:00:29,580 --> 00:00:35,730
data of this pre process version is again pre processed to give us this one.

8
00:00:35,730 --> 00:00:43,590
And then finally we have this last step or this last preprocessing step which gives us this data set

9
00:00:43,590 --> 00:00:47,310
version which happens to be an augmented dataset version.

10
00:00:47,310 --> 00:00:55,410
So this shows us that if at any point in time we are satisfied with this preprocessing, which was done

11
00:00:55,410 --> 00:01:03,390
right here to produce this dataset version, we could simply pre process from or making use of this

12
00:01:03,390 --> 00:01:05,100
data set version right here.

13
00:01:05,100 --> 00:01:11,970
And so this greatly simplifies our dataset management when working in our different machine learning

14
00:01:11,970 --> 00:01:12,480
project.

15
00:01:12,480 --> 00:01:20,890
Just as good parameters do code versioning, weighting biases gives us the possibility of doing dataset

16
00:01:20,940 --> 00:01:22,740
versioning and model versioning.

17
00:01:22,740 --> 00:01:29,490
And this can be done using weights and biases, artifacts which help us save and organize machine learning

18
00:01:29,490 --> 00:01:32,010
data sets throughout a project's lifecycle.

19
00:01:32,010 --> 00:01:37,470
We are going to start with the data set versioning and the most common ways in which weights and biases

20
00:01:37,470 --> 00:01:38,910
artifacts have been used for data.

21
00:01:38,910 --> 00:01:45,900
Versioning are to a version that is seamlessly prepackaged data splits like training and validation

22
00:01:45,900 --> 00:01:53,430
and test sets, iteratively refine data sets, juggle multiple data sets, and finally visualize and

23
00:01:53,430 --> 00:02:00,330
shared data workflow before getting into seeing how artifacts could be used in data set versioning.

24
00:02:00,330 --> 00:02:02,970
Let's look at this simple example.

25
00:02:02,970 --> 00:02:07,920
In order to obtain the malaria data set, we have to start by loading this data.

26
00:02:07,920 --> 00:02:11,490
We have this data loader which is represented by the square.

27
00:02:11,490 --> 00:02:16,230
And now once we've loaded is there, we have our data set.

28
00:02:16,230 --> 00:02:20,520
So let's say we have this original dataset right here.

29
00:02:20,520 --> 00:02:24,990
We now go ahead to split this data.

30
00:02:25,320 --> 00:02:30,480
So from here we can split up this data and then have the three different parts.

31
00:02:30,480 --> 00:02:35,640
We have the training data, the validation data and the test data.

32
00:02:35,670 --> 00:02:42,480
At this point, all three data sets and are being passed in this pre processing units right here.

33
00:02:42,480 --> 00:02:49,020
So we pre process the train, pre process the validation and pre process the test and this gives us

34
00:02:49,020 --> 00:02:49,950
an output here.

35
00:02:49,950 --> 00:02:54,510
So for the training we have a pre processed training data here.

36
00:02:54,510 --> 00:03:00,960
We have this pre processed training data, we have this pre processed validation data and then here

37
00:03:00,960 --> 00:03:04,860
we have also this pre processed test data.

38
00:03:04,890 --> 00:03:11,790
Again at this point we are now going to carry out augmentation on those pre processed training data.

39
00:03:11,790 --> 00:03:18,660
This Peters means pre processed this pre processed training data and the pre process validation data,

40
00:03:18,660 --> 00:03:20,940
pre processed testing data.

41
00:03:21,540 --> 00:03:27,960
So we have that and then we pass this through this augmentation process.

42
00:03:27,960 --> 00:03:35,160
So we have this process with raw will be that of carrying out data augmentation on this data set to

43
00:03:35,160 --> 00:03:42,840
produce another data set, which is actually now an augmented version of this.

44
00:03:42,840 --> 00:03:45,840
So here we have pre processed training data.

45
00:03:45,870 --> 00:03:54,030
Now we have augmented training data and this is kind of like the lifecycle of our data set in this particular

46
00:03:54,030 --> 00:04:02,900
problem in which we're working on margin that this original dataset contains mislabeled examples.

47
00:04:02,910 --> 00:04:10,980
In that case, what you want to do is now to carry out this labeling correctly, such that we have this

48
00:04:10,980 --> 00:04:14,400
data set here which now has been cleaned.

49
00:04:14,400 --> 00:04:22,170
Overall, this leads to a higher level of accountability as when working in a team and let's say you've

50
00:04:22,170 --> 00:04:25,140
done some modification, let's say you have cleaned the data.

51
00:04:25,170 --> 00:04:33,180
All the people in your team can now view this cleaned dataset and decide whether to modify it, deleted

52
00:04:33,180 --> 00:04:34,920
or keep using it.

53
00:04:34,920 --> 00:04:42,150
So if we suppose that the team accepts this clean data set and everyone is happy with this newly cleaned

54
00:04:42,150 --> 00:04:47,730
dataset, we now see that instead of passing this directly into the split, we will now pass, but this

55
00:04:47,730 --> 00:04:49,080
one into the split.

56
00:04:49,410 --> 00:04:51,380
So we'll have something like this.

57
00:04:51,400 --> 00:04:53,400
We'll go this way, Let's take this off.

58
00:04:54,000 --> 00:04:57,090
In that case, we will have to go this way.

59
00:04:57,330 --> 00:04:58,800
We go this way.

60
00:04:59,960 --> 00:05:06,860
And then into the speech and then talking about data set versioning for each and every dataset we've

61
00:05:06,860 --> 00:05:07,600
created here.

62
00:05:07,740 --> 00:05:14,930
Like for this one, this one, this, this, this, this, this or this one, we have different versions.

63
00:05:14,930 --> 00:05:23,060
So we could have, for example, this PVD here, that is the pre processed validation data and it could

64
00:05:23,060 --> 00:05:26,030
have a version zero or version zero.

65
00:05:26,030 --> 00:05:27,260
So we have this version.

66
00:05:27,260 --> 00:05:28,280
All right.

67
00:05:28,280 --> 00:05:28,790
Yeah.

68
00:05:28,790 --> 00:05:36,740
And then later on you may modify this preprocessing and it leads us to have another version of this

69
00:05:36,740 --> 00:05:38,060
validation data.

70
00:05:38,060 --> 00:05:44,090
So you could have another version, version one and the version version two and the version you could

71
00:05:44,090 --> 00:05:48,920
say version best you can have version latest.

72
00:05:50,110 --> 00:05:51,340
And so on and so forth.

73
00:05:52,180 --> 00:05:59,560
And a good thing is when waiting buyers or when using Western buyers to start this dinner or when we

74
00:05:59,560 --> 00:06:08,440
using Western bias artifacts, the data is stored in such a way that if we have data in this original

75
00:06:08,440 --> 00:06:16,300
data set, which is exactly the same as what we have in this pre processed validation data, then that

76
00:06:16,300 --> 00:06:17,800
data wouldn't be duplicated.

77
00:06:18,040 --> 00:06:25,120
Now, it's true that if we pre process data, obviously we'll have all the data set, all the elements

78
00:06:25,120 --> 00:06:26,290
of the dataset changing.

79
00:06:26,320 --> 00:06:29,320
Now let's take this example for the clean data set.

80
00:06:29,410 --> 00:06:36,250
So here we've had we have this original data set, let's suppose we have say we have ten, we have ten

81
00:06:36,250 --> 00:06:41,230
different examples, different samples in this original data set.

82
00:06:41,350 --> 00:06:46,930
And then after cleaning our dataset, only two of these samples have been modified.

83
00:06:46,930 --> 00:06:51,240
So we have eight unmodified and two modified.

84
00:06:51,250 --> 00:06:52,990
So we have modified.

85
00:06:52,990 --> 00:06:56,770
We have changed the levels of these two samples here.

86
00:06:56,800 --> 00:07:04,750
What Wilson bias will do is it will ensure that this eight year aren't duplicated, that it wouldn't

87
00:07:04,750 --> 00:07:10,960
create extra space for those eight others which haven't changed from this previous original data set

88
00:07:10,960 --> 00:07:11,460
here.

89
00:07:11,470 --> 00:07:19,180
And so in that case, we're going to only store this to new samples here while weights and biases keeps

90
00:07:19,180 --> 00:07:25,840
track of the fact that this other eight samples haven't been modified and so don't necessarily need

91
00:07:25,840 --> 00:07:33,560
to occupy extra space in the storage unit, which Wilson biases makes available for us for free.

92
00:07:33,580 --> 00:07:41,350
That said, we could look at these different processes right here, the processes in the square boxes

93
00:07:41,350 --> 00:07:51,580
as one DB runs, while this difference forms and even the versions which are data takes are the artifacts.

94
00:07:51,670 --> 00:07:57,180
And so we could consider this here, where this represents the artifacts and there's the runs.

95
00:07:57,190 --> 00:08:03,280
Also, we see that the artifacts are connected together by these different runs.

96
00:08:03,280 --> 00:08:11,110
So these two artifacts, this training data and the original data set has been connected together by

97
00:08:11,110 --> 00:08:19,890
this pleats run and then this to Peter Z and train data are connected together by the preprocessing

98
00:08:19,930 --> 00:08:20,410
run.

99
00:08:20,410 --> 00:08:26,920
Getting back to the documentation, we'll see how we create this artifact here, which is called the

100
00:08:26,920 --> 00:08:30,040
new data set of type raw data.

101
00:08:30,040 --> 00:08:34,390
And then this is created within this run right here.

102
00:08:34,390 --> 00:08:38,020
So we see how we create this run and we'll specify the project, my project.

103
00:08:38,020 --> 00:08:42,340
And once we create this artifact, we're going to add data into it.

104
00:08:42,370 --> 00:08:48,400
Now, one thing you could do with one TB is you could simply add a whole directory.

105
00:08:48,400 --> 00:08:53,560
So we're supposing that all your data is a given directory, and so you just all you need to do is specify

106
00:08:53,560 --> 00:08:58,210
this path and then you make this data part of this one.

107
00:08:58,210 --> 00:09:01,120
DB Artifacts, which we've called my data.

108
00:09:01,120 --> 00:09:05,260
And then finally you log this artifact to one TB.

109
00:09:05,380 --> 00:09:08,920
Let's copy out this sample code, paste it out here.

110
00:09:08,920 --> 00:09:14,170
We have that sample code and then we could get started with our dataset versioning.

111
00:09:14,170 --> 00:09:20,280
Now we are going to put this in a width statement, so we have your width one DB in need, and then

112
00:09:20,290 --> 00:09:25,930
we specify the project project we're working on, which is this malaria detection project, the entity

113
00:09:26,290 --> 00:09:28,090
learn and that's it.

114
00:09:28,090 --> 00:09:32,950
So we have this year and then we are going to create our original leader.

115
00:09:32,950 --> 00:09:34,960
So we have original data.

116
00:09:34,990 --> 00:09:35,740
There we go.

117
00:09:35,740 --> 00:09:37,270
We are going to create this artifact.

118
00:09:37,270 --> 00:09:42,700
Actually, we have that original data one TB artifact new this type raw data.

119
00:09:42,730 --> 00:09:48,400
Now, to check out the different arguments we could pass in your get back to documentation.

120
00:09:48,760 --> 00:09:52,030
Let's check out your and then you could scroll down.

121
00:09:52,030 --> 00:09:54,460
So yeah, you have this references.

122
00:09:54,850 --> 00:09:57,130
Let's reduce this and have this.

123
00:09:57,130 --> 00:09:59,680
Clearly we have references on the references.

124
00:09:59,680 --> 00:10:03,190
You have this Python library and then you have one TB artifact.

125
00:10:03,190 --> 00:10:05,620
So you could click on this and what do you get?

126
00:10:05,620 --> 00:10:07,870
You have this documentation right here.

127
00:10:07,870 --> 00:10:14,020
So we have the different arguments, name type, description, metadata, incremental and use us.

128
00:10:14,050 --> 00:10:16,930
Now note that this for optional.

129
00:10:16,930 --> 00:10:21,160
So that's why in the example we just had a name and a type.

130
00:10:21,160 --> 00:10:29,440
We'll now add up this here we have name and then type and we have description would have made a detour.

131
00:10:29,440 --> 00:10:31,060
So let's have this description.

132
00:10:31,060 --> 00:10:40,660
Write your description simply we could say malaria dataset or TensorFlow TensorFlow malaria dataset.

133
00:10:40,660 --> 00:10:43,270
So that's it for the description is actually a string.

134
00:10:43,270 --> 00:10:49,570
And then we have this matter data, which is this dictionary which contains information related.

135
00:10:49,690 --> 00:10:50,490
All decent.

136
00:10:50,560 --> 00:10:52,510
So, yeah, we define this dictionary.

137
00:10:52,510 --> 00:10:54,940
We are going to start, for example, the source.

138
00:10:54,940 --> 00:11:01,660
So we could have your source and then we say TF data set.

139
00:11:01,710 --> 00:11:03,930
Okay, so we have that source.

140
00:11:03,940 --> 00:11:06,880
We could also add other information from this home page.

141
00:11:06,880 --> 00:11:12,130
So let's even copy out this description here and then paste it out here.

142
00:11:12,130 --> 00:11:16,900
So in the place of this description, we have that and there we go.

143
00:11:16,900 --> 00:11:17,680
That's fine.

144
00:11:17,680 --> 00:11:20,650
Now we'll check out on the home page.

145
00:11:20,650 --> 00:11:22,330
We have home page source code.

146
00:11:22,330 --> 00:11:25,210
Let's copy this out and paste out here.

147
00:11:25,210 --> 00:11:29,050
And now we have all this necessary matter did our information.

148
00:11:29,050 --> 00:11:29,860
So here is it.

149
00:11:29,860 --> 00:11:31,360
We've created this artifact.

150
00:11:31,360 --> 00:11:38,110
We then paste out this code, which we had seen previously, and which permits us load the malaria dataset

151
00:11:38,110 --> 00:11:45,310
from TensorFlow dataset, and then we'll save this dataset in the non PI compressed format.

152
00:11:45,310 --> 00:11:55,300
So with our original data, which is this one year, this artifact, that new file, new file, so unlike

153
00:11:55,300 --> 00:12:02,080
here where we add directory here, we add in our we created this new file which now contains this data

154
00:12:02,080 --> 00:12:02,620
set.

155
00:12:02,620 --> 00:12:12,250
So with original data, the new file, we'll call it original original data dot np, z.

156
00:12:12,250 --> 00:12:14,200
So that's our file name.

157
00:12:14,200 --> 00:12:18,040
And then the mode is going to be a write mode.

158
00:12:18,040 --> 00:12:19,870
And we have this as file.

159
00:12:19,870 --> 00:12:27,820
So we have this artifact, we add this new file with a file name, and then we save this file while

160
00:12:27,820 --> 00:12:29,620
putting in the appropriate content.

161
00:12:29,620 --> 00:12:36,610
So right here we have NP save Z, That's our compressed format called Here we have and Z.

162
00:12:36,610 --> 00:12:44,500
And then yeah, we have this file and then what we pass in is our data set, which is this one.

163
00:12:44,500 --> 00:12:51,730
And so at this point we've written our information or this data set in this artifact, and then we're

164
00:12:51,730 --> 00:12:54,100
now ready to log this.

165
00:12:54,100 --> 00:12:59,320
So here we have run log artifacts this basically this year.

166
00:12:59,320 --> 00:13:01,300
So let's, let's do this.

167
00:13:01,300 --> 00:13:02,470
Let's take this off.

168
00:13:02,470 --> 00:13:04,390
We have there we go.

169
00:13:04,390 --> 00:13:06,070
We have run the lock artifact.

170
00:13:06,070 --> 00:13:10,390
And then what we're passing in here is original data.

171
00:13:10,420 --> 00:13:11,860
Okay, so that's it.

172
00:13:11,860 --> 00:13:15,490
So now we've seen how to create this run.

173
00:13:15,490 --> 00:13:21,730
And in this run we have this artifacts and then we put in the information, in the artifacts.

174
00:13:21,760 --> 00:13:30,430
Now, one thing we could do is put all this in a method, so we'll define the method load original data.

175
00:13:30,430 --> 00:13:33,990
So we load the original data and simply that's it.

176
00:13:34,000 --> 00:13:36,850
So let's send this one step and that's fine.

177
00:13:36,850 --> 00:13:37,600
There we go.

178
00:13:37,600 --> 00:13:40,750
So we have that load original data method defined.

179
00:13:40,750 --> 00:13:44,080
Let's add this code so and then we can call it right here.

180
00:13:44,080 --> 00:13:48,700
So we could call load or regional theater and there we go.

181
00:13:48,700 --> 00:13:51,370
So let's get back to this diagram we had previously.

182
00:13:51,370 --> 00:13:52,990
You'll see that in this diagram.

183
00:13:52,990 --> 00:13:59,800
If we take off, if we don't consider this path that is information flows this way and then gets pleats

184
00:13:59,800 --> 00:14:01,080
and so on and so forth.

185
00:14:01,090 --> 00:14:07,690
So here what we have is we've had this load original data method which takes into consideration this

186
00:14:07,690 --> 00:14:10,930
to here we have this run, which is the data loader.

187
00:14:10,930 --> 00:14:12,700
So we have a little run.

188
00:14:12,700 --> 00:14:20,800
And then what it does is it recuperates this original data set from TensorFlow data sets and then produces

189
00:14:20,800 --> 00:14:22,960
an artifact, which is this one now.

190
00:14:22,960 --> 00:14:30,550
So the artifact we have just created here, this artifact original data is this which we we had done

191
00:14:30,550 --> 00:14:31,330
previously here.

192
00:14:31,330 --> 00:14:35,440
We then run the cell and load the original data.

193
00:14:35,440 --> 00:14:38,680
We get this output, and fortunately, this process has failed.

194
00:14:38,680 --> 00:14:45,940
Let's check out the reason why this field cannot convert a tensor of D type variance to a non array.

195
00:14:46,990 --> 00:14:54,790
Now, since this variable dataset we have here is of the type variant, what we're going to do is we're

196
00:14:54,790 --> 00:14:59,290
going to take out each and every element of this dataset.

197
00:15:00,470 --> 00:15:02,840
And save it in a directory.

198
00:15:03,020 --> 00:15:06,470
So from here, we are going to copy out this code.

199
00:15:06,500 --> 00:15:09,350
Let's copy all this code and paste it out here back.

200
00:15:09,350 --> 00:15:13,520
And now what we have in here is we are going to go through this data set.

201
00:15:13,530 --> 00:15:20,900
So we have for the data set of our data for data in data set.

202
00:15:20,930 --> 00:15:24,920
Now we have a list here and then we could pick out the zero elements.

203
00:15:24,920 --> 00:15:26,510
So pick out the zero element.

204
00:15:26,510 --> 00:15:30,340
And then for that we are going to create this folder data set right here.

205
00:15:30,350 --> 00:15:34,940
So let's have this new folder and call it data set.

206
00:15:35,270 --> 00:15:40,640
So we have that new folder data set in that folder data set, we are going to put in this different

207
00:15:40,670 --> 00:15:45,920
tenses which are going to be stored in this non p compressed format.

208
00:15:46,070 --> 00:15:46,880
So that's it.

209
00:15:46,880 --> 00:15:49,070
We have this will modify the name.

210
00:15:49,070 --> 00:15:56,690
So let's say we have malaria, malaria data set and then we give it a number.

211
00:15:56,960 --> 00:16:02,270
So let's have that plus we'll give it a number plus this.

212
00:16:02,270 --> 00:16:08,000
So yeah, now we have this K, we've initialized K and then we're going to continually be increment

213
00:16:08,010 --> 00:16:08,760
in this case.

214
00:16:08,780 --> 00:16:15,710
So here we have now malaria data set, we have k k, and then we have this extension right here.

215
00:16:15,710 --> 00:16:18,380
So that's how we're going to save this file.

216
00:16:18,380 --> 00:16:20,720
And then the data we're going to be saving is going to be this.

217
00:16:20,720 --> 00:16:25,400
So we have data which we're going to be saving in each and every one of this file.

218
00:16:25,430 --> 00:16:30,590
Now, let's test this out and let's put out a break here so we see exactly what we're getting.

219
00:16:30,590 --> 00:16:36,230
We run the cell original data not defined what I need to do open here.

220
00:16:36,230 --> 00:16:40,150
So we're going to open this file and then save that.

221
00:16:40,160 --> 00:16:41,570
Let's run this again.

222
00:16:41,570 --> 00:16:43,490
We check out now this directory.

223
00:16:43,490 --> 00:16:45,910
We have your data set and there we go.

224
00:16:45,920 --> 00:16:48,970
You see, we have this information here saved.

225
00:16:48,980 --> 00:16:52,010
Now let's run this for all the data set.

226
00:16:52,010 --> 00:16:55,490
So let's take out the break and then run this.

227
00:16:55,490 --> 00:17:01,760
Let's print out K and then like after a thousand steps.

228
00:17:01,760 --> 00:17:09,710
So if K modulo 1000, 1000 equals zero, we should print out K.

229
00:17:09,740 --> 00:17:11,210
Okay, so let's run again.

230
00:17:11,210 --> 00:17:13,210
We have now all this data locked.

231
00:17:13,220 --> 00:17:21,800
Let's take off this one now and then get back to documentation and you have my data directory.

232
00:17:22,310 --> 00:17:24,200
So you're passing the directory right there.

233
00:17:24,200 --> 00:17:32,180
So from here we have now instead original data dot add directory, and then we're going to pass in the

234
00:17:32,180 --> 00:17:34,240
directory, which in this case is data set.

235
00:17:34,250 --> 00:17:35,960
So we have that data set past.

236
00:17:35,960 --> 00:17:39,440
We don't really need to have this again, so we could take that off.

237
00:17:39,530 --> 00:17:40,640
Okay, so that's it.

238
00:17:40,940 --> 00:17:42,050
Everything looks fine.

239
00:17:42,050 --> 00:17:43,370
Let's now run this.

240
00:17:43,490 --> 00:17:46,580
All right here we run that and run this.

241
00:17:46,580 --> 00:17:49,940
And this time around, the artifact is loaded successfully.

242
00:17:49,940 --> 00:17:57,230
So now we are going to check out in our dashboard, and here you'll see that we have this raw data,

243
00:17:57,260 --> 00:18:02,060
new data set, and then you have this versions, two different versions of this new data set.

244
00:18:02,060 --> 00:18:04,880
We have version zero and version one.

245
00:18:04,880 --> 00:18:07,460
Now you click on this version one, you should be able to have this.

246
00:18:07,460 --> 00:18:08,930
Let's check out the overview.

247
00:18:08,930 --> 00:18:13,610
You see, you have this data set, malaria data set on all this.

248
00:18:13,610 --> 00:18:15,740
In fact, there's a mirror data we put in already.

249
00:18:15,740 --> 00:18:16,850
So that's it.

250
00:18:16,850 --> 00:18:18,410
We have this API.

251
00:18:18,410 --> 00:18:22,880
We're going to come back to this actually, because to use this artifact, we're making use of this

252
00:18:22,880 --> 00:18:23,450
API.

253
00:18:23,480 --> 00:18:26,240
Then we check out on meta data.

254
00:18:26,240 --> 00:18:30,980
We have this metadata we should be locked in or we should put in in the notebook.

255
00:18:30,980 --> 00:18:32,150
We have the files.

256
00:18:32,150 --> 00:18:39,650
You see those different files here, this directory you see you have the root and then you have basically

257
00:18:39,650 --> 00:18:40,550
all those files.

258
00:18:40,550 --> 00:18:43,970
Now you have this graph view, which for now is very simple.

259
00:18:43,970 --> 00:18:49,520
So here you have a run, the runs actually this run here.

260
00:18:49,520 --> 00:18:50,960
So this run.

261
00:18:50,960 --> 00:18:54,050
And then we have this artifact which has been created.

262
00:18:54,050 --> 00:18:56,930
Now this artifact contains the raw data.

263
00:18:57,110 --> 00:19:00,800
And so we could click on Explode and basically that's it.

264
00:19:00,800 --> 00:19:00,980
Yeah.

265
00:19:00,980 --> 00:19:07,280
You have the name of this run Vivid water name of the run, and then the data is this new data set and

266
00:19:07,280 --> 00:19:08,570
it's version one.

267
00:19:08,570 --> 00:19:11,300
And then the next thing you want to do is to move to the next step.

268
00:19:11,300 --> 00:19:16,290
That is be able to split this data set into this order and to this three different parts.

269
00:19:16,310 --> 00:19:22,280
Now, what we could do other than this is process this before doing the splitting so we don't have to

270
00:19:22,280 --> 00:19:23,720
do this preprocessing tries.

271
00:19:23,720 --> 00:19:29,510
So instead of having this let's take this off here, we will modify this search that here is what we

272
00:19:29,510 --> 00:19:30,380
now obtain.

273
00:19:30,380 --> 00:19:37,970
So coming back to this, we have this run and this artifact, which are all is set is basically what

274
00:19:37,970 --> 00:19:38,930
we have here.

275
00:19:38,930 --> 00:19:42,050
And then the next step will be to reprocess this.

276
00:19:42,050 --> 00:19:48,530
So we'll do the preprocessing on our data set and then produce this pre processed data.

277
00:19:48,530 --> 00:19:50,930
And then from here we'll now do the splitting.

278
00:19:50,930 --> 00:19:56,420
We have this run which was row will split our data into the train validation and testing and then for

279
00:19:56,420 --> 00:19:59,990
the train we'll have another run which role will be to do.

280
00:20:00,080 --> 00:20:02,400
Augmentation on our training data.

281
00:20:02,420 --> 00:20:07,400
So, in fact, in summary, this is what we want to achieve.

282
00:20:07,400 --> 00:20:11,930
Like when we're done with all this, we want to have a graph view which looks like this.

283
00:20:11,930 --> 00:20:15,140
We now go straight forward into the preprocessing.

284
00:20:15,350 --> 00:20:15,920
Is copied.

285
00:20:15,920 --> 00:20:17,630
We get back to the code.

286
00:20:17,630 --> 00:20:18,200
That's fine.

287
00:20:18,200 --> 00:20:19,580
We paste this out here.

288
00:20:19,760 --> 00:20:22,820
We then put this in a statement and then run this code.

289
00:20:22,820 --> 00:20:23,840
And here's where we get.

290
00:20:23,840 --> 00:20:26,900
We've now downloaded this 27,000 files.

291
00:20:26,900 --> 00:20:28,190
We check out your.

292
00:20:28,340 --> 00:20:31,880
And we should have this artifacts, new data set, version one.

293
00:20:31,880 --> 00:20:35,390
And see, we have all those files which we are locked in previously.

294
00:20:35,390 --> 00:20:42,530
So now this means that the next time you want to work on this dataset, you don't really need to come

295
00:20:42,530 --> 00:20:44,270
and run this year.

296
00:20:44,270 --> 00:20:48,290
Like we don't need to run this again so we don't have to do this again.

297
00:20:48,290 --> 00:20:55,640
All we need to do now is just to use this artifact, which Wilson biases starts for us.

298
00:20:55,640 --> 00:20:57,680
And so now that we have this.

299
00:20:57,800 --> 00:21:02,270
We'll get back here and then we do this resize rescale.

300
00:21:02,330 --> 00:21:05,210
Practically our preprocessing here is resizing and rescaling.

301
00:21:05,210 --> 00:21:11,270
So we are going to do a resize and a rescale of all images.

302
00:21:11,360 --> 00:21:18,890
You can copy this and then get back to where we actually have loaded our artifacts.

303
00:21:18,890 --> 00:21:23,780
When we print this out, we have this path to our different files.

304
00:21:23,780 --> 00:21:25,790
So we're going to have your artifacts.

305
00:21:25,790 --> 00:21:28,040
We can check out all those files here.

306
00:21:28,040 --> 00:21:32,390
And then right now we're going to create this other one.

307
00:21:32,390 --> 00:21:36,620
So let's copy this out and then have the space that here.

308
00:21:36,650 --> 00:21:42,110
Now, here we have this pre process, pre process the data.

309
00:21:42,110 --> 00:21:44,180
So this is this method.

310
00:21:44,180 --> 00:21:48,770
We're going to be defining one to be in neat project entity as well.

311
00:21:48,770 --> 00:21:57,320
We've created this new run and then we have now this pre processed data, pre processed data which is

312
00:21:57,320 --> 00:22:06,260
now this new artifact and the name is pre processed, pre processed data set and then the type is pre

313
00:22:06,260 --> 00:22:18,650
processed data we have our pre processed dieter OC description will say oh pre processed version of

314
00:22:18,650 --> 00:22:22,130
the malaria data set of the malaria data set.

315
00:22:22,700 --> 00:22:28,340
Let's take this off And then for the data editor, we could let's take, let's take all this off, let's

316
00:22:28,340 --> 00:22:29,660
have this taken off.

317
00:22:30,230 --> 00:22:33,680
But you could always put information above the middle data.

318
00:22:33,680 --> 00:22:39,920
We are now going to go through each and every file we have in this directory here.

319
00:22:39,920 --> 00:22:52,250
So we have four file in, let's say for F for F in OS does list DA of this directory here.

320
00:22:52,280 --> 00:22:56,690
Let's copy this from your So we're going to list, we're going to create a list from this directory

321
00:22:56,690 --> 00:22:58,100
and we go through this list.

322
00:22:58,100 --> 00:23:05,900
So we go to this list and then for each and every file in this list, what we'll be doing is open up

323
00:23:05,900 --> 00:23:06,620
that file.

324
00:23:06,620 --> 00:23:12,170
So here you have artifact directory, which we've just defined, and then we have this modified directory

325
00:23:12,170 --> 00:23:18,560
plus F to specify the current file, and then we read that file as file.

326
00:23:18,560 --> 00:23:23,090
So from here now, once we've read this as file, we have x, y outputs.

327
00:23:23,090 --> 00:23:24,890
We call this output.

328
00:23:24,890 --> 00:23:32,900
So for each file, since each file we had has an X and Y, we'll take this from here and then we have

329
00:23:32,900 --> 00:23:35,080
this load file.

330
00:23:35,090 --> 00:23:40,250
Now, once we have this, we're going to have X or better still x full.

331
00:23:40,250 --> 00:23:41,090
That's X data.

332
00:23:41,090 --> 00:23:51,020
So let's, let's call this x data set or data set there is set X, so we have our data set, x dot append.

333
00:23:51,020 --> 00:23:53,360
We're going to create this as a list.

334
00:23:53,360 --> 00:24:02,480
So we have your data set, X a list, and then we have data set Y and not a list.

335
00:24:02,660 --> 00:24:03,680
So that's it.

336
00:24:03,920 --> 00:24:12,440
We've created this tool lists and then we want to take each and every element and appended to this dataset

337
00:24:12,440 --> 00:24:14,990
X and dataset Y respectively.

338
00:24:15,020 --> 00:24:23,120
Here we have for X and then we have data set Y that happened y.

339
00:24:23,120 --> 00:24:28,640
Now recall what we have to process as this is our preprocessing method.

340
00:24:28,640 --> 00:24:30,350
So here we have this resize.

341
00:24:30,350 --> 00:24:35,600
Let's make sure we run this resize for skill and it takes in the image, it takes an X basically.

342
00:24:35,600 --> 00:24:45,230
So here we have resize for skill instead of person x will do resize rescale and then we pass an x.

343
00:24:45,470 --> 00:24:48,740
Now we ensure that we define this in size.

344
00:24:48,950 --> 00:24:57,590
Let's have this size defined here in size equal to 24 rounded again, that should be fine.

345
00:24:57,890 --> 00:24:59,780
We need to also specify the.

346
00:24:59,900 --> 00:25:00,380
Baghdad.

347
00:25:00,380 --> 00:25:02,300
We allow people here.

348
00:25:02,300 --> 00:25:05,180
So we have this argument, which is turn to true.

349
00:25:05,210 --> 00:25:08,240
You could always check out in the documentation, write your.

350
00:25:09,080 --> 00:25:10,420
And then from your.

351
00:25:10,430 --> 00:25:13,060
We are not going to take this directly.

352
00:25:13,070 --> 00:25:13,430
Yeah.

353
00:25:13,430 --> 00:25:16,070
What we get is n p z array.

354
00:25:16,280 --> 00:25:27,140
And then to obtain x and Y we have x and Y, which is equal np z array.

355
00:25:27,350 --> 00:25:31,010
We get the number array and we get the values.

356
00:25:31,490 --> 00:25:33,560
So that's what we do to get this.

357
00:25:33,560 --> 00:25:39,700
You could always check our documentation to understand how all this and payload and all of this work.

358
00:25:39,710 --> 00:25:43,310
Now once you have X and Y, this is what we pass in here.

359
00:25:43,310 --> 00:25:46,310
So from here we can run this.

360
00:25:46,340 --> 00:25:47,180
This looks fine.

361
00:25:47,180 --> 00:25:49,520
We have our data set and that's okay.

362
00:25:49,520 --> 00:25:56,870
But before running, we have to ensure that we convert this now into TensorFlow data set.

363
00:25:56,870 --> 00:26:03,320
So let's get this done by having this called data set.

364
00:26:04,220 --> 00:26:11,660
And we have to have theater data set from tensor slices.

365
00:26:11,660 --> 00:26:16,960
And then what are we going to pass in here is dataset X and dataset Y.

366
00:26:16,970 --> 00:26:23,630
Now we're going to be saving this as a file, so we're going to take out this and we have this portion

367
00:26:23,630 --> 00:26:26,930
of this code right here, which is going from documentation.

368
00:26:26,930 --> 00:26:28,850
So here we have with artifact.

369
00:26:28,850 --> 00:26:34,850
The artifact here is pre processed, it's pre processed data.

370
00:26:34,850 --> 00:26:40,280
So with pre processed later, the new file will specify the file name.

371
00:26:40,580 --> 00:26:46,760
Let's call this pre processed data set as file.

372
00:26:46,760 --> 00:26:52,270
We then save this with non pi we have np that save Z.

373
00:26:52,280 --> 00:26:56,690
The compressed format we specify the file and the data to be saved.

374
00:26:56,690 --> 00:27:03,880
In this case our data is this data set right here, which is this TensorFlow data set this little okay

375
00:27:03,890 --> 00:27:12,740
we can now do log artifact pre processed dataset or other pre processed data.

376
00:27:12,770 --> 00:27:18,680
Then before running we are going to take out just a part of all this data set.

377
00:27:18,680 --> 00:27:21,020
So we'll take out only a thousand elements.

378
00:27:21,020 --> 00:27:27,650
And the reason why we're doing this is because we do not have enough memory to store the two 22,000

379
00:27:28,250 --> 00:27:34,370
520,500 different data points as a single variable.

380
00:27:34,370 --> 00:27:36,350
So we'll have that for now.

381
00:27:36,350 --> 00:27:40,640
Let's have this 1000 elements so you could see how this is done.

382
00:27:40,640 --> 00:27:46,820
Now, if you have a real world problem where you have, say, 100,000 different elements or 100,000

383
00:27:46,820 --> 00:27:51,380
different data points, then you could break them up into simpler parts.

384
00:27:51,380 --> 00:27:58,160
So yeah, we have that 1000 we speak, we take out just 1000, and then one thing we'll do is copy out

385
00:27:58,160 --> 00:27:59,120
this part here.

386
00:27:59,120 --> 00:28:03,110
We'll copy out this part and then include it in this run.

387
00:28:03,110 --> 00:28:05,300
So before, just before this.

388
00:28:05,480 --> 00:28:06,800
So we include this.

389
00:28:06,800 --> 00:28:08,630
You're in this run.

390
00:28:08,630 --> 00:28:15,980
And the fact that you include this in this run, we'll link up this new data set with our pre processed

391
00:28:15,980 --> 00:28:16,670
data set.

392
00:28:16,700 --> 00:28:23,660
Now, that said, we could run this and then run the next cell while we get is this error saying cannot

393
00:28:23,660 --> 00:28:26,870
convert a 10th of D type variance to a non piri.

394
00:28:26,870 --> 00:28:33,020
So here we are having this tensor of D type variance and we're trying to store it as a non pi array

395
00:28:33,020 --> 00:28:34,340
and to solve this problem.

396
00:28:34,340 --> 00:28:42,620
Now what we'll do is we'll just ignore this and then save the dataset X and Y as this list we have dataset

397
00:28:42,620 --> 00:28:49,520
X and it is set Y, so let's run this now and pre process our data.

398
00:28:50,480 --> 00:28:54,390
We now have successfully logged this to our artifact.

399
00:28:54,410 --> 00:28:57,680
Let's get to our dashboard one DB dashboard.

400
00:28:57,680 --> 00:28:59,960
Let's refresh this page here.

401
00:28:59,960 --> 00:29:01,790
You'll see your artifacts.

402
00:29:01,790 --> 00:29:06,830
You could click on artifacts, module detection artifacts, and then you could select from this.

403
00:29:06,830 --> 00:29:11,030
So this is what we had previously, this new data set under this raw data.

404
00:29:11,060 --> 00:29:16,220
Now, our pre process data, we have this pre processed data set and this most recent version that we

405
00:29:16,220 --> 00:29:17,810
go, you could check out the files.

406
00:29:17,810 --> 00:29:25,700
You see this file right here, metadata API, which you could use now to do or carry out all operations.

407
00:29:26,180 --> 00:29:29,930
The overview grab view this all grab view right here.

408
00:29:29,960 --> 00:29:32,000
Now let's exploit this graph view.

409
00:29:32,030 --> 00:29:32,410
Okay.

410
00:29:32,450 --> 00:29:34,160
We have exploited this graph view.

411
00:29:34,160 --> 00:29:35,360
Let's zoom.

412
00:29:35,390 --> 00:29:35,980
Okay.

413
00:29:35,990 --> 00:29:37,430
Now you see that?

414
00:29:37,430 --> 00:29:39,050
Let's drag this one here.

415
00:29:39,050 --> 00:29:45,690
And then what you'll notice is we have this first part which has to do with the creation or rather with

416
00:29:45,690 --> 00:29:47,900
the loading of our initial original dataset.

417
00:29:47,900 --> 00:29:55,480
And then once we load this original dataset, the next thing we did was to now pre process this dataset.

418
00:29:55,490 --> 00:29:58,580
Now we carry out all the runs previously.

419
00:29:58,580 --> 00:29:59,720
That's why you have this.

420
00:29:59,970 --> 00:30:02,910
You don't really need to take this into consideration from here.

421
00:30:02,910 --> 00:30:08,070
We just continue from this point here, we copy out this appears.

422
00:30:08,070 --> 00:30:12,240
We have that copy and then we get back to our code.

423
00:30:12,240 --> 00:30:13,230
Where are we going to start?

424
00:30:13,230 --> 00:30:20,070
Now, with the data splitting spaces out here, you see that we're going to have We've already had two

425
00:30:20,070 --> 00:30:21,240
artifacts created.

426
00:30:21,240 --> 00:30:25,410
One was original data, the other the pre process data.

427
00:30:25,440 --> 00:30:30,570
The next will be the train data, validation data and test data.

428
00:30:30,570 --> 00:30:33,930
And so that's why we call this section the data splitting.

429
00:30:33,960 --> 00:30:39,570
We'll again copy out this part from the processing data and paste here.

430
00:30:40,320 --> 00:30:48,200
Now we have this, let's go ahead and take this off from here and then replace this one here.

431
00:30:48,210 --> 00:30:51,450
We have this artifact, okay?

432
00:30:51,450 --> 00:30:54,120
We have this artifact now, which is the ref.

433
00:30:54,150 --> 00:30:56,700
We're using our pre processed data set.

434
00:30:56,700 --> 00:31:00,690
And then from here we have three artifacts which we are going to create.

435
00:31:00,690 --> 00:31:03,360
We have train data.

436
00:31:03,690 --> 00:31:09,450
We can just call this train data set train data set type essays, pre processed data description, training,

437
00:31:09,450 --> 00:31:13,830
data set, and then the artifact directory, we could get it from this year.

438
00:31:13,830 --> 00:31:18,060
So when we create this artifact, we're going to get this artifact directory.

439
00:31:18,060 --> 00:31:24,600
For now, let's just create the other artifacts so we copy this out train data, we have validation

440
00:31:24,600 --> 00:31:28,140
data, and then the test data to obtain the artifact directory.

441
00:31:28,140 --> 00:31:29,670
We are going to run this.

442
00:31:29,670 --> 00:31:31,890
We have this output here.

443
00:31:32,040 --> 00:31:33,630
Click on this artifact and see.

444
00:31:33,630 --> 00:31:40,530
We have this processed data set which has been loaded and we have our prep data set the NP Z file here.

445
00:31:40,530 --> 00:31:49,020
So this year we copy this path and then at the place of this artifact directory, we're going to place

446
00:31:49,020 --> 00:31:49,590
this path.

447
00:31:50,310 --> 00:31:52,020
Let's take this off.

448
00:31:52,020 --> 00:31:55,380
We have this path now which has been placed.

449
00:31:55,380 --> 00:31:57,330
We have now the artifact.

450
00:31:57,330 --> 00:32:09,540
Let's call this artifact file and then let's paste this out here, take this off and there we go.

451
00:32:09,810 --> 00:32:11,760
We've now taken all this off.

452
00:32:12,450 --> 00:32:19,820
Okay, so we have this here, and then we're going to have our artifact file instead.

453
00:32:19,830 --> 00:32:22,890
Feistier artifact file.

454
00:32:22,890 --> 00:32:29,820
So we read that as file, and then we're going to load this file, allow Pico and get the array.

455
00:32:30,780 --> 00:32:33,530
Then at this point, we define the trade and split vowels.

456
00:32:33,530 --> 00:32:37,620
Split and split From here, we're going to paste this out.

457
00:32:37,620 --> 00:32:42,300
We have our train array because we're trying to create this three different arrays.

458
00:32:42,300 --> 00:32:47,100
We have train array, which goes takes values from zero to the train split.

459
00:32:47,850 --> 00:32:54,300
We define a data length, data length, which is the length of our array.

460
00:32:54,840 --> 00:32:55,800
Zero.

461
00:32:55,800 --> 00:33:01,740
Now, recall that the array, the array we have in this array is made of.

462
00:33:01,770 --> 00:33:05,370
This is a list actually made of two parts the X and the Y.

463
00:33:05,400 --> 00:33:08,790
This x has a length of 1000, the Y length of 1000.

464
00:33:08,790 --> 00:33:14,220
That's why we're doing array the zero index of array, and then we're taking this length.

465
00:33:14,220 --> 00:33:17,730
So are we going to get the length to a value of 1000?

466
00:33:17,730 --> 00:33:23,910
And once we have this data length now we're saying that we're picking out those X, we're picking out

467
00:33:23,910 --> 00:33:31,470
this X right here, and then we take in values from 0 to 80% of the total length.

468
00:33:31,470 --> 00:33:36,720
And so we have trains split, which is 0.8 times data.

469
00:33:36,750 --> 00:33:39,690
Len So that's what we have.

470
00:33:39,960 --> 00:33:48,450
And then we repeat the same for this Y, then for the validation array, we're going to start from this

471
00:33:48,450 --> 00:33:52,680
train splits, we're going to start from here times that.

472
00:33:52,680 --> 00:33:59,460
So from the exit index with respect to the total data set, that's where multiplying by the data length,

473
00:33:59,460 --> 00:34:06,750
we're going to go from your right up to the trans plate plus the validation split.

474
00:34:06,750 --> 00:34:12,810
So now we're going to from 0.8 to 0.9 and we have that set.

475
00:34:12,810 --> 00:34:17,250
We'll just repeat this for the validation for the Y right here.

476
00:34:17,250 --> 00:34:25,410
So let's copy this out, We'll copy this out and then paste this year.

477
00:34:26,880 --> 00:34:27,750
That looks fine.

478
00:34:27,750 --> 00:34:33,120
We have that pace that we have it year and this year this looks fine.

479
00:34:33,120 --> 00:34:36,780
Let's repeat the same process for the test array.

480
00:34:37,380 --> 00:34:41,500
Copy this and paste out your test array.

481
00:34:41,520 --> 00:34:47,150
We're going to go from transplant plus vial split right up to the end.

482
00:34:47,160 --> 00:34:50,340
So we're going from 0.8 plus 0.1.

483
00:34:50,340 --> 00:34:55,730
That is 0.9 times total data length from 900th value right to the end.

484
00:34:55,740 --> 00:34:58,770
We just copy this again out here.

485
00:34:59,040 --> 00:35:03,450
Copy that out and then paste this for the y value.

486
00:35:03,600 --> 00:35:09,950
So we have this here, take this off, paste it, and then we go right to the end.

487
00:35:09,960 --> 00:35:11,730
So that looks fine Again.

488
00:35:11,730 --> 00:35:16,230
We have now our twin array validation array and test array.

489
00:35:16,260 --> 00:35:21,570
We're now set to write this information in our artifacts.

490
00:35:21,570 --> 00:35:24,570
So here we have with pre processed data.

491
00:35:24,570 --> 00:35:27,660
This is instead now we train here color artifact.

492
00:35:27,660 --> 00:35:31,710
We just create a US data train data validator and test data.

493
00:35:31,710 --> 00:35:32,890
So we train data.

494
00:35:32,910 --> 00:35:42,870
The new file we have our we call this train data set p z mode WB save the file and then we are saving

495
00:35:42,870 --> 00:35:43,770
train.

496
00:35:43,770 --> 00:35:45,810
What we're saving is actually just train array.

497
00:35:45,810 --> 00:35:52,860
So we just saving this train array and then we repeat the same process for the validation and the testing.

498
00:35:53,280 --> 00:36:00,930
Okay, we have that validation Validation that's test.

499
00:36:00,930 --> 00:36:06,390
And then yeah, we have test validation.

500
00:36:06,990 --> 00:36:16,440
Okay, we have okay, here we have validation and then yeah, two we have test then now we log our different

501
00:36:16,440 --> 00:36:17,160
artifacts.

502
00:36:17,160 --> 00:36:21,870
So we not only log in this one artifact, but the three different artifacts.

503
00:36:21,960 --> 00:36:23,760
So take this back then.

504
00:36:23,760 --> 00:36:33,900
Yeah, we log, train ed, train data, validator and test beta.

505
00:36:33,900 --> 00:36:35,370
Okay, we have that set.

506
00:36:35,370 --> 00:36:36,750
We could run this.

507
00:36:36,750 --> 00:36:39,390
Let's let's change it to split data.

508
00:36:39,480 --> 00:36:41,850
We run this cell here.

509
00:36:41,850 --> 00:36:45,480
Everything looks fine, and then we move on to split our data.

510
00:36:45,510 --> 00:36:46,320
Split data?

511
00:36:46,320 --> 00:36:51,030
We run that cell and wait for the response here, the output we get.

512
00:36:51,030 --> 00:36:58,650
Now, the reason why we're having this is because we must have integers in our floats at this level.

513
00:36:58,650 --> 00:37:05,310
So when we have this indices, we have to convert this all into integers.

514
00:37:05,310 --> 00:37:07,410
So we have this int here.

515
00:37:07,950 --> 00:37:08,970
There we go.

516
00:37:09,120 --> 00:37:16,650
INT And from here we now run this cell again and then split all data.

517
00:37:16,680 --> 00:37:18,960
The data has now been split successfully.

518
00:37:18,960 --> 00:37:21,780
Let's get back to our dashboard right here.

519
00:37:21,780 --> 00:37:25,380
We have all one DB dashboard artifacts.

520
00:37:25,380 --> 00:37:27,450
Let's refresh this page.

521
00:37:27,660 --> 00:37:33,360
As you could see, we have raw data, pre process data, and now we have the test data, validation

522
00:37:33,360 --> 00:37:35,520
data and our training data.

523
00:37:35,550 --> 00:37:42,120
Let's click upon this training data so you could look at a file here matter dealer or API.

524
00:37:42,120 --> 00:37:46,440
We can always make use of this in creating order artifacts.

525
00:37:46,440 --> 00:37:49,560
And let's get to the graph view and this graph view.

526
00:37:49,560 --> 00:37:57,570
Now we'll be able to see the link between this original data set the process data and the train data

527
00:37:57,570 --> 00:37:57,950
set.

528
00:37:57,960 --> 00:37:58,920
Here's what we get.

529
00:37:58,920 --> 00:38:02,310
Let's click on Explode and you could see this clearly now.

530
00:38:02,310 --> 00:38:08,760
So yeah, you see, you have this original data set, this artifact here we have this run which produces

531
00:38:08,760 --> 00:38:15,570
this data preprocessing artifact, and then we have this run which produces this train data set validation

532
00:38:15,570 --> 00:38:18,390
data set and test data set.

533
00:38:18,390 --> 00:38:22,170
So let's draw this take this off.

534
00:38:22,890 --> 00:38:25,290
We have this path here.

535
00:38:25,290 --> 00:38:27,120
You see, we have this here.

536
00:38:27,120 --> 00:38:30,540
We take this, this and this.

537
00:38:30,690 --> 00:38:35,640
Let's copy out this code right here and then start with the data augmentation.

538
00:38:35,640 --> 00:38:37,890
So from here, we now download the train data set.

539
00:38:37,890 --> 00:38:39,030
We could check this out.

540
00:38:39,150 --> 00:38:42,570
Our artifacts, you should have train data set right here.

541
00:38:42,570 --> 00:38:43,110
There we go.

542
00:38:43,110 --> 00:38:44,520
We have our train data set.

543
00:38:44,520 --> 00:38:47,250
We can now copy this part.

544
00:38:47,430 --> 00:38:51,120
Let's copy this path and pace here.

545
00:38:51,150 --> 00:38:55,910
Now we could click on this one to be finished to stop that run, and that should be fine.

546
00:38:55,920 --> 00:39:01,620
Now let's go ahead and copy out this part of the code which was used for preprocessing.

547
00:39:01,620 --> 00:39:08,310
So yeah, we have this year pre process data, we copy this and then there we go, let's paste.

548
00:39:08,310 --> 00:39:13,740
This year we have our AUGMENT data and then project artifacts.

549
00:39:13,740 --> 00:39:16,980
We're going to use this artifacts here.

550
00:39:17,130 --> 00:39:22,890
So let's come right here and then get this path.

551
00:39:23,520 --> 00:39:24,090
Okay?

552
00:39:24,090 --> 00:39:25,620
So we're going to get that.

553
00:39:26,000 --> 00:39:31,760
And then replace this one with this up on part or the pad to the train data set.

554
00:39:32,630 --> 00:39:35,870
The space is out and this should be fine.

555
00:39:36,740 --> 00:39:37,850
There's actually a file.

556
00:39:37,850 --> 00:39:39,890
So we have artifact file.

557
00:39:39,950 --> 00:39:46,580
Now we have this take this off pre processed data instead of pre processed data, we're going to use

558
00:39:46,580 --> 00:39:59,360
augmented arc meant that leader want to be artifact, augmented data set, augmented data set type,

559
00:39:59,360 --> 00:40:05,750
let's say pre processed data and augmented version, augmented version.

560
00:40:07,590 --> 00:40:11,370
Of the malaria train data set.

561
00:40:11,880 --> 00:40:14,250
So everything looks fine for now.

562
00:40:14,250 --> 00:40:15,270
We have that.

563
00:40:15,270 --> 00:40:22,630
And then here, let's get back to this and then copy out this year and then paste this right here.

564
00:40:22,680 --> 00:40:25,890
So we get in this from our training data, obviously.

565
00:40:26,280 --> 00:40:27,300
There we go.

566
00:40:27,480 --> 00:40:29,700
And everything looks fine.

567
00:40:29,730 --> 00:40:38,120
Next thing to do is do the actual documentation and then log the data set to our artifact.

568
00:40:38,130 --> 00:40:42,570
So let's take this back and then we take this off.

569
00:40:42,570 --> 00:40:44,580
Now we have this.

570
00:40:44,580 --> 00:40:47,910
We open up this artifact file, right?

571
00:40:47,910 --> 00:40:50,400
Your artifact file.

572
00:40:50,520 --> 00:40:53,760
We open that up, we obtain our array.

573
00:40:53,910 --> 00:41:01,410
Let's call that array then for for images or for image in the array.

574
00:41:01,860 --> 00:41:04,920
Our array or we've taken out x.

575
00:41:04,920 --> 00:41:11,790
So for image an X before moving on, we're going to create this to list your dataset X.

576
00:41:11,790 --> 00:41:19,560
And then we have that OC, we have dataset X, and then dataset X append our meant of whatever image

577
00:41:19,560 --> 00:41:20,520
we want to pass in.

578
00:41:20,640 --> 00:41:23,570
Let's send this one step and that's fine.

579
00:41:23,580 --> 00:41:27,870
So for all images, we are going to do this augmentation.

580
00:41:27,870 --> 00:41:35,310
Then after this we have dataset Y, which is simply the unchanged labels we've had already.

581
00:41:35,310 --> 00:41:40,080
So from here now we have all that is X and our dataset Y.

582
00:41:40,290 --> 00:41:49,920
Then we can now run the salt and then we run the augment data we have here, augment data and they should

583
00:41:49,920 --> 00:41:50,640
be fine.

584
00:41:50,820 --> 00:41:54,240
We obtain this error pre processed data not defined.

585
00:41:54,240 --> 00:41:59,040
So let's check back here and we see that this should be augmented data instead.

586
00:41:59,040 --> 00:42:00,600
So let's change this.

587
00:42:00,600 --> 00:42:02,850
And we have augmented data.

588
00:42:03,060 --> 00:42:03,990
There we go.

589
00:42:03,990 --> 00:42:07,710
We run this cells and see what we get.

590
00:42:07,710 --> 00:42:10,620
Now, those artifacts have been logged successfully.

591
00:42:10,620 --> 00:42:17,340
We could get to one DB and check this out here we have this and then we click on augmented data set.

592
00:42:17,340 --> 00:42:19,500
So loading the artifacts.

593
00:42:20,310 --> 00:42:21,380
Now that's loaded.

594
00:42:21,390 --> 00:42:25,260
Let's click on this and check out on this graph.

595
00:42:25,260 --> 00:42:26,550
So let's explode.

596
00:42:26,550 --> 00:42:28,020
And then this is what we have now.

597
00:42:28,020 --> 00:42:34,470
So you see again that we have this path, we have our data, we separate this or split this into train

598
00:42:34,470 --> 00:42:38,160
validation and testing you here you have the training and validation and testing.

599
00:42:38,160 --> 00:42:45,300
And then after the training we have this run which converts this training data into an augmented data,

600
00:42:45,300 --> 00:42:47,280
which is this one right here.

601
00:42:47,940 --> 00:42:54,810
And so at this point, we have different versions of our data which we could make use of depending on

602
00:42:54,810 --> 00:42:55,740
our needs.

603
00:42:55,920 --> 00:42:59,490
Thank you for getting right up to this point and see you next time.
