1
00:00:00,110 --> 00:00:02,600
Hi there and welcome to this new section.

2
00:00:02,600 --> 00:00:10,730
Previously, we saw that we could train a model such that it takes in input um output pairs.

3
00:00:10,730 --> 00:00:13,070
That's this input output pairs.

4
00:00:13,070 --> 00:00:23,060
And then later on when given new data is able to tell us, um, what the price of the car is based on

5
00:00:23,060 --> 00:00:24,140
this new data.

6
00:00:24,140 --> 00:00:31,040
And what this tells us is our model relies a lot on the data for performance.

7
00:00:31,040 --> 00:00:38,990
And so in this section we shall be focusing on how to prepare our data, um, so that the model makes

8
00:00:38,990 --> 00:00:40,250
use of it for training.

9
00:00:40,250 --> 00:00:43,250
So we're going to head over to Kaggle.com.

10
00:00:43,250 --> 00:00:51,740
And more specifically, uh, on Mayank Patel's Kaggle data set entitled Second Hand Cars Data Set.

11
00:00:51,770 --> 00:00:55,580
Now this data set is what we'll be using in this section.

12
00:00:55,580 --> 00:00:57,950
So here we have the different features.

13
00:00:57,950 --> 00:01:04,430
You go from the ID to on road or on road now years kilometers ratings and so on and so forth.

14
00:01:04,430 --> 00:01:08,780
You'll note that we are not going to make use of the on road hold and on road.

15
00:01:08,780 --> 00:01:11,690
Now, as these two features aren't very clear.

16
00:01:11,690 --> 00:01:20,000
Generally, in machine learning we want to have an idea or better still, know in detail what the features

17
00:01:20,000 --> 00:01:21,200
we're working with.

18
00:01:21,290 --> 00:01:22,460
Um, are.

19
00:01:22,460 --> 00:01:23,780
So that's it.

20
00:01:23,780 --> 00:01:26,360
You could, um, open your detail.

21
00:01:26,360 --> 00:01:29,780
You have all the details of the different features.

22
00:01:29,780 --> 00:01:38,090
So for the ID one, we have the, the on road oh five 535,651 on road.

23
00:01:38,090 --> 00:01:40,130
Now we have number of years.

24
00:01:40,130 --> 00:01:46,040
So this tells us that this car uh with ID one um, has spent three years.

25
00:01:46,160 --> 00:01:52,310
The kilometers, the number of kilometers covered is 78,945.

26
00:01:52,340 --> 00:01:56,660
The rating is one, the condition is two, and so on and so forth.

27
00:01:56,660 --> 00:02:00,650
So we have all this information right up to the horsepower.

28
00:02:00,650 --> 00:02:05,840
Then if we click on column here you have this um sort of summary.

29
00:02:05,840 --> 00:02:09,860
So for every feature let's take the number of years.

30
00:02:09,860 --> 00:02:15,800
For example, you have the number of valid, um samples we have in our data set.

31
00:02:15,800 --> 00:02:17,960
We have a thousand valid samples.

32
00:02:17,960 --> 00:02:22,340
The number of mismatches zero number of missing samples zero.

33
00:02:22,370 --> 00:02:29,270
The mean is 4.056 and the standard deviation is 1.72.

34
00:02:29,300 --> 00:02:30,530
Check out another feature.

35
00:02:30,530 --> 00:02:35,540
Like the number of kilometers covered, you have a mean of 100,000.

36
00:02:35,900 --> 00:02:39,680
Um while standard deviation is 29.1 thousand.

37
00:02:39,680 --> 00:02:47,780
So what this simply means is that, uh, most of our values, most of the values we have, will lie

38
00:02:47,780 --> 00:02:54,860
in the range of 1000 -29.1 and 1000 plus 29.1.

39
00:02:54,860 --> 00:03:00,500
But our normal distribution curve will look like this, where we have the mean at 100, um thousand

40
00:03:00,500 --> 00:03:07,580
and standard deviation, uh, which is 29.1, meaning that the values falling within the range 100,

41
00:03:07,580 --> 00:03:12,200
because here we have 100, um, -29.

42
00:03:12,230 --> 00:03:17,360
Obviously this isn't, um, in thousands since we have k, so 100 -29.

43
00:03:17,360 --> 00:03:21,470
And here we have 100 plus 29.

44
00:03:21,470 --> 00:03:24,920
That's essentially about 90 about 81.

45
00:03:24,920 --> 00:03:27,440
It's about uh, not actually 81, 71.

46
00:03:27,440 --> 00:03:31,730
So we're going from 71 to 129.

47
00:03:31,730 --> 00:03:35,720
So what this means is most values, uh, fall within this range.

48
00:03:35,720 --> 00:03:43,130
Or the fact that a value is in this range means that it has a higher probability of, um, appearing

49
00:03:43,130 --> 00:03:44,060
in our data set.

50
00:03:44,150 --> 00:03:44,990
At this point.

51
00:03:44,990 --> 00:03:49,070
You could go ahead and click right here and download this data set.

52
00:03:49,070 --> 00:03:51,500
So it's just 25kB.

53
00:03:51,500 --> 00:03:54,290
You will have your train dot CSV file downloaded.

54
00:03:54,290 --> 00:04:01,550
Then when you download and extract that file you could simply just drag this into your colab, um,

55
00:04:01,550 --> 00:04:02,090
notebook.

56
00:04:02,090 --> 00:04:08,210
So notice how here you have this drop files to upload them to session storage click you drag and uh

57
00:04:08,210 --> 00:04:09,500
drop in here.

58
00:04:09,500 --> 00:04:14,330
You should also note that once you disconnect from the session, you're going to lose this file.

59
00:04:14,330 --> 00:04:18,770
So, um, it's only, uh, in the session that we have access to this file.

60
00:04:18,770 --> 00:04:23,930
And so with the file, um, copied in here, we could start with importing.

61
00:04:23,930 --> 00:04:27,050
So let's import um TensorFlow.

62
00:04:27,080 --> 00:04:29,990
TensorFlow as tf.

63
00:04:29,990 --> 00:04:34,160
Well this is for working with uh models.

64
00:04:34,160 --> 00:04:42,350
So modeling or deep learning modeling then we have import pandas.

65
00:04:42,710 --> 00:04:44,600
Pandas as PD.

66
00:04:45,560 --> 00:04:52,220
This will be used for reading and processing processing uh data.

67
00:04:52,220 --> 00:04:57,260
And then we have import seaborn as SNS.

68
00:04:57,410 --> 00:04:59,660
This will be used for visual.

69
00:05:00,330 --> 00:05:01,110
Asians.

70
00:05:01,140 --> 00:05:06,330
Now we could run this and then click open our train.csv file.

71
00:05:06,330 --> 00:05:12,210
You see we have this um, train dot csv file right here could show 100 samples.

72
00:05:12,210 --> 00:05:15,960
You see that we have a 1 to 100 of 1000 entries.

73
00:05:15,960 --> 00:05:20,280
So we have 1000 different, um, input output pairs.

74
00:05:20,700 --> 00:05:23,370
Um, exactly the same as we had seen on Kaggle.

75
00:05:23,370 --> 00:05:28,500
So here we have the ID we have on road or on road now, years, kilometers and so on and so forth.

76
00:05:28,740 --> 00:05:29,670
Um, up to talk.

77
00:05:29,670 --> 00:05:30,570
That's for the inputs.

78
00:05:30,570 --> 00:05:34,140
And then now we have the output which is the current price.

79
00:05:34,140 --> 00:05:35,730
So with that we close that.

80
00:05:35,730 --> 00:05:37,920
And then now let's copy this path.

81
00:05:37,920 --> 00:05:47,130
So we copy that path and then we do data pandas um read CSV we specify that path.

82
00:05:47,130 --> 00:05:50,340
So let's let's just create um a file path here.

83
00:05:50,430 --> 00:05:51,540
File path.

84
00:05:51,900 --> 00:05:53,070
There we go.

85
00:05:53,070 --> 00:05:54,720
We have our file path.

86
00:05:54,720 --> 00:05:58,080
Now we specify our file file path.

87
00:05:58,080 --> 00:06:00,300
And then we also specify the separator.

88
00:06:00,300 --> 00:06:08,430
So here we have the CSV which stands for comma separated value separated values.

89
00:06:09,030 --> 00:06:15,750
And here given this separator or comma simply tells us that um, each and every value separated by a

90
00:06:15,750 --> 00:06:16,140
comma.

91
00:06:16,140 --> 00:06:17,760
So that is that simple.

92
00:06:17,760 --> 00:06:23,970
So let's run this and then we have data head and we see what we get.

93
00:06:24,270 --> 00:06:27,960
Um open this train dot csv file with your text editor.

94
00:06:27,960 --> 00:06:33,960
And one thing you will notice is the fact that you could see how all the elements in a given row are

95
00:06:33,960 --> 00:06:34,890
separated with a comma.

96
00:06:34,890 --> 00:06:36,600
So we have the idea one.

97
00:06:36,600 --> 00:06:42,900
We have on road old 535,651, which is this.

98
00:06:43,500 --> 00:06:45,120
Um, let's get back.

99
00:06:45,120 --> 00:06:49,170
We have the on road now 798,000.

100
00:06:49,200 --> 00:06:51,600
That's it here and so on and so forth.

101
00:06:51,600 --> 00:06:53,940
So you have three, you have the kilometers.

102
00:06:53,940 --> 00:06:55,530
You could scroll.

103
00:06:55,530 --> 00:06:57,210
So you get to the current price.

104
00:06:57,660 --> 00:07:01,740
We have the current price 351,318.

105
00:07:01,740 --> 00:07:09,420
And this comma two separates the header rows, different columns like the vid on road or on road.

106
00:07:09,420 --> 00:07:12,090
Now up to the current price.

107
00:07:12,090 --> 00:07:18,210
If you happen to work with csvs which are separated by something other than the comma, like say for

108
00:07:18,210 --> 00:07:25,380
example semicolon, then you would want to change the separator and have the semicolon.

109
00:07:25,380 --> 00:07:30,780
Now with that said, let's get back to our um, csv file.

110
00:07:30,780 --> 00:07:33,840
This is just an extract, um, of 21 samples.

111
00:07:34,200 --> 00:07:40,620
Um, here you could see already that this CSV file contains our data set.

112
00:07:40,620 --> 00:07:44,220
Looking at this, we could break up our data set into the two main parts.

113
00:07:44,220 --> 00:07:45,900
There's the inputs and the outputs.

114
00:07:45,900 --> 00:07:54,300
So we we have let's just take all this we have here the inputs right up to this point.

115
00:07:54,300 --> 00:07:56,910
We have the inputs see.

116
00:07:56,910 --> 00:08:01,230
And then we have um to the right we have the outputs.

117
00:08:01,230 --> 00:08:07,260
So if we have 21 um samples like in this case then it means we have 21 rows.

118
00:08:07,260 --> 00:08:13,230
So we have um a 2D tensor which is 21 by number of columns.

119
00:08:13,230 --> 00:08:18,540
Here we have 123456789 1011.

120
00:08:18,540 --> 00:08:20,580
So we have 11 columns.

121
00:08:20,580 --> 00:08:25,950
While um on the side of the output we still again have 21 rows.

122
00:08:25,950 --> 00:08:30,090
But this time around we have only a single, um, column.

123
00:08:30,570 --> 00:08:34,830
Uh, now we could generalize this so we could, we could go from 21 to n.

124
00:08:34,830 --> 00:08:40,950
So depending on number of samples we have, we could have n um and then we have 11.

125
00:08:40,950 --> 00:08:43,710
And then here we have n by one.

126
00:08:43,710 --> 00:08:47,670
So we have this 2D tensor which is uh n by 11.

127
00:08:47,670 --> 00:08:50,250
And this other 2D tensor which is n by one.

128
00:08:50,250 --> 00:08:53,190
After loading the data you could still check out the shape.

129
00:08:53,190 --> 00:08:56,550
Now note that this shape contains both the inputs and the outputs.

130
00:08:56,550 --> 00:09:01,110
So the n is maintained or true because we have 1000 samples.

131
00:09:01,110 --> 00:09:03,360
But we now we have a combination of the inputs and the outputs.

132
00:09:03,360 --> 00:09:05,940
So we have 11 plus one which is 12.

133
00:09:06,240 --> 00:09:12,180
From this point we will make use of Seaborn to carry out some um visualizations.

134
00:09:12,180 --> 00:09:21,840
So we'll make use of this um, pair plot method from Seaborn and compare the different um features.

135
00:09:22,140 --> 00:09:29,340
So we have number of years, kilometers, radian, um condition up to talk for the inputs and then for

136
00:09:29,340 --> 00:09:29,970
the output.

137
00:09:29,970 --> 00:09:32,010
We have the current price.

138
00:09:32,010 --> 00:09:40,980
Anyway, in this pair plot we are comparing each and every, um, feature with with the others.

139
00:09:40,980 --> 00:09:46,950
So we'll compare years with all these other features and for example economy with all the other features.

140
00:09:46,950 --> 00:09:53,010
So let's run this and see what we get here is the overall pair plot we're getting.

141
00:09:53,010 --> 00:09:59,580
And if we copy this um, on our board, you see that um, for each and every.

142
00:09:59,710 --> 00:10:06,160
One for each and every feature we have, um, a plot which shows how that feature is related to the

143
00:10:06,160 --> 00:10:06,580
others.

144
00:10:06,580 --> 00:10:12,070
So this plot here, for example, shows how years are related with a feature, years with itself.

145
00:10:12,070 --> 00:10:17,530
And then this plot shows how years is related with, um, this other one kilometres.

146
00:10:17,530 --> 00:10:19,450
So let's, let's just look at this.

147
00:10:19,450 --> 00:10:25,660
Let's zoom in and take a look, a closer look at this year we have years.

148
00:10:26,680 --> 00:10:28,690
And yeah, we have kilometers.

149
00:10:28,690 --> 00:10:36,520
So, um, the way kilometers is related with the number of years tells us that whether a car has been

150
00:10:36,520 --> 00:10:43,900
used for maybe, say, two years or six years, depending on the driver, we may have, um, a higher

151
00:10:43,900 --> 00:10:47,380
value for the number of kilometers covered or a lower value.

152
00:10:47,620 --> 00:10:57,580
So if we, we have a car which is new or brand new and that, um, is used, um, maybe to go for work

153
00:10:57,580 --> 00:11:02,770
and get back home, then we could fall around this range here you see fewer kilometers covered.

154
00:11:02,770 --> 00:11:11,110
But if you just bought a brand new car to be used as a taxi, then you would fall, um, in this range,

155
00:11:11,110 --> 00:11:15,310
as you you are, you're basically using the car, um, for the whole day.

156
00:11:15,460 --> 00:11:19,870
Also, you may have, uh, bought a second car or a third car.

157
00:11:19,870 --> 00:11:24,040
So you have maybe three cars, and you, you, you rarely use a third car.

158
00:11:24,040 --> 00:11:28,360
So even after seven years, the car hasn't run so many kilometers.

159
00:11:28,360 --> 00:11:35,260
On the other hand, even with normal usage after seven years, we expect that the car has, um, run,

160
00:11:35,320 --> 00:11:36,400
uh, many kilometers.

161
00:11:36,400 --> 00:11:44,890
So actually, this shows us that the number of kilometers covered, um, depends on the usage of the

162
00:11:44,890 --> 00:11:45,280
car.

163
00:11:45,280 --> 00:11:52,000
And so whether the car is one year old or seven years old, the usage or the number of kilometers covered

164
00:11:52,000 --> 00:11:52,870
may be different.

165
00:11:52,870 --> 00:11:59,830
One other feature which we would like to visualize is, um, this one comparing the, um, the current

166
00:11:59,830 --> 00:12:02,830
price and the number of kilometers covered.

167
00:12:02,830 --> 00:12:08,620
You will notice that as you cover, as you increase, uh, as there is an increase in number of kilometers

168
00:12:08,620 --> 00:12:11,440
covered, the price tends to drop.

169
00:12:11,440 --> 00:12:18,370
And unlike this other features where we don't have a direct or inverse, um, correlation with the current

170
00:12:18,370 --> 00:12:24,820
price, if you increase the number of kilometers or if the car has run through so many kilometers,

171
00:12:24,820 --> 00:12:32,890
then the car's price tend to drop regardless of the usage of the car, regardless of the type of car

172
00:12:32,890 --> 00:12:34,540
or any other factor.

173
00:12:34,540 --> 00:12:38,620
So this particular feature is really important.

174
00:12:38,620 --> 00:12:44,110
With that said, you could feel free to check out the Seaborn documentation, um, to learn more on

175
00:12:44,110 --> 00:12:50,200
how to visualize data so that you could take, um, much better and quicker decisions.

176
00:12:50,530 --> 00:12:54,910
Let's go ahead and add, um, this header right here.

177
00:12:54,910 --> 00:13:00,040
So we would have data data preparation.

178
00:13:00,070 --> 00:13:01,120
There we go.

179
00:13:01,120 --> 00:13:09,400
We have that header before we go on to extract the data and create the separate inputs um x and output

180
00:13:09,400 --> 00:13:10,150
y.

181
00:13:10,180 --> 00:13:12,910
We are going to shuffle this data.

182
00:13:12,940 --> 00:13:20,140
Now the reason why we're shuffling this data is because we want to avoid the model being biased based

183
00:13:20,140 --> 00:13:23,440
on the order in which the data was collected.

184
00:13:23,440 --> 00:13:28,360
So we shall simply create, uh, a random, uh, shuffle data.

185
00:13:28,360 --> 00:13:35,410
So here we have shuffle data, um, TensorFlow random um, shuffle.

186
00:13:35,410 --> 00:13:37,480
And then you pass in the data.

187
00:13:37,480 --> 00:13:38,440
So that's it.

188
00:13:38,440 --> 00:13:42,640
Now if we do shuffle um data head.

189
00:13:43,210 --> 00:13:50,260
Now if we get the first five elements of the shuffled data, shuffled data, um, first five elements

190
00:13:50,260 --> 00:13:57,910
and then we compare with the data first five, you would find that the ordering is going to be different.

191
00:13:57,910 --> 00:14:13,060
So here unlike the data which is 12345 the shuffle data is 288 um 201 882, 557 and 313.

192
00:14:13,060 --> 00:14:16,120
So this shows us that our data has been shuffled.

193
00:14:16,120 --> 00:14:20,890
Then since we want to get x and y we'll simply have our x.

194
00:14:20,890 --> 00:14:24,970
Now this x is going to contain the.

195
00:14:25,000 --> 00:14:26,710
If you look at data let's just print.

196
00:14:26,710 --> 00:14:29,530
Let us um print out Data.head here.

197
00:14:29,530 --> 00:14:33,790
So we have data Data.head.

198
00:14:33,970 --> 00:14:35,200
There we go.

199
00:14:35,230 --> 00:14:41,110
Our X is going to contain just some parts of our data, of our shuffle data.

200
00:14:41,110 --> 00:14:43,420
First of all it's going to contain all the rows.

201
00:14:43,420 --> 00:14:46,630
Remember we had seen this already in the section on tensors and variables.

202
00:14:46,630 --> 00:14:48,550
So we get all the rows.

203
00:14:48,550 --> 00:14:57,160
And then once we get all these different rows, the next thing we'll do is to collect or select um the

204
00:14:57,160 --> 00:14:59,110
specific um columns.

205
00:14:59,110 --> 00:15:02,260
So here we want the this column.

206
00:15:02,290 --> 00:15:05,020
This is the zeroth column 012.

207
00:15:05,020 --> 00:15:06,070
And this is three.

208
00:15:06,070 --> 00:15:12,010
Because we don't want on road now we don't want on road old and we don't want the ID, we want years

209
00:15:12,010 --> 00:15:13,930
um up to the talk.

210
00:15:13,930 --> 00:15:19,090
So we want to be three right up to the end.

211
00:15:19,690 --> 00:15:24,100
And then for Y we want the shuffle data.

212
00:15:24,100 --> 00:15:25,540
Shuffle data.

213
00:15:26,130 --> 00:15:30,360
Um, and here we want everything again and just the end.

214
00:15:30,360 --> 00:15:35,130
So let's run that print out X, and then we print out y.

215
00:15:35,130 --> 00:15:38,700
We have your X which is of shape 1000 by eight.

216
00:15:38,700 --> 00:15:43,140
And we have y which is of shape um 1000.

217
00:15:43,140 --> 00:15:45,420
So one d tensor.

218
00:15:45,420 --> 00:15:46,920
Let's just print out the shape.

219
00:15:46,920 --> 00:15:51,420
So you could see that we have y shape and x shape.

220
00:15:51,420 --> 00:15:54,420
Now we want our inputs.

221
00:15:54,420 --> 00:15:56,430
Uh that's x and y to be 2D.

222
00:15:56,430 --> 00:16:09,450
So we're going to go from this 1000 shape to um 1000 1000 by one.

223
00:16:09,450 --> 00:16:10,860
So that's what we want to do.

224
00:16:10,860 --> 00:16:14,310
And we're going to make use of the Expanddims method.

225
00:16:14,310 --> 00:16:20,160
So let's get back here and we do TensorFlow expand dimensions.

226
00:16:20,160 --> 00:16:24,930
And since we want to add this to the end we're just going to have negative one.

227
00:16:24,930 --> 00:16:26,550
So we run that again.

228
00:16:26,910 --> 00:16:29,610
Um and then let's take this off.

229
00:16:30,060 --> 00:16:39,000
And then we print out the x shape and y shape in general before passing our input data into our model.

230
00:16:39,000 --> 00:16:44,670
What we want to do is to normalize this data to speed up the training process.

231
00:16:44,670 --> 00:16:48,000
And so we are given, um, this input data.

232
00:16:48,330 --> 00:16:49,920
Uh, let's call that x.

233
00:16:49,920 --> 00:17:00,210
We shall subtract each and every one of those values by a mean value and divide by a standard deviation.

234
00:17:00,330 --> 00:17:08,760
And so what the model is going to get is actually an x prime which is equals the normalized um or which

235
00:17:08,760 --> 00:17:11,430
is the normalized version of our x.

236
00:17:11,700 --> 00:17:16,320
Now the way we could get the mean and standard deviation is by simply going through all the different

237
00:17:16,320 --> 00:17:20,250
values, getting the mean and the standard deviation.

238
00:17:20,250 --> 00:17:26,730
Now you should take note that we also don't want to include data, which the model will not see during

239
00:17:26,730 --> 00:17:27,210
training.

240
00:17:27,210 --> 00:17:30,240
So we only want to get this mean.

241
00:17:30,240 --> 00:17:34,050
And the standard deviation from um, data.

242
00:17:34,050 --> 00:17:36,690
The model is going to be trained on practically.

243
00:17:36,690 --> 00:17:43,020
The way we could normalize our data is by making use of TensorFlow um, normalization layer.

244
00:17:43,020 --> 00:17:46,500
So you see here we have the simple definition.

245
00:17:46,500 --> 00:17:50,520
We have this arguments which are well explained.

246
00:17:50,520 --> 00:17:53,070
And so you could always feel free to check this out.

247
00:17:53,070 --> 00:17:55,500
And then you will um have some examples.

248
00:17:55,500 --> 00:17:57,120
Now let's get back here.

249
00:17:57,120 --> 00:18:02,520
And then we import we're going to import TensorFlow.

250
00:18:02,550 --> 00:18:11,760
TensorFlow Keras layers um or rather from from TensorFlow Keras layers.

251
00:18:11,760 --> 00:18:18,150
We're going to import the normalization normalization.

252
00:18:18,150 --> 00:18:18,810
There we go.

253
00:18:18,810 --> 00:18:20,760
So we import that.

254
00:18:20,760 --> 00:18:27,990
Let's go down and then show how we could make use of that normalization layer to normalize some um random

255
00:18:27,990 --> 00:18:28,410
data.

256
00:18:28,410 --> 00:18:36,990
So here we have um our 1D um tensor let's say 34567.

257
00:18:36,990 --> 00:18:37,710
There we go.

258
00:18:37,710 --> 00:18:39,600
So we have this 1D tensor.

259
00:18:39,600 --> 00:18:42,690
And then we shall define a normalizer.

260
00:18:42,690 --> 00:18:47,310
We have our normalizer our normalization.

261
00:18:47,310 --> 00:18:48,840
Normalization.

262
00:18:49,410 --> 00:18:50,460
There we go.

263
00:18:50,460 --> 00:18:53,490
And then what we do is we say normalizer.

264
00:18:53,490 --> 00:18:55,620
Normalizer takes an A.

265
00:18:55,620 --> 00:18:57,750
So let's print it out and see what we get.

266
00:18:58,230 --> 00:19:00,330
Um let's print that out.

267
00:19:00,930 --> 00:19:03,660
You can see that when we don't specify any value here.

268
00:19:03,690 --> 00:19:06,600
Well we get returned the exact same input.

269
00:19:06,600 --> 00:19:11,610
Now let's modify this and say we have five and four or let's, let's be more specific.

270
00:19:11,610 --> 00:19:15,660
The mean is five and the variance is four.

271
00:19:15,660 --> 00:19:23,100
Note that the standard deviation is the square root of the variance, the square root of the variance.

272
00:19:23,100 --> 00:19:32,490
And so if you are doing um x minus the mean um divided by the standard deviation is really just x amount

273
00:19:32,520 --> 00:19:36,570
minus the mean divided by the square root of the variance.

274
00:19:36,570 --> 00:19:38,430
So let's put this brackets here.

275
00:19:38,430 --> 00:19:42,300
So see that clearly x minus mean divided by square root of the variance.

276
00:19:42,300 --> 00:19:46,200
And so um given that we specified the variance and the mean.

277
00:19:46,200 --> 00:19:48,750
Now let's run this again see what we get.

278
00:19:48,750 --> 00:19:53,850
You see we have negative one -0.50 0.5 and one.

279
00:19:53,850 --> 00:19:58,710
Now if we try to test this out let's well let's get back.

280
00:19:58,830 --> 00:20:02,730
Let's say we have um let's take three.

281
00:20:02,730 --> 00:20:06,720
So here we have three three minus the mean.

282
00:20:06,720 --> 00:20:07,740
The mean is five.

283
00:20:07,740 --> 00:20:12,420
Then um divided by the square root of the variance which is um two.

284
00:20:12,450 --> 00:20:14,850
So that gives us negative one.

285
00:20:14,850 --> 00:20:17,250
And that's how you have this negative one right here.

286
00:20:17,250 --> 00:20:22,980
So when you specify this mean um, and the variance when you specify the mean and the variance for each

287
00:20:22,980 --> 00:20:25,320
and every value, we're going to run this computation.

288
00:20:25,630 --> 00:20:27,910
And the values are now going to be normalized.

289
00:20:27,910 --> 00:20:28,990
So that's it.

290
00:20:29,020 --> 00:20:31,360
We could also test this out for 2D.

291
00:20:31,510 --> 00:20:32,620
Um tensor.

292
00:20:33,160 --> 00:20:34,540
Let's get back here.

293
00:20:34,570 --> 00:20:36,280
We have this 2D.

294
00:20:36,490 --> 00:20:39,160
Let's say we have six seven, eight.

295
00:20:39,580 --> 00:20:47,770
Um, whether you say for four, five, six, seven and eight, close that, run that again.

296
00:20:47,860 --> 00:20:56,350
See, we have as, uh, as we had this previously, but now for this other ones, we have um, -0.50,

297
00:20:56,350 --> 00:20:59,020
0.51 and 1.5.

298
00:20:59,050 --> 00:21:05,560
You could test out for eight, for example, if you have eight, eight, um, minus five is three divided

299
00:21:05,560 --> 00:21:10,240
by the square root of the variance, which is two, three divided by two is 1.5.

300
00:21:10,270 --> 00:21:13,000
That's how we have this 1.5 right here.

301
00:21:13,030 --> 00:21:20,080
Now um, when working with data, we generally don't know the mean and the variance upfront.

302
00:21:20,080 --> 00:21:25,720
And so because we don't know the mean and the variance upfront, we want to be able to calculate this

303
00:21:25,720 --> 00:21:26,680
automatically.

304
00:21:26,680 --> 00:21:34,960
And so what we could do is we could um allow our normalizer to adapt um its mean and variance based

305
00:21:34,960 --> 00:21:35,800
on the data.

306
00:21:35,800 --> 00:21:39,070
So let's get back here and say normalizer.

307
00:21:39,070 --> 00:21:44,890
Normalizer um adapt to a.

308
00:21:44,920 --> 00:21:48,220
So we want our normalizer now to adapt to a.

309
00:21:48,220 --> 00:21:51,880
And so now we don't need to have this five and four again.

310
00:21:51,880 --> 00:21:53,530
So let's run this.

311
00:21:53,530 --> 00:21:56,620
You see we have these values negative ones and then ones.

312
00:21:56,650 --> 00:21:59,050
Now let's explain why we have these values.

313
00:21:59,050 --> 00:22:06,310
First of all getting back to the documentation you would find that we have um this axis set to negative

314
00:22:06,310 --> 00:22:07,510
one by default.

315
00:22:07,510 --> 00:22:14,950
So if we set the axis or if we change the the the axis value, if we say axis is zero, you, you would

316
00:22:14,950 --> 00:22:18,040
find that we will have completely different values.

317
00:22:18,040 --> 00:22:24,520
So having the axis um, or not putting any value is the same as having the axis to be negative one.

318
00:22:24,520 --> 00:22:31,390
Now when for a 2D tensor like a, you specify that the axis is negative one, it means that you are

319
00:22:31,390 --> 00:22:36,730
going to um, compute the mean and the variance along the columns.

320
00:22:36,730 --> 00:22:44,050
Remember we have a a has shape um, two because we have two rows and then five columns, it has shape

321
00:22:44,050 --> 00:22:44,950
two by five.

322
00:22:44,950 --> 00:22:49,450
And so when you specify access to be negative one that's you're specifying the columns.

323
00:22:49,450 --> 00:22:53,200
What you're saying is you want to compute the mean along those different columns.

324
00:22:53,200 --> 00:23:00,790
So for this first column three and for our mean is 3.5 which is logical because if you do let's get

325
00:23:00,790 --> 00:23:01,420
back here.

326
00:23:01,420 --> 00:23:06,760
If we do three plus four um, divided by two you have 3.5.

327
00:23:06,760 --> 00:23:09,670
And then the variance is going to be 0.5.

328
00:23:09,670 --> 00:23:17,770
So what we're saying is if we have this two values that's three and four then our mean our mean here

329
00:23:17,920 --> 00:23:21,790
would be 3.5.

330
00:23:22,180 --> 00:23:26,200
Now um to the left another the the value we have is three.

331
00:23:26,200 --> 00:23:28,840
And to the right the value we have is four.

332
00:23:28,840 --> 00:23:33,820
Now what um, separates three and 3.5 is 0.5.

333
00:23:33,820 --> 00:23:38,710
And what separates 3.5 and four is also 0.5.

334
00:23:38,710 --> 00:23:42,760
And this happens to be our standard deviation.

335
00:23:42,760 --> 00:23:48,370
So in this case our standard deviation is equals 0.5.

336
00:23:48,520 --> 00:23:53,320
While the mean the mean here is 3.5.

337
00:23:53,320 --> 00:23:56,320
And so now if we do three let's take this off.

338
00:23:57,460 --> 00:24:09,550
If we do three -3.5, um, divided by -3.5, then all of that divided by 0.5, what we're going to have

339
00:24:09,550 --> 00:24:13,900
is um, -0.5 divided by 0.5, which is negative one.

340
00:24:13,900 --> 00:24:16,270
And then if you do let's just copy this.

341
00:24:16,540 --> 00:24:22,120
If we do, um, four -3.5 divided by 0.5 you would have one.

342
00:24:22,120 --> 00:24:24,850
So that's how we obtain these values negative one and one.

343
00:24:24,850 --> 00:24:31,300
Now for the rest is uh similar situation because we have 4 or 5, the mean there is 4.5 and then we

344
00:24:31,300 --> 00:24:32,200
have four.

345
00:24:32,200 --> 00:24:36,670
And then the standard deviation is 0.5, because the gap between 4 and 5 is one.

346
00:24:36,670 --> 00:24:42,250
And then when you take one half you have a standard deviation of um 0.5.

347
00:24:42,250 --> 00:24:47,980
Now if you replace this by five you see again and 4.5 here we have negative one one.

348
00:24:47,980 --> 00:24:50,350
So that's how we obtain um all this.

349
00:24:50,350 --> 00:24:53,200
And now let's add another.

350
00:24:53,200 --> 00:24:57,070
Let's get back and then add another um row.

351
00:24:57,070 --> 00:25:01,810
So let's say 912 um four and then zero.

352
00:25:01,810 --> 00:25:03,070
Run that again.

353
00:25:03,790 --> 00:25:05,290
Uh we're getting this error.

354
00:25:05,290 --> 00:25:06,580
Well let's take this off.

355
00:25:06,850 --> 00:25:07,930
Run that again.

356
00:25:08,560 --> 00:25:16,330
You see, now that we have, um, our, um, inputs, which is this which have now been normalized.

357
00:25:16,330 --> 00:25:19,120
And so now we could go ahead and normalize our x.

358
00:25:19,120 --> 00:25:21,760
So we're going to create x normalized.

359
00:25:22,000 --> 00:25:24,430
We could just have this.

360
00:25:24,790 --> 00:25:26,530
We have take that off.

361
00:25:26,530 --> 00:25:27,880
We take a off.

362
00:25:27,880 --> 00:25:33,070
And then we we have normalizer and we adapt our X.

363
00:25:33,070 --> 00:25:41,440
And then we'll have this to be x normalized X normalized is simply normalizer.

364
00:25:41,770 --> 00:25:44,920
Um and we pass in the x.

365
00:25:44,920 --> 00:25:45,940
So that's it.

366
00:25:45,940 --> 00:25:53,920
So take that off and then let's print out um x normalized x normalized.

367
00:25:54,580 --> 00:25:55,480
There we go.

368
00:25:55,480 --> 00:25:59,380
We now have the values of x which have been normalized.

369
00:25:59,380 --> 00:26:03,340
We still have our 1000 by by eight shaped um tensor.

370
00:26:03,520 --> 00:26:06,220
So that's it for this section on data preparation.

371
00:26:06,220 --> 00:26:09,160
We are now going to move on to modeling.
