1
00:00:00,420 --> 00:00:06,780
Hi and welcome to the section where we talk about pre-processing our data before we input it into the

2
00:00:06,780 --> 00:00:08,970
Keros model to train our model.

3
00:00:09,630 --> 00:00:12,540
So why do we need to preprocessor our data?

4
00:00:13,080 --> 00:00:19,500
Well, Keros expects data in a certain format and expects data with what they mention added onto it.

5
00:00:19,950 --> 00:00:24,320
That's just because of how it how the internal structure of data is processed inside of it.

6
00:00:24,360 --> 00:00:25,300
It's not a big deal.

7
00:00:25,320 --> 00:00:26,520
It's nothing scary at all.

8
00:00:26,580 --> 00:00:29,850
I'll show you in the code how we add that extra dimension onto it.

9
00:00:30,540 --> 00:00:34,560
Then, during the data type initially is unsigned into it.

10
00:00:34,830 --> 00:00:36,870
We have to change that to float to two.

11
00:00:37,440 --> 00:00:42,180
Second, to leave, we have to actually normalize data between zero and one previously.

12
00:00:42,180 --> 00:00:46,940
So in part to it, we normalize between minus one and one, but will work for both types of library.

13
00:00:46,960 --> 00:00:50,640
So you can also do this between minus one and one.

14
00:00:50,910 --> 00:00:53,280
And likewise for PI, to which you can go from zero to one.

15
00:00:53,610 --> 00:00:54,420
They both work.

16
00:00:54,900 --> 00:00:59,550
Generally, it's better to have zero mean zero center data like what we did in the pie to us lesson.

17
00:01:00,090 --> 00:01:02,660
However, it's fine to do zero to one as well.

18
00:01:02,670 --> 00:01:08,010
I frequently do zero to one, and most of my image models that I'm training, it's not going to make

19
00:01:08,010 --> 00:01:13,380
a big difference, but it can in some cases, but most in most cases, it's not going to make a big

20
00:01:13,380 --> 00:01:13,800
difference.

21
00:01:14,550 --> 00:01:17,250
And then we do hot one, including with the label data.

22
00:01:17,700 --> 00:01:19,530
So let's take a look at what we're doing here.

23
00:01:20,430 --> 00:01:26,670
So for the image rows and image columns, those are basically 28 by 28, which you can see here.

24
00:01:26,670 --> 00:01:33,180
Just print prints out here, just so you guys know and use that in this, these functions here where

25
00:01:33,180 --> 00:01:36,030
we reshape so to reshape which we want.

26
00:01:36,240 --> 00:01:42,630
This is the initial shape of our data block considered as like a block of data, 60000 entries of 20

27
00:01:42,630 --> 00:01:43,710
by 28 images.

28
00:01:44,250 --> 00:01:47,190
We want to get this into this shape with an X should have one dimension here.

29
00:01:47,880 --> 00:01:49,380
So to do that, it's quite simple.

30
00:01:49,380 --> 00:01:52,080
Actually, we just use to reshape function from no.

31
00:01:52,710 --> 00:01:54,270
So it's extreme, not reshape.

32
00:01:54,600 --> 00:01:56,540
We take the length of this area.

33
00:01:57,180 --> 00:01:59,970
That's the shape on a sixty thousand.

34
00:02:00,450 --> 00:02:03,630
We have 28 by 28 and I do one here.

35
00:02:04,110 --> 00:02:06,120
And likewise, we do it for excess.

36
00:02:06,870 --> 00:02:09,030
So then we keep a track of the image up here.

37
00:02:09,420 --> 00:02:11,730
This is this image Roy's image columns and one.

38
00:02:12,420 --> 00:02:15,000
And that's basically the number of dimensions here.

39
00:02:15,000 --> 00:02:22,620
And in that case, here's where we converted to float two to both of the trading and for the X rating

40
00:02:22,620 --> 00:02:23,640
and the test data.

41
00:02:23,640 --> 00:02:24,060
Sorry.

42
00:02:24,780 --> 00:02:26,140
And here's where we normalize.

43
00:02:26,160 --> 00:02:32,280
We just divide and divide it by point zero two candidates into a float, and then we just print out

44
00:02:32,280 --> 00:02:33,030
the exact day.

45
00:02:33,030 --> 00:02:37,950
And here it is as a strategic test to make sure that our final shape is what we desired.

46
00:02:38,160 --> 00:02:42,990
Sixty thousand twenty eight twenty eight comma one four dimensions is exactly what we wanted.

47
00:02:43,650 --> 00:02:49,080
Now let's move on to hot, including one hot, including sorry, all labels.

48
00:02:49,200 --> 00:02:53,460
So to do that, basically, let me explain to you what hot when encoding is.

49
00:02:53,940 --> 00:03:01,920
It's basically a format where to represent the labels four or five and two or any of the 10 digits or

50
00:03:01,920 --> 00:03:02,920
classes that you could be.

51
00:03:03,930 --> 00:03:10,140
You basically create a rule, a column like this where you have zero one two three four five six seven

52
00:03:10,140 --> 00:03:18,210
eight nine and then basically just put the 10 on one here when it represents a four and all the other

53
00:03:18,220 --> 00:03:19,200
numbers would be zero.

54
00:03:19,230 --> 00:03:24,180
Likewise, for five ways, for two and likewise for six, it's quite simple to understand this format.

55
00:03:24,690 --> 00:03:29,190
And that's just a format that Keros uses when trying to identify the class labels.

56
00:03:29,820 --> 00:03:31,320
You don't have to do this in spite of what you do.

57
00:03:31,500 --> 00:03:39,540
Fortunately, so to do that, we use the two categorical function that's in the carers utilities here,

58
00:03:40,440 --> 00:03:42,890
and you can see it's quite simple to implement.

59
00:03:42,900 --> 00:03:46,980
We just do by train input into this function, get back to output.

60
00:03:47,010 --> 00:03:52,890
Similarly, for this, and we just do a apprenticing number of classes here, we just stored a number

61
00:03:52,890 --> 00:03:57,150
of classes here in this variable because this is the index.

62
00:03:57,150 --> 00:04:04,440
One of this to ship gives us a number of classes now because remember, no, it's a 2D structure, so

63
00:04:04,440 --> 00:04:06,060
we have to use instead of the zero.

64
00:04:06,270 --> 00:04:12,150
We use one because zero would give us the number of labels, number of rows, whereas one gives this

65
00:04:12,150 --> 00:04:18,120
number of columns and we get the number of pixels as well, which would just be this by this.

66
00:04:19,140 --> 00:04:20,300
We actually don't need this.

67
00:04:20,310 --> 00:04:26,430
I think after double check and see if we're using this somewhere else in an awkward but for now, this

68
00:04:26,430 --> 00:04:28,050
is what is important here.

69
00:04:28,830 --> 00:04:30,090
This two categorical.

70
00:04:30,810 --> 00:04:34,830
So I'll stop there for now and we'll go back to building a model.

71
00:04:35,310 --> 00:04:37,630
But before I continue, let's take a look at this.

72
00:04:37,650 --> 00:04:39,900
Let's take a look at what the first element looks like.

73
00:04:40,320 --> 00:04:45,120
So you can see for sure that it has been quoted into the one shot, including here you can see is 10

74
00:04:45,120 --> 00:04:45,570
digits.

75
00:04:45,990 --> 00:04:49,530
This is zero one two three four five.

76
00:04:49,560 --> 00:04:57,150
So the first digit first class in our and image in tests in the training dataset is five.

77
00:04:57,780 --> 00:04:59,880
So we'll stop that and then we'll move on.

78
00:05:00,240 --> 00:05:03,210
Onto building a model in the next section.

79
00:05:03,360 --> 00:05:03,810
Thank you.
