1
00:00:12,960 --> 00:00:17,290
In this lecture we are going to continue looking at Siamese cat pie.

2
00:00:17,310 --> 00:00:21,780
This lecture will focus on splitting the data into train and test sets.

3
00:00:21,930 --> 00:00:27,930
Previously what we did was loading in all of our images into a big numb pie array and we also have another

4
00:00:27,930 --> 00:00:32,090
array to tell us which subject those images contain.

5
00:00:32,460 --> 00:00:39,000
To start I would like to know how many images of each subject we have to do this.

6
00:00:39,120 --> 00:00:43,580
I'm going to use the counter objects and passing the labels.

7
00:00:43,730 --> 00:00:49,700
Now when we print this out we can see that most of the subjects have 11 images except the first subjects

8
00:00:50,030 --> 00:00:52,340
which has 12 images.

9
00:00:52,340 --> 00:00:57,800
The reason why we want to do this is so that we can select an appropriate number of test images

10
00:01:02,230 --> 00:01:07,230
next we're going to grab a set of unique labels.

11
00:01:07,270 --> 00:01:12,150
Next we're going to get the number of subjects which is just the length of the set we just created

12
00:01:14,820 --> 00:01:15,220
now.

13
00:01:15,220 --> 00:01:17,280
Here is one key part of this.

14
00:01:17,440 --> 00:01:23,350
We're going to make it so that for each subject three images will be used for the test set.

15
00:01:23,350 --> 00:01:29,530
Importantly what you do not want to do is make your data set first where you have a bunch of pairs of

16
00:01:29,530 --> 00:01:32,560
images and then do a train test split.

17
00:01:32,710 --> 00:01:38,020
The reason you don't want to do that is because when you create your pairs of images the images will

18
00:01:38,020 --> 00:01:40,230
be repeated multiple times.

19
00:01:40,330 --> 00:01:46,510
So if you try to do a train test split after pairing then you will have the same images appearing in

20
00:01:46,510 --> 00:01:48,380
both the train and test sets.

21
00:01:48,430 --> 00:01:52,550
So don't make that mistake instead.

22
00:01:52,560 --> 00:01:58,890
We are going to decide first which images will be used for train and test then we'll create our pairs

23
00:01:58,950 --> 00:02:01,590
out of those train and test images.

24
00:02:01,590 --> 00:02:08,190
So since we chose three as the number of images to take from each subject for the test set that means

25
00:02:08,220 --> 00:02:19,450
and test will be three times and subjects then and train will just be an A minus and test.

26
00:02:19,490 --> 00:02:24,290
Next we're going to initialize the arrays that will hold the train and test images.

27
00:02:24,500 --> 00:02:31,610
So we create train images train labels test images and test labels using the N train and test.

28
00:02:31,610 --> 00:02:38,350
We just defined importantly realize that we haven't yet created our pairs of images.

29
00:02:38,360 --> 00:02:45,700
This is all just pre work that we need to do.

30
00:02:45,940 --> 00:02:49,660
Next we're going to fill up the arrays we just initialize.

31
00:02:49,720 --> 00:02:52,450
So currently they are all just zero.

32
00:02:52,540 --> 00:02:58,150
And the reason we need to do it this way is because all the images are currently randomly ordered.

33
00:02:58,330 --> 00:03:03,460
So we're going to loop through each image in the data set and for each subject we'll take the first

34
00:03:03,460 --> 00:03:10,060
three images we encounter and assign that to the test set then any image thereafter we will assign it

35
00:03:10,060 --> 00:03:18,070
to the train set so we start with a dictionary called count so far this will count how many images for

36
00:03:18,070 --> 00:03:21,380
each subject we've encountered so far in our loop.

37
00:03:21,520 --> 00:03:27,130
So the key to this dictionary is the subject i.e. and the value is the count.

38
00:03:27,130 --> 00:03:32,800
We also keep track of train index and test index so we know where to place the image we encounter

39
00:03:35,720 --> 00:03:36,920
inside the loop.

40
00:03:36,980 --> 00:03:40,480
The first thing we do is increment the count for the current label.

41
00:03:40,730 --> 00:03:45,850
So we index label and then we add one and then we use get here so that the default value is 0.

42
00:03:45,860 --> 00:03:52,610
If there is nothing there already next we check whether the count for the subject is bigger than three

43
00:03:53,660 --> 00:03:54,320
if it is.

44
00:03:54,320 --> 00:03:57,400
That means we've stored all the test images already.

45
00:03:57,500 --> 00:04:04,250
So that means we should store this image as a train image so we store that image variable in the train

46
00:04:04,250 --> 00:04:12,860
images array at the current index train index then we store the label and train labels then we increment

47
00:04:12,860 --> 00:04:13,820
train index

48
00:04:17,080 --> 00:04:22,510
then in the else block we do the same thing except for the test images and test labels.
