1
00:00:11,690 --> 00:00:16,870
Before we move on it helps to think about what exactly we are trying to build.

2
00:00:16,880 --> 00:00:22,880
Remember that eventually what we need to feed into the neural network is a batch of pairs of images

3
00:00:23,270 --> 00:00:29,710
along with the target telling us whether each of those pairs is a match or not however.

4
00:00:29,720 --> 00:00:36,080
Keep in mind that we do not actually want to store these batches of images because that would mean we

5
00:00:36,080 --> 00:00:41,000
would repeat the same images and memory more times than we have to.

6
00:00:41,000 --> 00:00:48,000
Instead it would be better if we could just generate these batches on the fly to that end.

7
00:00:48,330 --> 00:00:52,890
What we actually won is just a list of pairs of indexes.

8
00:00:53,250 --> 00:01:00,810
We'll have one array to carry the matching pairs and one array to carry the non matching pairs in both

9
00:01:00,810 --> 00:01:01,330
arrays.

10
00:01:01,350 --> 00:01:09,330
All we will store is simply a series of tuples of indexes so you can imagine if I have this big array

11
00:01:09,750 --> 00:01:13,720
and the first two images the image is at index 0 and index 1.

12
00:01:13,830 --> 00:01:20,040
If their matches I don't want to store those images together I'm just going to store 0 and 1 the tuple

13
00:01:20,400 --> 00:01:27,830
and then grab those images later.

14
00:01:28,090 --> 00:01:34,960
Now in order to create such a data structure it helps to first have a dictionary that maps from subject

15
00:01:35,260 --> 00:01:38,440
to the index is where that subjects images are stored.

16
00:01:39,250 --> 00:01:41,920
So that's what I'm creating here in the next block.

17
00:01:41,920 --> 00:01:47,000
Train label to idea X and test label to idea X for both of these.

18
00:01:47,050 --> 00:01:53,260
The key will be the subject tidy and the value will be a list containing all the indexes where that

19
00:01:53,260 --> 00:01:55,860
subjects images are stored.

20
00:01:55,930 --> 00:01:59,460
You should be able to see that this is what these two loops are doing.

21
00:01:59,530 --> 00:02:05,350
I'm just looping through each of the train labels then I'm using the label as the key and adding the

22
00:02:05,350 --> 00:02:11,330
index to a list which is stored as the corresponding value and then I do the same thing for the test

23
00:02:11,330 --> 00:02:11,960
labels

24
00:02:17,350 --> 00:02:22,510
in the next block is where I create my list of tuples for both train and test.

25
00:02:22,510 --> 00:02:28,990
I'm going to have two lists one to store the matches which I'm calling the positives and one to store

26
00:02:28,990 --> 00:02:32,080
the non matches which I'm calling the negatives.

27
00:02:32,080 --> 00:02:35,200
It helps again to think of just a simple example.

28
00:02:35,200 --> 00:02:41,530
So imagine we have just four images A and B which belong to subjects 1 and C and D which belong to subject

29
00:02:41,530 --> 00:02:42,560
2.

30
00:02:42,700 --> 00:02:50,440
For this example we will just have 2 positive samples A B and C so that would be stored in our positives

31
00:02:50,440 --> 00:02:59,590
list for the negatives list we would have AC a d B.C. and b in total that 6 pairs which is equal to

32
00:02:59,590 --> 00:03:00,520
4 choose to

33
00:03:03,650 --> 00:03:06,740
so here's how we're going to build these lists in code.

34
00:03:06,740 --> 00:03:11,550
We're going to start by looping over those labels to index mappings we just created.

35
00:03:11,570 --> 00:03:15,000
Remember that the key is the label or subject idea.

36
00:03:15,230 --> 00:03:20,860
Whereas the value is a list of indices where that subject's images are stored.

37
00:03:20,900 --> 00:03:27,670
So the first thing I do in the loop is figure out all the indices that do not belong to the subject.

38
00:03:27,710 --> 00:03:32,270
So I call that other indices so that's easy.

39
00:03:32,270 --> 00:03:38,630
I just take the set difference between the full set of indices from 0 up to and train and then subtract

40
00:03:38,660 --> 00:03:41,810
the set of businesses that belong to this subject.

41
00:03:41,810 --> 00:03:48,390
And remember that this is not numerical subtraction but set difference in then I'm going to loop through

42
00:03:48,660 --> 00:03:54,880
each index for each index that belongs to this user for each index.

43
00:03:54,900 --> 00:03:59,220
I'm going to loop through every index after it in the same list.

44
00:03:59,400 --> 00:04:07,200
So I live through illnesses and then I live through indices again but only after the index i and then

45
00:04:07,200 --> 00:04:10,440
I append all these indexes to the train positives list

46
00:04:14,020 --> 00:04:22,470
then I live through all the other indices and add all these pairs to the train negatives as a sign though

47
00:04:22,900 --> 00:04:27,550
if you find this complicated I always recommend simply not following my code.

48
00:04:27,700 --> 00:04:33,250
Instead write it yourself and you'll encounter all the obstacles and have to solve the same problems

49
00:04:33,250 --> 00:04:34,540
on your own.

50
00:04:34,570 --> 00:04:39,730
This will give you a much better understanding of what's going on rather than trying to follow me

51
00:04:42,580 --> 00:04:46,610
and so after this we just do the same thing over the test set.

52
00:04:46,690 --> 00:04:52,090
So it's the exact same structure except instead of a train label to index we have tests labeled to index

53
00:04:53,110 --> 00:04:59,170
and we're populating test positives and test negatives rather than train positives and train negatives.
