1
00:00:12,210 --> 00:00:18,270
In this lecture we are going to discuss how to work with large data sets in PI talk specifically large

2
00:00:18,360 --> 00:00:20,430
image data sets.

3
00:00:20,430 --> 00:00:25,740
When you're first learning about deep learning it's convenient to have data sets like amnesty CFR 10

4
00:00:26,070 --> 00:00:30,490
and SBA Chen preloaded into a CSP or an umpire.

5
00:00:30,690 --> 00:00:36,660
That way you can focus on the machine learning aspects and not the data processing aspects but in the

6
00:00:36,660 --> 00:00:42,890
real world images don't come in the form of CSF vs instead images or images.

7
00:00:42,930 --> 00:00:48,250
In other words actual image files like JPEG PMG and so forth.

8
00:00:48,540 --> 00:00:51,240
These are files that sit on your computer.

9
00:00:51,240 --> 00:00:54,400
They are not organized in a nice numb pie array for you.

10
00:00:54,570 --> 00:01:00,780
On top of that images are usually much larger than what you see in amnesty or CFR 10 where you have

11
00:01:00,780 --> 00:01:09,150
a 28 by 28 image or a 32 by 32 image image that models like veggie and that are trained on images of

12
00:01:09,150 --> 00:01:14,210
size 2 to 4 by 2 to 4 which is an order of magnitude larger than amnesty

13
00:01:19,370 --> 00:01:20,650
as an exercise.

14
00:01:20,660 --> 00:01:27,010
Think about how much space you would need to store one million images each of size 2 to 4 by 2 to 4.

15
00:01:27,260 --> 00:01:30,110
Please pause the video until you've calculated and answer

16
00:01:37,330 --> 00:01:37,600
all right.

17
00:01:37,630 --> 00:01:42,700
So hopefully you thought about how much space it would take to store one million images of size 2 to

18
00:01:42,700 --> 00:01:44,620
4 by 2 to 4.

19
00:01:44,620 --> 00:01:45,940
Here's how much space it would take.

20
00:01:46,960 --> 00:01:49,390
First we have one million images.

21
00:01:49,390 --> 00:01:52,920
Multiply that by 2 to 4 by 2 to 4 by 3.

22
00:01:52,930 --> 00:01:55,210
The number of bytes per image.

23
00:01:55,270 --> 00:01:56,730
This is the total number of bytes.

24
00:01:56,740 --> 00:01:58,200
Our dataset would take up.

25
00:01:58,660 --> 00:02:02,440
This is about one hundred fifty billion bytes.

26
00:02:02,440 --> 00:02:06,820
This turns out to be about one hundred forty gigabytes.

27
00:02:07,140 --> 00:02:17,270
It should be clear that you cannot fit a dataset of this size into memory on a standard machine.

28
00:02:17,300 --> 00:02:18,700
So what should we do.

29
00:02:19,160 --> 00:02:19,760
As always.

30
00:02:19,760 --> 00:02:25,860
I don't like to give people code without having them first think about how to approach the problem themselves.

31
00:02:25,940 --> 00:02:29,440
So let's first take a moment to think about how to approach this problem.

32
00:02:29,450 --> 00:02:36,230
As an engineer rather than as a simple API user a copying code off the Internet first recognize the

33
00:02:36,230 --> 00:02:42,970
difference between a disk and memory disk is generally slow and memory is generally fast.

34
00:02:43,010 --> 00:02:46,350
The tradeoff is that disk generally has lots of space.

35
00:02:46,460 --> 00:02:53,380
While memory has much less space your model reads data from memory but the images live on disk.

36
00:02:53,450 --> 00:02:59,120
Second recognize that our approach is Batch gradient descent where we only look at one batch of data

37
00:02:59,150 --> 00:03:02,340
at a time rather than the entire dataset.

38
00:03:02,480 --> 00:03:07,950
So technically only that batch of data needs to exist in memory for the time that it's needed.

39
00:03:13,070 --> 00:03:15,480
In fact that's all we need to know.

40
00:03:15,490 --> 00:03:19,090
Let's suppose we have two arrays to represent our dataset.

41
00:03:19,240 --> 00:03:24,820
Alice the file names where the images live and a list of target labels for those images.

42
00:03:24,820 --> 00:03:29,950
Let's say our backside is 32 in order to do a batch gradient descent.

43
00:03:29,950 --> 00:03:35,820
All we need to do is loop through each of the arrays 32 items at a time inside the loop.

44
00:03:35,830 --> 00:03:40,060
We load in those 32 images and store them in an umpire.

45
00:03:40,300 --> 00:03:45,150
Of course thirty two images is a totally feasible amount to store in memory.

46
00:03:45,190 --> 00:03:48,440
One million is not but 32 is just fine.

47
00:03:48,490 --> 00:03:54,640
Then we can call a function like model dot train on batch passing in X and Y which will do one iteration

48
00:03:54,640 --> 00:03:59,470
of gradient descent on this batch of data on the next iteration of the loop.

49
00:03:59,500 --> 00:04:04,570
We throw at the old data and reassign x and y to the next 32 images.

50
00:04:04,690 --> 00:04:09,270
Thus at any point in time only 32 images exist in memory.

51
00:04:09,310 --> 00:04:12,290
Once they are used that they are forgotten until the next epoch

52
00:04:17,790 --> 00:04:22,010
luckily in code although you could do this manually you don't have to.

53
00:04:22,200 --> 00:04:27,800
We're going to make use of a few special functions and classes that make it very easy for us.

54
00:04:27,810 --> 00:04:32,610
First I want to show you all the main ingredients and then I'll describe how to put them all together.

55
00:04:34,170 --> 00:04:39,210
What you should recognize is that this is just like the data set and data loader combinations we've

56
00:04:39,210 --> 00:04:46,720
seen throughout the course previously the image data set objects we use before loaded in images into

57
00:04:46,720 --> 00:04:48,210
single arrays.

58
00:04:48,310 --> 00:04:49,690
These images were very tiny.

59
00:04:49,690 --> 00:04:54,640
So that was OK they could all live in memory at the same time in the same array.

60
00:04:55,030 --> 00:05:01,400
We had specialized data set objects for different datasets such as Amnesty fashion amnesty and CFR.

61
00:05:01,420 --> 00:05:07,120
So your question should be is there a kind of data set object that does not load all the data into memory

62
00:05:07,120 --> 00:05:09,290
All at once that is generic.

63
00:05:09,310 --> 00:05:15,300
So it's not tied to any specific data set and that can be used for actual image files like J pegs and

64
00:05:15,310 --> 00:05:17,020
PND instead of arrays

65
00:05:19,900 --> 00:05:25,890
and of course as you would expect this functionality is included in the torch vision library.

66
00:05:26,020 --> 00:05:29,680
So how does it work well quite appropriately.

67
00:05:29,690 --> 00:05:32,300
It's an object called image folder.

68
00:05:32,300 --> 00:05:38,480
Basically it means your dataset is a set of files that exists in some folder specified by some path

69
00:05:39,080 --> 00:05:43,400
and the arguments are generally what you would expect from a data set object.

70
00:05:43,400 --> 00:05:49,460
The first argument is the path to the Data folder and the second argument is a list of image transformations

71
00:05:50,770 --> 00:05:52,210
as you recall from earlier.

72
00:05:52,210 --> 00:05:57,340
These could be used for data augmentation and for simple scaling like making the pixel values go from

73
00:05:57,340 --> 00:06:03,030
0 to 1.

74
00:06:03,140 --> 00:06:08,750
The last thing I want to mention in this lecture is that using the image folder data set object necessitates

75
00:06:08,810 --> 00:06:11,670
a very specific folder structure.

76
00:06:11,720 --> 00:06:16,520
This folder structure is very reasonable so there's no reason for you to not be able to store your data

77
00:06:16,520 --> 00:06:17,810
this way.

78
00:06:17,810 --> 00:06:19,240
Here's how it goes.

79
00:06:19,550 --> 00:06:23,060
First assume your train and validation data live in different folders.

80
00:06:23,060 --> 00:06:25,820
Let's call them train and validation.

81
00:06:25,820 --> 00:06:30,980
Next within each of these folders we have an individual folder for each class.

82
00:06:30,980 --> 00:06:33,680
The name of the folder should be the class name.

83
00:06:33,680 --> 00:06:39,440
So for example if you want to classify cars trucks and helicopters then we should have three folders

84
00:06:39,500 --> 00:06:43,140
car truck and helicopter next.

85
00:06:43,140 --> 00:06:47,090
Within those nested folders is where we would store the actual images.

86
00:06:47,130 --> 00:06:52,280
So inside the car folder we would have images of all cars inside the truck folder.

87
00:06:52,290 --> 00:06:55,970
We would have images of all our trucks and inside the helicopter folder.

88
00:06:56,040 --> 00:06:58,450
We would have images of helicopters.

89
00:06:58,650 --> 00:07:02,430
Again this is just how towards Vision expects your data to be organized.

90
00:07:02,490 --> 00:07:06,750
So if you want to use those built in functions then you must conform to this format.