1
00:00:00,630 --> 00:00:01,910
Hi and welcome back.

2
00:00:01,950 --> 00:00:09,000
So in this lesson, we take a look at a problem that has basically stumped computer vision practitioners

3
00:00:09,000 --> 00:00:11,090
and engineers for four years.

4
00:00:11,100 --> 00:00:12,030
For decades, maybe.

5
00:00:12,510 --> 00:00:20,370
And that's how to get dumped from an image from an monocular rg b image, meaning that there's no adept

6
00:00:20,370 --> 00:00:27,810
information like a LEIDER or RG b d comer of the distance for depth can give us that depth, like a

7
00:00:27,810 --> 00:00:32,640
time of flight sensor can give us a depth of the image of things in the image that makes individual

8
00:00:32,640 --> 00:00:33,030
pixels.

9
00:00:33,030 --> 00:00:40,140
Even however, most cameras pretty much all camera phones don't use that, and when you're including

10
00:00:40,140 --> 00:00:43,720
the image and video cameras as well CCTV.

11
00:00:43,770 --> 00:00:51,690
So we just deal with plain RGV images and we need to find a way to infer depth from that image depth,

12
00:00:51,690 --> 00:00:52,800
meaning the distance.

13
00:00:53,310 --> 00:00:57,900
So let's open that book 63, and we'll take a look at this less than here.

14
00:00:58,320 --> 00:01:02,550
So credit for this notebook goes to Victor Basu.

15
00:01:03,030 --> 00:01:09,690
This is another official Keros tutorial online, and you can see this is what we're trying to do with

16
00:01:09,690 --> 00:01:14,310
trying to get depth estimation so you can see this is a regular RGV image here.

17
00:01:14,320 --> 00:01:15,810
Apologies for the low quality.

18
00:01:15,810 --> 00:01:18,660
It's just a small image that's blown up now.

19
00:01:19,080 --> 00:01:25,530
This is a ground truth depth prediction here, so you can see this wall is closer to you and things

20
00:01:26,220 --> 00:01:27,240
we read.

21
00:01:28,170 --> 00:01:30,120
And that's what the ground truth labels look like.

22
00:01:30,630 --> 00:01:32,880
And this is what a prediction looks like.

23
00:01:32,880 --> 00:01:34,350
And it's actually quite good.

24
00:01:34,650 --> 00:01:35,330
Very close.

25
00:01:35,340 --> 00:01:41,010
It's just a bit fuzzier around the edges sometimes concede four different images.

26
00:01:41,010 --> 00:01:43,530
Here's another indoor image.

27
00:01:44,010 --> 00:01:47,900
This dataset is composed of, I believe, of indoor and outdoor images.

28
00:01:48,450 --> 00:01:49,500
So let's begin.

29
00:01:49,650 --> 00:01:55,500
So to run the setup, just lower your standard libraries that we're going to use, we download the data.

30
00:01:55,890 --> 00:02:03,690
This takes just over a minute, and it's basically a subset of the diode, which accounts for dents

31
00:02:03,690 --> 00:02:06,510
in the one outdoor depth dataset.

32
00:02:07,380 --> 00:02:10,320
The original dataset is eighty one gigs.

33
00:02:10,500 --> 00:02:15,540
So this we just training on the validation dataset, so it's much smaller.

34
00:02:16,830 --> 00:02:22,920
Obviously, our results aren't going to be as good, though, but for the interests of time and tutorial

35
00:02:22,920 --> 00:02:24,960
purposes, we will use that dataset.

36
00:02:25,590 --> 00:02:27,440
So we prepare the dataset here.

37
00:02:27,450 --> 00:02:29,430
We just point to the right parts and images.

38
00:02:29,430 --> 00:02:35,100
So we have to get the image, the depth and the depth to mask and visualize those things shortly.

39
00:02:35,700 --> 00:02:41,310
Next, we just prepare the hyper parameters as well how much epochs and about size and lending rates,

40
00:02:41,490 --> 00:02:42,360
height and widths.

41
00:02:43,590 --> 00:02:50,970
Next, we have our class data generator where we just basically it's like a training section.

42
00:02:50,970 --> 00:02:55,250
I guess you can consider it here or pre-processing the depth map.

43
00:02:55,260 --> 00:02:56,000
All of those things.

44
00:02:56,010 --> 00:02:57,060
Actually, this isn't the training.

45
00:02:57,060 --> 00:03:01,050
Sorry, but this is just some of the functions that we will be using.

46
00:03:01,440 --> 00:03:06,480
OK, so now we can visualize some of our samples here.

47
00:03:06,480 --> 00:03:09,210
So let's take a look at that.

48
00:03:09,810 --> 00:03:13,140
So you can see this is the original image here.

49
00:03:13,230 --> 00:03:15,220
This is the ground truth labels.

50
00:03:15,270 --> 00:03:21,870
You can see the corners of first part of the image, and as you go closer here, these are closer to

51
00:03:21,870 --> 00:03:24,120
us of similarly.

52
00:03:24,120 --> 00:03:27,900
Here you can see the she and foreground exclusive ones in the back.

53
00:03:27,900 --> 00:03:29,190
People didn't talk or read.

54
00:03:29,760 --> 00:03:34,770
This wall is further away from everything else and goes gets flipped as we move to the right of the

55
00:03:34,770 --> 00:03:37,710
image and so on and so on.

56
00:03:37,710 --> 00:03:38,730
You can see this one.

57
00:03:38,730 --> 00:03:44,850
This actually has a lot of variation in it, even though it's not that much depth difference, but it's

58
00:03:45,240 --> 00:03:50,730
needed to illustrate and teach our model how to infer depth from these images.

59
00:03:51,600 --> 00:03:56,010
Actually, that I thought this was a mirror just now is actually a window to in the room.

60
00:03:56,100 --> 00:04:03,000
So that's why everything is much darker red and you can see in the a corner point as well.

61
00:04:03,840 --> 00:04:09,570
So now we can actually visualize the depth because remember, it's in depth image so you can actually

62
00:04:09,570 --> 00:04:11,130
see what it looks like here.

63
00:04:11,130 --> 00:04:12,150
So this is one of them.

64
00:04:12,660 --> 00:04:14,820
You can actually see the depths out of it.

65
00:04:15,150 --> 00:04:16,650
So it's pretty cool, isn't it?

66
00:04:17,760 --> 00:04:18,810
Pretty images?

67
00:04:19,680 --> 00:04:24,030
Next, we're going to build a model, and the basic model is a unit model.

68
00:04:24,720 --> 00:04:30,960
So that's this is where we define or downscale block here and the upscaled block to remember how the

69
00:04:30,960 --> 00:04:33,960
unit has a bottleneck in the middle.

70
00:04:34,650 --> 00:04:35,880
And that's what we have here.

71
00:04:36,150 --> 00:04:41,850
So we did find that the next we defined the loss functions here.

72
00:04:42,690 --> 00:04:44,610
So we have three different losses.

73
00:04:44,610 --> 00:04:52,980
We have structural similarity index ss i am l one loss or point was dumped in our case and depth smoothness

74
00:04:52,980 --> 00:04:53,410
loss.

75
00:04:53,430 --> 00:04:59,430
So we define of loss functions and calculate them there and next.

76
00:04:59,670 --> 00:05:05,640
We're ready to train tomorrow model, so we just set about all our training and data generators, and

77
00:05:05,640 --> 00:05:11,360
then we run a model that fit to train the model for two epochs and the train trains quite quick on the

78
00:05:11,360 --> 00:05:19,130
GPU here and you can see voltage and low start up 1.4 and then goes on to one roughly and then goes

79
00:05:19,130 --> 00:05:22,040
on all the way to point one eight one one nine.

80
00:05:22,580 --> 00:05:23,720
That's not bad.

81
00:05:23,810 --> 00:05:26,750
Actually, it's better than the training wheels, to be honest.

82
00:05:27,350 --> 00:05:34,490
So now we can visualize the output, so we take a look with loads of images here, and you can see this

83
00:05:34,490 --> 00:05:35,990
is the real image.

84
00:05:35,990 --> 00:05:36,850
Head Hajime.

85
00:05:36,850 --> 00:05:43,520
The image that we're training on this is the mess that is supposed to generate using a unit model.

86
00:05:44,090 --> 00:05:45,170
And this is adept.

87
00:05:45,320 --> 00:05:46,820
So you can see it's not.

88
00:05:47,420 --> 00:05:48,170
It's not that bad.

89
00:05:48,530 --> 00:05:49,010
It's OK.

90
00:05:49,850 --> 00:05:54,170
This one is, I think this one is actually reversed.

91
00:05:54,320 --> 00:05:54,530
If.

92
00:05:55,100 --> 00:05:57,560
Yeah, this one was actually visiting the foreground.

93
00:05:58,100 --> 00:06:00,290
Is this and this is the background, unfortunately.

94
00:06:01,130 --> 00:06:06,900
This one is a little bit better, actually, although it does think something on the corner his for

95
00:06:07,080 --> 00:06:09,350
some reason this one.

96
00:06:09,390 --> 00:06:13,250
The ground should looks quite weird in the labeling, but I guess it's accurate.

97
00:06:14,240 --> 00:06:16,070
Probably doesn't need to be such granular.

98
00:06:16,160 --> 00:06:17,270
Some big changes here.

99
00:06:17,870 --> 00:06:21,320
This one is actually probably better from analysis point of view.

100
00:06:21,360 --> 00:06:24,120
I like it because I put a ring here.

101
00:06:24,140 --> 00:06:30,980
This isn't for the back, although I hope it's maybe just a red from this being looking darker red over

102
00:06:30,980 --> 00:06:31,250
it.

103
00:06:32,450 --> 00:06:37,730
This one, this one is okay, but it has different colors for the same object.

104
00:06:38,420 --> 00:06:42,310
It shouldn't really be and so on.

105
00:06:42,320 --> 00:06:44,150
So you can examine all of these here.

106
00:06:44,810 --> 00:06:46,760
So there's a lot of possible improvements.

107
00:06:46,760 --> 00:06:51,560
However, remember we didn't train this for very long, and we didn't treat this for the full dataset.

108
00:06:52,640 --> 00:06:56,630
So there's a lot more improvements that the researchers didn't do to get the best results.

109
00:06:57,440 --> 00:07:01,950
These are some of the references you can take a look at to see to learn more about this topic.

110
00:07:02,000 --> 00:07:03,350
It's a very interesting topic.

111
00:07:03,350 --> 00:07:09,230
It's a very hot topic as well, although there are some very good networks now that do depth estimation

112
00:07:09,710 --> 00:07:10,700
quite well.

113
00:07:11,090 --> 00:07:18,080
So once you have a reference point, you can basically get the depth of any object to a fairly, fairly

114
00:07:18,080 --> 00:07:21,080
rough, but good enough accuracy.

115
00:07:21,680 --> 00:07:22,760
So that's it.

116
00:07:23,420 --> 00:07:25,770
I'll stop there and I'll see you in the next lesson.

117
00:07:25,820 --> 00:07:26,720
Thank you for watching.