1
00:00:00,120 --> 00:00:06,870
Hi and welcome back to the course in this section, we'll take a look at performing image segmentation

2
00:00:06,990 --> 00:00:08,790
using some deep learning methods.

3
00:00:09,330 --> 00:00:13,140
But firstly, what exactly is image segmentation?

4
00:00:13,830 --> 00:00:15,510
Well, take a look at these images here.

5
00:00:15,810 --> 00:00:22,860
Firstly, we've done a lot of classifiers early in the course, so classifier basically takes the entire

6
00:00:22,860 --> 00:00:27,630
image like this there and classifies it into different categories of classes.

7
00:00:27,750 --> 00:00:32,880
So in this case, it will say this is a dog, because most likely, that's what that image would have

8
00:00:32,880 --> 00:00:37,110
been labeled, however, which we've just done in the previous lessons.

9
00:00:37,950 --> 00:00:40,410
How do we figure out where the object is?

10
00:00:40,620 --> 00:00:44,940
That's what localization is, and that's where object detection comes into play.

11
00:00:45,390 --> 00:00:48,210
So we've done that extensively in the last sections.

12
00:00:48,780 --> 00:00:51,360
So you can see this is what object detection looks like.

13
00:00:51,870 --> 00:00:53,730
No, what about this?

14
00:00:54,210 --> 00:00:56,190
Which pixels belong to which object?

15
00:00:56,670 --> 00:00:59,330
So in this case, we want to -- everywhere.

16
00:00:59,340 --> 00:01:02,430
That's a dog in a certain color or sit in class.

17
00:01:02,790 --> 00:01:09,030
And we use different colours to highlight the classes in these images, as well as for a cat.

18
00:01:09,750 --> 00:01:17,220
So basically, you can see image segmentation is sort of a more advanced way of doing object detection.

19
00:01:17,580 --> 00:01:23,310
But instead of localizing a box, it's now doing pixel level predictions.

20
00:01:24,330 --> 00:01:28,380
And that's exactly what the definition of segmentation sort of is.

21
00:01:28,800 --> 00:01:34,530
It's a process of dividing an image into different regions based on characteristics of the pixels,

22
00:01:34,680 --> 00:01:42,030
which means that the characteristics in this case are the classes to in order to identify objects or

23
00:01:42,030 --> 00:01:47,310
boundaries, to simplify the image more efficiently, to analyze it.

24
00:01:47,730 --> 00:01:49,520
So that's effectively what it is here.

25
00:01:49,740 --> 00:01:55,980
So think of it as predicting the class for each pixel of an image that's effectively what it is.

26
00:01:56,670 --> 00:02:03,420
So before we move into the deep learning methods, I just want to highlight that previously, before

27
00:02:03,420 --> 00:02:07,440
deep learning methods became popular and in style.

28
00:02:08,250 --> 00:02:13,380
There was segmentation with done segmentation was done by a few different classical methods.

29
00:02:13,770 --> 00:02:18,960
And those would have been key means you can see an example of Kenyans who played a video.

30
00:02:20,340 --> 00:02:23,790
You can see it's trying to find a clusters the central inside of each cluster.

31
00:02:24,210 --> 00:02:27,120
That's effectively what it would do for different colors in an image.

32
00:02:27,240 --> 00:02:30,790
So it tries to localize and group different colors together.

33
00:02:30,810 --> 00:02:35,030
That's how Ki-Moon's would have looked, as well as is another one called means.

34
00:02:35,070 --> 00:02:40,860
If looks so mean mentioned here just basically moves toward the center of mass here.

35
00:02:41,370 --> 00:02:42,480
So that's how he can do.

36
00:02:42,780 --> 00:02:45,840
You can do some simple segmentation with these methods right now.

37
00:02:46,530 --> 00:02:53,970
However, I wouldn't recommend you use this for anything more advanced other than maybe background extraction

38
00:02:53,970 --> 00:02:55,440
or foreground extraction.

39
00:02:56,130 --> 00:03:00,480
So let's take a look at how Kamins works on these images here.

40
00:03:00,510 --> 00:03:07,110
Suppose you want to extract a cut out of this image here and placed a cut on this background so you

41
00:03:07,110 --> 00:03:08,440
can use carvings to do that.

42
00:03:08,460 --> 00:03:14,160
You can see either typically how the key means us doing it here, and you can see in this one, it actually

43
00:03:14,160 --> 00:03:19,700
gets where this is, where these key values this would have been too.

44
00:03:20,250 --> 00:03:25,980
That gets the value it gets to cut out and then you can cut the cut and pieces of the leaf here.

45
00:03:26,520 --> 00:03:32,940
This was a very old way of doing basically foreground extraction and placing it onto a different background.

46
00:03:33,960 --> 00:03:38,460
But now we have a lot of a lot more advanced methods and context aware methods that we can use, which

47
00:03:38,460 --> 00:03:40,920
I won't go into here because that's a different topic.

48
00:03:41,400 --> 00:03:43,830
This topic is purely segmentation.

49
00:03:44,580 --> 00:03:48,480
So as we can see, this works for this image, but it barely works.

50
00:03:48,480 --> 00:03:49,350
It doesn't work that well.

51
00:03:49,860 --> 00:03:55,800
And you can imagine for multiple classes and cluttered scenes to get something like this output where

52
00:03:55,800 --> 00:04:01,290
you can actually see the cores being segmented, pedestrians being segmented, objects in or traffic

53
00:04:01,290 --> 00:04:03,470
lights, trees, buildings to ruin it.

54
00:04:03,490 --> 00:04:09,450
So if the sidewalk all in different colors, you might need something definitely more advanced, and

55
00:04:09,450 --> 00:04:11,760
that's something is deeply methods.

56
00:04:12,450 --> 00:04:16,820
So just to do a quick refresher, again, this is what image recognition tells us.

57
00:04:16,820 --> 00:04:18,930
This is what object detection tells us.

58
00:04:19,920 --> 00:04:25,530
And now we can see what semantic segmentation is and instant segmentation.

59
00:04:26,010 --> 00:04:27,830
So take a look at these two images here.

60
00:04:27,840 --> 00:04:32,460
And before I tell you what the major differences, I'm pretty sure you can figure it out.

61
00:04:32,700 --> 00:04:38,910
So semantic segmentation effectively just puts all of these sheep in the same class.

62
00:04:39,330 --> 00:04:41,190
Then these were sheep just making sure.

63
00:04:42,060 --> 00:04:46,170
So you can see all of them are blue and they will all sort of merge into each other.

64
00:04:46,830 --> 00:04:51,240
And no, the dog is in separate class as well, because the wolf or whatever he is, is a dog.

65
00:04:51,240 --> 00:04:51,930
So he's a dog.

66
00:04:53,100 --> 00:04:55,700
So, no, an instant segmentation.

67
00:04:56,190 --> 00:04:59,940
You can see each instance of the class is in a different.

68
00:05:00,020 --> 00:05:00,350
Color.

69
00:05:00,860 --> 00:05:05,990
So now we can use something like an instant segmentation algorithm to actually count the number of sheep

70
00:05:06,440 --> 00:05:11,270
that are being observed here, whereas in this one they would all sort of merge together here and you

71
00:05:11,270 --> 00:05:17,540
don't get this separation of different classes, which is what you want and which most of these deep

72
00:05:17,540 --> 00:05:20,820
learning segmentations can actually easily adapt to.

73
00:05:20,840 --> 00:05:24,120
Because if you think about it, it is predicting different.

74
00:05:24,140 --> 00:05:26,280
It is sheep categories.

75
00:05:26,320 --> 00:05:29,060
They imagine sheep of bounding boxes for each one.

76
00:05:29,510 --> 00:05:33,440
So it does have some inherent knowledge that there are three separate sheep.

77
00:05:33,950 --> 00:05:39,110
However, we can actually now do the pixel wide to actually get the separating boundary, which is where

78
00:05:39,110 --> 00:05:42,440
the more advanced and complicated methods come into play.

79
00:05:43,820 --> 00:05:45,270
So these are more examples here.

80
00:05:45,860 --> 00:05:50,490
Obviously, object detection, semantic segmentation, instant segmentation.

81
00:05:51,080 --> 00:05:53,720
Likewise, you can see it in this image here below.

82
00:05:54,740 --> 00:06:02,360
So now we're going to take a look at forward very popular and common deep learning segmentation models.

83
00:06:02,510 --> 00:06:06,710
So those would be a signet unit, deep love, fish and tree.

84
00:06:06,860 --> 00:06:12,230
There's a deep language, and one or two tree is the one that performs best, obviously, and is the

85
00:06:12,230 --> 00:06:12,770
most recent.

86
00:06:13,730 --> 00:06:17,420
And then there's a mask our CNN's, which are actually quite good as well.

87
00:06:18,020 --> 00:06:25,100
So I'm going to go over each of these rather quickly because to get into the detail of each of these

88
00:06:25,100 --> 00:06:30,010
would probably require a half full hour and you still won't be scratching the surface of it.

89
00:06:30,020 --> 00:06:32,300
You just have a deeper understanding of what it is.

90
00:06:32,750 --> 00:06:35,350
So I'm just going to go through the highlights of each one.

91
00:06:35,360 --> 00:06:42,580
So let's take a look at segment segment was developed in 2015, and it's a semantic segmentation model.

92
00:06:42,590 --> 00:06:47,180
As we know now, it does use something called an encoder decoder method.

93
00:06:47,570 --> 00:06:50,680
So you can see this is how to see in architecture progressive.

94
00:06:50,720 --> 00:06:55,820
As you can see, the feature maps get progressively smaller here, and then there's a constraining point

95
00:06:55,820 --> 00:07:01,310
right there, and then it gets up sampled again right here and upsampling prediction.

96
00:07:01,430 --> 00:07:08,840
Is this so taking this input in the training data and this is the output crunch with levels, so your

97
00:07:08,840 --> 00:07:15,920
network is trying to linear mapping between this and this, that's effectively what all of these instance

98
00:07:16,370 --> 00:07:18,620
or semantic segmentation models are doing.

99
00:07:19,490 --> 00:07:26,460
So the network was basically based on the video G60 Network, which performs fairly well.

100
00:07:26,490 --> 00:07:33,470
However, it's a very heavy parameter of a parameter heavy network, so it's not often used anymore.

101
00:07:33,770 --> 00:07:35,360
So, but this is how it works.

102
00:07:36,240 --> 00:07:40,760
And you can see basically, as was explained here, the rule of the decoding network is to map to low

103
00:07:40,760 --> 00:07:45,320
resolution and could a feature maps to the full input resolution for Pixel?

104
00:07:45,650 --> 00:07:46,880
Why is classification?

105
00:07:47,030 --> 00:07:49,250
That's effectively what this is doing here.

106
00:07:50,030 --> 00:07:56,930
So now we can take a look at unit and unit is actually quite similar because there's a seam up down

107
00:07:57,110 --> 00:07:59,080
this basically this architecture looks at it.

108
00:07:59,090 --> 00:08:03,590
Looking at here, you take the input image again, downsampling downsampled.

109
00:08:03,590 --> 00:08:09,560
And there's this bottleneck constraining point and then you subsample here again soon as a contracting

110
00:08:09,560 --> 00:08:11,120
path and the expensive part.

111
00:08:11,840 --> 00:08:16,010
So unit was very popular and actually was very successful.

112
00:08:16,490 --> 00:08:18,590
It doesn't use the video gene network.

113
00:08:18,590 --> 00:08:24,650
It uses a different architecture, so that's probably why it gets better performance then segment.

114
00:08:26,810 --> 00:08:28,430
But that's about it for a unit.

115
00:08:28,460 --> 00:08:29,000
No.

116
00:08:29,120 --> 00:08:32,180
We can move on to Deep Lab, which in tree.

117
00:08:32,750 --> 00:08:39,890
So First Vision of Deep Love was released in 2014 before the others and developed which intriguing were

118
00:08:39,890 --> 00:08:42,800
in 2017 and had a number of improvements.

119
00:08:43,460 --> 00:08:49,940
It is a semantic segmentation model, and it definitely improves upon previous networks, as they said,

120
00:08:50,450 --> 00:08:57,860
and handles the problem that deals with actually detecting objects at multiple scales, which was something

121
00:08:57,860 --> 00:09:00,710
that the other networks did have some issues with sometimes.

122
00:09:01,670 --> 00:09:07,740
And you can see here these are some examples of how our performance you can see it looks quite low and

123
00:09:08,490 --> 00:09:10,370
it looks quite smooth in the predictions.

124
00:09:10,370 --> 00:09:15,650
You can actually see it gets these horses and right correctly, and it looks quite good.

125
00:09:16,460 --> 00:09:22,160
So in order to handle the problem of the segmenting, objects have multiple skills.

126
00:09:22,730 --> 00:09:30,060
They had several modules that were designed to employ a truss or dilated convolution and cascade or

127
00:09:30,080 --> 00:09:36,240
in parallel, sometimes to capture motor skill contexts by adopting multiple dilation rates.

128
00:09:36,350 --> 00:09:41,360
So it's a bit of a mouthful, and it probably won't make much sense to you unless you read the paper

129
00:09:41,900 --> 00:09:43,340
and the sentence in detail.

130
00:09:43,370 --> 00:09:50,660
However, you can see again is this constricting network as we move along here and then going deeper

131
00:09:50,660 --> 00:09:55,790
without the accuracy completion gives us something like this you can see to read here and how we get

132
00:09:55,790 --> 00:09:58,160
the expansion going right now.

133
00:09:58,640 --> 00:09:59,660
So that's basically.

134
00:09:59,950 --> 00:10:02,740
Of the architecture works for this network.

135
00:10:04,060 --> 00:10:06,040
Next, we'll take a look at Musk.

136
00:10:06,070 --> 00:10:14,710
See our CNN's which came out in 2017, I believe and I do remember when these were released, it was

137
00:10:14,710 --> 00:10:20,820
quite like groundbreaking at the time, only because it performed so well, and some of the demo videos

138
00:10:20,830 --> 00:10:23,500
the researchers used were quite impressive.

139
00:10:23,740 --> 00:10:31,260
I do think this was one of them with this may, maybe from detection to know what's good about the Facebooks

140
00:10:31,270 --> 00:10:35,350
detection, too, is that you can actually trim and mascaras.

141
00:10:35,350 --> 00:10:39,870
CNN's quite easily on your own dataset, which we'll be doing in a future lesson.

142
00:10:40,530 --> 00:10:50,500
So Professor Musk are CNN's are built upon the faster our CNN's just generally deal with very, very

143
00:10:50,500 --> 00:10:50,860
low.

144
00:10:51,310 --> 00:10:57,580
And it wasn't hard to adapt the RCA lens to perform mask or science or segmentation effectively.

145
00:10:58,300 --> 00:11:05,500
And I would highly recommend you use this one, this one and this one for your segmentation tests.

146
00:11:06,010 --> 00:11:10,870
Generally, unit is quite easy to train and get up and running deep labs and treat.

147
00:11:10,870 --> 00:11:16,200
Those have better performance when there's finer detail, like if you were treating a model to detect

148
00:11:16,240 --> 00:11:22,300
noise and is you would probably want to use deep love, efficient tree and mascaras, Syrians generally

149
00:11:22,300 --> 00:11:24,060
give very, very good performance.

150
00:11:24,070 --> 00:11:28,730
I haven't had a lot of experience training them on this, on different data sets.

151
00:11:28,750 --> 00:11:30,810
However, I do know it works quite well.

152
00:11:30,850 --> 00:11:35,470
I was quite impressed with the training sets, the data sets that I used to treat this model.

153
00:11:36,190 --> 00:11:43,150
So that's it for this lesson and what we'll do next, we'll go into CoLab and start experimenting with

154
00:11:43,150 --> 00:11:45,700
some of these deep segmentation models.

155
00:11:46,360 --> 00:11:53,380
And so you can actually see hands on how a unit, deep love vision tree as well as mascaras.

156
00:11:53,380 --> 00:11:57,440
CNN's will be taking a look at the Facebook detection to implementation.

157
00:11:57,970 --> 00:12:02,170
You can see how all of this work, so I'll see you the next lessons.

158
00:12:02,740 --> 00:12:03,130
Thank you.