1
00:00:00,420 --> 00:00:06,390
Hi, guys, and welcome back to the course in this section, we'll take a look at image captioning,

2
00:00:06,390 --> 00:00:08,520
which is a very cool project.

3
00:00:09,600 --> 00:00:14,010
I'll actually show you what it does quite shortly, so let's open this notebook and we can take a look

4
00:00:14,010 --> 00:00:15,030
and get started.

5
00:00:15,810 --> 00:00:17,340
So what is the image captioning?

6
00:00:17,550 --> 00:00:24,570
Well, imagine we have a picture like this and we feed it to an algorithm that tells us what's happening

7
00:00:24,570 --> 00:00:25,980
into this in this picture.

8
00:00:26,010 --> 00:00:29,490
That's what captioning is basically a description of what's happening in this picture.

9
00:00:29,970 --> 00:00:35,370
And you can see in this picture, I would say it's a brown dog jumping over a hurdle in something that

10
00:00:35,370 --> 00:00:37,290
looks like an obstacle course.

11
00:00:37,770 --> 00:00:41,010
And this is what our predicted tag actually says.

12
00:00:41,010 --> 00:00:44,400
The dog jumped over a hurdle, which is quite accurate.

13
00:00:45,090 --> 00:00:51,410
So we're going to create a model that implements image captioning using a CNN and a transformer.

14
00:00:51,420 --> 00:00:57,840
And the reason we need a transformer is because this project is a combination of an old techniques as

15
00:00:57,840 --> 00:00:59,400
well as computer vision techniques.

16
00:00:59,760 --> 00:01:00,930
So let's get started.

17
00:01:00,930 --> 00:01:03,600
So firstly, run this block of code.

18
00:01:03,990 --> 00:01:11,160
This imports the fanciful functions we'll be using and notice here, and then we're importing text factorization.

19
00:01:11,160 --> 00:01:12,660
That's how we include the words.

20
00:01:13,050 --> 00:01:16,410
And speaking of words, how is this set?

21
00:01:16,500 --> 00:01:19,170
How do we use data for this type of task?

22
00:01:19,590 --> 00:01:27,510
Well, we used a Flickr ET key dataset, which, as the name implies, has 8000 images, and each image

23
00:01:27,960 --> 00:01:32,850
has five different levels that human annotators attributed to the image.

24
00:01:33,240 --> 00:01:36,510
So this one image has these five different descriptions.

25
00:01:36,930 --> 00:01:40,750
A child is in a pink dress coming up a set of stairs in an entryway.

26
00:01:41,370 --> 00:01:43,110
A girl going into a wooden building.

27
00:01:43,650 --> 00:01:46,140
Wooden Playhouse Playhouse wooden cabin.

28
00:01:46,140 --> 00:01:48,050
You can see that they're all basically in play.

29
00:01:48,450 --> 00:01:49,800
A girl or little girl.

30
00:01:49,890 --> 00:01:51,270
Some describe her outfit.

31
00:01:51,270 --> 00:01:51,870
Some don't.

32
00:01:52,350 --> 00:01:59,250
But it will describe that she's going into a room or wooden wooden cabinet playhouse or something along

33
00:01:59,250 --> 00:01:59,850
those lines.

34
00:02:00,960 --> 00:02:04,970
So this data set is quite big, so downloaded here.

35
00:02:05,570 --> 00:02:10,790
It may take about a minute roughly and then to set some defaults here.

36
00:02:10,800 --> 00:02:16,860
So we have an image sized vocabulary size that we'll be using the six sequence length, which is a max

37
00:02:16,860 --> 00:02:19,480
of 25 actually fixed length 225.

38
00:02:19,980 --> 00:02:27,210
The embedding dimension for the image embeddings because we're using efficient net network in this case

39
00:02:28,080 --> 00:02:31,590
and some other parameters as well for the training purposes.

40
00:02:32,370 --> 00:02:36,930
So now we need to create some functions here that lower the captions from the dataset.

41
00:02:36,930 --> 00:02:39,000
So that's what this load captions data does.

42
00:02:39,450 --> 00:02:45,450
You can see it returns a caption mapping as well as text data and caption map mapping is actually basically

43
00:02:45,450 --> 00:02:51,330
a dictionary mapping image names to the corresponding captions and then text it that contains all the

44
00:02:51,570 --> 00:02:52,630
viewable captions.

45
00:02:52,650 --> 00:02:57,120
So it's basically a function that does that because we would be using that.

46
00:02:57,590 --> 00:03:04,050
This does text labels as our input into the into creating an algorithm or model, I should say.

47
00:03:04,800 --> 00:03:08,400
So then this is a function that creates the train vowel split as well.

48
00:03:09,390 --> 00:03:11,080
And we just created data sets.

49
00:03:11,080 --> 00:03:16,200
Here you can see training samples 61 14, 15 29 for validation.

50
00:03:17,610 --> 00:03:18,630
That's all fine.

51
00:03:18,780 --> 00:03:24,270
Now we need to vector raised to text, and vector raising text basically means taking the words sort

52
00:03:24,270 --> 00:03:30,730
of in the strings and creating an integer sequence where each integer represents what that word is in

53
00:03:30,730 --> 00:03:31,910
the vocabulary.

54
00:03:31,920 --> 00:03:36,720
So remember, we have 10000 max of 10000 vocabulary words.

55
00:03:37,230 --> 00:03:44,730
So each word, each particular word, I would say the word frog is encoded with a value of C 16.

56
00:03:45,330 --> 00:03:49,290
Every time frog is mentioned in label, it corresponds to number 16.

57
00:03:49,350 --> 00:03:51,780
That's effectively what this factorization is.

58
00:03:51,780 --> 00:03:56,070
I mean, factorization does remove a lot of stop words, a lot of other things.

59
00:03:56,560 --> 00:03:59,730
So you can look it up if you want to learn more about factorization.

60
00:03:59,730 --> 00:04:03,030
This isn't an LP course, but it's always good knowledge.

61
00:04:03,030 --> 00:04:08,600
It's a deep learning practitioner to have some basic A.P LP stands for natural language processing.

62
00:04:08,610 --> 00:04:11,340
Maybe I should have said at the beginning, but that's what it is.

63
00:04:12,570 --> 00:04:16,830
And we also defined image augmentation steps that we would be taking as well.

64
00:04:17,400 --> 00:04:23,940
Consider these are the characters we remove as well, and everything is sent as lowercase.

65
00:04:24,990 --> 00:04:32,310
Now let's start building the data pipeline for loading all of this data, so we have to decode and resize

66
00:04:32,310 --> 00:04:32,780
function.

67
00:04:32,790 --> 00:04:34,260
We have a read and treat image.

68
00:04:34,710 --> 00:04:36,240
Read a validation image.

69
00:04:36,540 --> 00:04:41,760
Make dataset here, which uses Tell us about functions and then everything here.

70
00:04:41,900 --> 00:04:43,110
Well, sorry, this one does that.

71
00:04:43,110 --> 00:04:48,600
And then everything creates the list of images that correspond to the captions for the training dataset,

72
00:04:48,600 --> 00:04:50,190
as well as a validation dataset.

73
00:04:50,190 --> 00:04:54,000
So a lot of data preparation work for this model.

74
00:04:54,570 --> 00:04:57,450
Now let's talk about the model or the models involved.

75
00:04:57,630 --> 00:04:59,710
So firstly, we use a.

76
00:04:59,840 --> 00:05:03,910
CNN to extract the image features next.

77
00:05:04,000 --> 00:05:10,390
We passed whose image features to something called a transformer encoder that takes the extracted features

78
00:05:10,390 --> 00:05:14,710
and passes it to a transformer, which includes a representation of those inputs.

79
00:05:15,220 --> 00:05:17,320
Next, we have the transformer decoder.

80
00:05:17,800 --> 00:05:22,600
That model basically learns to map the inputs to the caption itself.

81
00:05:22,720 --> 00:05:25,240
So that's effectively what we're building here.

82
00:05:26,080 --> 00:05:30,820
So this is just the CNN mall where you can see it's a simple, efficient net model that we're using

83
00:05:30,820 --> 00:05:32,020
to get the features out of it.

84
00:05:32,020 --> 00:05:34,180
So we just build this model quickly here.

85
00:05:34,840 --> 00:05:40,630
Now, the Transformer and A is a bit more involved and you can see it has the self attention modules

86
00:05:40,630 --> 00:05:43,030
with the multi attention header that we create here.

87
00:05:44,320 --> 00:05:48,700
And then there's also the positional embedding, which has typically leads to the sequence.

88
00:05:49,750 --> 00:05:51,850
And then there's the transformer decoder block.

89
00:05:52,480 --> 00:05:55,180
So the transformers are different type of network.

90
00:05:55,180 --> 00:05:59,530
We haven't touched on that much in discourse, except a bit individual transformer section.

91
00:06:00,160 --> 00:06:02,930
But you can look it up if you want to learn more about it.

92
00:06:02,950 --> 00:06:08,650
It's actually a very important part of the planning right now, especially in natural language processing.

93
00:06:08,950 --> 00:06:16,270
So that's it for those networks that we defined and we have to get because, sorry, casual attention

94
00:06:16,270 --> 00:06:17,590
mask as well.

95
00:06:18,220 --> 00:06:24,460
And then we have the class which creates the image captioning model, which basically just creates everything

96
00:06:24,460 --> 00:06:25,000
that we said.

97
00:06:25,000 --> 00:06:27,210
There were two losses that we need to involve.

98
00:06:27,580 --> 00:06:32,950
So that way, we can start trimming the network in this from this class, we you have the training step

99
00:06:32,950 --> 00:06:36,070
here where we step two and we do back propagation.

100
00:06:36,940 --> 00:06:42,730
You know, all of this already is this encoded here as the gradient tape?

101
00:06:42,730 --> 00:06:51,210
So you could actually do it like how you do it and PyTorch effectively and then the test step as well

102
00:06:51,220 --> 00:06:53,200
and then the Matrix as well.

103
00:06:53,650 --> 00:06:58,900
So you can see we get the model here, we get the encoder with candy decoder, we get the caption model,

104
00:06:58,900 --> 00:07:03,100
which generates text labels, and that's basically everything here.

105
00:07:03,550 --> 00:07:05,710
Consider Encoder Decoder and CNN Mobile.

106
00:07:06,490 --> 00:07:11,290
Next, we need to train the model, which is the big part which takes quite a while.

107
00:07:11,290 --> 00:07:15,550
Actually, it might take a movie about a week or two, perhaps maybe longer.

108
00:07:15,550 --> 00:07:17,740
I'm not sure, but you or you're using.

109
00:07:18,340 --> 00:07:24,040
So we just set up language schedule and then we start training the network and see we train for two

110
00:07:24,310 --> 00:07:28,300
epochs, and each book takes about two minutes, and that's roughly an hour it's going to take.

111
00:07:29,320 --> 00:07:33,760
And no, once you're finished, you can actually check some sample predictions.

112
00:07:34,330 --> 00:07:41,500
So let's generate some random samples using this to generate caption function that takes a random choice

113
00:07:41,500 --> 00:07:46,480
from the validation data sets and predicts using the caption model to the image.

114
00:07:47,050 --> 00:07:47,440
Sorry.

115
00:07:47,530 --> 00:07:53,230
Over here you actually get the coded caption here and then then you could print out the decoded caption

116
00:07:53,230 --> 00:07:55,000
at the end so you can see.

117
00:07:55,210 --> 00:07:55,990
Take a look at this one.

118
00:07:56,920 --> 00:08:02,210
Some dogs in the dog hurdle race and you can see a group of dog exercer on a sled team.

119
00:08:03,250 --> 00:08:09,490
There's a guy in a kayak or boat paddling a paddle boat, and it's a dog in some red tube.

120
00:08:09,490 --> 00:08:13,180
A black and white dog is running through a grassy area, so you can see this is looking pretty good.

121
00:08:13,660 --> 00:08:17,200
Let's take a look at a couple of more Trimmel, actually.

122
00:08:18,220 --> 00:08:23,890
So the first one is looks like a group of men talking a man in a white shirt and a man in a blue shirt,

123
00:08:23,890 --> 00:08:28,690
although that's wrong, although this is blue, but I wouldn't say it's that blue is standing in front

124
00:08:28,690 --> 00:08:29,500
of a crowd of people.

125
00:08:29,710 --> 00:08:30,300
Not not.

126
00:08:30,340 --> 00:08:31,230
Yeah, that's right.

127
00:08:31,250 --> 00:08:32,160
It's not the women.

128
00:08:32,240 --> 00:08:34,120
Maybe what a human might say, but it's close.

129
00:08:35,170 --> 00:08:37,620
This one, a man and woman ride bikes on the sidewalk.

130
00:08:37,630 --> 00:08:39,700
That's pretty much self-explanatory.

131
00:08:39,700 --> 00:08:43,060
It's a very good caption that our model is predicting as well.

132
00:08:43,990 --> 00:08:49,540
And this one avoid a skateboard, a boy in a black suit and a boy in a yellow shirt is jumping on a

133
00:08:49,540 --> 00:08:52,000
skateboard, and I only see one boy.

134
00:08:52,030 --> 00:08:53,650
So this one's clearly a mistake.

135
00:08:53,920 --> 00:08:54,730
But that's OK.

136
00:08:55,090 --> 00:08:59,620
Because, as you can see, in the end, when training, we got 41 percent accuracy, which isn't to

137
00:08:59,680 --> 00:09:00,220
create.

138
00:09:00,790 --> 00:09:05,500
But this is a very good start, though, because you can see using deep learning to predict captions

139
00:09:06,160 --> 00:09:07,840
is doing very, very well.

140
00:09:07,900 --> 00:09:12,880
Actually, in my opinion, this is something that maybe 10 years ago and no one would have thought was

141
00:09:12,880 --> 00:09:13,840
possible easily.

142
00:09:14,440 --> 00:09:15,610
So there we have it.

143
00:09:15,790 --> 00:09:18,520
So that's it for the image captioning lesson.

144
00:09:19,090 --> 00:09:22,030
I'll stop there for now, and I'll see you in the next lesson.

145
00:09:22,060 --> 00:09:22,900
Thank you for watching.
