WEBVTT

1
00:00:00.000 --> 00:00:02.730
<v ->The fourth principle is where we start to move</v>

2
00:00:02.730 --> 00:00:06.360
outside the prompt itself and more into the wider system.

3
00:00:06.360 --> 00:00:09.630
We wanna identify how often the prompt gets errors

4
00:00:09.630 --> 00:00:11.580
and we wanna rate the responses

5
00:00:11.580 --> 00:00:13.260
and test what drives performance.

6
00:00:13.260 --> 00:00:17.010
So this isn't specifically a tactic to apply to the prompt,

7
00:00:17.010 --> 00:00:18.330
but really a meta tactic

8
00:00:18.330 --> 00:00:21.090
for getting better prompts in general.

9
00:00:21.090 --> 00:00:23.343
Now, if we look at the text models,

10
00:00:24.210 --> 00:00:26.610
we have no evaluation in this case.

11
00:00:26.610 --> 00:00:28.173
We're just prompting ChatGPT.

12
00:00:28.173 --> 00:00:30.630
Now, this is a simple prompt to get product names

13
00:00:30.630 --> 00:00:32.310
that can fit any foot size.

14
00:00:32.310 --> 00:00:36.030
Whereas when we're prompt engineering. we have the case...

15
00:00:36.030 --> 00:00:37.890
Whereas when we're prompt engineering,

16
00:00:37.890 --> 00:00:41.610
we really wanna focus on running the prompt multiple times

17
00:00:41.610 --> 00:00:43.140
and we don't just wanna run it once,

18
00:00:43.140 --> 00:00:45.780
like I've tested this template hundreds of times

19
00:00:45.780 --> 00:00:47.220
just to see how often it fails

20
00:00:47.220 --> 00:00:49.770
and then I've iterated on it since then.

21
00:00:49.770 --> 00:00:52.920
So if we look at the prompt template,

22
00:00:52.920 --> 00:00:55.080
you can see it's really well optimized

23
00:00:55.080 --> 00:00:58.470
and there's nothing superfluous in there.

24
00:00:58.470 --> 00:01:01.200
The specific place where you would evaluate quality

25
00:01:01.200 --> 00:01:05.520
in ChatGPT is really just to rerun the prompt, say 10 times.

26
00:01:05.520 --> 00:01:09.150
Now, I typically evaluate quality when I'm writing code

27
00:01:09.150 --> 00:01:12.450
because much easier to run it 10 and 100 times

28
00:01:12.450 --> 00:01:14.910
and then check the responses.

29
00:01:14.910 --> 00:01:17.130
But in this case, you can actually just do it

30
00:01:17.130 --> 00:01:19.860
in the ChatGPT interface as well.

31
00:01:19.860 --> 00:01:21.750
You can see I've run this 10 times,

32
00:01:21.750 --> 00:01:23.733
and I can see that it's providing.

33
00:01:25.680 --> 00:01:28.230
It's providing good quality product names.

34
00:01:28.230 --> 00:01:31.560
And in the run up to building this template,

35
00:01:31.560 --> 00:01:34.740
I tested lots of different combinations of examples,

36
00:01:34.740 --> 00:01:38.280
of descriptions, of direction, formatting,

37
00:01:38.280 --> 00:01:41.910
and settled upon this as it works and most reliable.

38
00:01:41.910 --> 00:01:44.197
Now, you can actually test lots of different things.

39
00:01:44.197 --> 00:01:46.650
You know, I talked about reliability,

40
00:01:46.650 --> 00:01:48.360
but one of the main things is quality,

41
00:01:48.360 --> 00:01:50.490
like are the names actually any good?

42
00:01:50.490 --> 00:01:51.930
Now this case I specified

43
00:01:51.930 --> 00:01:54.090
I wanted something in the style of Steve Jobs

44
00:01:54.090 --> 00:01:57.180
and you have Flex, iFit, and you have iShoe,

45
00:01:57.180 --> 00:02:00.210
but OmniStep doesn't have the I in front of it.

46
00:02:00.210 --> 00:02:04.290
So I found that sometimes this was like one in 10 results,

47
00:02:04.290 --> 00:02:06.150
it doesn't include the I at the beginning.

48
00:02:06.150 --> 00:02:07.500
And if that's really important to you,

49
00:02:07.500 --> 00:02:09.900
then that's something you might wanna test for.

50
00:02:09.900 --> 00:02:10.980
You can also look at the length.

51
00:02:10.980 --> 00:02:13.050
So I have pretty short names here.

52
00:02:13.050 --> 00:02:15.450
One of the main things I wanted it to do was to make...

53
00:02:15.450 --> 00:02:17.580
Really long names are not gonna be as memorable.

54
00:02:17.580 --> 00:02:19.590
You have iShoe, which is great, iFlex,

55
00:02:19.590 --> 00:02:23.250
but then we have one example here is Adapt-a-Step iSneakers,

56
00:02:23.250 --> 00:02:24.720
really got off the rails.

57
00:02:24.720 --> 00:02:26.670
And that's something you can test programmatically.

58
00:02:26.670 --> 00:02:29.010
You don't need a human to test your results.

59
00:02:29.010 --> 00:02:31.800
You could just actually run a script that says

60
00:02:31.800 --> 00:02:35.340
if the name is longer than a set number of characters,

61
00:02:35.340 --> 00:02:37.860
then, you know, we've done a bad job.

62
00:02:37.860 --> 00:02:41.490
And you can run the prompt 10 times, 100 times,

63
00:02:41.490 --> 00:02:44.400
and see how often it creates one of these long names.

64
00:02:44.400 --> 00:02:45.690
The other thing is format.

65
00:02:45.690 --> 00:02:47.550
This is particularly important when you're coding

66
00:02:47.550 --> 00:02:49.680
because the output of a prompt

67
00:02:49.680 --> 00:02:52.010
is typically what you are gonna pass on

68
00:02:52.010 --> 00:02:54.000
in the next step in the chain.

69
00:02:54.000 --> 00:02:57.330
And in this case, it came up with a product description

70
00:02:57.330 --> 00:02:58.710
that's part of the response

71
00:02:58.710 --> 00:03:00.000
that didn't come up with the names.

72
00:03:00.000 --> 00:03:02.857
It actually just came back with this whole description.

73
00:03:02.857 --> 00:03:04.260
"Innovative and revolutionary,

74
00:03:04.260 --> 00:03:06.420
these shoes are designed to adapt to any foot size."

75
00:03:06.420 --> 00:03:09.060
It's not what I asked it to do. It's not part of the prompt.

76
00:03:09.060 --> 00:03:11.850
And this is an example of what happened before.

77
00:03:11.850 --> 00:03:16.260
I provided three different examples for it to go off.

78
00:03:16.260 --> 00:03:17.550
Once you provide more examples,

79
00:03:17.550 --> 00:03:19.050
the prompt gets a little bit more reliable

80
00:03:19.050 --> 00:03:21.540
and this formatting issue doesn't happen as often.

81
00:03:21.540 --> 00:03:23.760
This is something, again, you can test for in code.

82
00:03:23.760 --> 00:03:26.850
If you're specifying JSON, you can pass the JSON

83
00:03:26.850 --> 00:03:29.460
and check if that passing worked or not

84
00:03:29.460 --> 00:03:32.310
before you decide to retry or stop that step.

85
00:03:32.310 --> 00:03:33.660
Let's look at image models

86
00:03:33.660 --> 00:03:37.470
and how evaluation of quality applies to that.

87
00:03:37.470 --> 00:03:39.240
Again, we just have our normal prompts

88
00:03:39.240 --> 00:03:41.310
where we're prompting Stable Diffusion.

89
00:03:41.310 --> 00:03:43.380
We're trying to find a shoe that can fit any foot size

90
00:03:43.380 --> 00:03:45.240
and the results are all over the map.

91
00:03:45.240 --> 00:03:49.740
But the really nice thing about this type of interface

92
00:03:49.740 --> 00:03:51.240
is that with image models,

93
00:03:51.240 --> 00:03:54.120
quite often, evaluation is built into the platform.

94
00:03:54.120 --> 00:03:56.580
So on Midjourney on Stable Diffusion,

95
00:03:56.580 --> 00:03:59.580
you get multiple responses back

96
00:03:59.580 --> 00:04:02.130
and you can choose which one you like.

97
00:04:02.130 --> 00:04:04.710
Yeah, I would run this multiple times

98
00:04:04.710 --> 00:04:06.090
and then when I find one I like,

99
00:04:06.090 --> 00:04:09.000
you can actually grab the seed of that image

100
00:04:09.000 --> 00:04:10.980
and apply that in the advanced settings,

101
00:04:10.980 --> 00:04:12.540
so it's gonna be constraining the creativity

102
00:04:12.540 --> 00:04:14.880
to just that type of shoe.

103
00:04:14.880 --> 00:04:18.150
You can do a lot actually in terms of evaluation,

104
00:04:18.150 --> 00:04:20.730
but the evaluation is usually human and manual.

105
00:04:20.730 --> 00:04:22.500
It's very hard to program

106
00:04:22.500 --> 00:04:25.500
programmatic evaluations or metrics.

107
00:04:25.500 --> 00:04:27.930
You typically just have to look at the responses

108
00:04:27.930 --> 00:04:30.570
and then click on the ones you liked and then save them

109
00:04:30.570 --> 00:04:33.030
and then maybe use them as examples later.

110
00:04:33.030 --> 00:04:35.430
That's the prompt template, if you did wanna use that.

111
00:04:35.430 --> 00:04:38.190
But you can get similar results to me

112
00:04:38.190 --> 00:04:39.540
if you use that template.

113
00:04:39.540 --> 00:04:42.780
We can evaluate the quality that you get back.

114
00:04:42.780 --> 00:04:45.150
Here, I've selected the one that I like

115
00:04:45.150 --> 00:04:47.790
and then I can use that as the base image

116
00:04:47.790 --> 00:04:49.560
when I'm generating small variations

117
00:04:49.560 --> 00:04:52.980
because maybe I don't like some aspects of this.

118
00:04:52.980 --> 00:04:57.180
I don't like how the lip is pointing up, for example,

119
00:04:57.180 --> 00:04:58.560
but I do the rest of the shoe.

120
00:04:58.560 --> 00:05:00.870
So if I select that seed image,

121
00:05:00.870 --> 00:05:03.570
I could run it again multiple times if I want.

122
00:05:03.570 --> 00:05:06.000
But either way, at the end of the day,

123
00:05:06.000 --> 00:05:09.720
I'm getting a good idea of what are the options

124
00:05:09.720 --> 00:05:11.640
and then I'm choosing which one is good.

125
00:05:11.640 --> 00:05:14.520
Equally, Stable Diffusion by getting some feedback

126
00:05:14.520 --> 00:05:18.240
in terms of here are the four that we've generated

127
00:05:18.240 --> 00:05:21.390
and now the user has decided to download that one

128
00:05:21.390 --> 00:05:23.190
or they've decided to upscale that one

129
00:05:23.190 --> 00:05:24.660
to a higher resolution,

130
00:05:24.660 --> 00:05:26.280
so that means that of these four,

131
00:05:26.280 --> 00:05:28.893
the model did a good job with the one in the bottom,

132
00:05:29.760 --> 00:05:33.420
so that helps them improve the quality of their models too.

133
00:05:33.420 --> 00:05:36.450
Now, you can test different parameters with image models

134
00:05:36.450 --> 00:05:39.600
and that's where I see a lot of experimentation.

135
00:05:39.600 --> 00:05:41.460
There's something called the CFG Scale,

136
00:05:41.460 --> 00:05:44.550
the Classifier-Free Guidance in Stable Diffusion.

137
00:05:44.550 --> 00:05:47.670
And in DreamStudio, they call it prompt strength.

138
00:05:47.670 --> 00:05:50.220
This is how much does the model follow your prompt

139
00:05:50.220 --> 00:05:52.590
versus being creative in itself.

140
00:05:52.590 --> 00:05:54.000
You can see the impact of this.

141
00:05:54.000 --> 00:05:56.520
What I like to do is create an image grid

142
00:05:56.520 --> 00:05:57.900
where I'm changing the parameter

143
00:05:57.900 --> 00:06:00.180
and I can see visually the difference between them.

144
00:06:00.180 --> 00:06:03.570
So you can see with CFG scale of three, which is low,

145
00:06:03.570 --> 00:06:06.180
and then we're getting this kind of weird looking image,

146
00:06:06.180 --> 00:06:08.490
it's not as good, it's not really following the prompt.

147
00:06:08.490 --> 00:06:10.830
And then when we dial up the CFG scale,

148
00:06:10.830 --> 00:06:12.840
we're getting much better results

149
00:06:12.840 --> 00:06:15.330
and testing where in these parameters

150
00:06:15.330 --> 00:06:17.190
the sweet spot is really,

151
00:06:17.190 --> 00:06:19.443
you know, big key for image generation.