WEBVTT

00:00.836 --> 00:02.370
-: I am gonna teach you about prompt testing

00:02.370 --> 00:04.590
and the good news is you don't have to know

00:04.590 --> 00:05.670
how to code to do this.

00:05.670 --> 00:07.950
Here you can see the outcome of a test.

00:07.950 --> 00:10.650
This is variation A versus variation B.

00:10.650 --> 00:13.050
And specifically I was trying to test something

00:13.050 --> 00:16.350
that would increase the word length of blog posts.

00:16.350 --> 00:18.480
Even if you ask it for quite a long blog post,

00:18.480 --> 00:19.800
it doesn't listen all the time.

00:19.800 --> 00:22.080
There's all sorts of tricks to getting it to listen.

00:22.080 --> 00:23.670
And the trick that I tried here,

00:23.670 --> 00:26.647
if you look into the data sheet, I specifically just said,

00:26.647 --> 00:30.390
"Make it really long or I lose my job."

00:30.390 --> 00:31.560
The reason why it works,

00:31.560 --> 00:34.410
if you look for the emotion prompting paper,

00:34.410 --> 00:37.020
is that large language models can actually understand

00:37.020 --> 00:39.570
and can be enhanced by emotional stimuli.

00:39.570 --> 00:41.040
And that's pretty interesting

00:41.040 --> 00:45.030
because AI is like a simulation of a human brain,

00:45.030 --> 00:46.320
if you think about it that way.

00:46.320 --> 00:48.570
So it does make sense that they would react

00:48.570 --> 00:51.210
in similar ways to the way that we work, right?

00:51.210 --> 00:55.470
So that's the technical explanation to why this works.

00:55.470 --> 00:56.820
All right, so what you can see here,

00:56.820 --> 00:58.350
and I've just outlined these,

00:58.350 --> 01:00.330
is that these responses are actually longer

01:00.330 --> 01:01.350
but not consistently.

01:01.350 --> 01:02.670
In some cases they're shorter.

01:02.670 --> 01:04.230
In this case, we actually got

01:04.230 --> 01:06.180
like a pretty short response back.

01:06.180 --> 01:10.830
In a few other cases we've gotten below a thousand words.

01:10.830 --> 01:13.650
The way I'm calculating word length here

01:13.650 --> 01:16.230
is I'm just counting the result

01:16.230 --> 01:18.210
of splitting the text by space.

01:18.210 --> 01:20.220
You can see that this doesn't have to be really complicated.

01:20.220 --> 01:22.080
Evals is a whole topic, right?

01:22.080 --> 01:24.720
There's a lot that you can learn about evals

01:24.720 --> 01:27.090
and most of them are geared towards programmers

01:27.090 --> 01:29.400
who can set up like a eval framework.

01:29.400 --> 01:30.657
But in this case, you know,

01:30.657 --> 01:32.700
you just need something simple in Google Sheets,

01:32.700 --> 01:34.320
so you just need to count the number of words.

01:34.320 --> 01:36.270
We have an existing experiment

01:36.270 --> 01:39.990
and we've tested adding it in all caps,

01:39.990 --> 01:43.440
but we want to add another variation here.

01:43.440 --> 01:47.850
And that should be as simple as just creating variation C.

01:47.850 --> 01:50.013
And then we're gonna run it 10 times.

01:52.440 --> 01:53.613
Just count 10 of these.

01:54.750 --> 01:56.940
There we go. So we have 10.

01:56.940 --> 02:01.290
And the hypothesis I have

02:01.290 --> 02:03.150
is that if we take this,

02:03.150 --> 02:07.230
the hypothesis I have is that if I specify a word length,

02:07.230 --> 02:09.630
then it might listen to that, right?

02:09.630 --> 02:11.490
We'll see, and this is an experiment, right?

02:11.490 --> 02:12.720
Because we don't actually know

02:12.720 --> 02:14.910
if it's capable of following word length at this point.

02:14.910 --> 02:18.303
I'm just gonna call this Specify Word Length.

02:19.650 --> 02:22.920
And I'm gonna take this existing prompt, the control,

02:22.920 --> 02:26.100
and it's important that you start from the control each time

02:26.100 --> 02:28.470
because you might be getting different results

02:28.470 --> 02:31.110
based on the different changes that you made.

02:31.110 --> 02:33.450
I'm just gonna paste in the template here,

02:33.450 --> 02:35.910
get rid of these quote marks.

02:35.910 --> 02:38.700
And then I'm gonna say an outline for a blog post.

02:38.700 --> 02:42.150
Each section should be a minimum of two paragraphs long,

02:42.150 --> 02:43.900
and the overall article

02:45.330 --> 02:49.563
should be more than 2000 words.

02:51.292 --> 02:53.190
I'm gonna pass that in there.

02:53.190 --> 02:55.076
And then, so this is all, you know,

02:55.076 --> 02:57.630
this is gonna be the same every time.

02:57.630 --> 03:02.630
And the actual post here,

03:03.450 --> 03:04.650
I'm just gonna paste this in.

03:04.650 --> 03:06.840
So this is when the template's filled in.

03:06.840 --> 03:09.870
And if you see here, we have these in curly brackets,

03:09.870 --> 03:12.420
we have writing style, we have topic,

03:12.420 --> 03:14.820
these are variables that you need to fill in.

03:14.820 --> 03:19.440
And if we see an example here, this is where it's filled in.

03:19.440 --> 03:22.920
So you can see that we filled in the topic,

03:22.920 --> 03:24.840
productivity with time blocking,

03:24.840 --> 03:27.720
and we filled in the writing style that we want

03:27.720 --> 03:30.240
and we filled in the example post as well.

03:30.240 --> 03:32.940
So we've given it one example already,

03:32.940 --> 03:37.050
which is a post here about the Pomodoro technique, right?

03:37.050 --> 03:40.140
And then we're just gonna wait for it to come back

03:40.140 --> 03:41.130
with the actual response.

03:41.130 --> 03:43.590
We need to adapt this filled in.

03:43.590 --> 03:45.960
We call it the formatted prompt as well,

03:45.960 --> 03:47.310
because that's what we're actually gonna

03:47.310 --> 03:49.593
copy and paste into ChatGPT.

03:50.550 --> 03:52.380
So now we have it in here.

03:52.380 --> 03:53.910
We need to just make that change

03:53.910 --> 03:55.350
that we made to the template.

03:55.350 --> 03:57.570
We said minimum of two paragraphs long

03:57.570 --> 03:59.943
and it should be more than 2000 words.

04:06.810 --> 04:10.170
Okay, so that's the only change that we're testing here,

04:10.170 --> 04:12.240
and we're just isolating that one variable

04:12.240 --> 04:15.180
and we can just copy and paste this down further, right?

04:15.180 --> 04:17.610
So if you're running a more advanced experiment,

04:17.610 --> 04:19.380
you might want to try multiple things.

04:19.380 --> 04:21.360
So maybe you try multiple topics,

04:21.360 --> 04:23.160
maybe you try different examples,

04:23.160 --> 04:25.020
whatever it is, you can test that.

04:25.020 --> 04:26.730
And the reason why, by the way,

04:26.730 --> 04:28.350
I'm running the same thing 10 times,

04:28.350 --> 04:31.350
is you can see just how often the results vary.

04:31.350 --> 04:33.390
And this is the thing with LLMs,

04:33.390 --> 04:35.340
they don't always give you the same response back.

04:35.340 --> 04:37.350
So it's really important that you actually

04:37.350 --> 04:39.420
take a look at what's coming back

04:39.420 --> 04:42.240
and you actually run experiments like this.

04:42.240 --> 04:44.550
Okay, cool, now we have, just to summarize,

04:44.550 --> 04:45.990
we have the specified word length.

04:45.990 --> 04:49.050
We have a hypothesis of what we think might work.

04:49.050 --> 04:52.110
We have the actual prompt template,

04:52.110 --> 04:54.750
which we're recording here for posterity,

04:54.750 --> 04:57.780
and we could use that template to fill in different topics

04:57.780 --> 05:00.900
or different writing styles, whatever it is.

05:00.900 --> 05:05.010
Then we have the formatted prompt for this run, right?

05:05.010 --> 05:08.700
And this is, each one of these rows is a run of ChatGPT.

05:08.700 --> 05:12.180
And this is exact text that we've pasted in.

05:12.180 --> 05:14.190
We will paste into ChatGPT,

05:14.190 --> 05:16.920
so it's the prompt with the variables filled in.

05:16.920 --> 05:18.600
That's the formatted prompt.

05:18.600 --> 05:19.920
And then we just need to put in here

05:19.920 --> 05:21.480
which model we're gonna use.

05:21.480 --> 05:26.160
In this case we're gonna use GTP-4, which is ChatGPT+.

05:26.160 --> 05:28.860
And then this is where we're gonna paste in the response.

05:28.860 --> 05:31.320
We can also copy this formula down

05:31.320 --> 05:33.840
and I'm just gonna drag that down there.

05:33.840 --> 05:36.450
So that's gonna count the number of words.

05:36.450 --> 05:37.320
Cool, all right.

05:37.320 --> 05:40.620
So now I'm not gonna use ChatGPT

05:40.620 --> 05:44.820
because ChatGPT I've found is relatively inconsistent.

05:44.820 --> 05:47.280
It's somewhat personalized to people's results.

05:47.280 --> 05:48.480
If you have customer instructions,

05:48.480 --> 05:50.130
it's gonna change things, right?

05:50.130 --> 05:51.930
I would say whenever you're templating something,

05:51.930 --> 05:53.430
you want to use the API,

05:53.430 --> 05:55.800
but obviously I promise you wouldn't have to code here,

05:55.800 --> 05:57.360
so I'm not gonna make you,

05:57.360 --> 06:00.780
but if you sign up for a developer account at OpenAI,

06:00.780 --> 06:03.150
the main thing that I use here is this playground.

06:03.150 --> 06:04.530
This is the system message.

06:04.530 --> 06:06.420
This is like the customer instructions.

06:06.420 --> 06:07.830
We're not gonna play with that here.

06:07.830 --> 06:11.010
We're just gonna emulate talking to ChatGPT.

06:11.010 --> 06:14.100
We have the model over here so we can use turbo preview,

06:14.100 --> 06:16.440
we can use GPT-4 or whatever it is,

06:16.440 --> 06:18.090
and then you can set the temperature as well.

06:18.090 --> 06:20.160
I just leave it at one right now

06:20.160 --> 06:21.930
and that's relatively balanced

06:21.930 --> 06:24.030
if you want it to be more deterministic.

06:24.030 --> 06:25.770
So you want it to give the same result every time

06:25.770 --> 06:27.540
you could lower it or if you want it to be

06:27.540 --> 06:30.270
a bit more crazy or creative, you could increase it.

06:30.270 --> 06:31.440
The main thing you wanna make sure

06:31.440 --> 06:33.570
is you set the maximum length here.

06:33.570 --> 06:36.720
I'd set it for 4,000 because that should be enough.

06:36.720 --> 06:38.400
One token, like it's saying here,

06:38.400 --> 06:40.380
is roughly four characters of English text.

06:40.380 --> 06:42.930
So about 3/4 of a word.

06:42.930 --> 06:45.720
Yeah, 4,000 tokens would be about 3000 words.

06:45.720 --> 06:48.030
It should be plenty for what we need.

06:48.030 --> 06:49.920
And then don't worry about these other settings.

06:49.920 --> 06:52.260
I'm gonna paste in the prompt here.

06:52.260 --> 06:57.260
And we can see, actually, lemme just delete the quote marks,

06:58.289 --> 07:00.480
this is something that Google Sheets adds.

07:00.480 --> 07:02.820
But yeah, this is our formatted prompt.

07:02.820 --> 07:04.500
Remember, don't place the template in.

07:04.500 --> 07:08.580
So we have here the productivity with time blocking topic.

07:08.580 --> 07:11.520
We have the writing style,

07:11.520 --> 07:14.280
and then we have this example post of another post

07:14.280 --> 07:15.750
I've written in the past.

07:15.750 --> 07:16.890
And it's gonna go off that.

07:16.890 --> 07:19.113
Okay, cool. So let's hit submit.

07:21.450 --> 07:24.153
And it's gonna generate this response.

07:28.140 --> 07:30.360
And this takes a little bit of time with GPT-4,

07:30.360 --> 07:33.450
is much faster with GPT-3.5. (laughs)

07:33.450 --> 07:35.460
But the nice thing about this, by the way,

07:35.460 --> 07:38.220
is you're also not gonna hit your limits

07:38.220 --> 07:41.340
because GPT-4, you can only send 50 messages a day

07:41.340 --> 07:43.920
or in three hours, which is super annoying, right?

07:43.920 --> 07:45.720
And if you have this situation

07:45.720 --> 07:48.480
where you're trying to test prompt,

07:48.480 --> 07:50.100
I need to run it 10 times,

07:50.100 --> 07:53.670
that means I can only test five things and that sucks.

07:53.670 --> 07:54.750
It's better to do it through

07:54.750 --> 07:58.140
the OpenAI API playground instead.

07:58.140 --> 08:00.120
One thing I should also mention

08:00.120 --> 08:02.460
is that there is some cost to this.

08:02.460 --> 08:06.630
GTP-4 is pretty cheap, but you can see the pricing

08:06.630 --> 08:08.340
and the usage in here.

08:09.189 --> 08:10.500
I'm just gonna try and open that in another tab.

08:10.500 --> 08:13.740
So you can see I only spent three bucks this month,

08:13.740 --> 08:18.740
but last month I spent, yeah, a fair bit, 130 bucks.

08:19.560 --> 08:21.480
Just depends on what you're doing.

08:21.480 --> 08:23.400
This actual, these API calls,

08:23.400 --> 08:26.730
so like this one I just made here is like 5 cents.

08:26.730 --> 08:31.020
So it is not a big amount, so don't worry too much about it.

08:31.020 --> 08:32.430
Cool. All right, here we go.

08:32.430 --> 08:35.170
We have response here

08:36.360 --> 08:40.740
and I'm just gonna copy.

08:40.740 --> 08:41.573
We're not even gonna read it.

08:41.573 --> 08:42.993
I'm just gonna paste in here.

08:44.580 --> 08:45.413
And here we go.

08:45.413 --> 08:48.670
We've only got 933 words, so it did ignore

08:50.100 --> 08:53.310
our admonition to make the word length long.

08:53.310 --> 08:54.570
But you can have a couple of choices here.

08:54.570 --> 08:56.280
Either one, you could iterate on the prompt

08:56.280 --> 08:59.010
and just say, "It did so badly on the first try

08:59.010 --> 09:00.930
that I'm not even gonna try it multiple times.

09:00.930 --> 09:02.160
I'm gonna try something else."

09:02.160 --> 09:04.830
Or you could keep running it 10 more times

09:04.830 --> 09:06.180
and pasting these in just to see

09:06.180 --> 09:07.290
if there's some variation.

09:07.290 --> 09:09.423
I'm gonna do that and then I'll restart.

09:10.470 --> 09:13.353
Okay, on the last run now, that you can see

09:13.353 --> 09:15.900
that I've just been copy and pasting in here

09:15.900 --> 09:17.400
and it's not looking good,

09:17.400 --> 09:20.880
actually we're getting worse results so far.

09:20.880 --> 09:22.530
Yeah, with an experiment like this,

09:22.530 --> 09:24.150
typically I would run it a couple of times

09:24.150 --> 09:26.040
and if it's not even close I would stop it.

09:26.040 --> 09:29.400
But this is interesting data, right?

09:29.400 --> 09:31.830
Because here we go, we've got the last one here,

09:31.830 --> 09:33.390
if we just paste that one in.

09:33.390 --> 09:36.660
What this tells us is that it does not respect word count.

09:36.660 --> 09:40.290
It has basically completely ignored what we said here,

09:40.290 --> 09:42.270
that it should be more than 2000 words.

09:42.270 --> 09:45.510
That is a good result When it comes to an experiment,

09:45.510 --> 09:46.590
you'd be surprised

09:46.590 --> 09:49.590
because one, it tells us that technique doesn't work.

09:49.590 --> 09:51.300
So there's a lot of time and effort,

09:51.300 --> 09:53.670
like worrying about whether it's gonna work or not.

09:53.670 --> 09:54.990
We're not gonna try that one next.

09:54.990 --> 09:57.000
We're gonna try a different technique.

09:57.000 --> 09:59.280
The other thing that's really helpful here

09:59.280 --> 10:01.950
is that when we see other people using this technique,

10:01.950 --> 10:03.630
we can tell them of this experiment

10:03.630 --> 10:04.830
and actually give them the data

10:04.830 --> 10:06.810
and say, look, it doesn't look like it works,

10:06.810 --> 10:09.720
but you test it and it fosters a culture

10:09.720 --> 10:11.640
of testing within the organization.

10:11.640 --> 10:13.110
The really nice thing as well

10:13.110 --> 10:14.757
is that this is just a small scale test

10:14.757 --> 10:17.220
with you doing it in Google Sheets.

10:17.220 --> 10:18.870
But like you could pass this to a developer

10:18.870 --> 10:21.420
and say, "Hey, can you test this at a larger scale?"

10:21.420 --> 10:24.660
Or maybe we could, if you find a result that does work

10:24.660 --> 10:28.200
like we did with make it really long in caps,

10:28.200 --> 10:30.063
the emotion prompting, then that's something

10:30.063 --> 10:32.520
that you could validate with a larger scale test.

10:32.520 --> 10:34.380
And you could try it with lots of different topics

10:34.380 --> 10:36.510
or lots of different example blog posts

10:36.510 --> 10:37.830
or lots of different writing styles

10:37.830 --> 10:42.003
and see which of those things impacts the overall length.

10:42.840 --> 10:44.190
You could also then scale this up

10:44.190 --> 10:47.280
and do other types of evaluations.

10:47.280 --> 10:49.320
So maybe the word length isn't as good

10:49.320 --> 10:51.300
when we use one type of prompt,

10:51.300 --> 10:53.160
but maybe the quality is high.

10:53.160 --> 10:55.500
So you could have maybe outsource this,

10:55.500 --> 10:57.840
have someone read each of these blog posts

10:57.840 --> 11:00.120
and then put like a rating out of 10.

11:00.120 --> 11:02.940
The reason why it's useful to save this stuff here

11:02.940 --> 11:05.760
is that rather than just running it in ChatGPT,

11:05.760 --> 11:08.250
and then you lose the output, you can never find it again

11:08.250 --> 11:09.810
'cause it's in your history somewhere.

11:09.810 --> 11:12.630
Then you can actually give this data to someone else.

11:12.630 --> 11:14.610
You actually have some real data now.

11:14.610 --> 11:17.730
And you can use that for whatever it is you need.

11:17.730 --> 11:20.340
You can go back and review it in different ways,

11:20.340 --> 11:23.070
or you can add new tests to the experiment.

11:23.070 --> 11:24.360
Really, whatever it is you want,

11:24.360 --> 11:26.280
that's a really powerful thing you can do here.

11:26.280 --> 11:27.960
And just on the final thing,

11:27.960 --> 11:30.898
let me just update this pivot table.

11:30.898 --> 11:32.970
Okay, so to change the range,

11:32.970 --> 11:36.810
you just select the pivot table, then you click Edit here,

11:36.810 --> 11:40.230
and then if you scroll up, you'll see the range here.

11:40.230 --> 11:45.057
Right now it goes to G21, which is just the bottom here.

11:45.057 --> 11:48.753
And we want it to go all the way down to G31.

11:50.407 --> 11:52.350
So we're just gonna change that.

11:52.350 --> 11:55.050
Or if you want to make this automatic,

11:55.050 --> 11:57.120
you could just say to G

11:57.120 --> 11:59.730
and then it's gonna grab everything if you want,

11:59.730 --> 12:02.131
but I'm just gonna do this manually.

12:02.131 --> 12:03.930
Cool, all right, so we can see

12:03.930 --> 12:07.810
that we actually had a lower response

12:08.970 --> 12:11.040
then I'll lose my jobs in CAPS,

12:11.040 --> 12:14.160
but we did have a slightly higher response to the control.

12:14.160 --> 12:15.390
The way we figure that out,

12:15.390 --> 12:17.550
I just copy and paste the formula.

12:17.550 --> 12:18.630
Actually, lemme just do it here.

12:18.630 --> 12:23.630
So it would be this divided by this minus one.

12:24.210 --> 12:28.260
So it is 7.4% better in terms of length

12:28.260 --> 12:30.720
across 10 observations.

12:30.720 --> 12:31.830
So it's certainly pretty small,

12:31.830 --> 12:33.210
but it did make a difference.

12:33.210 --> 12:34.950
It just didn't make as much of a difference

12:34.950 --> 12:38.130
as the all caps emotion response.

12:38.130 --> 12:39.540
You could also test other things here,

12:39.540 --> 12:40.890
so you could take away the caps

12:40.890 --> 12:42.600
and just use the root of emotion.

12:42.600 --> 12:44.670
Or you could put the word length in caps

12:44.670 --> 12:45.870
as well if you want.

12:45.870 --> 12:48.180
You can get pretty fine grain with this,

12:48.180 --> 12:50.460
but now you know how to do prompt testing.

12:50.460 --> 12:52.050
You can test whatever you want.

12:52.050 --> 12:54.480
So feel free to make a copy of this sheet,

12:54.480 --> 12:57.030
but also do make your own sheets

12:57.030 --> 12:59.670
or add your own stuff if you find that necessary

12:59.670 --> 13:01.260
for the purposes of your test

13:01.260 --> 13:03.720
or the type of evidence you need to be able to give

13:03.720 --> 13:05.370
to the other people on your team.