WEBVTT

00:00.150 --> 00:01.830
-: So you have a prompt and it's

00:01.830 --> 00:04.590
a pretty flowery prompt here.

00:04.590 --> 00:07.650
You have a lot of words and we're not

00:07.650 --> 00:09.930
a hundred percent sure whether we need all of these words.

00:09.930 --> 00:12.990
And obviously in Midjourney there's no real limit

00:12.990 --> 00:14.781
to how many tokens you can have,

00:14.781 --> 00:19.230
but it can, I think, add a lot of noise to the image,

00:19.230 --> 00:20.670
if you are pushing the model

00:20.670 --> 00:22.350
in too many different directions

00:22.350 --> 00:25.080
and maybe not all of these tokens are necessary,

00:25.080 --> 00:27.210
you could be getting random results

00:27.210 --> 00:28.987
or like unusual results because

00:28.987 --> 00:30.510
you're including a lot of things

00:30.510 --> 00:33.900
that aren't really contributing a lot, but adding noise.

00:33.900 --> 00:36.516
Okay, so I'm just gonna copy this

00:36.516 --> 00:39.243
and then I'm gonna say shorten,

00:40.224 --> 00:43.470
and this is a technique that I haven't actually seen

00:43.470 --> 00:46.590
that many people use, but it is pretty amazing.

00:46.590 --> 00:48.840
So if you just hit enter, I haven't seen

00:48.840 --> 00:51.870
any other model that has this.

00:51.870 --> 00:54.900
Maybe it'd be possible to build for stable diffusion,

00:54.900 --> 00:57.660
but what it does is it chooses

00:57.660 --> 01:00.302
what are the important tokens and it tells you

01:00.302 --> 01:02.430
here which ones are important.

01:02.430 --> 01:04.533
So graffiti style mural is really important.

01:04.533 --> 01:07.424
So clever is more important than slide.

01:07.424 --> 01:09.570
That's interesting. And then dachshund.

01:09.570 --> 01:11.758
It doesn't need dog showcasing Banksy.

01:11.758 --> 01:15.270
And then it's stencil. It doesn't need technique, right?

01:15.270 --> 01:16.830
It doesn't need set against.

01:16.830 --> 01:18.540
It just needs brick and background.

01:18.540 --> 01:21.195
It actually knows, because it is the model,

01:21.195 --> 01:24.765
it can give you the attention that's paid to each token

01:24.765 --> 01:27.782
and make a subjective decision about

01:27.782 --> 01:30.000
what should be included in the prompt.

01:30.000 --> 01:32.190
And then it gives you this pretty cool thing here,

01:32.190 --> 01:36.211
which is it just slowly narrows it down, deletes words.

01:36.211 --> 01:39.151
Then this is just deleting all the unnecessary words

01:39.151 --> 01:41.310
and then cuts more and more away

01:41.310 --> 01:43.774
until literally if you just put in like mural,

01:43.774 --> 01:46.440
dachshund, Banksy, brick, then

01:46.440 --> 01:47.940
you get something pretty good,

01:47.940 --> 01:50.550
even though you haven't used all of these tokens.

01:50.550 --> 01:52.320
So this is just an interesting concept.

01:52.320 --> 01:56.220
One thing I do a lot is I will use this feature

01:56.220 --> 01:59.430
and then copy the prompt across to stable diffusion.

01:59.430 --> 02:02.207
I know that's not quite exactly transferrable,

02:02.207 --> 02:05.070
but it is useful to know, at least in general,

02:05.070 --> 02:06.510
like what types of words are triggering

02:06.510 --> 02:08.340
for this type of model.

02:08.340 --> 02:10.034
All right. The other thing you can do is

02:10.034 --> 02:11.850
you can click on one of these and it will run that prompt.

02:11.850 --> 02:13.110
But there's one more little thing

02:13.110 --> 02:15.420
I wanna show you, which is show details.

02:15.420 --> 02:16.980
And this is really exciting because

02:16.980 --> 02:18.330
it gives you the numbers, right?

02:18.330 --> 02:20.130
It actually gives you the numbers.

02:20.130 --> 02:21.153
You can see there's a little chart here,

02:21.153 --> 02:23.880
the dachshund and is very important.

02:23.880 --> 02:25.950
Banksy and the mural, but then where as

02:25.950 --> 02:27.944
like showcasing as not as important.

02:27.944 --> 02:30.420
So that's how it makes these decisions.

02:30.420 --> 02:32.580
You can look and say, so I thought the word subtly,

02:32.580 --> 02:34.080
for example, would be important,

02:34.080 --> 02:35.490
but it's zero, point, zero, one

02:35.490 --> 02:37.470
in terms of attention paid to that token.

02:37.470 --> 02:39.900
So yeah, really not worth it.

02:39.900 --> 02:40.890
The other thing is it tells you

02:40.890 --> 02:42.240
a little bit how things are split up.

02:42.240 --> 02:46.260
So saying and social. So that is interesting

02:46.260 --> 02:48.510
because I've used the words stencil, art,

02:48.510 --> 02:50.580
technique, and social commentary.

02:50.580 --> 02:52.860
Like I thought that it would split.

02:52.860 --> 02:54.570
So it's like stencil, art, technique,

02:54.570 --> 02:56.490
and then social commentary would be separate,

02:56.490 --> 02:58.290
but it's actually paired and social together.

02:58.290 --> 03:01.710
So I might be getting unusual results because of that.

03:01.710 --> 03:04.320
Right, so it's really important to know that

03:04.320 --> 03:05.490
and to think about that when

03:05.490 --> 03:08.354
you are starting to split these prompts up

03:08.354 --> 03:10.335
and putting them in different sections.

03:10.335 --> 03:12.477
There's just really good intelligence.

03:12.477 --> 03:15.780
All right, so just out of interest, let's run number five

03:15.780 --> 03:18.640
and see what we get back, and hit submit

03:21.900 --> 03:24.360
and see if we get like a similar image,

03:24.360 --> 03:27.513
even though we've drastically shortened the prompt.

03:33.060 --> 03:34.800
While this is running out, I just wanna say as well,

03:34.800 --> 03:36.742
one thing this teaches me is just that

03:36.742 --> 03:40.170
it is not like a LLM, right?

03:40.170 --> 03:41.820
When you're prompting for images,

03:41.820 --> 03:44.730
it doesn't understand like the words necessarily.

03:44.730 --> 03:48.600
It's just grabbing the individual tokens more than anything.

03:48.600 --> 03:52.020
You might not actually be like using natural language

03:52.020 --> 03:54.030
to describe things like I did here where I said,

03:54.030 --> 03:57.360
and social commentary or a contrasted urban brick.

03:57.360 --> 03:58.890
You could literally just list words,

03:58.890 --> 04:01.320
I think, and get the similar results.

04:01.320 --> 04:04.175
Here we go. We actually got some pretty cool results here.

04:04.175 --> 04:06.510
Although it did get a little bit messed up here,

04:06.510 --> 04:07.920
and that's pretty amazing though.

04:07.920 --> 04:09.360
When you think we went from

04:09.360 --> 04:13.380
all these words down to just four key words.

04:13.380 --> 04:15.723
That's still a pretty amazing result.
