WEBVTT

00:00.510 --> 00:03.210
-: Okay, let's walk you through the Vision Prompting Guide,

00:03.210 --> 00:04.860
which is how to prompt a vision model

00:04.860 --> 00:08.970
and just really showing you what they're capable of doing.

00:08.970 --> 00:12.180
So, vision models are typically called multimodal models.

00:12.180 --> 00:14.490
That means they have multiple modes of input.

00:14.490 --> 00:17.190
Not just text, but you can input images as well,

00:17.190 --> 00:19.530
and I'm sure there's gonna be more coming

00:19.530 --> 00:24.450
in terms of inputting video and inputting audio, et cetera.

00:24.450 --> 00:28.140
So, one of the big breakthroughs I think in recent years

00:28.140 --> 00:30.270
is that ChatGPT can see now.

00:30.270 --> 00:33.600
So, OpenAI just, they released that a few months ago

00:33.600 --> 00:37.770
at the developer conference at the end of 2023,

00:37.770 --> 00:42.770
and now it's pretty easy to have AI look at an image

00:43.050 --> 00:44.190
and tell you what's in it.

00:44.190 --> 00:48.210
Back in 2014 it was so impossible to do that,

00:48.210 --> 00:52.110
that there was a joke on the internet on XKCD

00:52.110 --> 00:54.360
that if you needed to take a picture

00:54.360 --> 00:57.780
and see if the picture contained a bird,

00:57.780 --> 01:00.330
it would be like a five-year research project

01:00.330 --> 01:02.520
with a team of PhDs, right?

01:02.520 --> 01:04.440
So it was virtually impossible,

01:04.440 --> 01:06.240
whereas now it's pretty easy.

01:06.240 --> 01:08.040
You just ask ChatGPT.

01:08.040 --> 01:11.880
Which multimodal models are available as well as ChatGPT,

01:11.880 --> 01:13.920
which uses GPT-4 Vision?

01:13.920 --> 01:16.057
There's also CLIP by OpenAI.

01:16.057 --> 01:17.400
It's an open source model,

01:17.400 --> 01:19.320
and that was released as part of Stable,

01:19.320 --> 01:22.380
it's actually what Stable Diffusion was originally built on.

01:22.380 --> 01:24.300
It's a very simple model, though.

01:24.300 --> 01:27.090
It can only really tell you what's in an image

01:27.090 --> 01:29.610
or give you captions of an image.

01:29.610 --> 01:33.360
Gemini 1.5 Pro by Google is multimodal.

01:33.360 --> 01:35.970
It also accepts video, which is really cool.

01:35.970 --> 01:39.810
Claude 3 by Anthropic now accepts images too.

01:39.810 --> 01:42.410
And then there's a couple of other open source I'll turn to.

01:42.410 --> 01:45.960
So LLaVA 1.6 by Meta is open source.

01:45.960 --> 01:49.710
That is a fine tune of the Llama 2 model,

01:49.710 --> 01:52.440
so that accepts images as well as text.

01:52.440 --> 01:55.593
And then there's also Qwen VL, which is by Alibaba.

01:57.330 --> 01:58.740
So, what can it do?

01:58.740 --> 02:01.320
Here's a an example that really blew my mind.

02:01.320 --> 02:03.358
I uploaded an image of a Twitch stream

02:03.358 --> 02:04.950
of the game "World of Tanks,"

02:04.950 --> 02:06.780
and it could pull out lots of information.

02:06.780 --> 02:08.010
It could detect different objects.

02:08.010 --> 02:09.141
Is there a tank in this image?

02:09.141 --> 02:11.580
It can read text from the image as well.

02:11.580 --> 02:13.693
It can understand overall layout.

02:13.693 --> 02:15.510
UI can give feedback on that

02:15.510 --> 02:17.970
if you are prompting it for that reason.

02:17.970 --> 02:20.700
It can also understand what's happening.

02:20.700 --> 02:23.610
Like, it can explain what's happening in the scene

02:23.610 --> 02:26.490
based on that image, which is really exciting.

02:26.490 --> 02:28.560
What I'm gonna do now is apply

02:28.560 --> 02:32.970
the five principles of prompting to vision prompting,

02:32.970 --> 02:34.320
because one thing you'll see

02:34.320 --> 02:36.627
is it doesn't matter what the AI model is,

02:36.627 --> 02:39.330
essentially there's always

02:39.330 --> 02:41.370
these same five principles that apply.

02:41.370 --> 02:44.460
So, everything you've learned for text and image prompting

02:44.460 --> 02:46.593
will also work for vision as well.

02:47.460 --> 02:49.320
First principle was give direction,

02:49.320 --> 02:51.330
and that's also true here.

02:51.330 --> 02:54.480
People have found that by pointing to specific objects,

02:54.480 --> 02:56.760
literally annotating an image, like drawing a circle

02:56.760 --> 02:59.550
around the thing that you want to ask about,

02:59.550 --> 03:01.440
that works really well.

03:01.440 --> 03:05.070
And you can combine that with traditional ML

03:05.070 --> 03:07.470
and image segmentation techniques.

03:07.470 --> 03:09.690
So you could draw these on programmatically.

03:09.690 --> 03:12.120
Identify if there's a bottle in the image.

03:12.120 --> 03:15.750
Draw an arrow next to the bottle programmatically,

03:15.750 --> 03:18.270
and then query GPT Vision

03:18.270 --> 03:20.223
to ask your question of that bottle.

03:22.080 --> 03:23.610
Specifying format also works.

03:23.610 --> 03:26.160
So JSON mode works here as well, which is really great.

03:26.160 --> 03:28.430
You can get structured text from an image.

03:28.430 --> 03:30.930
So here's an example of reading a driver's license

03:30.930 --> 03:33.245
and pulling out the important information,

03:33.245 --> 03:35.280
and that's really exciting.

03:35.280 --> 03:38.340
There's also the concept of providing examples.

03:38.340 --> 03:40.230
So it can actually accept multiple images.

03:40.230 --> 03:41.838
It doesn't need to be just one,

03:41.838 --> 03:43.890
and it's in the chat format,

03:43.890 --> 03:47.370
just like any other kind of chat that you have with ChatGPT.

03:47.370 --> 03:49.080
Here's an example where they got it

03:49.080 --> 03:51.180
to correctly read a speedometer,

03:51.180 --> 03:52.770
which is something it fails at,

03:52.770 --> 03:55.323
where they just provided a few examples first.

03:55.323 --> 03:58.170
So, the first two there,

03:58.170 --> 03:59.160
two first two images,

03:59.160 --> 04:02.850
they manually wrote the response,

04:02.850 --> 04:05.250
and then the third one they just included the image

04:05.250 --> 04:09.420
and then GPT-4 was able to write that response there

04:09.420 --> 04:11.670
and get it correct, which is really good.

04:11.670 --> 04:14.040
Evaluating quality is just as important with vision

04:14.040 --> 04:17.250
as it is with any other LLM work that you're doing,

04:17.250 --> 04:19.407
because it sometimes gets things wrong.

04:19.407 --> 04:21.600
And so it does a pretty good of recognizing

04:21.600 --> 04:22.440
who these people are.

04:22.440 --> 04:24.000
They're famous researchers,

04:24.000 --> 04:26.880
but the bounding boxes it draws around them

04:26.880 --> 04:27.810
are a little bit off.

04:27.810 --> 04:31.170
Like, you can see it's missed an Andrew Ng, right,

04:31.170 --> 04:33.450
Andrew Ng here with the boundary box.

04:33.450 --> 04:35.460
Don't fully trust it for every task.

04:35.460 --> 04:36.690
You need to evaluate quality,

04:36.690 --> 04:38.790
just like you do for anything.

04:38.790 --> 04:40.980
Division of labor works here as well.

04:40.980 --> 04:43.453
So, if you divide a task up into multiple prompts

04:43.453 --> 04:45.660
or if even within the same prompt,

04:45.660 --> 04:47.670
if you get it to think step-by-step

04:47.670 --> 04:50.197
by prompting it to figure out,

04:50.197 --> 04:52.110
"Let's count these apples row by row,"

04:52.110 --> 04:53.940
it does a much better job

04:53.940 --> 04:57.330
of getting the exact correct response, right?

04:57.330 --> 04:59.880
You can see here where we didn't ask it to do that.

04:59.880 --> 05:02.283
It didn't count the right number of apples.

05:04.200 --> 05:05.033
Cool.

05:05.033 --> 05:06.087
Now, I think the biggest thing

05:06.087 --> 05:08.970
and the most exciting thing for me is

05:08.970 --> 05:10.410
temporal reasoning.

05:10.410 --> 05:12.870
What I mean by that is it can actually understand

05:12.870 --> 05:14.880
what's happening in a sequence of images.

05:14.880 --> 05:18.420
Here's an example where it just showed four pictures

05:18.420 --> 05:19.470
of someone doing a pushup.

05:19.470 --> 05:21.506
They go down, and then they go up again

05:21.506 --> 05:23.047
and you can actually ask it,

05:23.047 --> 05:24.517
"What's this person gonna do next?"

05:24.517 --> 05:25.680
"They're gonna do a pushup."

05:25.680 --> 05:28.890
It actually understands the sequence of images here,

05:28.890 --> 05:30.870
and it could reorder the images as well

05:30.870 --> 05:32.400
if you put them out of sequence,

05:32.400 --> 05:34.440
which is again very exciting.

05:34.440 --> 05:35.610
Now, this is really different.

05:35.610 --> 05:36.630
I think this is the thing

05:36.630 --> 05:39.060
that you couldn't do before with traditional ML

05:39.060 --> 05:41.220
'cause it didn't have a reasoning engine behind it.

05:41.220 --> 05:43.680
You could identify that there are cameras in the image,

05:43.680 --> 05:45.780
you could identify a person in the image,

05:45.780 --> 05:47.790
but it couldn't really figure out,

05:47.790 --> 05:50.220
like, what that person is doing from frame to frame,

05:50.220 --> 05:53.010
and I think that little glimpse of intelligence,

05:53.010 --> 05:55.019
that is really what makes this very special

05:55.019 --> 05:58.080
and worth the exorbitant cost

05:58.080 --> 06:00.750
that comes with using vision models.

06:00.750 --> 06:04.500
I applied this to a project that I worked on recently

06:04.500 --> 06:07.260
where we transcribed the audio for some video,

06:07.260 --> 06:11.250
but we also transcribed the images using AI vision.

06:11.250 --> 06:14.700
So, we got it to provide commentary on the game,

06:14.700 --> 06:16.770
and this is from a Twitch stream.

06:16.770 --> 06:19.530
So, you can see the sequence of images we provided.

06:19.530 --> 06:21.390
These are the timestamps in seconds.

06:21.390 --> 06:24.360
And then this is the commentary here that the AI wrote,

06:24.360 --> 06:25.320
and it's really good,

06:25.320 --> 06:27.300
and you can stitch all this together as well.

06:27.300 --> 06:28.590
But one of the things that we found

06:28.590 --> 06:33.210
is that it can handle maybe up to 30 images at the minute

06:33.210 --> 06:34.680
in terms of the context length,

06:34.680 --> 06:36.452
and it's also very expensive.

06:36.452 --> 06:40.380
So it turned out to be about $6 an hour to do this,

06:40.380 --> 06:42.180
which isn't actually that far off

06:42.180 --> 06:44.700
from paying someone in a low cost country to do it.

06:44.700 --> 06:45.900
We expect that's gonna come down

06:45.900 --> 06:47.493
by an order of magnitude soon.

06:49.230 --> 06:50.310
One thing you have to watch out

06:50.310 --> 06:52.050
is AI models have a mind of their own.

06:52.050 --> 06:53.340
They can't be trusted.

06:53.340 --> 06:55.860
Just like I said before, evaluation is really important.

06:55.860 --> 06:57.420
We found that quite often,

06:57.420 --> 07:00.870
it would refuse to do the request or it'd get things wrong,

07:00.870 --> 07:02.550
so we had to really develop

07:02.550 --> 07:04.590
a lot of good evaluation techniques

07:04.590 --> 07:06.300
in order to make sure this is working.

07:06.300 --> 07:07.133
And it's a real pain

07:07.133 --> 07:09.600
because every time it gets something wrong,

07:09.600 --> 07:11.220
that's costing you real money.

07:11.220 --> 07:12.270
For vision requests,

07:12.270 --> 07:13.710
it could be a couple of cents each, right?

07:13.710 --> 07:15.840
So if you're doing hundreds of hours of footage,

07:15.840 --> 07:18.600
then these problems really add up.

07:18.600 --> 07:21.599
So, we developed a bunch of evaluation metrics.

07:21.599 --> 07:24.480
A few different things that we found were useful

07:24.480 --> 07:28.740
is we had a test set, where we actually knew the answers,

07:28.740 --> 07:30.810
so we could provide a sequence

07:30.810 --> 07:32.787
and then we knew what sorts of things

07:32.787 --> 07:36.390
were in that sequence, and we asked it to identify

07:36.390 --> 07:37.920
whether those things were in the sequence.

07:37.920 --> 07:40.260
And yeah, what we can do then

07:40.260 --> 07:43.170
is get an accuracy score so we can see,

07:43.170 --> 07:46.980
how often did it predict false and it actually was false,

07:46.980 --> 07:49.020
and how often did it predict false,

07:49.020 --> 07:50.310
but it actually was true?

07:50.310 --> 07:54.390
One of the things we found is that it doesn't hallucinate

07:54.390 --> 07:55.740
when something isn't there.

07:55.740 --> 08:00.740
So, it actually had zero cases where something like,

08:01.140 --> 08:02.520
let's give an example.

08:02.520 --> 08:04.170
Is there a tank in this image, right?

08:04.170 --> 08:06.030
Like, it doesn't imagine

08:06.030 --> 08:07.950
there's a tank in the image if it's not there,

08:07.950 --> 08:10.320
but sometimes it would miss something.

08:10.320 --> 08:11.790
One of the things we're looking for is,

08:11.790 --> 08:13.680
was it close quarters combat?

08:13.680 --> 08:16.860
And sometimes there was actually close quarters combat

08:16.860 --> 08:18.540
and it didn't recognize that,

08:18.540 --> 08:20.580
so the accuracy was about 80%

08:20.580 --> 08:22.830
for the things that we were testing it on.

08:22.830 --> 08:27.090
The other thing we needed to do was actually figure out

08:27.090 --> 08:29.370
how close it got with some labeling.

08:29.370 --> 08:31.980
It wouldn't always label the same things the same way.

08:31.980 --> 08:34.860
We had active combat as one of the labels here,

08:34.860 --> 08:36.480
but then sometimes it would come up with a new label,

08:36.480 --> 08:39.210
which is like active combat engagement, which is so close

08:39.210 --> 08:41.520
that it's pretty much the same thing,

08:41.520 --> 08:44.850
and we would treat that as a winning test answer.

08:44.850 --> 08:46.470
This was the reference answer,

08:46.470 --> 08:48.150
and then this is what it came up with.

08:48.150 --> 08:49.350
And because it's so close,

08:49.350 --> 08:53.580
we used cosine similarity on a vector database

08:53.580 --> 08:54.840
just based on this text

08:54.840 --> 08:57.120
that we could get that similarity back,

08:57.120 --> 09:00.870
and we'd count it as a one if the similarity was above 90%,

09:00.870 --> 09:03.070
and you can see some examples where that is.

09:05.250 --> 09:07.470
One thing we expect is that open source models

09:07.470 --> 09:08.610
will bring down cost.

09:08.610 --> 09:11.990
So this is an experiment I ran with 1.6,

09:11.990 --> 09:14.700
and it's a 30 billion parameter model.

09:14.700 --> 09:15.840
It does a pretty good job.

09:15.840 --> 09:16.710
It's slow, though.

09:16.710 --> 09:19.200
At the present, it takes about five minutes

09:19.200 --> 09:23.100
to process one minute worth of video, which isn't great.

09:23.100 --> 09:24.810
It's much slower than real time.

09:24.810 --> 09:27.570
GPT Vision is actually slightly faster than real time.

09:27.570 --> 09:29.160
This obviously will speed up.

09:29.160 --> 09:31.350
This was me running it on my M3 MacBook.

09:31.350 --> 09:34.110
Over time, I expect the speed will increase,

09:34.110 --> 09:35.160
the cost will come down

09:35.160 --> 09:38.703
and be relatively cheap to sequence any sort of video.

09:39.930 --> 09:42.270
But one of the really exciting things I've seen,

09:42.270 --> 09:43.680
which I think will change a lot,

09:43.680 --> 09:45.480
and it's not something I really predicted,

09:45.480 --> 09:48.210
was using this for software development.

09:48.210 --> 09:52.260
tldraw did a really cool feature in the platform,

09:52.260 --> 09:54.540
where you can take your mock-up,

09:54.540 --> 09:57.210
and then it takes a screenshot of that mock-up

09:57.210 --> 09:59.220
and it feeds it to GPT Vision,

09:59.220 --> 10:03.120
and then what you get back is a generated code, right?

10:03.120 --> 10:04.710
So like it literally generates,

10:04.710 --> 10:07.650
design the code for you based on that mock-up.

10:07.650 --> 10:10.380
You just click Make it Real, which is pretty cool.

10:10.380 --> 10:12.960
So, there's gonna be all sorts of new modalities there.

10:12.960 --> 10:15.300
I think now that AI can see,

10:15.300 --> 10:17.160
then maybe typing text into a box

10:17.160 --> 10:20.700
isn't gonna be the main way we interact with AI,

10:20.700 --> 10:21.840
but we'll see.

10:21.840 --> 10:23.880
And one of the other exciting things I found

10:23.880 --> 10:27.750
is that, especially with Google, I've seen some demos here

10:27.750 --> 10:31.650
that it can return structured data from video.

10:31.650 --> 10:34.590
Here is an example of someone who works at Google.

10:34.590 --> 10:37.320
They uploaded a video of American cars,

10:37.320 --> 10:39.390
and then it pulled out structured data

10:39.390 --> 10:40.257
of what cars are in the video,

10:40.257 --> 10:43.680
and I think this is gonna be really big importance,

10:43.680 --> 10:45.330
especially because Google's model

10:45.330 --> 10:48.570
has a 1 million token context window.

10:48.570 --> 10:50.880
So you could potentially take a whole movie

10:50.880 --> 10:53.940
and pull out structured data of which characters appeared

10:53.940 --> 10:56.010
in which parts of the movie.

10:56.010 --> 10:57.330
That could open up quite a bit.

10:57.330 --> 10:59.430
That's the sort of stuff that people like Netflix

10:59.430 --> 11:03.600
pay millions of dollars for people to do manually right now,

11:03.600 --> 11:07.500
which should hopefully be done by AI in the future

11:07.500 --> 11:09.150
for a lot less money.

11:09.150 --> 11:11.460
So, what are the other vision use cases?

11:11.460 --> 11:13.833
I just listed a few here that I came up with.

11:15.180 --> 11:17.460
This is primarily from a paper I'll link to at the end,

11:17.460 --> 11:20.040
but a few that are interesting.

11:20.040 --> 11:22.230
It's pretty good at understanding medical images,

11:22.230 --> 11:25.710
although it's not doctor, so it can at least help you

11:25.710 --> 11:28.080
diagnose what's going on potentially.

11:28.080 --> 11:30.450
It can recognize logos, it can count objects,

11:30.450 --> 11:32.700
it can tell you what's funny about an image.

11:32.700 --> 11:34.140
It's also good at visual reasoning,

11:34.140 --> 11:36.000
so it can infer from visual clues

11:36.000 --> 11:37.680
in the image what's going on.

11:37.680 --> 11:38.790
It can extract text.

11:38.790 --> 11:40.650
It's really good at code architecture,

11:40.650 --> 11:45.240
not just the example I showed you, where we draw a mock-up

11:45.240 --> 11:47.400
and it turns that into front end code,

11:47.400 --> 11:48.480
but also on the backend.

11:48.480 --> 11:51.450
You can draw a mock-up of a data architecture

11:51.450 --> 11:54.060
or a database structure and it can turn that

11:54.060 --> 11:55.473
into Python code.

11:56.400 --> 11:57.839
It can interpret charts.

11:57.839 --> 11:59.910
It can actually pull data from the chart,

11:59.910 --> 12:03.870
so, "Extract all the sales figures from this pie chart,"

12:03.870 --> 12:05.400
could be really interesting.

12:05.400 --> 12:08.010
It can translate language, which is really powerful,

12:08.010 --> 12:10.110
especially when it gets faster than real time.

12:10.110 --> 12:13.290
You'd be able to hold it up, just like you can do

12:13.290 --> 12:14.700
with Google Translate right now.

12:14.700 --> 12:17.520
You can hold it up to a sign and it will,

12:17.520 --> 12:19.560
it can automatically translate that

12:19.560 --> 12:22.590
back into your native language.

12:22.590 --> 12:24.810
It's pretty good at data extraction as well.

12:24.810 --> 12:26.310
You can pull data out the image,

12:26.310 --> 12:28.040
like you've seen examples of.

12:28.040 --> 12:30.330
It can read human emotions and understand them

12:30.330 --> 12:32.790
and infer information from them.

12:32.790 --> 12:34.890
It can evaluate aesthetics.

12:34.890 --> 12:37.470
So, I've seen some people use this in image prompting.

12:37.470 --> 12:40.740
Like, it can evaluate whether something is beautiful or not.

12:40.740 --> 12:42.810
It can score it, which is pretty cool.

12:42.810 --> 12:44.310
It can detect defects.

12:44.310 --> 12:47.031
I imagine this can be used a lot in factories, right?

12:47.031 --> 12:50.160
In order to identify whether something has gone wrong

12:50.160 --> 12:52.320
or the product has got a hole in it

12:52.320 --> 12:54.240
or something needs to be stopped

12:54.240 --> 12:55.890
about the manufacturing process.

12:55.890 --> 12:57.120
And then the final two I think

12:57.120 --> 12:58.740
is where it really gets sci-fi.

12:58.740 --> 13:02.130
Embodied agents being able to walk around as a robot

13:02.130 --> 13:04.740
and it can actually tell you where to go

13:04.740 --> 13:06.210
to get something from the fridge.

13:06.210 --> 13:09.637
So that means it can actually be applied to a robot,

13:09.637 --> 13:12.810
and that robot could then use GPT Vision

13:12.810 --> 13:15.240
to understand where it's moving in the world.

13:15.240 --> 13:16.590
I think that's pretty cool.

13:16.590 --> 13:19.020
But you can also set up a browsing agent.

13:19.020 --> 13:21.270
So a little bit less sci-fi

13:21.270 --> 13:23.010
than a robot walking around your home,

13:23.010 --> 13:26.400
but you can get it to actually operate your browser,

13:26.400 --> 13:29.340
operate your computer and book things for you

13:29.340 --> 13:30.570
or schedule things for you,

13:30.570 --> 13:33.210
so I think that's gonna be really powerful as well.

13:33.210 --> 13:34.770
If you wanted to learn a little bit more

13:34.770 --> 13:37.500
about visual prompting, I really recommend this paper.

13:37.500 --> 13:39.210
It's where a lot of these examples came from.

13:39.210 --> 13:41.100
It's called "The Dawn of LMMs."

13:41.100 --> 13:44.010
That's LM, large multimodal models,

13:44.010 --> 13:46.710
instead of LLM, large language models.

13:46.710 --> 13:48.480
But yeah, I think this is the future.

13:48.480 --> 13:51.990
I think there's a lot of unexpected businesses

13:51.990 --> 13:54.000
that are gonna come out of this that will prove

13:54.000 --> 13:57.390
to be really impactful on our world.

13:57.390 --> 13:59.990
The quicker you can test out this stuff, the better.
