WEBVTT

00:00.240 --> 00:01.440
-: What is Google Vision?

00:01.440 --> 00:04.500
So, this is one that a lot of people don't know about

00:04.500 --> 00:08.130
because they're really focused on Bard and DALL-E

00:08.130 --> 00:10.980
and all these kind of more famous AI tools.

00:10.980 --> 00:12.810
But the Google Vision is actually

00:12.810 --> 00:14.970
one of the first transformer models that I used,

00:14.970 --> 00:18.540
and it's pretty powerful for image recognition.

00:18.540 --> 00:20.820
It's available as an API in Google Cloud,

00:20.820 --> 00:22.440
and here's an example of it on the right,

00:22.440 --> 00:25.470
as something I did where I took an image

00:25.470 --> 00:28.470
of a woman wearing fashionable clothes

00:28.470 --> 00:31.170
and then just tagged the labels from that image.

00:31.170 --> 00:33.857
It's recognized correctly that image has a hair in it

00:33.857 --> 00:36.780
and it has a picture of her head, shoulder, eye.

00:36.780 --> 00:40.020
It's flash photography, label, her sleeve, dress.

00:40.020 --> 00:44.070
It's really useful for extracting out what's in an image.

00:44.070 --> 00:46.260
And it has more than just the label.

00:46.260 --> 00:50.430
So this is the same image uploaded to their online demo,

00:50.430 --> 00:52.980
and you can see that it extracts the objects.

00:52.980 --> 00:55.950
So it knows that there's a person in the image, for example.

00:55.950 --> 00:58.920
The face, it can detect whether there is a face

00:58.920 --> 01:03.420
and then also what the emotion of that face is.

01:03.420 --> 01:05.820
There's also a few of the other kind of properties,

01:05.820 --> 01:07.710
like dominant colors you can get,

01:07.710 --> 01:11.790
which is pretty useful I think for categorization of images.

01:11.790 --> 01:13.650
Crop hints as well, so it kind of gives you an idea

01:13.650 --> 01:17.460
of if you need to crop that image, what should you focus on?

01:17.460 --> 01:19.620
And then one which I've seen some people use

01:19.620 --> 01:21.300
is the safety thing here.

01:21.300 --> 01:24.390
This isn't tagged as an adult image, which is correct.

01:24.390 --> 01:26.910
There's no violence in here, but it is racy.

01:26.910 --> 01:28.280
It's very likely to be racy, right?

01:28.280 --> 01:30.570
So you can use this to automatically detect

01:30.570 --> 01:33.030
some not safe for work images.

01:33.030 --> 01:36.540
Specific places I've used it, one was product images,

01:36.540 --> 01:38.340
tagging what's in your product images

01:38.340 --> 01:40.770
and then being able to filter as you get

01:40.770 --> 01:42.617
a good sense of what types of product images

01:42.617 --> 01:45.750
are converting, for example, on your e-commerce website.

01:45.750 --> 01:48.540
Ad creative, this is some tagging

01:48.540 --> 01:50.250
I ran across all of the creatives

01:50.250 --> 01:51.330
in their Facebook ad account

01:51.330 --> 01:54.865
so you can see what tags tend to perform better than others

01:54.865 --> 01:58.800
and how many creatives have that tag.

01:58.800 --> 02:03.540
So, with this example, we had 23 creatives that used,

02:03.540 --> 02:05.670
that had the label of communication device

02:05.670 --> 02:07.903
or 29 with electronic device,

02:07.903 --> 02:09.810
so that's interesting in itself

02:09.810 --> 02:12.060
to see what sort of patterns are appearing.

02:12.060 --> 02:13.170
And then you can also use it

02:13.170 --> 02:14.636
for not safe for work detection.

02:14.636 --> 02:17.790
Another use case is just detecting if there are faces

02:17.790 --> 02:21.300
or detecting if there are specific labels in an image

02:21.300 --> 02:22.890
in order to correct for them.

02:22.890 --> 02:25.230
AI is not particularly good with faces,

02:25.230 --> 02:26.970
and you can use this to,

02:26.970 --> 02:29.700
after you've generated an image with DALL-E, for example,

02:29.700 --> 02:31.230
you could check if there's a face in it

02:31.230 --> 02:35.370
and then regenerate if you don't want faces in your image.

02:35.370 --> 02:37.560
Okay, Google Vision I think is really powerful.

02:37.560 --> 02:41.430
It's based on the same transformer architecture as DALL-E,

02:41.430 --> 02:44.010
but it's specifically trained

02:44.010 --> 02:48.090
for this type of labeling and entity extraction,

02:48.090 --> 02:50.583
but really useful for specific use cases.