WEBVTT

1
00:00:00.510 --> 00:00:03.000
<v Maximilian>Now, back in the chat window,</v>

2
00:00:03.000 --> 00:00:04.980
which is arguably the place you'll spend

3
00:00:04.980 --> 00:00:08.400
most of your time in, and back in user mode.

4
00:00:08.400 --> 00:00:12.150
I wanna take a closer look at this plus button here

5
00:00:12.150 --> 00:00:16.770
because this button allows you to attach images

6
00:00:16.770 --> 00:00:19.080
or files to your chat.

7
00:00:19.080 --> 00:00:21.330
Now, in order to work with images,

8
00:00:21.330 --> 00:00:25.050
you need to work with a model that supports images

9
00:00:25.050 --> 00:00:27.570
that's multimodal, and you can tell

10
00:00:27.570 --> 00:00:29.940
whether that's the case for a given model

11
00:00:29.940 --> 00:00:33.213
by checking if this icon here is there.

12
00:00:34.260 --> 00:00:38.460
You can also go to your app settings and the model search.

13
00:00:38.460 --> 00:00:41.430
And there if you select a model, either one you have

14
00:00:41.430 --> 00:00:44.430
or also models you didn't download yet,

15
00:00:44.430 --> 00:00:47.220
you will also see that icon there.

16
00:00:47.220 --> 00:00:48.600
And if you hover over that icon,

17
00:00:48.600 --> 00:00:50.460
you'll also see this info text.

18
00:00:50.460 --> 00:00:53.550
This model can process image inputs.

19
00:00:53.550 --> 00:00:55.860
And you of course must be working with a model

20
00:00:55.860 --> 00:00:59.580
that does support image inputs in order to be able to upload

21
00:00:59.580 --> 00:01:01.590
and use an image.

22
00:01:01.590 --> 00:01:04.410
So for example, here, I'll open a new chat

23
00:01:04.410 --> 00:01:06.810
and I will attach an image here.

24
00:01:06.810 --> 00:01:08.730
And the image I do want to attach

25
00:01:08.730 --> 00:01:12.303
is a car wash receipt I have.

26
00:01:14.070 --> 00:01:17.700
This is the image, it's a pretty hard to read receipt

27
00:01:17.700 --> 00:01:20.190
of me washing my car or of the place

28
00:01:20.190 --> 00:01:21.450
where I did wash my car.

29
00:01:21.450 --> 00:01:23.040
It's in German, but in the end,

30
00:01:23.040 --> 00:01:26.250
it contains the amount I paid and some other information.

31
00:01:26.250 --> 00:01:28.470
And I can upload this here.

32
00:01:28.470 --> 00:01:30.060
And I could then use this model,

33
00:01:30.060 --> 00:01:32.310
since this Gemma 3 model,

34
00:01:32.310 --> 00:01:34.050
this 12 billion parameters model

35
00:01:34.050 --> 00:01:36.060
does support image processing.

36
00:01:36.060 --> 00:01:39.990
I could ask it to, "Please extract the core

37
00:01:39.990 --> 00:01:44.067
information from this uploaded image."

38
00:01:44.910 --> 00:01:47.460
And we could see whether it's able to do that or not.

39
00:01:47.460 --> 00:01:50.340
It is technically able to parse images,

40
00:01:50.340 --> 00:01:53.610
but let's see if it can read the content from this image,

41
00:01:53.610 --> 00:01:56.820
since this is really bad quality.

42
00:01:56.820 --> 00:02:00.060
Now, processing images can take a bit longer,

43
00:02:00.060 --> 00:02:03.960
but it then goes ahead and this indeed doesn't look too bad.

44
00:02:03.960 --> 00:02:06.960
Now, there are some mistakes in here,

45
00:02:06.960 --> 00:02:10.170
like the street name is not totally correct,

46
00:02:10.170 --> 00:02:13.890
but the amount I paid, for example, is correct.

47
00:02:13.890 --> 00:02:15.390
And that's of course pretty impressive

48
00:02:15.390 --> 00:02:17.880
since this is just a small model running locally

49
00:02:17.880 --> 00:02:22.770
on our system and the quality here is not great.

50
00:02:22.770 --> 00:02:27.750
By the way, for comparison, here I am using that same image

51
00:02:27.750 --> 00:02:31.470
with the more powerful 27 billion parameters

52
00:02:31.470 --> 00:02:32.730
Gemma 3 model.

53
00:02:32.730 --> 00:02:35.940
And at least here for me, it does extract

54
00:02:35.940 --> 00:02:38.250
all that information correctly.

55
00:02:38.250 --> 00:02:42.060
So using these models for OCR tasks like this,

56
00:02:42.060 --> 00:02:44.010
for extracting text from images,

57
00:02:44.010 --> 00:02:46.533
can be one very useful use case.

