1
00:00:05,890 --> 00:00:13,990
In this lesson, we are going to talk about the limitation that right now GPT four vision has.

2
00:00:16,670 --> 00:00:25,490
So the first important limitation of GPT four vision is for medical images.

3
00:00:25,490 --> 00:00:33,350
This is the first use case everybody was thinking about when they launched GPT four vision.

4
00:00:33,350 --> 00:00:37,340
And they immediately a.

5
00:00:38,260 --> 00:00:40,120
Warn us about this.

6
00:00:40,120 --> 00:00:52,180
So GPT four vision is right now not suitable for interpreting specialized medical images like CT scans,

7
00:00:52,180 --> 00:00:55,810
and shouldn't be used for medical advice.

8
00:00:56,410 --> 00:00:59,320
So what they are saying is be careful.

9
00:00:59,320 --> 00:01:05,470
These are powerful technology, but it has limitations.

10
00:01:05,470 --> 00:01:11,440
So in fields like health, you have to be super careful using this.

11
00:01:12,140 --> 00:01:13,160
Second.

12
00:01:14,250 --> 00:01:19,500
GPT four vision may not perform optimally right now.

13
00:01:19,500 --> 00:01:29,040
Today, when handling images with text of non-Latin alphabets such as Japanese or Korean.

14
00:01:29,400 --> 00:01:29,880
Okay.

15
00:01:29,880 --> 00:01:31,500
Second, limitations.

16
00:01:31,500 --> 00:01:37,830
We have many students from Japan and South Korea, so pay attention to this.

17
00:01:37,860 --> 00:01:47,430
If in the image you have text that is non Latin and non Latin alphabet, careful with that because the

18
00:01:47,430 --> 00:01:50,220
model may not perform optimally.

19
00:01:54,230 --> 00:01:56,180
Regarding a small text.

20
00:01:56,450 --> 00:02:06,980
The recommendation of OpenAI is enlarge text within the image to improve readability, but avoid cropping

21
00:02:06,980 --> 00:02:09,139
important details.

22
00:02:09,169 --> 00:02:15,020
Okay, so careful with text that is too small in your image.

23
00:02:16,370 --> 00:02:17,960
About rotation.

24
00:02:19,530 --> 00:02:27,750
ChatGPT for vision may misinterpret rotated or upside down text or images.

25
00:02:28,170 --> 00:02:34,980
Therefore, with that as well, if you don't have your image in the proper position.

26
00:02:35,010 --> 00:02:43,230
ChatGPT for vision can be confused, so it's very important to prepare the image before working with

27
00:02:43,230 --> 00:02:45,240
a GPT for vision.

28
00:02:47,400 --> 00:02:49,020
Next ChatGPT.

29
00:02:49,050 --> 00:02:49,620
Four.

30
00:02:49,650 --> 00:03:01,650
Vision may struggle to understand graphs or text where colors or styles like solid dash or dotted lines

31
00:03:01,650 --> 00:03:02,340
vary.

32
00:03:03,430 --> 00:03:03,910
Okay.

33
00:03:03,910 --> 00:03:14,050
So you need to understand all these scenarios where GPT four vision is not going to work well and avoid

34
00:03:14,050 --> 00:03:21,310
them in order to have today an application that works, uh, in a professional level.

35
00:03:22,570 --> 00:03:31,990
Next, GPT four vision may generate incorrect descriptions or captions in certain scenarios.

36
00:03:32,170 --> 00:03:35,350
They don't give us more detail about that.

37
00:03:35,590 --> 00:03:46,510
They, as you can see, provide a very general statement a covering themselves about possible mistakes,

38
00:03:46,510 --> 00:03:47,380
possible errors.

39
00:03:47,380 --> 00:03:56,710
So you will see that when we talk about the different methods we can use in order to build a multimodal

40
00:03:56,710 --> 00:04:05,050
LM application, we are going to see different degrees of accuracy depending on the different methods

41
00:04:05,050 --> 00:04:05,530
we use.

42
00:04:05,530 --> 00:04:11,860
So just a little advance about what we are going to see later regarding accuracy.

43
00:04:12,880 --> 00:04:15,160
What about the image shape?

44
00:04:15,160 --> 00:04:22,600
So GPT four vision struggles with panoramic and fisheye images.

45
00:04:22,600 --> 00:04:25,240
So this is very important.

46
00:04:25,690 --> 00:04:34,420
Uh, if your business or your application is going to handle these kind of images, careful with these

47
00:04:34,420 --> 00:04:37,930
panoramic images and fisheye images.

48
00:04:41,340 --> 00:04:45,150
Also about metadata and resizing.

49
00:04:45,510 --> 00:04:55,470
Right now, GPT four vision doesn't process original file names or metadata, and images are resized

50
00:04:55,470 --> 00:05:01,260
before analysis, affecting their original dimensions dimensions.

51
00:05:01,290 --> 00:05:04,980
Okay, so careful about this as well.

52
00:05:05,100 --> 00:05:07,260
A couple of additional things.

53
00:05:07,740 --> 00:05:08,520
First.

54
00:05:09,550 --> 00:05:15,190
GPT four vision may give approximate counts for objects in images.

55
00:05:15,190 --> 00:05:18,250
So do you remember the first example?

56
00:05:18,250 --> 00:05:25,930
We saw this picture with some children and we asked GPT for vision how many children do we have here?

57
00:05:25,930 --> 00:05:27,340
And he says seven.

58
00:05:27,340 --> 00:05:28,180
And it was correct.

59
00:05:28,180 --> 00:05:30,850
But this is not always the case.

60
00:05:31,060 --> 00:05:35,170
In some cases the the counting is going to be approximate.

61
00:05:35,170 --> 00:05:36,730
It's not going to be exact.

62
00:05:36,730 --> 00:05:38,710
So careful with that as well.

63
00:05:38,980 --> 00:05:42,850
And finally, regarding CAPTCHAs, I already told you.

64
00:05:43,570 --> 00:05:51,520
OpenAI says for safety reasons, we have implemented a system to block the submission of CAPTCHAs.

65
00:05:51,970 --> 00:05:56,680
You see, because ChatGPT for vision was.

66
00:05:57,620 --> 00:05:59,150
Breaking all the captures.

67
00:05:59,150 --> 00:06:08,270
So this is a very interesting note about the power and the potential of the the multimodal LM models.

68
00:06:08,750 --> 00:06:18,860
So this is the question you may ask if GPT four vision is so good, why do we need multimodal LM applications

69
00:06:18,860 --> 00:06:19,520
for.

70
00:06:20,180 --> 00:06:29,540
If you remember, when we initially talk about the regular chat, GPT model and LM applications, we

71
00:06:29,540 --> 00:06:33,920
also made the same question.

72
00:06:34,130 --> 00:06:38,270
So let's see the answer in the next lesson.

