1
00:00:05,520 --> 00:00:12,810
Okay, so this is going to be a super interesting lesson in order to understand what can you do with

2
00:00:12,810 --> 00:00:19,590
GPT four vision and what can you do with your multimodal LM applications?

3
00:00:19,950 --> 00:00:30,630
What are the main use cases of this new multimodal LM applications or multimodal LM Foundation model?

4
00:00:33,010 --> 00:00:44,260
So there have been different ways to classify the different use cases of a multimodal LM models like

5
00:00:44,260 --> 00:00:44,740
GPT.

6
00:00:44,770 --> 00:00:53,470
For vision, we can use this 1A6 main generic use cases.

7
00:00:53,470 --> 00:00:59,230
And we are going to see examples of all of them and applications in different industries.

8
00:00:59,410 --> 00:01:06,940
So the first generic use case will be to identify and describe visual content.

9
00:01:07,390 --> 00:01:11,440
The second to analyze diagrams and images.

10
00:01:12,250 --> 00:01:16,030
The third to provide critiques and recommendations.

11
00:01:17,170 --> 00:01:23,290
The fourth convert image into something new, an image into something new.

12
00:01:24,250 --> 00:01:24,940
Five.

13
00:01:24,970 --> 00:01:33,010
Extract data from image and seeks to solve visual based based tasks.

14
00:01:33,010 --> 00:01:41,470
So let's see examples of all of them and also applications in different industries.

15
00:01:42,090 --> 00:01:53,070
So the first a main generic use case of a multi modal LM models and also multi modal LM applications

16
00:01:53,070 --> 00:01:58,170
is to identify and describe image or.

17
00:01:59,530 --> 00:02:02,020
Elements in one image okay.

18
00:02:02,140 --> 00:02:11,080
So we can use GPT for vision or our LM applications, as you will see at the end of this blog, to describe

19
00:02:11,080 --> 00:02:14,500
to describe what it is in one particular image.

20
00:02:15,100 --> 00:02:20,170
And we can also identify and describe visual content.

21
00:02:20,170 --> 00:02:29,380
For example, we can enter this image to GPT four and ask, can you count the number of children in

22
00:02:29,380 --> 00:02:35,320
the image and GPT four vision is able to answer, can you tell me what they are doing?

23
00:02:35,320 --> 00:02:42,340
And GPT four is able to tell us, okay, so this is a very simple image, but as you will see, it can

24
00:02:42,340 --> 00:02:44,890
get more complicated than this.

25
00:02:46,930 --> 00:02:55,660
Regarding the a second possibility in this initial use case to analyze images.

26
00:02:55,660 --> 00:03:01,870
What we can do, we can analyze medical diagrams and imagery.

27
00:03:01,900 --> 00:03:05,110
We will see that, for example, GPT four vision has a limitation.

28
00:03:05,110 --> 00:03:06,250
Regarding that.

29
00:03:06,970 --> 00:03:11,020
We can analyze tech diagrams and schemas.

30
00:03:11,020 --> 00:03:14,410
We can analyze images and deduce context.

31
00:03:14,440 --> 00:03:15,910
We will see examples of that.

32
00:03:15,910 --> 00:03:19,150
We can do sentiment analysis from an image.

33
00:03:19,150 --> 00:03:26,260
We can do artistic interpretation and we can do data analysis from charts for example.

34
00:03:26,260 --> 00:03:35,200
And these are these are just a few examples on how can we apply this image analysis using the multimodal

35
00:03:35,200 --> 00:03:37,510
LM models or applications.

36
00:03:38,140 --> 00:03:49,450
So instead of one photo or drawing as we presented before, we can load a diagram, a graphic like this

37
00:03:49,450 --> 00:03:55,870
and we can ask GPT four, can you explain this graph and provide insight insights.

38
00:03:55,870 --> 00:04:04,090
And GPT four vision is absolutely able to explain a diagrams.

39
00:04:04,090 --> 00:04:07,630
And you will see it's really amazing what it can do.

40
00:04:07,660 --> 00:04:09,160
We will see more examples.

41
00:04:10,550 --> 00:04:20,209
So when we talk about the use case to provide critiques and recommendations, we are talking about much

42
00:04:20,209 --> 00:04:26,030
more than simply analyze or describe what it is in an image.

43
00:04:26,030 --> 00:04:27,950
Here we are talking about.

44
00:04:28,670 --> 00:04:30,620
To critique an image.

45
00:04:31,360 --> 00:04:40,780
To provide feedback from an image, to provide recommended actions based on one image, or to evaluate

46
00:04:40,780 --> 00:04:49,450
an image statically or from accuracy, point of view, or even a subjective evaluation.

47
00:04:49,450 --> 00:04:52,900
This is super, super advanced.

48
00:04:53,140 --> 00:04:55,000
So let's see some examples.

49
00:04:55,000 --> 00:05:06,970
For example, we can load a weird image like this to chat GPT four vision and ask what is unusual about

50
00:05:06,970 --> 00:05:07,960
this image?

51
00:05:08,690 --> 00:05:19,250
And as you can read in the answer chat, GPT four is totally able to have a subjective evaluation of

52
00:05:19,250 --> 00:05:19,880
an image.

53
00:05:19,880 --> 00:05:28,820
So we are talking about a very, very powerful stuff with many applications in many fields.

54
00:05:30,040 --> 00:05:41,080
We can also convert a an image or to use an image in order to create a different thing, like a storyline,

55
00:05:41,080 --> 00:05:47,380
like a prompt, like a recommendation, or like any other actionable format.

56
00:05:47,620 --> 00:05:50,920
For example, imagine what we can do.

57
00:05:51,370 --> 00:05:52,690
We can.

58
00:05:53,540 --> 00:06:01,220
Make a photograph of an application, a web application, or a dashboard like this one.

59
00:06:01,870 --> 00:06:05,860
And ask ChatGPT for vision A.

60
00:06:06,820 --> 00:06:12,100
Can you tell me how to code a web application like this?

61
00:06:13,480 --> 00:06:14,830
This is amazing.

62
00:06:15,370 --> 00:06:18,820
We can even go further than that.

63
00:06:18,970 --> 00:06:28,780
We can just provide a handwritten draft of the web application we want and we can ask for the code.

64
00:06:28,780 --> 00:06:32,320
So here is the design for a blogging website.

65
00:06:32,320 --> 00:06:38,890
Provide a working source code for the website using HTML, CSS, and JavaScript as required.

66
00:06:39,640 --> 00:06:43,840
Imagine what we can do with these models.

67
00:06:45,310 --> 00:06:50,590
We can also extract data from handwritten text.

68
00:06:50,770 --> 00:06:54,970
Structured data from an image or subjected data from an image.

69
00:06:54,970 --> 00:06:58,840
For example, we can provide an image like this.

70
00:06:58,840 --> 00:07:01,270
This is a pneumatic, a wheel.

71
00:07:02,440 --> 00:07:12,460
And attire and we can ask ChatGPT for read the serial number, return only the number without additional

72
00:07:12,460 --> 00:07:13,150
text.

73
00:07:13,150 --> 00:07:21,010
And as you can see, ChatGPT four is able to read the serial number of the tire.

74
00:07:22,680 --> 00:07:29,280
We can also, uh, extract text from a handwritten note.

75
00:07:30,200 --> 00:07:34,910
So this is, as you can see, an old, uh, manuscript.

76
00:07:34,910 --> 00:07:37,910
And we are asking ChatGPT for vision.

77
00:07:37,910 --> 00:07:38,960
Can you read this?

78
00:07:38,960 --> 00:07:41,360
And here you have.

79
00:07:42,980 --> 00:07:45,920
We can also ask for recommendations.

80
00:07:45,920 --> 00:07:47,360
And this is amazing.

81
00:07:47,360 --> 00:07:48,920
So see here.

82
00:07:48,920 --> 00:07:51,470
This is just a picture of a plant.

83
00:07:52,180 --> 00:07:53,470
And this is what we say.

84
00:07:53,470 --> 00:07:58,420
What is this plan and how should I care about it?

85
00:07:58,450 --> 00:08:10,780
So ChatGPT four identifies the kind of plant, uh, we have in the picture and provides tips, care

86
00:08:10,780 --> 00:08:13,990
tips, uh, for this kind of plant.

87
00:08:14,680 --> 00:08:15,700
It's amazing.

88
00:08:17,550 --> 00:08:23,700
We can also use ChatGPT for vision to start data and use it like this.

89
00:08:23,730 --> 00:08:30,930
We can make a photograph of a traffic signs like this and we can ask.

90
00:08:30,930 --> 00:08:32,789
Suppose it is.

91
00:08:32,789 --> 00:08:36,720
It is Wednesday and the time is 4 p.m..

92
00:08:37,140 --> 00:08:41,280
Am I allowed to park my car at this spot?

93
00:08:41,669 --> 00:08:44,010
This is really amazing.

94
00:08:44,010 --> 00:08:50,520
And boom, immediately you have the right answer from ChatGPT for vision.

95
00:08:51,320 --> 00:08:57,500
We can also do things like solve visual based tasks.

96
00:08:57,500 --> 00:09:01,190
For example, we could use.

97
00:09:02,010 --> 00:09:07,320
ChatGPT for to solve most captchas out there.

98
00:09:07,770 --> 00:09:17,070
The the the problem is that, uh, the, the people from OpenAI, they realize this and they have introduced

99
00:09:17,070 --> 00:09:18,300
some changes.

100
00:09:18,300 --> 00:09:25,500
So right now we cannot use ChatGPT for vision for this, but it could be it was able to do that.

101
00:09:25,950 --> 00:09:29,490
We can also solve other visual based tasks.

102
00:09:29,850 --> 00:09:36,240
We can explain visual situations and we can even make ChatGPT for vision.

103
00:09:36,240 --> 00:09:37,710
Recommend us.

104
00:09:37,710 --> 00:09:43,110
Uh, one strategy to follow based on what we see on an image.

105
00:09:44,150 --> 00:09:46,820
So as you can see.

106
00:09:47,810 --> 00:09:53,750
The amount of opportunities and new scenarios.

107
00:09:54,320 --> 00:10:05,690
These new multimodal LM models and applications open is amazing, so we weren't able to satisfy.

108
00:10:06,260 --> 00:10:11,060
All the opportunities we have with the regular.

109
00:10:11,850 --> 00:10:13,890
LM applications.

110
00:10:13,890 --> 00:10:20,880
And now here you have a total different universe to conquer.

111
00:10:20,910 --> 00:10:27,360
Absolutely available for you to build a lot of new things for many different industries.

112
00:10:27,360 --> 00:10:30,570
So we are in a super exciting moment.

113
00:10:30,570 --> 00:10:38,730
And in this blog you are going to learn how to create a multimodal LM application, and you will see

114
00:10:38,730 --> 00:10:44,790
the number of uses you can make for this new technology.

115
00:10:44,790 --> 00:10:54,300
Okay, so let's see in the next lesson some important limitations we will have to keep in mind when

116
00:10:54,300 --> 00:11:02,610
we are thinking about multimodal LM applications and also multimodal LM models like GPT for vision.

117
00:11:02,610 --> 00:11:05,760
What are the limitations as of today?

118
00:11:05,760 --> 00:11:09,720
This as you know, is evolving very quickly.

119
00:11:09,720 --> 00:11:15,990
So as of today, GPT four vision has some important limitations that you need to know.

120
00:11:15,990 --> 00:11:19,980
So let's see these limitations in the next lesson.