1
00:00:04,939 --> 00:00:05,360
Okay.

2
00:00:05,360 --> 00:00:14,360
So very quickly in this lesson, we are going to see the alternative ways to create multimodal LM applications.

3
00:00:14,360 --> 00:00:17,420
And what is the best one.

4
00:00:17,630 --> 00:00:22,550
Uh, right now in terms of efficiency and in terms of accuracy.

5
00:00:25,200 --> 00:00:40,260
So the key operation in multimodal LM applications is the process we use to convert an image into embeddings.

6
00:00:40,260 --> 00:00:41,910
This is the key operation.

7
00:00:41,910 --> 00:00:48,720
Do you remember what is the key operation in our traditional applications?

8
00:00:48,720 --> 00:00:57,600
It was also a this process right to to the process we use in order to generate our embeddings.

9
00:00:57,750 --> 00:01:00,150
We have different alternatives.

10
00:01:00,390 --> 00:01:03,990
Some of them are more efficient, more accurate than others, etc..

11
00:01:03,990 --> 00:01:08,820
So it's very important to understand very well this step.

12
00:01:08,820 --> 00:01:09,180
Right.

13
00:01:09,180 --> 00:01:15,510
So in the multimodal applications we have the same, uh, the same key operation.

14
00:01:15,510 --> 00:01:22,920
How to go from the raw image to the embeddings we are going to use in our rack technique.

15
00:01:23,860 --> 00:01:25,930
So right now we have.

16
00:01:27,010 --> 00:01:33,370
Mostly two alternatives in order to go from image to embeddings.

17
00:01:33,460 --> 00:01:40,630
The first one is to convert the image into embeddings like that.

18
00:01:40,960 --> 00:01:48,580
And this should be, you know, the the one that we expect to use, right?

19
00:01:48,580 --> 00:01:57,880
Because it seems to be the most direct a and it seems to be the, the, the most efficient, but it

20
00:01:57,880 --> 00:02:08,979
is not as efficient and as accurate today as the second one, which is to create a text summary of the

21
00:02:08,979 --> 00:02:18,730
image, like what we call a caption of the image, and then convert that summary into embeddings.

22
00:02:18,730 --> 00:02:24,100
So the first part is what GPT four vision does.

23
00:02:25,310 --> 00:02:35,240
You give an image to GPT four vision, and GPT four vision tells you what it is in that image.

24
00:02:35,270 --> 00:02:39,440
Do you remember all the use cases we saw in the in the previous lesson?

25
00:02:39,470 --> 00:02:45,860
So this is the part where we are going to use GPT four vision for.

26
00:02:47,350 --> 00:02:56,110
And then once we have this text summary of the image, we then will convert that summary into embeddings

27
00:02:56,110 --> 00:02:58,990
using our regular Rag technique.

28
00:02:59,260 --> 00:03:07,630
And you will see that we are going to use the regular ChatGPT for our Rag technique.

29
00:03:07,930 --> 00:03:08,950
We will see this later.

30
00:03:08,950 --> 00:03:12,460
So we are going to combine both models.

31
00:03:12,460 --> 00:03:22,060
We will use GPT four to create a text summary of the image, and then we will convert that summary into

32
00:03:22,060 --> 00:03:25,900
embeddings using the regular ChatGPT model.

33
00:03:25,900 --> 00:03:31,510
Okay we you will see how we do this in the next lesson.

34
00:03:31,510 --> 00:03:42,520
So we are going to see how to create multimodal applications, multimodal Elm applications with an orchestration

35
00:03:42,520 --> 00:03:44,710
framework like long chain.

36
00:03:44,710 --> 00:03:50,260
So we are first going to see the concepts the steps.

37
00:03:50,260 --> 00:03:53,920
And then we will see the project in practice.