1
00:00:05,460 --> 00:00:12,900
In this lesson, we are going to talk about what LM models, what foundation LM models will we use to

2
00:00:12,900 --> 00:00:16,500
create multimodal LM applications?

3
00:00:21,380 --> 00:00:24,410
As you know, the problem we have.

4
00:00:25,260 --> 00:00:33,540
When we are trying to develop multi-modal applications, is that the regular LM models we have been

5
00:00:33,540 --> 00:00:41,460
using until now, like ChatGPT, for example, have several limitations.

6
00:00:41,610 --> 00:00:49,020
A regarding the use of images, tables and other elements except text.

7
00:00:49,320 --> 00:00:56,640
It happens the same with most of the foundation LM models that are right now in the market.

8
00:00:57,530 --> 00:01:06,800
So this solution we have is to use the new multimodal LM Foundation models.

9
00:01:06,980 --> 00:01:09,320
These are extremely new.

10
00:01:09,350 --> 00:01:18,170
They are not as mature or are as stable as the regular LM models we know.

11
00:01:18,410 --> 00:01:24,230
But they are extremely, extremely interesting and have a huge potential.

12
00:01:24,910 --> 00:01:30,310
The first pioneers in this field were clip, Laba and Fuyu.

13
00:01:30,340 --> 00:01:34,660
These are extremely new multimodal models.

14
00:01:34,660 --> 00:01:44,200
We are talking about clips around 2021 and Le-van Fuyu were created in 2023.

15
00:01:44,590 --> 00:01:54,430
In the same year we had a the current leaders of the market GPT four, Vision and Gemini Pro.

16
00:01:54,460 --> 00:02:04,420
As you know, GPT four vision is a paid model from OpenAI, and Gemini Pro is a paid model from Google.

17
00:02:05,050 --> 00:02:08,530
Right now, the main leader is GPT four vision.

18
00:02:08,530 --> 00:02:16,240
But Gemini Pro is is is doing a very good job in multimodal in the multimodal space.

19
00:02:16,600 --> 00:02:25,840
The pioneers clip lava and Fuyu are all open source, and we trust that in the coming months they are

20
00:02:25,840 --> 00:02:31,150
going to be in a level more similar to GPT four vision.

21
00:02:31,150 --> 00:02:38,380
Let's see, GPT four vision evolves and is even far better than it is today.

22
00:02:39,790 --> 00:02:43,000
Very important a this caveat here.

23
00:02:43,000 --> 00:02:52,900
So when we are talking about multimodal LM applications or multimodal LM Foundation models, we need

24
00:02:52,900 --> 00:02:57,670
to understand that they are in a very early stage.

25
00:02:58,270 --> 00:03:05,950
The maturity of multimodal LM models is still far from regular LM models.

26
00:03:05,950 --> 00:03:07,240
This is very important.

27
00:03:07,240 --> 00:03:16,600
So right now is a good moment for experimentation and good moment to position yourself for my immediate

28
00:03:16,600 --> 00:03:17,470
future.

29
00:03:17,470 --> 00:03:28,720
But it may be still a bit early for production applications, because you will see that even the most

30
00:03:28,720 --> 00:03:38,020
advanced multimodal LM models like GPT four vision have some problems.

31
00:03:38,020 --> 00:03:48,130
They are not 100% stable or 100% a usable right now for production applications.

32
00:03:48,130 --> 00:03:55,630
So I would say right now, very good moment for experimentation and to position yourself for the immediate

33
00:03:55,630 --> 00:03:56,410
future.

34
00:03:56,620 --> 00:04:04,090
The evolution of these multimodal models is really, really, uh, fast.

35
00:04:04,090 --> 00:04:05,680
It's being very, very fast.

36
00:04:05,680 --> 00:04:06,550
So.

37
00:04:07,190 --> 00:04:14,300
We don't know when it's going to be, the moment where they are stable enough to be using production

38
00:04:14,300 --> 00:04:20,450
applications, but I think we can be very, very, uh, near right now.

39
00:04:20,450 --> 00:04:26,750
I'm talking, uh, in March 15th, 2024.

40
00:04:26,750 --> 00:04:36,770
So probably May June, we will start seeing solid production applications in this space.

41
00:04:36,770 --> 00:04:37,580
Let's see.

42
00:04:39,360 --> 00:04:51,750
In the next lesson, we are going to start a studying the best multimodal LLM foundation models we have

43
00:04:51,750 --> 00:04:53,010
right now in the market.

44
00:04:53,010 --> 00:04:56,940
We will see how to use GPT for vision.

45
00:04:56,940 --> 00:05:04,050
What are the main use cases for GPT for vision, and what are the limitations of this model?

46
00:05:04,050 --> 00:05:08,190
We will see this starting in the following in the following lesson.