1
00:00:03,150 --> 00:00:10,320
Okay, so let's talk about some advanced tips, uh, regarding language myth data sets.

2
00:00:10,320 --> 00:00:19,290
First, how to evaluate your LM application with a test data set in the prompt.

3
00:00:19,290 --> 00:00:25,110
And in the prototyping phase, you will create your own test database.

4
00:00:25,900 --> 00:00:26,860
In the beta.

5
00:00:26,890 --> 00:00:31,390
In the beta testing phase, you will add to that initial database.

6
00:00:32,110 --> 00:00:35,920
Examples of real feedback from your beta users.

7
00:00:36,310 --> 00:00:38,290
Mostly relevant cases.

8
00:00:38,290 --> 00:00:44,710
When the user has labeled the LM answer as thumbs up or thumbs down.

9
00:00:45,040 --> 00:00:52,870
So this is going to be the most frequent, uh, um, use case for user feedback.

10
00:00:52,870 --> 00:00:56,860
The thumbs up and thumbs down buttons.

11
00:00:56,860 --> 00:01:00,730
You see this in the ChatGPT app, right?

12
00:01:00,730 --> 00:01:07,510
So this is right now the most common way to get feedback from the beta users and also from the final

13
00:01:07,510 --> 00:01:08,230
users.

14
00:01:09,370 --> 00:01:18,010
You can use the test data set to evaluate different versions of your LM application with different LM

15
00:01:18,010 --> 00:01:26,740
models, different LM model features, and compare the performance in terms of accuracy, latency,

16
00:01:26,740 --> 00:01:35,200
cost, etc. it is very useful to use the comparison view to compare the performance of different versions

17
00:01:35,200 --> 00:01:39,220
of the LM application with the test data set.

18
00:01:39,220 --> 00:01:41,470
So remember, if we go to the platform.

19
00:01:42,830 --> 00:01:44,840
You remember that in the data?

20
00:01:44,840 --> 00:01:55,160
In the data data set and testing dashboard, if we go to one data set and we click on the test we have,

21
00:01:55,160 --> 00:01:57,800
we have just one, but we have more than one.

22
00:01:57,800 --> 00:02:07,340
We can click here in the compare button to see, you know, different graphics, uh, uh, of the performance

23
00:02:07,490 --> 00:02:09,380
of the different different tests.

24
00:02:09,380 --> 00:02:09,740
Right.

25
00:02:09,740 --> 00:02:14,330
With the comparison view, this is the button that activates the comparison view.

26
00:02:14,360 --> 00:02:17,900
We will see this in more detail in the next lesson.

27
00:02:17,900 --> 00:02:22,550
When we go to to the especially with the professional project okay.

28
00:02:22,550 --> 00:02:30,740
So as you can see, how to evaluate your LM application with a test data set is relatively easy.

29
00:02:32,000 --> 00:02:36,950
How many examples should have the test data set.

30
00:02:36,950 --> 00:02:42,500
So this is a question that the long chain team answer.

31
00:02:44,130 --> 00:02:47,370
They said they langschmidt teams.

32
00:02:47,880 --> 00:02:52,710
So the lang lang lang lang Tain team or the Lang Smith team, they are the same.

33
00:02:54,330 --> 00:02:55,650
There is a typo here.

34
00:02:57,040 --> 00:02:58,120
Says.

35
00:02:59,310 --> 00:03:05,130
That the average test data set has around 20 examples.

36
00:03:05,130 --> 00:03:11,550
When an LLM application development team starts the beta testing phase.

37
00:03:12,250 --> 00:03:19,390
But the right number really depends on each project and how much time and effort they want.

38
00:03:19,390 --> 00:03:22,300
Or they can invest on evaluation.

39
00:03:22,300 --> 00:03:31,960
So I found that this very interesting because I thought that the test data set was going to have like

40
00:03:31,960 --> 00:03:35,530
100 or, you know, hundreds of examples.

41
00:03:35,530 --> 00:03:48,850
But the launching team tell us that the average test data set they have observed has around 20 examples.

42
00:03:49,720 --> 00:03:55,630
So my opinion this is a little bit too too too too low.

43
00:03:55,630 --> 00:03:58,510
But this is what they have observed.

44
00:03:59,290 --> 00:04:05,560
And what they say is that, okay, the right number really depends on each project and how much time

45
00:04:05,560 --> 00:04:09,370
and effort they want or can invest on evaluation.

46
00:04:09,400 --> 00:04:13,750
Of course, this is also another thing to keep in mind.

47
00:04:13,750 --> 00:04:19,720
The bigger the data set, the bigger the cost associated with with it.

48
00:04:19,720 --> 00:04:29,590
But a well, this is right now the information that the long chain team has shared with us, with us.

49
00:04:30,160 --> 00:04:33,880
Another interesting advance tip language myth.

50
00:04:33,880 --> 00:04:39,310
Data sets can be used for more things other than evaluation.

51
00:04:39,430 --> 00:04:48,910
The main use of language myth data sets is evaluation, but some teams have also used them for other

52
00:04:48,910 --> 00:04:54,040
purposes, like few shot prompting or even fine tuning.

53
00:04:54,250 --> 00:04:55,750
Okay, I find this.

54
00:04:57,160 --> 00:05:04,210
Couriers, but I would say 99% of us we are going to use data sets for evaluation.

55
00:05:05,440 --> 00:05:10,270
And finally, offline evaluation versus online evaluation.

56
00:05:10,990 --> 00:05:15,520
Offline evaluation is the current Lang Smith evaluation.

57
00:05:15,520 --> 00:05:21,460
Your LM application is tested against a test data set.

58
00:05:21,460 --> 00:05:26,710
This is the offline evaluation that right now we are having in the LAN Smith platform.

59
00:05:27,790 --> 00:05:33,640
The online evaluation is the next Lang Smith feature they are preparing for us.

60
00:05:35,010 --> 00:05:38,970
Evaluators will run on a sample of your traffic.

61
00:05:38,970 --> 00:05:47,850
For example, evaluate 20% of your downvoted traces with a particular evaluator in production with real

62
00:05:47,850 --> 00:05:48,300
data.

63
00:05:48,300 --> 00:05:51,330
So this is what they are trying to accomplish.

64
00:05:51,330 --> 00:05:52,440
And.

65
00:05:53,330 --> 00:05:55,730
Hopefully we will see this soon.

66
00:05:55,760 --> 00:06:04,940
Okay, so after seeing all these very interesting advanced tips regarding a language myth data sets,

67
00:06:04,940 --> 00:06:14,120
we are going to see how language myth help us solve the main challenge we find during the beta testing

68
00:06:14,120 --> 00:06:14,510
phase.

69
00:06:14,510 --> 00:06:17,450
We will see this in the next lesson.