1
00:00:02,750 --> 00:00:10,730
In this application, which is still a basic application, we are going to do a very sophisticated task,

2
00:00:10,730 --> 00:00:14,930
which is the evaluation of an application.

3
00:00:15,680 --> 00:00:16,790
So.

4
00:00:17,730 --> 00:00:24,420
We are going to, uh, perform two different kinds of evaluations, and you will see that the second

5
00:00:24,420 --> 00:00:26,310
one is really amazing.

6
00:00:26,490 --> 00:00:33,540
So the problem we want to solve is we want to evaluate the quality of a question and answer application

7
00:00:33,540 --> 00:00:34,290
on a document.

8
00:00:34,290 --> 00:00:37,410
This is something we have made before.

9
00:00:37,830 --> 00:00:45,540
Uh, so the first step is going to be to reproduce the creation of a question and answer application.

10
00:00:45,540 --> 00:00:50,310
We are going to load a document and we want to make questions about this document.

11
00:00:50,310 --> 00:00:50,670
Right.

12
00:00:50,670 --> 00:00:52,140
We know how to do that.

13
00:00:52,560 --> 00:00:57,930
But what we know is that evaluating an LLM application is not easy.

14
00:00:57,930 --> 00:01:03,960
You remember in the previous, uh, chapters talking about, you know, more, uh, theoretical point

15
00:01:03,960 --> 00:01:11,130
of view, we knew that one of the things that is not easy in LLM applications is evaluation.

16
00:01:11,460 --> 00:01:18,660
So this is a problem because the answers can be slightly different from the same question.

17
00:01:18,660 --> 00:01:28,200
So if you ask the same question to ChatGPT twice it is very probable you have two different responses.

18
00:01:28,200 --> 00:01:36,570
Probably these two responses are going to uh mean the same, but they are not going to be in the same,

19
00:01:36,960 --> 00:01:38,820
uh, war by war format.

20
00:01:38,820 --> 00:01:39,030
Right?

21
00:01:39,030 --> 00:01:47,520
So ChatGPT is going to answer the same question in a different way, uh, whenever we ask the same question.

22
00:01:47,640 --> 00:01:55,530
And this makes it very difficult to evaluate the quality of an application, because if you have evaluated

23
00:01:55,530 --> 00:02:01,020
previously a conventional software application, you are used to to have or in machine learning, for

24
00:02:01,020 --> 00:02:06,420
example, we have a this test question and answers, right.

25
00:02:06,420 --> 00:02:07,380
We know the questions.

26
00:02:07,380 --> 00:02:08,310
We know the answers.

27
00:02:08,310 --> 00:02:10,320
So we check the application.

28
00:02:10,320 --> 00:02:13,620
If we have exactly these answers, the application is good.

29
00:02:13,620 --> 00:02:19,590
If the application if the answers are not exactly that we are having a mistake or whatever, right.

30
00:02:19,590 --> 00:02:28,080
So in the case of LLM application, we have a different situation because we are not going to have two,

31
00:02:28,320 --> 00:02:34,920
uh, responses to answers that are exactly the same to the same question.

32
00:02:34,920 --> 00:02:40,500
So a conventional software application evaluation system cannot be applied.

33
00:02:40,500 --> 00:02:42,060
What are we going to do?

34
00:02:42,600 --> 00:02:49,230
We first are going to prepare a list of questions for which we know the answers.

35
00:02:49,230 --> 00:02:58,200
And then we are going to ask them to the application using a chain a long chain a chain for manual evaluation.

36
00:02:58,200 --> 00:03:02,340
So the first thing we are going to do is manual evaluation.

37
00:03:02,340 --> 00:03:06,210
We are going to visualize the results of the application.

38
00:03:06,210 --> 00:03:10,320
And we are going to compare these results with our results.

39
00:03:11,400 --> 00:03:16,650
And the second, uh, evaluation we are going to perform is an automatic evaluation.

40
00:03:16,650 --> 00:03:23,940
So we are going to use the LM application to evaluate itself.

41
00:03:23,940 --> 00:03:30,510
That is the, uh, very amazing part of this application, which is still a very basic application.

42
00:03:30,510 --> 00:03:35,970
So the process now is going to have a two main parts.

43
00:03:35,970 --> 00:03:42,120
The first part is going to be what we already know, which is to build the QA application.

44
00:03:42,120 --> 00:03:47,460
And the second part, which is the interesting part now, is going to be the evaluation in these two

45
00:03:47,460 --> 00:03:48,330
different steps.

46
00:03:48,330 --> 00:03:52,380
First the manual evaluation and then the automatic evaluation.

47
00:03:53,290 --> 00:04:03,070
So if you look at the code in the first part, as usual, we are going to load the OpenAI API key in

48
00:04:03,070 --> 00:04:05,950
order to connect ourselves with ChatGPT.

49
00:04:06,520 --> 00:04:13,480
We are going to create an instance of the Elm, and we are going to load the document.

50
00:04:13,480 --> 00:04:21,399
Once we have the document, we are going to divide the document in different chunks using the text splitter.

51
00:04:21,399 --> 00:04:33,370
And we are going to, uh, use, uh, this operation in order to create a vector database and store

52
00:04:33,370 --> 00:04:38,560
the embeddings we created from this document chunks.

53
00:04:39,130 --> 00:04:47,680
Once we have that, we are going to create a chain in order to, uh, question this document.

54
00:04:47,680 --> 00:04:48,490
Okay.

55
00:04:49,550 --> 00:04:53,660
In order to send questions to this document and get responses.

56
00:04:54,050 --> 00:05:00,830
The interesting thing about this chain we are creating here is this last line.

57
00:05:01,280 --> 00:05:07,940
As you can read here, notice that we have added input key in the chain configuration.

58
00:05:07,940 --> 00:05:11,900
This tells the chain where will the user prompt be located.

59
00:05:11,900 --> 00:05:18,050
Okay, so this is the only difference from the previous basic application.

60
00:05:18,230 --> 00:05:23,810
We are going to evaluate this application with two questions and answers that we already know.

61
00:05:23,810 --> 00:05:31,250
These answers are technically known in the line chain lingo as ground truth answers.

62
00:05:31,310 --> 00:05:38,480
So the first thing we are going to do is we are going to enter these two questions and answers that

63
00:05:38,480 --> 00:05:39,320
we already know.

64
00:05:39,320 --> 00:05:44,360
So we have read the document and we have these two.

65
00:05:45,270 --> 00:05:48,870
My questions and answers that we know for sure.

66
00:05:48,870 --> 00:05:54,960
So the first question is, where is a whole neighborhood of YC funded startups?

67
00:05:54,960 --> 00:05:58,470
YC is Y Combinator is an accelerator in Silicon Valley.

68
00:05:58,470 --> 00:06:04,050
And the answer that we know from reading the document is in San Francisco.

69
00:06:04,970 --> 00:06:14,390
The second question based in the on the document is what may be the most valuable thing for barhait

70
00:06:14,390 --> 00:06:16,310
made for Google?

71
00:06:16,310 --> 00:06:21,500
And the answer that we know is true is the motto don't be evil.

72
00:06:23,000 --> 00:06:29,210
Once we have these questions and answers that we know for sure, we are going to apply the original

73
00:06:29,210 --> 00:06:36,350
chain we created for question and answers application to these questions and answers.

74
00:06:36,560 --> 00:06:45,050
And we are going to compare the results we get with the, uh, answer we already know.

75
00:06:45,050 --> 00:06:47,270
So to the first question.

76
00:06:48,050 --> 00:06:53,570
What is a whole neighborhood of YC funded startups, the LM.

77
00:06:53,570 --> 00:07:01,610
So ChatGPT is answering a whole neighborhood of YC funded startups is in San Francisco.

78
00:07:01,610 --> 00:07:04,100
So it's correct.

79
00:07:04,310 --> 00:07:07,340
This is the answer we know is true.

80
00:07:07,760 --> 00:07:12,950
The second question, what may be the most valuable thing Paul Baguette make for Google?

81
00:07:12,950 --> 00:07:20,960
The result is the answer from ChatGPT, and it says Paul Baguette is credited with creating the phrase

82
00:07:20,960 --> 00:07:21,410
blah blah blah.

83
00:07:21,440 --> 00:07:23,330
Don't be so okay.

84
00:07:23,330 --> 00:07:24,470
The answer is correct.

85
00:07:24,710 --> 00:07:33,050
So the the evaluation of this app has been positive since the app has responded to evaluation questions,

86
00:07:33,050 --> 00:07:33,500
right?

87
00:07:33,500 --> 00:07:33,920
Okay.

88
00:07:33,920 --> 00:07:37,910
So if we do this manually we are satisfied.

89
00:07:38,570 --> 00:07:39,500
But.

90
00:07:40,740 --> 00:07:48,900
What if instead of two questions and answers, we have 2000 or 20,000 or 2 million, right?

91
00:07:48,930 --> 00:07:52,860
It depends on the size of the document we are evaluating.

92
00:07:53,130 --> 00:07:58,500
So the manual evaluation would be very painful.

93
00:07:58,920 --> 00:08:07,110
And luckily we have a way to do this uh using the LM application.

94
00:08:07,110 --> 00:08:15,870
So instead of confirming that manually or ourselves, we can ask the LM to check if the responses are

95
00:08:15,870 --> 00:08:18,000
coincidental with the ground.

96
00:08:18,210 --> 00:08:20,910
Truth answers ground.

97
00:08:24,000 --> 00:08:24,720
Okay.

98
00:08:25,400 --> 00:08:36,380
So what we are going to do is we are going to import a different component, which is the QA eval chain.

99
00:08:36,380 --> 00:08:46,130
So you remember that for the QA application we imported the retrieval QA component.

100
00:08:46,130 --> 00:08:48,620
And we use that to create our chain.

101
00:08:48,920 --> 00:08:55,160
In this case we are going to use a different chain which is the QA eval chain.

102
00:08:55,550 --> 00:09:06,800
And what we are going to do is to apply this chain a to the question and answers we have.

103
00:09:06,980 --> 00:09:16,670
And basically what ChatGPT is going to do with this chain is to read these results, and it's going

104
00:09:16,670 --> 00:09:21,110
to compare each of the answers with the results.

105
00:09:21,110 --> 00:09:26,150
And if they are similar, ChatGPT is going to say correct.

106
00:09:26,150 --> 00:09:30,050
And if they are not, you will say incorrect.

107
00:09:30,050 --> 00:09:38,510
So after doing that automatic evaluation, ChatGPT is telling us that the first result is correct and

108
00:09:38,510 --> 00:09:40,820
the second result is also correct.

109
00:09:40,970 --> 00:09:46,880
So as you can see in this very in this very simple application.

110
00:09:48,290 --> 00:09:53,810
We have achieved two very remarkable things.

111
00:09:53,810 --> 00:10:01,430
First, the manual evaluation of LM applications and second, automatic evaluation.

112
00:10:01,670 --> 00:10:09,320
Remember that this is not going to solve 100% of the problem of evaluating LM applications, because

113
00:10:09,320 --> 00:10:14,390
that is a complex matter, as we will see in in in next chapters.

114
00:10:14,390 --> 00:10:24,140
But I think this is a very good step ahead and is going to start giving you interesting concepts and

115
00:10:24,140 --> 00:10:24,860
tools.

