WEBVTT

00:00.120 --> 00:05.360
In this section, we'll dive into the technique which is called retrieval augmentation generation,

00:05.560 --> 00:07.200
also known as Rag.

00:07.440 --> 00:12.920
But before we go and dive into the technical implementation, I want to discuss the motivation for this

00:12.920 --> 00:16.360
technique and what this technique is going to help us achieve.

00:16.400 --> 00:17.800
And what is it solving.

00:17.840 --> 00:18.200
All right.

00:18.200 --> 00:22.520
So let's take the use case that we have a very large document.

00:22.920 --> 00:26.920
And this document has tons of information inside it.

00:26.920 --> 00:30.000
So it may be maybe hundreds of pages.

00:30.000 --> 00:36.560
And in this example I took the book Harry Potter and the Sorcerer's Stone, which is a pretty long book.

00:36.880 --> 00:44.920
Now, we might want to ask the LLM questions on this book, and specifically maybe questions focused

00:44.920 --> 00:50.400
on different areas in the book, or maybe specific scenes, maybe specific paragraphs.

00:50.400 --> 00:56.120
So questions, for example, like how to make a certain type of potion where the answer resides in a

00:56.120 --> 00:57.680
very particular paragraph.

01:01.710 --> 01:08.030
This example illustrates a problem regarding to Harry Potter, but it also applies to other fields.

01:08.030 --> 01:14.350
For example, if we have a very large financial document that we want to ask question about, and we

01:14.350 --> 01:19.550
want to find certain clause in that document, and that's where the answer is going to come from.

01:19.830 --> 01:26.030
And this problem is especially important when we're dealing with private data, because large language

01:26.030 --> 01:30.150
models are not trained on these private data, so they are not aware of it.

01:30.350 --> 01:36.230
So because our financial document is private, the LLM doesn't really know anything about it.

01:36.230 --> 01:41.630
So we need to find a way for the LLM to effectively question answer over this document.

01:41.990 --> 01:44.390
So this is the problem that we're solving here.

01:44.590 --> 01:46.310
Now let's talk about the solutions.

01:46.470 --> 01:52.390
Now a very naive solution is going to take the entire book.

01:52.390 --> 01:54.750
Let's say it's a PDF document.

01:55.110 --> 02:01.100
And to simply go and plug everything into our prompt that we sent LN.

02:01.100 --> 02:08.180
So we're going to stuff the LM with the entire book and we are going to ask the question.

02:08.180 --> 02:11.060
So here's the placeholder for the user's question.

02:11.260 --> 02:12.660
And here's a placeholder.

02:12.660 --> 02:16.740
Where are we going to plug the entire book of Harry Potter here.

02:16.780 --> 02:24.220
Now this solution might work some of the times, but it has a lot of problems with it and it doesn't

02:24.220 --> 02:25.020
really scale.

02:25.340 --> 02:32.300
There is an inherent hard limit of how much text we can feed the LM, so if the document is going to

02:32.300 --> 02:37.740
be too long and if this book is going to be too long, we're not going to be able to fit everything

02:37.740 --> 02:40.700
because we're going to exceed the LMS token limits.

02:40.700 --> 02:48.220
And even if we have LMS with 1 million or 2 million token limit, which is quite common these days,

02:48.500 --> 02:50.860
this solution is still not ideal.

02:50.860 --> 02:53.820
It has many other disadvantages.

02:53.820 --> 03:01.940
For example, the LM is going to be much less effective with very long prompts and very long context.

03:01.940 --> 03:04.220
And these have been proven in research.

03:04.260 --> 03:10.540
For example, the needle in the haystack research, which clearly shows that large language models,

03:10.540 --> 03:15.620
even with huge token limits, they get less effective with very long prompts.

03:16.020 --> 03:22.860
There is also the issue for cost, because larger prompts are going to cost us more, and there is a

03:22.860 --> 03:26.980
latency issue because larger prompts are going to take longer to process.

03:27.060 --> 03:29.900
So let's conclude that we have four problems here.

03:30.100 --> 03:33.220
The first problem is that we have a very hard limit.

03:33.260 --> 03:39.620
What we can put in, and is that we have the needle in the haystack problem, which means that the longer

03:39.620 --> 03:43.100
the prompt, the less effective the answer from the LM.

03:43.580 --> 03:45.380
Third, we have the cost issue.

03:45.380 --> 03:49.380
And for and last we have the latency issue here.

03:50.180 --> 03:50.580
All right.

03:50.580 --> 03:53.610
So let's look now on another solution here.

03:54.410 --> 03:59.250
So the second solution is going to require us to add some pre-processing.

03:59.610 --> 04:03.970
And it's going to require us to take the original document.

04:03.970 --> 04:09.370
And doesn't matter how long it is, we're going to be splitting it into smaller chunks.

04:09.570 --> 04:14.290
Now these chunking process can be naive and it can also be complex.

04:14.290 --> 04:16.490
We have a long range of how to do it.

04:16.490 --> 04:18.930
And we're going to be discussing this in the course.

04:18.930 --> 04:23.170
So let's assume now that we have those chunks now ready.

04:23.570 --> 04:31.050
Now in the second solution, instead of plugging in the entire book, we're going to add another step,

04:31.050 --> 04:39.170
which is going to take the user's query, and it's going to find the most relevant chunk for that query.

04:39.650 --> 04:46.170
We're going to plug to the LM call only that relevant chunk, which is the most relevant to the question

04:46.170 --> 04:46.730
here.

04:46.730 --> 04:54.200
So now we are focusing the large language model to answer and to ground the answer only on something

04:54.200 --> 04:58.720
which is the relevant piece of data that is going to answer the question.

04:58.720 --> 05:03.360
So the LLM is going to have a much easier time of answering our question.

05:03.360 --> 05:09.600
So instead of sending the entire book, we are going to send only a specific paragraph or only a couple

05:09.640 --> 05:10.760
of paragraphs here.

05:10.920 --> 05:15.040
And we are going to solve all of the problems we talked about.

05:15.080 --> 05:21.800
We are not going to pass the hard token limit of the LLM because we are sending much smaller piece of

05:21.800 --> 05:22.560
context.

05:22.880 --> 05:29.160
We won't encounter the needle in the haystack problem because we are only sending very, very specific

05:29.160 --> 05:35.400
pieces of text which are the most relevant for the question going to cost less because we're sending

05:35.440 --> 05:37.720
fewer tokens to the large language model.

05:37.920 --> 05:43.720
And of course, the processing time of the LLM is going to be faster because it's going to process less

05:43.720 --> 05:44.320
tokens.

05:44.600 --> 05:51.270
This technique can scale to very large documents, and it can even work with multiple documents here.

05:51.310 --> 05:53.590
Now, it does have its drawbacks.

05:53.790 --> 06:00.030
So it's going to require us to add a pre-processing step to chunk the large documents.

06:00.310 --> 06:03.990
There is a lot of depth into this chunking mechanism.

06:04.190 --> 06:05.990
How do we chunk the document?

06:06.030 --> 06:09.630
What are going to be the tokens that we split the documents from?

06:09.950 --> 06:14.550
How are we going to make sure that each chunk is going to have the relevant data?

06:14.710 --> 06:18.950
And what if we're dealing not with the document but with a code repository?

06:19.230 --> 06:23.110
We need different chunking strategies to different type of documents.

06:23.310 --> 06:26.910
And what if we do not know what is the content of the document.

06:26.910 --> 06:29.030
And we get it dynamically for the user.

06:29.110 --> 06:32.750
So there is a lot of depth in this pre-processing part.

06:32.750 --> 06:35.390
And we are going to be covering this in the course.

06:35.630 --> 06:44.390
Now another downside is that we need to have some kind of searching mechanism to find those relevant

06:44.390 --> 06:45.070
chunks.

06:45.510 --> 06:52.710
And what if those relevant chunks are not that relevant, and we need additional context to send to

06:52.710 --> 06:54.550
the LM to answer the user's query.

06:54.950 --> 06:58.870
In this section, we're going to be answering all of those challenges.

06:58.870 --> 07:05.990
And by the way solution number two that we discussed this is actually rack retrieval augmentation generation.

07:05.990 --> 07:07.430
And what we actually saw.

07:07.430 --> 07:11.350
It was a very high level overview of rack.

07:11.750 --> 07:16.550
Now in rack retrieval is for retrieving the relevant chunks.

07:17.030 --> 07:23.310
Augmentation means that we take our prompt and we augment it with those relevant chunks.

07:24.030 --> 07:30.750
And generation means to simply send to the LM and to use that when we make the query to the LM.

07:30.870 --> 07:37.950
So what we saw in this video is really the motivation and the intuition for what's rack.

07:38.030 --> 07:45.590
So let's go and let's now learn about the implementation itself and learn how to implement this kind

07:45.590 --> 07:46.150
of technique.