WEBVTT

00:00.840 --> 00:02.370
Instructor: LLMs, regardless

00:02.370 --> 00:04.200
of their specific architecture,

00:04.200 --> 00:06.660
all have predefined token limits.

00:06.660 --> 00:08.880
So this will determine the maximum number

00:08.880 --> 00:11.910
of tokens they can process in a single interaction.

00:11.910 --> 00:15.360
Nowadays, most LLMs have the token limit

00:15.360 --> 00:17.460
of around 4K tokens,

00:17.460 --> 00:20.880
but the number of tokens that the LLM would be able

00:20.880 --> 00:23.880
to digest keeps growing and growing over time.

00:23.880 --> 00:26.400
So we even saw right now that Anthropic

00:26.400 --> 00:30.720
came out with a model that can't ingest 100K tokens,

00:30.720 --> 00:32.190
which is a dramatic improvement

00:32.190 --> 00:34.320
to what's currently on the market.

00:34.320 --> 00:36.660
Anyways, even though the number of tokens

00:36.660 --> 00:38.100
of the LLMs will continue

00:38.100 --> 00:41.100
to grow and grow, we still will have a limit

00:41.100 --> 00:44.193
that we will need to get by and to find the workaround.

00:45.090 --> 00:48.000
Now, for simplicity, let's assume

00:48.000 --> 00:52.110
that each token is equal to a one word.

00:52.110 --> 00:54.930
Now, the total count of tokens

00:54.930 --> 00:57.720
will include both of the input prompt

00:57.720 --> 01:01.650
and the generated response that the LLM gives us.

01:01.650 --> 01:03.450
So as long as both

01:03.450 --> 01:07.740
of them do not surpass the token limit, 4K, in this example,

01:07.740 --> 01:10.560
then we will be fine and everything will work.

01:10.560 --> 01:13.950
Now, in an LLM interaction, the LLM does not care

01:13.950 --> 01:16.260
how do we divide our token limit?

01:16.260 --> 01:18.960
Now, we can create a prompt with very few tokens

01:18.960 --> 01:22.500
and request LLM to return us an elaborate response

01:22.500 --> 01:24.480
so that way, most of the tokens

01:24.480 --> 01:27.330
are being spent on the response part,

01:27.330 --> 01:29.370
but we can divide it half and half

01:29.370 --> 01:32.460
and we can even say that we want our huge prompt

01:32.460 --> 01:35.370
and we want a very quick and concise response.

01:35.370 --> 01:37.320
The LLM won't really care.

01:37.320 --> 01:39.390
As long as we don't surpass this token limit,

01:39.390 --> 01:42.480
it would be able to digest it and work fine.

01:42.480 --> 01:44.850
But in real world applications

01:44.850 --> 01:47.550
and in advanced usage of LLM,

01:47.550 --> 01:50.100
we will hit the token limit for sure.

01:50.100 --> 01:52.290
This can happen for a variety of reasons,

01:52.290 --> 01:56.520
like we're having a prompt which its context is too large

01:56.520 --> 01:59.430
and really, this problem is inevitable.

01:59.430 --> 02:02.670
So once we surpass the token limit,

02:02.670 --> 02:04.980
we will get an error from the LLM stating

02:04.980 --> 02:09.120
that we can't use it because we sent it too many tokens.

02:09.120 --> 02:10.680
Now, there are a couple of strategies

02:10.680 --> 02:12.690
to solve this token limitation.

02:12.690 --> 02:14.730
A LangChain supports all of them.

02:14.730 --> 02:16.260
The first one is called stuffing.

02:16.260 --> 02:18.030
We have map reduce, refine,

02:18.030 --> 02:20.103
we're going to discuss in this session.

02:21.600 --> 02:24.030
Now, the best way to explain those strategies

02:24.030 --> 02:26.430
is by using a real world example

02:26.430 --> 02:29.190
of the problem of summarization.

02:29.190 --> 02:31.710
So let's say we have a bunch of documents

02:31.710 --> 02:33.570
that we want to summarize.

02:33.570 --> 02:36.570
So we can use the load_summarize_chain.

02:36.570 --> 02:39.690
Now, this function will give us a LangChain chain

02:39.690 --> 02:41.640
that summarize documents.

02:41.640 --> 02:44.220
Now, when we initialize this chain,

02:44.220 --> 02:47.010
we can pass it the argument of the chain type.

02:47.010 --> 02:49.260
When chain type is equal to stuff,

02:49.260 --> 02:50.970
we are telling LangChain

02:50.970 --> 02:54.780
that we want our summarization chain to handle the context

02:54.780 --> 02:56.640
with the stuffing strategy.

02:56.640 --> 02:58.380
Now, what stuffing?

02:58.380 --> 03:00.840
So exactly like a stuffed animal doll

03:00.840 --> 03:03.930
is stuffed with cotton and with fiber,

03:03.930 --> 03:08.250
we are telling LangChain that we want to stuff our context,

03:08.250 --> 03:11.520
which is the documents into the prompt as is.

03:11.520 --> 03:15.000
So we're simply pushing everything into the prompt

03:15.000 --> 03:17.520
as is without doing any alterations.

03:17.520 --> 03:20.790
Now, this will cost us one API call to the LLM.

03:20.790 --> 03:23.640
Now, maybe this is the most intuitive thing to do,

03:23.640 --> 03:26.850
but eventually if we will use more than a couple

03:26.850 --> 03:29.340
of documents, we'll hit the token limit

03:29.340 --> 03:31.200
because if we have a lot of documents,

03:31.200 --> 03:34.350
then we'll for sure have a lot of tokens.

03:34.350 --> 03:37.050
Now, even if we take an edge case

03:37.050 --> 03:41.010
and let's hypothetically assume that there is an LLM model

03:41.010 --> 03:44.370
that can ingest any size of tokens,

03:44.370 --> 03:48.270
then we will still encounter the barrier of the payload

03:48.270 --> 03:51.960
we can send the server which would process that request.

03:51.960 --> 03:54.000
Okay, so one way to solve this

03:54.000 --> 03:57.210
is to use the map_reduce chain.

03:57.210 --> 03:59.880
Now, this time we're using the load_summarize_chain,

03:59.880 --> 04:03.360
but we pass it the chain type of map_reduce.

04:03.360 --> 04:04.980
Now, in this chain,

04:04.980 --> 04:07.830
what we're doing is taking all the documents

04:07.830 --> 04:10.440
that we use to stuff in our prompt

04:10.440 --> 04:12.630
and send directly to the LLM.

04:12.630 --> 04:15.540
And from each document we're going to be making

04:15.540 --> 04:19.050
a new prompt which will hold the instructions to summary

04:19.050 --> 04:21.810
and a context, which will be the document.

04:21.810 --> 04:25.320
Basically what we did comes from functional programming

04:25.320 --> 04:28.620
and it's called mapping in the formal terminology.

04:28.620 --> 04:31.560
And it's basically applying a transformation function

04:31.560 --> 04:32.970
to a collection

04:32.970 --> 04:36.720
of some things, like in this example, documents

04:36.720 --> 04:39.360
and to produce from it a new collection.

04:39.360 --> 04:42.270
The new collection is going to be a collection

04:42.270 --> 04:44.700
of prompts with the context.

04:44.700 --> 04:47.370
And now, we want to create a new mapping step,

04:47.370 --> 04:49.470
which will take the prompts

04:49.470 --> 04:51.480
that we created from the documents

04:51.480 --> 04:55.470
and each one of those prompts will send to the LLM.

04:55.470 --> 04:58.410
So this transformation will take the prompts,

04:58.410 --> 05:00.270
make an API call to the LLM

05:00.270 --> 05:03.330
and get the summary for each document.

05:03.330 --> 05:07.230
Now, notice that everything can run in parallel over here,

05:07.230 --> 05:09.750
and that's a very great advantage

05:09.750 --> 05:13.380
that will optimize performance and we can do that

05:13.380 --> 05:16.770
because the documents are not dependent on each other

05:16.770 --> 05:18.720
and can be standalone.

05:18.720 --> 05:20.880
Now, the last step would be to take

05:20.880 --> 05:22.440
all those small summaries

05:22.440 --> 05:25.830
that we had made from the original documents

05:25.830 --> 05:28.440
and to create a big summary.

05:28.440 --> 05:30.420
That is going to be the summary

05:30.420 --> 05:33.930
of all those summaries we have prepared.

05:33.930 --> 05:36.480
Now, in the professional terminology,

05:36.480 --> 05:37.800
this is called reducing

05:37.800 --> 05:41.280
and it's basically applying a reduction function.

05:41.280 --> 05:46.110
This one would be to call the LLM to an interval

05:46.110 --> 05:48.870
and to produce a single value.

05:48.870 --> 05:52.830
Now, the single value here is going to be the final salary.

05:52.830 --> 05:56.220
And how does this entire process looks in LangChain?

05:56.220 --> 05:58.980
Well, it's actually one line.

05:58.980 --> 06:02.040
So all we need to do in order to perform all this

06:02.040 --> 06:06.180
is to simply select the chain time to be of map_reduce.

06:06.180 --> 06:08.760
Now, this is very, very cool

06:08.760 --> 06:10.920
and this is where LangChain shines

06:10.920 --> 06:13.620
because it does all the heavy lifting for us

06:13.620 --> 06:15.360
and all of this entire

06:15.360 --> 06:18.660
and lengthy process, it does it for us.

06:18.660 --> 06:21.630
Now, the advantages of using map_reduce

06:21.630 --> 06:26.310
is that it can scale to an enormous number of documents

06:26.310 --> 06:28.830
and it can also run in parallel.

06:28.830 --> 06:33.600
So we'll have optimized performance and it'll run fast.

06:33.600 --> 06:36.420
However, it does come with some disadvantages

06:36.420 --> 06:39.480
because we're going to be making a lot of API calls

06:39.480 --> 06:42.030
and this may impact costs

06:42.030 --> 06:46.590
and we might lose informations in the mapping process

06:46.590 --> 06:49.740
where we summarize each document.

06:49.740 --> 06:52.320
So we might be losing some context.

06:52.320 --> 06:56.793
And basically, this is prone to be losing information.

06:58.050 --> 07:03.050
Okay, the next way to handle this is using the refine chain.

07:03.750 --> 07:07.020
Now, I think the best way to explain

07:07.020 --> 07:09.750
what the refine chain does is

07:09.750 --> 07:12.870
to talk about a concept from functional programming,

07:12.870 --> 07:16.830
and this is the function foldl.

07:16.830 --> 07:18.540
I'm going to explain this function,

07:18.540 --> 07:20.580
which seems to be unrelated,

07:20.580 --> 07:23.400
and then we can talk about the refinement strategy

07:23.400 --> 07:26.250
and you'll see how those topics connect to each other.

07:26.250 --> 07:29.460
So in functional programming, the foldl function

07:29.460 --> 07:34.140
or fold left is what is called a higher order function

07:34.140 --> 07:37.080
that is commonly used for iterating over a list

07:37.080 --> 07:41.160
and accumulating a result by applying a binary function

07:41.160 --> 07:45.390
for each element and accumulating the value so far.

07:45.390 --> 07:47.010
Now, a binary function

07:47.010 --> 07:51.000
or a binary operation is simply a fancy word of saying

07:51.000 --> 07:55.200
that this is a function that needs to receive two arguments.

07:55.200 --> 08:00.030
So what the foldl function does is apply the binary function

08:00.030 --> 08:01.560
to the initial value

08:01.560 --> 08:03.510
and the first element of the list.

08:03.510 --> 08:06.450
This will produce a new accumulated value.

08:06.450 --> 08:09.510
Then it applies the binary function again

08:09.510 --> 08:12.960
to the new accumulated value and the second element

08:12.960 --> 08:17.280
and so on and so on until it reach the end of the list,

08:17.280 --> 08:21.120
and we manage to reduce the list into a single value.

08:21.120 --> 08:22.590
So let's revisit this

08:22.590 --> 08:25.890
and let's explain the example we see right now.

08:25.890 --> 08:30.890
The function foldl or fold left has three parameters.

08:30.960 --> 08:34.320
The first one is the function to apply.

08:34.320 --> 08:36.510
The second one is the initial value,

08:36.510 --> 08:38.850
and the third parameter is the list

08:38.850 --> 08:40.173
to apply the function to.

08:41.100 --> 08:43.920
Now, let's explain how does foldl work.

08:43.920 --> 08:46.440
So we first take the initial value

08:46.440 --> 08:48.720
and we take the function to apply.

08:48.720 --> 08:52.890
Now, we simply take the first element from the list

08:52.890 --> 08:54.570
and then we call the function

08:54.570 --> 08:56.190
with those two arguments.

08:56.190 --> 08:59.430
So in this example, we have the initial value one.

08:59.430 --> 09:02.130
We take it and multiply it with the first element

09:02.130 --> 09:05.790
of the list, which is one, and the result is one.

09:05.790 --> 09:09.630
And now, we continue to do it with the rest of the list.

09:09.630 --> 09:12.030
So we have the result, which is one,

09:12.030 --> 09:14.910
and we then take the second element

09:14.910 --> 09:16.860
from the list, which is two.

09:16.860 --> 09:18.960
We now use the function to apply

09:18.960 --> 09:21.660
and we perform one times two,

09:21.660 --> 09:24.750
and then we get the result of two.

09:24.750 --> 09:28.320
Now, we continue this, so we take the result two,

09:28.320 --> 09:31.620
we take now the third element for the list, which is three,

09:31.620 --> 09:34.320
and we take the function to apply

09:34.320 --> 09:38.040
and we perform two times three, which is now six.

09:38.040 --> 09:39.390
This is the result.

09:39.390 --> 09:42.150
Now we are left with the last element of the list.

09:42.150 --> 09:43.800
So we take that element

09:43.800 --> 09:47.670
and we call the function to apply six times four,

09:47.670 --> 09:52.140
and eventually, we reduce this entire list into 24

09:52.140 --> 09:56.190
using the initial value and function to apply.

09:56.190 --> 09:59.790
Now, I want you to imagine that instead of the function

09:59.790 --> 10:03.660
to apply to be the multiplication operator,

10:03.660 --> 10:07.530
going to provide a function that will receive two documents

10:07.530 --> 10:09.900
and it's going to combine those documents

10:09.900 --> 10:11.820
and summarize them.

10:11.820 --> 10:15.930
And the initial value, I'm not going to provide one,

10:15.930 --> 10:18.600
but I'm simply going to provide an empty string

10:18.600 --> 10:22.800
or a document that represent an empty document.

10:22.800 --> 10:26.520
And in the list parameter, I'm going to provide the list

10:26.520 --> 10:30.030
of documents that we want to summarize.

10:30.030 --> 10:31.470
Now, I want you to imagine

10:31.470 --> 10:34.290
what would happen if we'll take the function foldl

10:34.290 --> 10:38.280
and apply it to those arguments that we just mentioned.

10:38.280 --> 10:42.270
So basically, we'll summarize an empty document

10:42.270 --> 10:46.740
and the first document into a first summary.

10:46.740 --> 10:48.270
We'll take this first summary

10:48.270 --> 10:50.850
and then we'll take this second document,

10:50.850 --> 10:52.080
combine them together

10:52.080 --> 10:55.230
and create another refined summarization.

10:55.230 --> 10:56.700
So we keep refining

10:56.700 --> 11:00.450
and refining the summarization until we end up

11:00.450 --> 11:04.470
with the perfect summary for all of the documents.

11:04.470 --> 11:09.000
So this is basically the implementation of the refine chain.

11:09.000 --> 11:11.190
So it's very, very cool

11:11.190 --> 11:15.780
and LangChain is doing all of this background work

11:15.780 --> 11:19.200
and heavy lifting for us, which is super, super cool

11:19.200 --> 11:21.240
and save us a lot of time.

11:21.240 --> 11:24.750
Now, in my opinion, this is simply mind blowing

11:24.750 --> 11:28.203
of how elegant and smart this framework is.