WEBVTT

00:00.060 --> 00:00.930
-: All right, welcome back.

00:00.930 --> 00:02.700
So in this video we're gonna have a look at

00:02.700 --> 00:04.470
how we can do conversation history

00:04.470 --> 00:07.860
and also memory inside of LangChain Expression Language.

00:07.860 --> 00:10.470
We've got a range of different imports at the top here,

00:10.470 --> 00:12.150
some that you'll be probably quite familiar with,

00:12.150 --> 00:14.280
and there's also this format document

00:14.280 --> 00:16.200
that allows us to take in a document

00:16.200 --> 00:18.510
and a prompt and return a string as well,

00:18.510 --> 00:20.040
and that one's probably new.

00:20.040 --> 00:21.720
We have a vector database

00:21.720 --> 00:24.000
that we've set up from our previous lesson

00:24.000 --> 00:26.250
and so what we are doing is basically just having

00:26.250 --> 00:29.040
two documents where I currently work for James

00:29.040 --> 00:31.830
and also his age and we're using the OpenAIEmbeddings

00:31.830 --> 00:34.680
for that and setting up our rag retriever.

00:34.680 --> 00:36.427
We then have a template that says,

00:36.427 --> 00:37.867
"Given the following conversation

00:37.867 --> 00:40.777
"and a follow-up question, rephrase the follow-up question

00:40.777 --> 00:44.250
"to be a standalone question in its original language."

00:44.250 --> 00:48.030
We have a chat history and so what we're gonna be covering

00:48.030 --> 00:51.358
on this lesson is how do we get the chat history in?

00:51.358 --> 00:54.057
And then we have the follow-up input, which is the question,

00:54.057 --> 00:57.300
and we return, basically, just the standalone question.

00:57.300 --> 01:00.150
So this is our condensed question prompt.

01:00.150 --> 01:02.550
We then have an answer that uses the context

01:02.550 --> 01:05.070
and that context is gonna come from our retriever

01:05.070 --> 01:08.400
from the question that we got back from this prompt here.

01:08.400 --> 01:11.460
We've also got a default document template,

01:11.460 --> 01:13.980
which will take in the page content

01:13.980 --> 01:17.190
and it will then combine multiple pages

01:17.190 --> 01:18.780
using the document prompt.

01:18.780 --> 01:21.330
So you can see here, it will format each document

01:21.330 --> 01:23.010
to get document strings,

01:23.010 --> 01:25.470
and then after that we join those together

01:25.470 --> 01:29.490
and then we've got a prompt which just has the page content.

01:29.490 --> 01:33.000
So basically all of this combined documents gets associated

01:33.000 --> 01:35.868
with a variable of page content.

01:35.868 --> 01:40.080
Now looking at the chat history, what we've decided to do

01:40.080 --> 01:43.560
is we have our chat history, which is a list of messages

01:43.560 --> 01:46.260
that could either be a human message, a system message,

01:46.260 --> 01:47.550
or an AI message.

01:47.550 --> 01:49.050
And so we can create a buffer,

01:49.050 --> 01:50.280
which is just an empty string,

01:50.280 --> 01:52.230
and we're just gonna loop through each message

01:52.230 --> 01:54.570
and check to see what type of message that is

01:54.570 --> 01:57.960
and we'll add the appropriate bit of string at the prefix

01:57.960 --> 02:00.240
of each individual message, whether it's a human,

02:00.240 --> 02:02.340
an assistant, or a system message

02:02.340 --> 02:03.990
and then we'll return the buffer.

02:03.990 --> 02:05.430
Now it's important to do this step

02:05.430 --> 02:07.380
using LangChain Expression Language

02:07.380 --> 02:08.610
because you do have to make sure

02:08.610 --> 02:10.650
that all of the types of eventually converge

02:10.650 --> 02:14.460
to a string type when you're using your prompt formatting.

02:14.460 --> 02:16.650
We set up something called a runnable map,

02:16.650 --> 02:18.120
which is exactly the same

02:18.120 --> 02:20.160
as something called the runnable parallel.

02:20.160 --> 02:22.740
But I thought to demo that sometimes you might experience

02:22.740 --> 02:26.130
a runnable map and it's basically just a runnable parallel.

02:26.130 --> 02:28.830
We have the standalone question which says

02:28.830 --> 02:32.523
take the chat history and format that chat history.

02:33.840 --> 02:35.910
We also have the condensed question prompt,

02:35.910 --> 02:37.170
so we take the chat history,

02:37.170 --> 02:39.660
push that into the condensed question prompt.

02:39.660 --> 02:41.490
We then get the chat OpenAI model,

02:41.490 --> 02:43.593
and then we get it to answer the question.

02:44.700 --> 02:46.680
We also have the context here

02:46.680 --> 02:49.800
and you can see that's getting the standalone question.

02:49.800 --> 02:51.930
It's piping that into the retriever,

02:51.930 --> 02:56.520
and then those retrieve documents are then being combined

02:56.520 --> 02:58.980
and we have the question here, which is then

02:58.980 --> 03:01.380
the standalone question from before.

03:01.380 --> 03:03.720
And this makes up our conversational Q&amp;A chain

03:03.720 --> 03:05.760
where we have our inputs.

03:05.760 --> 03:07.440
We then have our context

03:07.440 --> 03:09.090
and all of that ends up getting piped

03:09.090 --> 03:12.333
into the answer prompt, which you can see here.

03:14.100 --> 03:16.110
We then pass that to the chat model.

03:16.110 --> 03:18.390
So let's have a look and see what happens.

03:18.390 --> 03:21.633
We just make sure that all of these have been run.

03:24.060 --> 03:26.493
And I've gotta go and get the retriever too.

03:28.650 --> 03:30.930
So now we can invoke our Q&amp;A chain

03:30.930 --> 03:32.880
with the question "Where did James work?"

03:32.880 --> 03:34.200
and the chat history.

03:34.200 --> 03:35.550
Now the problem is at the moment

03:35.550 --> 03:37.260
in LangChain Expression Language,

03:37.260 --> 03:40.830
you do have to add the chat history manually

03:40.830 --> 03:43.170
and they haven't done a way for automating that.

03:43.170 --> 03:44.370
So the best way at the moment

03:44.370 --> 03:45.900
is to use something like memory,

03:45.900 --> 03:48.720
where you would set a conversational buffer memory

03:48.720 --> 03:50.340
with an output key

03:50.340 --> 03:53.220
and the input would be associated with the question.

03:53.220 --> 03:56.280
You can return the messages to be true, which you should do

03:56.280 --> 03:58.770
for the above function to work.

03:58.770 --> 04:01.500
And then what we do is we say load the memory,

04:01.500 --> 04:04.080
so RunnablePassThrough.assign,

04:04.080 --> 04:05.820
and we get the chat history

04:05.820 --> 04:10.290
and we make a runnable lambda that gets the memory.load,

04:10.290 --> 04:12.180
load memory variables,

04:12.180 --> 04:16.870
and it's using that function on the item getter history.

04:16.870 --> 04:20.250
After that, then we set our standalone question key

04:20.250 --> 04:23.130
to have a question and a chat history.

04:23.130 --> 04:25.200
We then have the condensed question prompt,

04:25.200 --> 04:27.510
the Chat OpenAI, the string passer,

04:27.510 --> 04:30.130
and then basically we get the retrieved documents

04:31.350 --> 04:33.150
which get passed into the retriever.

04:34.140 --> 04:37.050
The standalone question and our final inputs

04:37.050 --> 04:39.510
where we have the combined documents

04:39.510 --> 04:42.510
as the context and the question as the item

04:42.510 --> 04:45.001
get difficult question from a previous step.

04:45.001 --> 04:47.850
And then after that, then what we do is we take

04:47.850 --> 04:50.550
the final inputs, pipeline into the answer prompt,

04:50.550 --> 04:53.520
pipeline to the model and get the docs.

04:53.520 --> 04:55.590
And so the chain does a couple of things.

04:55.590 --> 04:58.410
So the first thing is we get the loaded memory

04:58.410 --> 05:00.663
and then we go into the standalone question.

05:01.800 --> 05:03.900
Which will get asked a single question

05:03.900 --> 05:06.000
given the chat history.

05:06.000 --> 05:08.610
After that, then we go and get the retrieved documents.

05:08.610 --> 05:11.940
So using the standalone question, we then pass that

05:11.940 --> 05:14.850
into the retriever to get some relevant documents.

05:14.850 --> 05:17.580
And after that then we get the standalone question

05:17.580 --> 05:20.700
and reassign it with a key of question, right.

05:20.700 --> 05:22.380
And then after we've got that,

05:22.380 --> 05:27.030
then we pass it into the answer which has the final inputs,

05:27.030 --> 05:28.050
which is the context.

05:28.050 --> 05:31.440
So those documents, we then get those as the context key.

05:31.440 --> 05:33.750
And also then we get the question

05:33.750 --> 05:36.690
from the previous step here, you can see

05:36.690 --> 05:40.680
and then we pipe that in to the answer prompt

05:40.680 --> 05:42.930
and get the chat model to respond.

05:42.930 --> 05:44.670
And we also pass in the docs.

05:44.670 --> 05:47.430
So sometimes it can be good to read this in the start,

05:47.430 --> 05:49.470
but also you saw me sort of going, well,

05:49.470 --> 05:50.460
we've got this final inputs

05:50.460 --> 05:52.140
but it's not really its own chain.

05:52.140 --> 05:54.210
It's sort of inside the answer

05:54.210 --> 05:57.360
and the answer is sort of using the final inputs,

05:57.360 --> 05:59.460
which then get passed into an answer prompt,

05:59.460 --> 06:01.530
which then get passed into Chat OpenAI.

06:01.530 --> 06:03.120
So it's important to just sort of see

06:03.120 --> 06:06.030
where do the dictionaries flow, like can you see

06:06.030 --> 06:08.160
how the chain is being created?

06:08.160 --> 06:09.330
And then obviously this would, you know,

06:09.330 --> 06:11.520
you would just pass into question, where did I work?

06:11.520 --> 06:13.530
And then it'll pass the answer

06:13.530 --> 06:16.330
and the relevant documents that are used to answer that.

06:17.310 --> 06:20.100
Now the thing is, if we run this, you'll see that

06:20.100 --> 06:22.860
at the moment it doesn't have any chat history.

06:22.860 --> 06:24.780
So what you would normally do is you would do

06:24.780 --> 06:26.580
memory.loadvariables.

06:26.580 --> 06:29.310
And so the next time this function would run,

06:29.310 --> 06:33.540
the loaded memory, if we scroll back up here, is a function

06:33.540 --> 06:36.840
that will basically run and it will assign

06:36.840 --> 06:41.310
and pipe that to the the item get it or history basically.

06:41.310 --> 06:43.770
So that's basically it in a nutshell.

06:43.770 --> 06:45.840
So you can see now that we've done the load variables,

06:45.840 --> 06:48.060
there's actually a bit of history in there.

06:48.060 --> 06:49.950
So when this line gets executed,

06:49.950 --> 06:52.110
we will actually have some chat history

06:52.110 --> 06:53.310
the next time it was run.

06:53.310 --> 06:56.520
And that's basically how you add conversational history

06:56.520 --> 06:59.070
and memory inside of LangChain Expression Language.