WEBVTT

00:00.270 --> 00:01.620
Eden: Hey there, Eden here,

00:01.620 --> 00:03.600
and in this video I wanna talk about

00:03.600 --> 00:06.450
integrating agent into a production environment.

00:06.450 --> 00:08.670
And specifically I want to discuss

00:08.670 --> 00:10.636
the challenges in doing so.

00:10.636 --> 00:11.610
(graphics whooshing)

00:11.610 --> 00:13.980
So, we saw that when working with agents,

00:13.980 --> 00:16.410
we are using a lot of LLM calls

00:16.410 --> 00:20.280
because we're using the LLM as a reasoning engine.

00:20.280 --> 00:22.320
So, every step we're going to make

00:22.320 --> 00:23.970
and every tool we're going to use,

00:23.970 --> 00:26.340
it needs to come after an LLM call

00:26.340 --> 00:29.220
where the LLM has decided to use that tool.

00:29.220 --> 00:32.790
And this will result in multiple calls to the LLM,

00:32.790 --> 00:35.190
but not only will have multiple calls,

00:35.190 --> 00:38.700
but those will be sequential calls, one after another,

00:38.700 --> 00:42.870
where each one is waiting for the result of the prior one.

00:42.870 --> 00:45.540
So, this depends how complex is our task

00:45.540 --> 00:48.390
and how many reasoning steps do we need to make,

00:48.390 --> 00:50.010
and can easily escalate

00:50.010 --> 00:52.830
into a very long running application.

00:52.830 --> 00:54.510
So, we need to take that in mind.

00:54.510 --> 00:56.670
And there are some workarounds for this,

00:56.670 --> 01:00.960
like using a semantic cache and using an LLM cache,

01:00.960 --> 01:03.570
but we'll not discuss it in this course.

01:03.570 --> 01:06.930
The next thing I want to talk about is the context window.

01:06.930 --> 01:09.990
You notice that we send a huge prompt to the LLM

01:09.990 --> 01:12.420
every time we make a reasoning step.

01:12.420 --> 01:14.940
Now, most of LMS nowadays

01:14.940 --> 01:18.270
can handle around 32k tokens.

01:18.270 --> 01:19.620
It may sound a lot,

01:19.620 --> 01:21.780
but in a real-world application,

01:21.780 --> 01:24.600
we can easily surpass that limit.

01:24.600 --> 01:27.510
So, eventually, we are limited

01:27.510 --> 01:29.640
by the number of steps we can make

01:29.640 --> 01:31.830
because of the context window.

01:31.830 --> 01:34.920
Now, I know there are models like Cloud Anthropic,

01:34.920 --> 01:38.370
which can receive up to 1k tokens,

01:38.370 --> 01:41.880
but de facto working with 100k tokens

01:41.880 --> 01:45.690
sending to the LLM introduce us with a lot of problems,

01:45.690 --> 01:48.600
like the LLM tend to forget what's in the middle.

01:48.600 --> 01:51.930
See the paper of Lost in the Middle.

01:51.930 --> 01:53.730
Okay, so the third thing

01:53.730 --> 01:56.340
I want to talk about is hallucinations.

01:56.340 --> 01:57.810
Now, we know hallucination

01:57.810 --> 02:00.390
is when it's sent to the LLM a question

02:00.390 --> 02:01.680
and the LLM responds

02:01.680 --> 02:04.620
with something not relevant for that question

02:04.620 --> 02:06.480
because eventually the LLM

02:06.480 --> 02:09.450
is guessing one token after another.

02:09.450 --> 02:11.850
And retrieval augmentation, by the way,

02:11.850 --> 02:14.400
is a nice technique to reduce hallucinations

02:14.400 --> 02:17.520
because we ground the LLM with information

02:17.520 --> 02:19.380
that we send in the context.

02:19.380 --> 02:22.350
Anyways, 'cause the large language model

02:22.350 --> 02:24.720
is a statistical creature,

02:24.720 --> 02:27.180
then this means we have a probability

02:27.180 --> 02:28.890
of getting the correct answer.

02:28.890 --> 02:29.940
And the correct answer

02:29.940 --> 02:33.180
is actually choosing the correct tool to use.

02:33.180 --> 02:38.180
So, let's assume that we have 0.9 probability

02:38.220 --> 02:40.140
of getting the correct response

02:40.140 --> 02:41.970
that chooses the correct tool.

02:41.970 --> 02:44.310
Now, if we do this only once,

02:44.310 --> 02:46.890
that's okay and that's a very good probability.

02:46.890 --> 02:49.020
However, if we're making sequential calls

02:49.020 --> 02:51.270
one after another and after another,

02:51.270 --> 02:55.410
then by the multiplication law, after a few steps,

02:55.410 --> 02:57.720
and in this case after six steps,

02:57.720 --> 03:00.510
then we get that the probability

03:00.510 --> 03:04.530
for getting a good answer drops to 0.59%,

03:04.530 --> 03:07.410
and this is only for six consecutive calls.

03:07.410 --> 03:09.870
What if we have a huge task that requires more?

03:09.870 --> 03:12.720
Then this number drops even more.

03:12.720 --> 03:14.670
So, there are ways to solve it,

03:14.670 --> 03:18.240
and one of the ways is to use fine tuning.

03:18.240 --> 03:21.120
And by that I mean to take an LLM

03:21.120 --> 03:24.180
and fine tune it for tool selection.

03:24.180 --> 03:27.270
So, there are research papers about that

03:27.270 --> 03:30.960
and that people manage to fine tune LLMs

03:30.960 --> 03:35.010
to make the tool selection, for example, to run API calls

03:35.010 --> 03:37.800
to have them yield better results.

03:37.800 --> 03:40.260
So, instead of a 90% chance,

03:40.260 --> 03:43.200
the chance of getting the correct tool is much, much higher

03:43.200 --> 03:46.170
because the LLM is fine tuned on the tools

03:46.170 --> 03:48.660
that the agent have to its disposal.

03:48.660 --> 03:50.790
Okay, let's talk about pricing.

03:50.790 --> 03:53.790
Now, you know that we pay for the tokens

03:53.790 --> 03:56.160
that we send and receive from the LLM.

03:56.160 --> 03:57.510
Now, if you're using agents,

03:57.510 --> 03:59.790
you saw how big the prompts can get.

03:59.790 --> 04:03.000
And when we do this in scale by the millions,

04:03.000 --> 04:05.790
then you can figure out that the billing report

04:05.790 --> 04:08.070
we're going to get is going to be very high.

04:08.070 --> 04:09.990
And I didn't even mention GPT-4,

04:09.990 --> 04:12.960
which has a strong reasoning capability,

04:12.960 --> 04:16.830
but runs very, very slow and is very expensive.

04:16.830 --> 04:19.800
So, if we use that and we use it in scale,

04:19.800 --> 04:21.720
then we can get into a situation

04:21.720 --> 04:26.720
where it's not financially worth for us to run those agents.

04:26.910 --> 04:29.100
So, there are a couple of strategies to solve this,

04:29.100 --> 04:33.390
the first one is to use some kind of cache, semantic cache,

04:33.390 --> 04:36.090
instead of making an LLM call, we discussed this.

04:36.090 --> 04:38.850
And another one is to use retrieval augmentation

04:38.850 --> 04:40.290
for the tool selection.

04:40.290 --> 04:42.300
We're not going to show this in this course,

04:42.300 --> 04:45.120
but retrieval augmentation for the tool selection

04:45.120 --> 04:46.650
can also handle in a case

04:46.650 --> 04:48.720
where we have too many tools to choose.

04:48.720 --> 04:51.240
So, in that way, before we make the LLM call

04:51.240 --> 04:52.590
for the reasoning process,

04:52.590 --> 04:56.040
we do a semantic search and retrieve those relevant tools

04:56.040 --> 04:57.870
that have a high probability

04:57.870 --> 05:00.573
to be the correct tools to our answer.

05:01.530 --> 05:04.080
So, now we need to talk about response validation.

05:04.080 --> 05:06.540
By the way, all the topics I discussed

05:06.540 --> 05:08.760
are relevant for all LLM application

05:08.760 --> 05:10.470
and are real-world challenges.

05:10.470 --> 05:12.240
So, it's not only for agents,

05:12.240 --> 05:14.640
it's also for every LLM application.

05:14.640 --> 05:17.100
So, because we're making a call to the LLM

05:17.100 --> 05:18.990
and we're counting on it

05:18.990 --> 05:21.120
and based on that response

05:21.120 --> 05:23.400
we're going to maybe choose some tool

05:23.400 --> 05:25.440
or maybe output something to the user,

05:25.440 --> 05:28.560
then we need to have a mechanism that validates this.

05:28.560 --> 05:32.160
Because even if the LLM responds us with the correct answer,

05:32.160 --> 05:34.200
but it's not in the correct format,

05:34.200 --> 05:36.720
then it can mess up our application.

05:36.720 --> 05:40.740
So, testing it is a very complicated task.

05:40.740 --> 05:43.200
And I personally haven't encountered

05:43.200 --> 05:45.690
a robust solution for this issue.

05:45.690 --> 05:47.850
Okay, let's talk about security.

05:47.850 --> 05:50.700
And in iGenetic applications,

05:50.700 --> 05:53.100
we give the LLM capabilities to do stuff.

05:53.100 --> 05:55.950
For example, to run queries against the database,

05:55.950 --> 05:59.760
make an API call, or talk to a third party,

05:59.760 --> 06:03.450
and the agent has permissions to do all of this stuff.

06:03.450 --> 06:07.170
So, if some malicious user hijacks our prompt

06:07.170 --> 06:08.790
with prompt injection

06:08.790 --> 06:12.840
or get a hold of our API key,

06:12.840 --> 06:16.980
then they can have access into our tools.

06:16.980 --> 06:19.980
And if our database is proprietary

06:19.980 --> 06:23.190
and we don't want to expose it, then we are in a problem.

06:23.190 --> 06:25.320
So, security is a big issue,

06:25.320 --> 06:27.960
and overall we want to adhere

06:27.960 --> 06:29.880
to the least privilege principle,

06:29.880 --> 06:33.210
and that is to give those tools and to those agents

06:33.210 --> 06:37.050
the minimum permissions as they require.

06:37.050 --> 06:39.390
We want to have guardrails on the prompts

06:39.390 --> 06:41.040
we send to our agents

06:41.040 --> 06:44.400
and to allow and not allow maybe some prompts.

06:44.400 --> 06:47.010
So, there are many open source solutions for this.

06:47.010 --> 06:49.470
Now, I recommend using LLM Guard,

06:49.470 --> 06:50.880
which is an open source

06:50.880 --> 06:53.700
that I think is a very promising one,

06:53.700 --> 06:55.530
which offers a lot of functionality

06:55.530 --> 06:57.360
when it comes to LLM security,

06:57.360 --> 06:59.910
integrating LLMs in production.

06:59.910 --> 07:02.340
Now, the last topic I want to talk about

07:02.340 --> 07:05.313
is overkilling it when using agents.

07:06.660 --> 07:10.830
Agents are good when we have a deterministic sequence

07:10.830 --> 07:13.410
of steps we want to execute.

07:13.410 --> 07:16.620
If we know exactly what we want to execute

07:16.620 --> 07:20.340
and we can define it by talking or by writing code,

07:20.340 --> 07:23.490
then we don't need to use agents.

07:23.490 --> 07:26.490
I've encountered multiple people in companies

07:26.490 --> 07:29.880
trying to achieve things with LLM agents

07:29.880 --> 07:33.300
where the real solution and robust solution

07:33.300 --> 07:37.410
was to simply implement some deterministic code in Python

07:37.410 --> 07:39.630
that would achieve what they wanted.

07:39.630 --> 07:42.540
So, my advice to you before you use agents,

07:42.540 --> 07:44.910
really think if you can implement it yourself

07:44.910 --> 07:46.680
in some deterministic code.

07:46.680 --> 07:49.920
If you can, then I really don't suggest you using agents

07:49.920 --> 07:52.593
because, as you saw, it has a lot of challenges.

07:53.520 --> 07:55.230
And that's it for this video.

07:55.230 --> 07:58.113
And I really want to give a quick disclaim over here.

07:58.950 --> 08:02.400
I think agents are an amazing technology

08:02.400 --> 08:06.210
and they have a massive potential for us

08:06.210 --> 08:08.730
to achieve great things.

08:08.730 --> 08:11.700
I don't think it's easy going from a prototype

08:11.700 --> 08:13.440
to a production application.

08:13.440 --> 08:17.130
It is totally possible, but it has a lot of challenges.

08:17.130 --> 08:18.900
By no means, I'm not saying

08:18.900 --> 08:21.870
that agents are not ready for production, I didn't say that.

08:21.870 --> 08:24.480
I just mean that we need to be very careful

08:24.480 --> 08:27.510
when we use them because it comes with a lot of cost.

08:27.510 --> 08:30.240
And integrating such great technology

08:30.240 --> 08:33.873
comes with a cost, and we saw the challenges.