WEBVTT

00:00.020 --> 00:03.470
In this video, we're going to have a look at something called the mixture of agents.

00:03.470 --> 00:10.220
We're going to look at how a layered architecture using layered amounts of LMS and then an aggregator

00:10.220 --> 00:13.910
can potentially improve the efficiency or the outputs gained.

00:13.910 --> 00:17.090
This works really well when you're using open source models.

00:17.090 --> 00:22.280
And if we have a look at the diagram here and sort of the bottom left hand side of the screen, we can

00:22.280 --> 00:25.520
see that you have a prompt which goes into a layer.

00:25.520 --> 00:29.000
This layer of LMS will then answer that question.

00:29.000 --> 00:36.540
We will then have a secondary layer which will aggregate the results of those LM calls, will synthesize

00:36.540 --> 00:43.260
those responses into a single higher quality response, which will then go through into being the final

00:43.260 --> 00:43.920
output.

00:43.950 --> 00:48.750
We've done a proof of concept which is again referenced to Together computer.

00:48.750 --> 00:53.400
They've put together a really great paper on mixture of mixture of experts.

00:53.400 --> 00:54.450
So let's have a look.

00:54.450 --> 01:00.100
So the first thing that we do is we install a bunch of packages and then we run our imports.

01:00.130 --> 01:08.080
Now you will have to for this notebook script have an OpenAI API key, a anthropic API key and a mistral

01:08.080 --> 01:08.740
API key.

01:08.740 --> 01:10.810
So all of those are different platforms.

01:10.810 --> 01:15.880
And then what you can see here is we have a user prompt like, what are three fun things to do in San

01:15.880 --> 01:16.690
Francisco.

01:16.720 --> 01:25.210
We have our reference models which in this case we've used ChatGPT for oh uh, Claude 33.5 sonnet and

01:25.220 --> 01:26.870
also Mistral's API.

01:26.870 --> 01:30.050
And we have our aggregator model that's set to being GPT four.

01:30.080 --> 01:35.600
Oh, and then what we're doing here is we're sort of saying you you've been provided with a set of responses

01:35.600 --> 01:41.600
from various open source models to the latest user query or task is to synthesize these responses into

01:41.600 --> 01:43.820
a higher quality response.

01:43.820 --> 01:46.430
So and then you get the response from the models.

01:46.430 --> 01:51.800
And then we run each model and we put a bit of time in and we sleep those models.

01:51.800 --> 01:53.420
and then we get the responses.

01:53.420 --> 01:57.410
So this is to run the LM function for specifically running the LM.

01:57.680 --> 02:02.870
And then this main function here will run for each model in the reference models.

02:02.870 --> 02:08.030
It will run that this function here and it will gather up the results.

02:08.030 --> 02:13.970
And then we will then use the aggregator here with the system message from the aggregator system prompt

02:13.970 --> 02:15.470
that we were talking about earlier.

02:15.470 --> 02:19.020
And we'll pass in all of the results and enumerate those.

02:19.020 --> 02:24.180
And then we have our human user message, which will be the user prompt.

02:24.180 --> 02:25.920
So let's have a look and see what happens.

02:25.920 --> 02:31.770
So now when we run this async code or run what's actually happening is we've got multiple layers of

02:31.770 --> 02:33.360
these llms being called.

02:33.360 --> 02:40.440
And then that's all being aggregated again by GPT four zero to try and hopefully synthesize a better

02:40.440 --> 02:40.860
response.

02:40.860 --> 02:47.110
So this is a really simple way for you to integrate mixture of experts into your LM based applications.

02:47.140 --> 02:48.880
A couple of key points here.

02:48.880 --> 02:52.510
This will add probably extra latency to your application.

02:52.510 --> 02:57.640
The reason for this is that the slowest API vendor is going to become your bottleneck.

02:57.640 --> 03:01.720
And also this isn't good for a streaming based architecture.

03:01.720 --> 03:07.360
So if you need to stream the results immediately to users, that's not probably the best use of of of

03:07.360 --> 03:08.380
this kind of application.

03:08.380 --> 03:13.060
So you are going to get additional latency, but it can hopefully improve your results.