WEBVTT

00:00.120 --> 00:00.953
-: All right, welcome back.

00:00.953 --> 00:03.720
So in this video, we're gonna have a look at rate limits

00:03.720 --> 00:06.300
and retrying within the Responses API.

00:06.300 --> 00:08.047
I want you to go and open the notebook

00:08.047 --> 00:10.650
rate_limits_and_retrying_with_tenacity,

00:10.650 --> 00:12.930
and we have a variety of different ways

00:12.930 --> 00:14.340
that you can get around errors.

00:14.340 --> 00:17.400
So if you're doing lots and lots of API requests,

00:17.400 --> 00:18.960
you might start running into something

00:18.960 --> 00:20.400
called a rate limit error.

00:20.400 --> 00:22.020
I'm gonna show you a couple of different ways

00:22.020 --> 00:24.900
that you can overcome rate limits by using retrying.

00:24.900 --> 00:26.880
The first thing we're gonna do is install some packages,

00:26.880 --> 00:29.670
so openai, tenacity, and backoff.

00:29.670 --> 00:32.880
Tenacity and backoff are custom rate-limiting packages.

00:32.880 --> 00:34.590
You're then gonna initialize your client,

00:34.590 --> 00:36.600
and you'll need to put your API key here

00:36.600 --> 00:37.890
alongside with your model.

00:37.890 --> 00:39.690
You can customize the OpenAI client.

00:39.690 --> 00:42.810
By default, it will do two retries,

00:42.810 --> 00:45.960
but you can customize this to do more retries.

00:45.960 --> 00:49.950
So if you are hitting limits when you're using OpenAI,

00:49.950 --> 00:51.510
you can increase the number of retries.

00:51.510 --> 00:54.120
I will say that this will increase the amount

00:54.120 --> 00:55.140
of time it will take

00:55.140 --> 00:58.470
for OpenAI to come back with the output

00:58.470 --> 01:01.230
if it's running into a rate limit error.

01:01.230 --> 01:02.820
So that's one way you could do this

01:02.820 --> 01:05.760
is customizing that max retries on the client itself.

01:05.760 --> 01:06.660
Now you're probably wondering

01:06.660 --> 01:08.400
what are the different types of rate limits?

01:08.400 --> 01:09.480
Well, there's actually several.

01:09.480 --> 01:11.520
There's the request per minute rate limits.

01:11.520 --> 01:13.110
That's how many requests you can make

01:13.110 --> 01:14.940
against OpenAI per minute.

01:14.940 --> 01:17.040
There's also something called tokens per minute,

01:17.040 --> 01:18.270
and this is where the fact

01:18.270 --> 01:19.770
that you now know about tick token,

01:19.770 --> 01:21.240
how to count tokens,

01:21.240 --> 01:22.740
allows you to roughly work out

01:22.740 --> 01:25.590
how many tokens a message is going to use.

01:25.590 --> 01:28.980
As well as that, we can also make a request

01:28.980 --> 01:31.350
against the endpoint.

01:31.350 --> 01:34.110
And when we make a request with the request package

01:34.110 --> 01:36.300
and we're just putting in one token

01:36.300 --> 01:38.670
and we're just using the role of user in this content,

01:38.670 --> 01:39.660
and then what you'll see

01:39.660 --> 01:42.000
is we get back this response headers

01:42.000 --> 01:44.220
and you'll see that it shows us

01:44.220 --> 01:46.650
how many remaining requests we have.

01:46.650 --> 01:49.473
If we go and have a look at the response.headers,

01:51.690 --> 01:54.720
you'll see there's a variety of different things in here.

01:54.720 --> 01:56.370
So the important ones

01:56.370 --> 01:58.770
will be the version of API that you're using.

01:58.770 --> 02:01.020
These ones, there's x-ratelimit requests,

02:01.020 --> 02:02.610
so how many requests you have,

02:02.610 --> 02:04.710
how many token limits you have,

02:04.710 --> 02:07.590
and also things like the remaining number of requests

02:07.590 --> 02:09.540
and the remaining number of tokens

02:09.540 --> 02:13.020
and how long it will take until both the requests are reset

02:13.020 --> 02:14.880
and also the tokens are reset.

02:14.880 --> 02:18.330
You can also customize retries with the packet itself,

02:18.330 --> 02:21.300
with the client package itself, with this .with_options.

02:21.300 --> 02:24.780
And you can see here we're adding a custom max retries

02:24.780 --> 02:27.870
on the chat.completions functionality.

02:27.870 --> 02:30.600
You can also use a Python package called Tenacity,

02:30.600 --> 02:33.240
and that basically will decorate a function.

02:33.240 --> 02:35.340
So you can see we've just wrapped

02:35.340 --> 02:38.490
the client.responses.create and all the kwargs,

02:38.490 --> 02:40.290
and we've done it create_with_backoff.

02:40.290 --> 02:42.440
Then we can pass in all the original arguments.

02:42.440 --> 02:44.130
So the model is equal to the MODEL,

02:44.130 --> 02:45.787
the input is equal to this "role": "user",

02:45.787 --> 02:47.160
"content": "Tell me a joke".

02:47.160 --> 02:50.850
And then that's basically setting up six retries

02:50.850 --> 02:53.640
and it's also adding in some wait time

02:53.640 --> 02:55.560
between those retries.

02:55.560 --> 02:58.740
You can also code up a manual exponential backoff.

02:58.740 --> 03:01.230
So if you really need to go quite low level,

03:01.230 --> 03:02.250
have a look at this

03:02.250 --> 03:04.260
where you're basically creating a function,

03:04.260 --> 03:07.680
the initial delay, the factor, some random jitter,

03:07.680 --> 03:09.720
and the maximum number of retries.

03:09.720 --> 03:12.060
And then we are using function decorators

03:12.060 --> 03:14.280
to basically loop over that function.

03:14.280 --> 03:17.250
And you'll see here if there is an OpenAI rate error,

03:17.250 --> 03:19.440
then we're basically going and sleeping

03:19.440 --> 03:21.450
and then adjusting the delay.

03:21.450 --> 03:23.160
And then you can use that like this,

03:23.160 --> 03:24.570
so you've got this retry backoff,

03:24.570 --> 03:26.100
you've got a create_manual,

03:26.100 --> 03:28.890
and then we're just using that client.responses.create.

03:28.890 --> 03:30.720
So there's another way for you to write that

03:30.720 --> 03:34.290
if you wanted to customize the retrying a lot.

03:34.290 --> 03:36.510
Now, my personal opinion on this

03:36.510 --> 03:38.490
would be to customize the client

03:38.490 --> 03:41.130
and how many max retries you set for the client.

03:41.130 --> 03:43.710
Also, remember to visit your OpenAI platform,

03:43.710 --> 03:44.640
look at the usage,

03:44.640 --> 03:46.620
and then have a look at the various things.

03:46.620 --> 03:48.600
So like the input tokens you're using daily,

03:48.600 --> 03:50.430
the output tokens you're using,

03:50.430 --> 03:52.410
and how many requests you're making,

03:52.410 --> 03:54.720
that will give you a good amount of knowledge

03:54.720 --> 03:57.450
about how close you are to the rate limits.

03:57.450 --> 03:59.280
Now, where do you find the rate limits?

03:59.280 --> 04:01.860
Click on the cog on Settings,

04:01.860 --> 04:03.270
go to Limits.

04:03.270 --> 04:05.340
And depending upon your tier,

04:05.340 --> 04:06.750
you'll see you get different limits.

04:06.750 --> 04:09.450
So for example, I'm on Usage Tier 1,

04:09.450 --> 04:11.550
and these are my token limits per model

04:11.550 --> 04:14.013
and my request per minute models.

04:15.000 --> 04:17.490
Now, depending upon when you upgrade

04:17.490 --> 04:19.350
to a different higher threshold,

04:19.350 --> 04:22.950
basically if you spend more than a certain amount of money,

04:22.950 --> 04:25.290
you will automatically upgrade

04:25.290 --> 04:26.880
and you can actually buy credits

04:26.880 --> 04:28.890
to make sure that your organization

04:28.890 --> 04:30.753
goes to a new usage tier.

04:31.680 --> 04:34.230
So remember, you've got two types of rate limits,

04:34.230 --> 04:36.930
tokens per minute, TPM,

04:36.930 --> 04:39.210
and requests per minute, RPM.

04:39.210 --> 04:41.340
You need to respect both these limits.

04:41.340 --> 04:42.240
The final thing I'll say

04:42.240 --> 04:45.390
is if you're doing heavy, large workloads,

04:45.390 --> 04:48.180
so you need to process 500,000 requests,

04:48.180 --> 04:50.640
you should look at using the batch API endpoint,

04:50.640 --> 04:53.580
which will not affect your daily usage quota.

04:53.580 --> 04:56.730
The batch API endpoint, these rate limits are separate.

04:56.730 --> 05:00.750
So you can run 900,000 tokens per day

05:00.750 --> 05:02.490
on GPT-4.1

05:02.490 --> 05:05.940
or 2 million tokens per day on GPT-4.1-mini.

05:05.940 --> 05:08.670
You can still then use 500 requests per minute

05:08.670 --> 05:12.870
and 200,000 tokens per minute for your daily usage.

05:12.870 --> 05:14.550
So check out the batch API,

05:14.550 --> 05:16.953
and it has its own separate limits for that.