WEBVTT

00:00.060 --> 00:01.290
-: In this video, you're gonna learn about

00:01.290 --> 00:02.583
what is prompt caching.

00:03.491 --> 00:04.500
We're gonna look at the prompt caching methods

00:04.500 --> 00:06.390
for both OpenAI and Anthropic

00:06.390 --> 00:08.220
and how you can understand these

00:08.220 --> 00:10.530
to make decisions about your LM calls.

00:10.530 --> 00:14.670
So, prompt caching was released by Anthropic before OpenAI.

00:14.670 --> 00:17.160
They released us on the 14th of August 2024.

00:17.160 --> 00:20.400
They have basically made a more cost effective way

00:20.400 --> 00:21.750
for your input tokens

00:21.750 --> 00:23.910
if you're reusing the same input tokens.

00:23.910 --> 00:27.300
OpenAI followed suit on the following month in October,

00:27.300 --> 00:29.040
the first of 2024,

00:29.040 --> 00:31.770
and they introduced prompt caching in the API layer.

00:31.770 --> 00:33.120
There is some major differences.

00:33.120 --> 00:35.250
We're gonna explore what prompt caching is

00:35.250 --> 00:36.960
and how you can leverage it as well.

00:36.960 --> 00:38.700
So prompt caching is a way for you

00:38.700 --> 00:42.690
to reuse frequent context in API calls,

00:42.690 --> 00:45.000
and it's solely for input tokens

00:45.000 --> 00:47.280
and doesn't affect output token prices.

00:47.280 --> 00:50.640
So if you're always using the same inputs again and again

00:50.640 --> 00:53.940
and again, then prompt caching is gonna save you money,

00:53.940 --> 00:56.430
whether you're using OpenAI or Anthropic.

00:56.430 --> 00:59.310
The common use cases that you'll use for prompt caching

00:59.310 --> 01:02.700
is where you want to reduce the cost and the latency.

01:02.700 --> 01:05.580
You want to use it for the same context again and again,

01:05.580 --> 01:08.850
and it's ideal for long form content and conversations.

01:08.850 --> 01:11.760
OpenAI offer prompt caching automatically,

01:11.760 --> 01:13.680
so you don't actually have to do anything,

01:13.680 --> 01:17.340
but apart from you have to use a prompt,

01:17.340 --> 01:21.240
which has to contain at least 1,024 tokens,

01:21.240 --> 01:24.240
and they will cache the largest thing,

01:24.240 --> 01:27.870
which will be at least 1,024 tokens,

01:27.870 --> 01:30.480
and it will lead to a 50% cost reduction.

01:30.480 --> 01:31.920
The cache also expires

01:31.920 --> 01:34.380
after five to 10 minutes of inactivity,

01:34.380 --> 01:38.040
and it's available for currently GPT-4o, GPT-4o Mini,

01:38.040 --> 01:40.710
o1 Preview, and o1 Mini.

01:40.710 --> 01:42.390
If we have a look at Anthropic,

01:42.390 --> 01:44.160
it's a little bit different in the sense

01:44.160 --> 01:46.320
that the cost reduction is much greater

01:46.320 --> 01:48.870
and also provides a latency reduction.

01:48.870 --> 01:52.350
Now the pricing structure is based on both the cache write

01:52.350 --> 01:53.550
and the cache read

01:53.550 --> 01:56.100
so initially is slightly more expensive,

01:56.100 --> 01:59.670
about 25% more expensive than the base price

01:59.670 --> 02:00.930
of those input tokens.

02:00.930 --> 02:03.060
However, when you're reading from the cache,

02:03.060 --> 02:07.650
then it's only gonna be specifically 10% of the base price,

02:07.650 --> 02:10.830
and it's available for haiku, sonnet, and opus models.

02:10.830 --> 02:11.940
Now we're gonna have a look

02:11.940 --> 02:15.240
and see in a notebook in the next video exactly

02:15.240 --> 02:17.520
how you can explore prompt caching

02:17.520 --> 02:19.500
in both of these providers.

02:19.500 --> 02:20.760
You'll also learn how to calculate

02:20.760 --> 02:22.533
the input token savings as well.
