AI FinOps

Prompt Caching vs Semantic Caching vs Exact Match

Two of these caches skip the LLM call for $0; the third makes the call cheaper. Here is what prompt caching, semantic caching, and gateway exact match each catch, and how to layer all three.

Ashraf Ali

06 Jun 2026 • 7 min read

Three different caches can sit in front of an LLM API, and they save money in two completely different ways. Picking the wrong one, or assuming "prompt caching" covers all of it, is how teams leave 40 to 80 percent of their cache savings on the table. This guide breaks down prompt caching vs semantic caching vs gateway exact match: what each one actually catches, where it runs, what it costs, and how to stack all three so they compound instead of overlap.

The short version: two of these caches let you skip the API call entirely and pay $0. The third cannot skip the call, but it makes the call you do send much cheaper. Most production stacks need both kinds.

TL;DR

Gateway exact match returns a stored answer for byte-identical requests. Cost on a hit: $0.
Semantic caching returns a stored answer for near-identical requests above a similarity threshold. Cost on a hit: $0.
Native prompt caching does not skip the call. It reuses a provider-side computed prefix so the cached tokens bill at a steep discount (Anthropic charges cache reads at 0.1x base input; OpenAI discounts cached input 50 to 90 percent). Cache hit rates of 60 to 80 percent are common on stable prefixes.
They layer. A request can hit exact match first, fall through to semantic, then fall through to a discounted prompt-cache call.

The two jobs of an LLM cache

Every caching technique in this comparison does one of two jobs. Keep them separate in your head and the rest is easy.

Job 1: skip the call. If you can prove a new request is the same as one you already answered, you return the stored answer and never touch the provider. You pay nothing. The risk is correctness: return a stored answer for a request that was not actually equivalent, and you serve a wrong result.

Job 2: cheapen the call. Some requests genuinely have to run. They are new, or the answer must be fresh. You cannot skip them, but you can avoid re-paying for the parts the model has already processed. That is what provider-side prompt caching does, and it is the reason a 30,000-token system prompt does not cost full price on every single call.

Exact match and semantic caching do Job 1. Native prompt caching does Job 2. That single distinction is the whole comparison.

Native prompt caching: cheapen the call you can't avoid

Native prompt caching is a feature of the model providers, not of any gateway. OpenAI, Anthropic, and others store the computed state of a long, stable prefix (your system prompt, few-shot examples, a large RAG context block) so that on the next request with the same prefix, you are not billed full price for those tokens again.

The receipts, as of June 2026:

OpenAI applies prompt caching automatically on any prompt over 1,024 tokens, caching the longest previously-seen prefix in 128-token increments. Cached input is discounted (50 percent on older models, 75 to 90 percent on 2026 GPT-5.x models). Caches typically evict after 5 to 10 minutes of inactivity. No code change required.
Anthropic is explicit: a cache read costs 0.1x the base input price (a 90 percent discount), while a 5-minute cache write costs 1.25x and a 1-hour write costs 2x. Note the 2026 change: Anthropic's default cache TTL quietly dropped from 1 hour to 5 minutes in early 2026, which silently inflated costs for intermittent workloads. If your traffic is bursty, you are paying the write multiplier more often than you think.

The key honesty point: prompt caching still bills tokens. It is a discount, not a skip. A cached Anthropic read at 0.1x is cheap, but it is not free, and you still pay full freight on the dynamic tail of every prompt (the user's actual question).

A good gateway automates this: it identifies cacheable prefixes and structures requests to hit the provider's cache, so you do not hand-tune cache_control breakpoints per provider.

Best for: long, stable system prompts and RAG context. Customer support, template generation, agent loops with a fixed instruction block.

Gateway exact match: $0 on byte-identical requests

Gateway exact match is the simplest cache to reason about. The gateway hashes the request body. If an identical hash has been seen before, it returns the stored response and the provider is never called. API cost on that duplicate: zero. Response time drops from seconds to milliseconds.

The catch is in the word "identical." One different character (a trailing space, a reordered JSON key, a different temperature) is a cache miss. That makes exact match safe (you will almost never serve a wrong answer) but narrow (it only catches true duplicates).

Best for: batch processing with repeated inputs, regression test suites that re-run the same prompts, and user-facing queries that happen to repeat verbatim.

Semantic caching: $0 on near-identical requests

Semantic caching is where the real savings envelope opens up. Instead of hashing the bytes, it embeds the request and compares it to prior requests by meaning. When two requests exceed a similarity threshold (around 95 percent), the stored result is reused and the provider call is skipped. Cost on a hit: zero. Typical savings reported: 40 to 60 percent of token cost on workloads full of paraphrased prompts.

This is the difference between catching "What is your refund policy?" and also catching "how do refunds work here?" Exact match sees two different strings. Semantic caching sees one question.

The trade-off is the threshold. Set it too tight and you miss reuse opportunities. Set it too loose and you risk serving a stored answer to a request that was not really equivalent. For high-stakes paths, pair it with application-level cache-key hints so a false positive cannot do damage.

Best for: chat assistants, search, FAQ bots, any user-facing surface where humans ask the same thing fifty different ways.

Prompt caching vs semantic caching vs exact match: the comparison

Cache	Job	Matches on	Where it runs	Cost on hit	Risk	Best for
Gateway exact match	Skip the call	Byte-identical request	Gateway	$0	Near zero	Batch, regression tests, verbatim repeats
Semantic caching	Skip the call	~95% semantic similarity	Gateway	$0	False positives if threshold too loose	Chat, search, FAQ, paraphrased prompts
Native prompt caching	Cheapen the call	Identical long prefix	Provider side	Discounted tokens (0.1x read on Anthropic)	None to correctness	Long stable system prompts, RAG context

How the three layer (they don't compete)

The reason "vs" is the wrong frame: in a real request path, these run in sequence, cheapest first.

Exact match checks the hash. Verbatim duplicate? Return it. $0. Done.
Semantic cache checks similarity. Close enough to a prior request? Return it. $0. Done.
The call runs. Now native prompt caching kicks in so the stable prefix bills at the provider's cache-read discount instead of full price.

Each layer catches what the layer above it missed. Exact match is the cheapest and safest, so it goes first. Semantic catches the paraphrases exact match cannot. Prompt caching cleans up the cost of everything that had to run anyway. Run only one of the three and you are paying full price for two of the three patterns of waste.

Which cache do you actually need?

Mostly long, stable system prompts (agents, RAG): start with native prompt caching. The prefix discount is the single biggest line item.
High duplicate-rate batch jobs or test suites: gateway exact match alone recovers most of the waste, at near-zero risk.
User-facing chat or search: semantic caching is the unlock, because humans never phrase the same question the same way twice.
All of the above (most real products): layer all three. They do not conflict.

Where Alephant fits

Alephant is an AI FinOps gateway: a bring-your-own-key proxy that runs these caches for you and then shows you, per member and per agent, exactly what each one saved. Every AI gateway routes your requests. Only Alephant attributes the savings back to the workload that earned them.

The caching mechanisms run in the Alephant Gateway runtime, including the open-source self-hosted build. In Alephant Cloud, the managed controls for prompt caching and exact-match $0 calls live in the Prompt Registry on the Pro plan and up, and basic semantic caching is a Pro feature with advanced similarity caching on Enterprise. The free tier covers budget-safety controls, not the full caching suite.

For the routing side of the savings story, see How Model Routing Cuts AI Costs by 30 to 70%. For the category itself, see What Is AI FinOps?.

FAQ

What is the difference between prompt caching and semantic caching?

Prompt caching is a provider-side discount: it reuses a previously computed prompt prefix so those tokens bill cheaper, but the call still runs and still costs something. Semantic caching skips the call entirely when a new request is close in meaning to an old one, returning the stored answer for $0.

Is semantic caching better than exact-match caching?

Not better, wider. Exact match only catches byte-identical requests, with almost no risk of a wrong answer. Semantic caching also catches paraphrased requests, which recovers far more savings, at the cost of needing a tuned similarity threshold. Production stacks usually run both.

Does prompt caching make API calls free?

No. Prompt caching discounts the cached portion of the prompt. Anthropic charges a cache read at 0.1x base input (a 90 percent discount) and OpenAI discounts cached input by 50 to 90 percent depending on model, but you still pay for the cache write and for the dynamic part of every prompt.

How much can caching reduce LLM costs?

It depends on the pattern. Provider prompt caching commonly hits 60 to 80 percent on stable prefixes. Semantic caching reports 40 to 60 percent token savings on paraphrase-heavy workloads. Exact match drops duplicate calls to $0. Stacked, the three address different waste, so the savings add rather than overlap.

What is the risk of semantic caching?

A false positive: serving a stored answer to a request that was not truly equivalent. The fix is threshold tuning and, on high-stakes paths, application-level cache-key hints so a near-match cannot return the wrong result.

Why did my Anthropic caching costs go up in 2026?

Anthropic's default cache TTL dropped from 1 hour to 5 minutes in early 2026. Intermittent workloads now let the cache expire between requests, which means paying the 1.25x to 2x write multiplier more often instead of the 0.1x read. A gateway that re-warms or batches calls inside the TTL window mitigates this.

Start measuring your cache savings

Pick the cache that matches your traffic, then layer the other two. If you want the savings attributed per workload instead of guessed at, route through the Alephant Gateway, keep your own keys, and read the hit rates off the dashboard.