Batch API vs Real-Time API

OpenAI batch is 50% off. So is Anthropic's. But "cheaper" only matters if your workload can tolerate a 24-hour SLA. This isn't a breakeven calculator. It's a workload decision tool. Plug in your constraints, see all four strategies side by side, get a recommendation.

Model (batch-capable only)

Input tokens / call

Output tokens / call

Calls / month

Workload constraints

Latency tolerance

Streaming required (UX needs token-by-token) User-facing interactive (waits for response)

Use prompt caching (where supported)

Cacheable portion of input (60%)

Cost across 4 strategies

Real-time, no cache

$0

Standard sync API call

Real-time + cache

$0

Sync with prompt caching

Batch, no cache

$0

Async batch API (24h SLA)

Batch + cache

$0

Stacks where supported

Recommendation

·

Batch API on most providers has a 24-hour SLA (no guarantees of when in that window your job runs). Don't use batch for anything user-facing or streaming. Caching/batch stacking is provider-specific: OpenAI yes, Anthropic no (as of latest verification).

When batch fits

Overnight bulk processing (embeddings, summarizations, classifications)
Backfill jobs running through historical data
Periodic aggregation (daily reports, weekly digests)
Internal tools where async is fine

When batch DOESN'T fit

Anything user-facing that waits for a response
Streaming UIs (chat, autocomplete)
Time-sensitive workflows (fraud scoring, alerting)
Pipelines where downstream steps need the output within minutes

When caching beats batch

If your workload has a long stable prefix (system, tools, retrieval context) and you send it many times, sync with caching often beats batch alone. Cache reads can be 10× cheaper than full input; batch is only 2× cheaper. With a high cache hit rate, real-time-cached can be cheaper than batch-no-cache and still serves user-facing traffic.

When stacking matters

OpenAI lets you stack batch and cache discounts. Anthropic doesn't (cache reads aren't batch-discounted further). For Claude, you usually pick one: caching for online traffic, or batch for offline jobs, not both. The calculator shows "n/a" for batch+cache where stacking isn't supported.

Reality check: for most production workloads, the decision is binary. Your latency constraint either allows batch or it doesn't. Don't try to retrofit batch onto an interactive product because the math looks better on paper. Failed batch jobs, retry logic, and ops complexity can eat the savings.