Efficient Batch Processing Strategies

Task statement 4.5 of the Claude Certified Architect — Foundations (CCA-F) exam — "Design efficient batch processing strategies" — sits inside Domain 4 (Prompt Engineering & Structured Output, 20 % weight) and is the single task statement on the blueprint that is explicitly about workload shape rather than prompt shape. The exam tests whether an architect can recognise when a volume of work should move from synchronous Messages API calls to the asynchronous Message Batches API, how to correlate thousands of results back to application records using custom_id, how the 50 % batch discount and 24-hour SLA interact with business latency requirements, and how to handle per-item errors without failing the whole batch. Community pass reports consistently flag batch design as the area where candidates who "default to batch to save cost" lose latency-sensitive scenario questions, and candidates who "default to real-time for safety" lose throughput-and-cost scenario questions.

This study note walks through the full batch-processing surface a CCA-F candidate is expected to design at architecture level. It covers the Message Batches API primitives, the 50 % discount and the 256 MB / 100 000-request structural limits, the 24-hour completion SLA and the 29-day result retention window, the custom_id correlation contract, the request array shape with per-item params, polling patterns for processing_status and in_progress_requests, the decision matrix that maps latency tolerance and volume to batch versus real-time, per-item error versus batch-level failure semantics, retry strategies that resubmit only failed custom_id values, throughput saturation tactics, and the two CCA-F scenarios that exercise batch design most heavily (structured-data-extraction and claude-code-for-continuous-integration). A traps section, practice anchors, and a five-question FAQ close the note.

What the Message Batches API Is — Asynchronous Bulk Inference for Claude

The Message Batches API is a dedicated endpoint on Anthropic's platform that accepts a list of Messages API requests, processes them asynchronously against Claude, and returns the collated results once the batch completes. It is architecturally separate from the synchronous Messages API: requests submitted via batch do not return a response on the original HTTP call; instead, your application submits the batch, receives a batch identifier, and later polls or receives a notification that results are ready.

The Message Batches API exists because a large fraction of production Claude workloads are not interactive. Backfilling a document lake with structured extractions, running a nightly review of the previous day's transcripts, processing a research dataset, scoring ten thousand inbound applications — none of these workloads need a response in two seconds. They need the work to be correct, the cost to be low, and the throughput to be high. The batch API is the architectural primitive that makes those trade-offs explicit.

The Message Batches API is Anthropic's asynchronous bulk inference endpoint that accepts up to 100 000 Messages API requests in a single batch, up to 256 MB of total payload, processes them within a 24-hour SLA at a 50 % discount versus synchronous calls, and retains results for up to 29 days. Each request carries a caller-defined custom_id used to correlate outputs back to application records. Batch is the default primitive for any non-interactive, high-volume Claude workload on CCA-F scenario questions. Source ↗

Synchronous Messages API vs Asynchronous Batch API in One Sentence

The synchronous Messages API is a request-response pipe optimised for interactive latency; the Message Batches API is a submit-and-collect queue optimised for throughput and cost. Every request body that is valid on one endpoint is valid on the other — the model, the messages, the tools, the system prompt, the JSON schema, the tool_choice, the max_tokens, the temperature, all identical. The only thing that changes is when and how the response arrives.

Batch Processing Use Cases — High-Volume Extraction, Async Review, Dataset Processing

CCA-F scenario questions frame batch-appropriate workloads very specifically. Recognising these frames cold is the fastest path to the right answer on exam day.

High-Volume Structured Data Extraction

The canonical batch workload is structured data extraction across a document corpus: ten thousand resumes to parse into a candidate schema, fifty thousand support tickets to label with a categorisation taxonomy, a hundred thousand invoices to extract into a vendor-aware JSON shape. Each document is independent (no cross-document dependencies), each extraction has the same prompt template with a per-document content substitution, and the business deadline is overnight rather than instant. This is the shape the structured-data-extraction CCA-F scenario cluster tests most heavily.

Asynchronous Multi-Pass Review

Multi-pass review architectures (task 4.6) often fan work through two or three distinct prompts — extraction, then quality checking, then a reconciliation pass. When the review cycle runs nightly or weekly across a large corpus, batching each pass is the default. The batch API handles pass one, results seed pass two, and so on. Latency between passes is measured in hours, not seconds, and the 50 % discount stacks across every pass.

Dataset Labelling and Evaluation

Offline model evaluation pipelines — running a held-out test set of a thousand prompts through Claude to compute metrics, or relabelling a training set to compare taxonomies — are archetypal batch workloads. No human is waiting, volume is high, cost is amplified by repetition, and latency requirements are effectively nonexistent.

CI/CD Bulk Analysis

In the claude-code-for-continuous-integration scenario, a nightly pipeline might analyse every file changed across the previous day's pull requests, or review every open issue for triage labels. These are non-interactive, high-volume, and deadline-bound-by-tomorrow — classic batch candidates. The same pipeline running per-PR on every commit, however, needs synchronous Messages API calls because a developer is waiting.

CCA-F scenario framing telegraphs batch-appropriate workloads with specific language. Phrases like "process the archive overnight", "backfill the dataset", "the team is comfortable with next-day results", "score the historical ticket corpus", or "run the analysis every evening" all signal batch. Phrases like "respond to the user within two seconds", "the agent resolves the ticket live", or "the developer is waiting on the result" signal synchronous Messages API. Candidates who pick batch for interactive-sounding scenarios are actively penalised. Source ↗

Message Batches API Structure — Submitting Multiple Requests in a Single API Call

A batch submission is a single HTTP POST whose body carries a requests array. Each element in the array is a complete, standalone Messages API request wrapped with a caller-defined custom_id. The entire batch succeeds or fails as one submission, but the requests inside are processed and scored independently.

The Request Array Shape

Each element in requests contains two top-level fields:

custom_id — a caller-chosen string (unique within the batch) used to correlate this request with its eventual result.
params — the full Messages API body for this individual request: model, max_tokens, messages, optionally system, tools, tool_choice, temperature, and any other standard Messages API parameters.

Every custom_id must be unique within its batch; duplicates are rejected at submission time. custom_id values are caller-defined strings — Anthropic never generates them for you. This is the single most-missed architectural fact on batch questions.

Submission Response

When you submit a batch, the endpoint returns a batch object that includes a batch ID, a processing_status (initially in_progress), request counts (total, processing, succeeded, errored, canceled, expired), and the created / expires-at timestamps. The batch object is the handle you use to poll for completion and later to retrieve results.

Structural Limits

The Message Batches API enforces two structural caps per batch:

Up to 100 000 requests per batch.
Up to 256 MB total payload size per batch.

Batches that need more volume are split into multiple submissions. Both caps are hard — the submission is rejected if either is exceeded.

custom_id is a caller-defined string, unique within a batch, that you attach to each individual request in the requests array. It is the only correlation key between the requests you submit and the results the API returns — Anthropic does not generate or track any application-side identity for you. Standard practice is to derive custom_id from a stable application primary key (for example, ticket-{{id}} or resume-{{uuid}}) so results can be joined back to your database without additional lookup tables. custom_id is NOT system-generated, and this fact is a recurring CCA-F trap. Source ↗

Batch Pricing — 50 % Cost Reduction vs Real-Time API Calls

The economic case for batch processing is the 50 % discount on both input and output tokens compared to synchronous Messages API calls with the same model. This is the headline number the exam expects you to know without hesitation, and it is the pivot point for every "batch vs real-time" cost calculation a scenario question may set up.

How the Discount Applies

The 50 % discount applies at token-level billing — input tokens, output tokens, and (where applicable) cached tokens within a batched request are all billed at half their synchronous rate. The discount is model-independent within the supported Claude family — Haiku, Sonnet, and Opus batch tokens all get the same 50 % reduction off their respective synchronous prices. The discount is not a rebate; it is applied at invoice time.

What the Discount Does Not Cover

The discount applies to model inference tokens only. It does not change the structural batch limits (100 000 requests, 256 MB). It does not waive the 24-hour SLA. It does not apply to non-batched parts of your system — if your pipeline does a real-time call to pre-score items before batching, the pre-scoring call is billed at full synchronous price.

Economic Decision Framing

At the architecture level, the 50 % batch discount matters most when three conditions hold simultaneously: (a) volume is high enough that the absolute dollar difference is material; (b) no human is waiting on individual results; (c) the 24-hour SLA is compatible with the business deadline. If any of the three fails, the 50 % discount is a distraction and the real architectural choice is elsewhere.

When a scenario question anchors heavily on "reducing inference cost" without stating a latency requirement, batch processing with the 50 % discount is usually the intended answer. When the same scenario instead emphasises "two-second response time" or "real-time user experience", the 50 % discount is a distractor and synchronous Messages API is correct even though it costs twice as much. Read latency and cost together — never in isolation. Source ↗

Batch SLA — 24-Hour Processing Guarantee and Availability Window

The Message Batches API's service-level commitment is that submitted batches complete within 24 hours. This number is the single most-traps-laden fact on the entire 4.5 task statement, and candidates lose scored points to it on every sitting.

"Within 24 hours" is an Upper Bound, Not a Target Time

The 24-hour SLA is the maximum time the platform will take to complete a batch under normal conditions. The vast majority of batches — especially small ones — complete in minutes or hours. "Within 24 hours" does not mean "at the 24-hour mark." Answers that assume every batch takes exactly 24 hours are wrong. Answers that assume the SLA is "usually much less than 24 hours but may take up to 24 hours" are correct.

Results Retention Window

Once a batch completes, results are available for retrieval for up to 29 days from creation. After the retention window, results are deleted and cannot be recovered. Applications that need long-term persistence must pull and store results into their own system of record well before expiry — batch results are not a durable archive.

Expired Batches

If a batch does not complete within the maximum processing window (24 hours), it transitions to an expired terminal state. Any requests inside that did not finish are counted in the expired request count, and partial successes within the batch are still retrievable by custom_id. Expired batches cannot be resumed; failed or expired items must be resubmitted as a new batch.

Batch SLA and retention — four numbers to commit to memory:

Up to 100 000 requests per batch.
Up to 256 MB total payload per batch.
Up to 24 hours maximum processing time (typical completion is much faster).
Results retained 29 days after batch creation.

Distractor cues: answers that say "exactly 24 hours", "guaranteed immediate results", "unlimited retention", "unlimited request count", or "Anthropic generates custom_id automatically" are wrong. Source ↗

custom_id Field — Correlating Batch Requests to Application Records

The custom_id contract is the architectural bridge between the batch API and your application database. Getting this wrong is the most common implementation-level batch mistake.

custom_id is Caller-Defined, Always

You choose custom_id. Anthropic does not. There is no "system-generated" identity for a batch request. If you do not supply a meaningful custom_id, you will receive a result set you cannot correlate, and you will have to reconstruct identity from brittle heuristics like request order (which is not guaranteed stable across the API boundary).

Derive custom_id from a Stable Key

Production practice is to derive custom_id from your primary key. For a ticket-labelling batch, use ticket-{{id}}. For a resume-parsing batch, use resume-{{uuid}}. For a multi-pass review, use pass{{n}}-{{record_id}}. This makes the result stream trivially joinable against your database.

Uniqueness Within Each Batch

custom_id values must be unique within a single batch submission. They do not need to be unique across batches, but cross-batch reuse is a code smell — it suggests you are submitting the same work twice. If your pipeline retries failed items from batch N into batch N+1, use the same custom_id values so your database upsert logic treats them as updates to the same record.

Correlation Beats Ordering

Never assume the order of results matches the order of requests. The batch API processes requests in whatever order is most efficient, and the result stream can interleave or reorder freely. Always correlate by custom_id; never by array index.

A plausible-sounding distractor on CCA-F batch questions is "use the array index of the request as the correlation key instead of custom_id." This is wrong on two levels: (a) custom_id exists precisely to avoid this anti-pattern, and (b) result ordering is not guaranteed to match request ordering. Always correlate by custom_id. Any answer that routes around custom_id or suggests it is optional for large batches is architecturally broken. Source ↗

Polling for Batch Results — Checking Completion Status and Retrieving Results

Batch completion is detected by polling the batch's processing_status field, not by waiting for a push notification. CCA-F expects you to recognise the polling pattern and choose a sane polling cadence.

Processing Status Values

Each batch carries a top-level processing_status that progresses through: in_progress (work is still being done), canceling (a cancellation request is being honoured), ended (the batch has reached a terminal state with all requests complete, errored, canceled, or expired). Once processing_status is ended, results are ready for retrieval.

Request-Level Counts

Inside the batch object, the request_counts block tracks processing, succeeded, errored, canceled, and expired counts. These counts update as the batch progresses and let you distinguish partial success from complete failure without retrieving results.

Polling Cadence Guidance

Because most batches finish well inside the 24-hour window, an exponential-backoff polling cadence is standard: poll every 30 seconds for the first few minutes, then every minute, then every five minutes, capping at roughly every fifteen minutes for the tail of the SLA window. Hammering the status endpoint every second is wasteful; polling once an hour risks long idle gaps after completion. A simple scheduled poll every five to fifteen minutes is adequate for most pipelines.

Retrieving Results

Once processing_status is ended, a dedicated results endpoint streams a line-delimited JSON (JSONL) stream of results, one line per original request, each tagged with its custom_id. Each line carries either a message (on success) or an error block (on per-item failure). Consume the JSONL stream into your system of record within the 29-day retention window.

Batch Request Structure — requests Array With Per-Item custom_id and params

Beyond the top-level array shape, the per-item params object deserves dedicated attention because the CCA-F exam tests whether candidates understand that each batched request is a full standalone Messages API call.

Every Item Can Have Its Own System Prompt, Tools, and Model

Nothing about the batch API forces uniformity across requests. You can submit a batch where item 1 uses Sonnet with one tool set, item 2 uses Haiku with a different tool set, and item 3 uses Opus with no tools at all. Each item's params is independent. In practice most batches have homogeneous params because the workload is "run the same prompt template across N inputs", but heterogeneity is supported and occasionally useful (for example, routing low-complexity items to Haiku and high-complexity items to Sonnet within one submission).

Tool Use Works Inside Batch, But Only Single-Turn

Batch supports tools and tool_choice in each request's params, which is the mechanism structured extraction uses to force schema-conformant outputs via strict: true tool calling. What batch does not support is multi-turn agentic loops — there is no way for your code to observe a tool_use result in the middle of a batched request and inject a tool_result back. Each batch item is a single Messages API call. If the task requires an agentic loop (read a file, decide, read another file, decide), batch is the wrong primitive and synchronous Messages API with run() or a manual loop is required.

Streaming is Not Supported

The batch API does not support streaming. Responses for each item are returned whole; there are no server-sent events, no incremental tokens, no mid-generation observability. For any workload that genuinely needs streaming (chat UIs, live dashboards), synchronous Messages API is the only answer. Streaming-related internals are explicitly out of scope for CCA-F, but the fact that batch cannot stream is in scope as an architectural constraint.

Batch vs Real-Time Decision Matrix — Latency Tolerance, Cost Sensitivity, Volume

The single most tested concept on task 4.5 is the decision between batch and synchronous Messages API. CCA-F frames this decision along three axes: latency tolerance, cost sensitivity, and volume.

Axis 1: Latency Tolerance

If an individual result must return within seconds (interactive UI, live customer support, synchronous tool call inside an agent), synchronous Messages API is the only correct answer — batch cannot guarantee individual-item latency. If an individual result is allowed to take minutes to hours (overnight backfill, weekly review, nightly extraction), batch is viable. If the business truly does not care when the result arrives within 24 hours, batch is preferred.

Axis 2: Cost Sensitivity

Batch is always 50 % cheaper per token than synchronous for the same model. When absolute cost is a primary driver and latency permits, batch wins. When cost is negligible compared to the business value of a fast response (for example, a customer-support agent resolving a critical ticket), synchronous wins regardless of the 50 % discount.

Axis 3: Volume

Batch is engineered for high-volume work. A single batch can carry 100 000 requests in one submission, amortising the HTTP and authentication overhead across the whole set. Synchronous Messages API is engineered for low-volume, request-response traffic. For ten requests a day, either works; for ten thousand requests a day without an interactive user, batch is the correct primitive.

The 2x2 Mental Model

A simplified 2x2 matrix candidates can draw mentally on exam day:

	Low Volume	High Volume
Low Latency Need	Synchronous Messages API	Synchronous Messages API (scale infra)
High Latency OK	Synchronous fine; batch saves money	Batch API (default)

Any CCA-F scenario that lands squarely in the bottom-right quadrant — high volume, high latency tolerance — expects batch as the answer.

Cost-at-Scale Quick Estimate

For a workload of one million requests per week averaging 2 000 input tokens and 500 output tokens per request on Sonnet, the synchronous-versus-batch cost delta is substantial. The 50 % batch discount is applied across the entire 2 500-token-per-request load, billed weekly. The scenario does not require candidates to compute exact dollar figures, but it does require recognising that "substantial" means "batch is the correct architectural choice when latency permits." Defaulting to synchronous at that scale is an anti-pattern.

On CCA-F, "always batch to save 50 %" is just as wrong as "never batch because latency matters." The decision is conditional on latency tolerance, cost sensitivity, and volume together. Scenarios that describe interactive support agents, live code review, or chat experiences require synchronous. Scenarios that describe overnight backfills, nightly extraction, or weekly corpus review require batch. Read the latency requirement first; then read the volume; only then weigh cost. Source ↗

Error Handling in Batches — Per-Item Error vs Batch-Level Failure

Errors in the batch API split cleanly into two categories, and CCA-F tests whether candidates can distinguish them.

Per-Item Errors

A per-item error means one specific request inside the batch failed, but the batch as a whole continued. The results stream contains an error block for that custom_id instead of a message. Common per-item error causes: malformed params for that request, content filter triggers on that specific input, model-specific rejections. Per-item errors are reflected in the errored count of request_counts.

Batch-Level Failures

A batch-level failure means the submission itself could not be processed — most commonly due to structural violations (exceeding the 100 000-request cap, exceeding the 256 MB payload cap, duplicate custom_id values within the submission, invalid JSON at the top level, authentication failure on submission). Batch-level failures are rejected at submission time and never enter in_progress state.

Expired Requests

Requests that were still processing when the 24-hour SLA expires transition to expired. The batch's processing_status becomes ended, and the expired requests appear with an error shape in the result stream. Partial successes within the same batch are unaffected and retrievable.

Cancellation

A batch can be cancelled in flight. Requests that were already completed when the cancellation lands return normally; requests that were still processing are marked canceled and return with an error shape. Cancellation is useful when a downstream dependency changes (for example, the target schema was wrong and you need to resubmit with corrected params).

A common distractor on CCA-F batch-error questions is "if any individual request fails, the entire batch is rolled back and no results are returned." This is false. The batch API is designed for partial success — failures on individual custom_id values do not affect the rest of the batch, and you can retrieve the successful results regardless. Architectures that treat batches as atomic all-or-nothing transactions are incorrect and will force unnecessary full-batch resubmissions. Source ↗

Retry Strategy for Failed Batch Items — Re-submitting Only Failed custom_ids

Partial-success semantics enable a clean retry strategy: after a batch ends, collect the custom_id values that errored, construct a new batch containing only those, and resubmit. CCA-F questions on retry design consistently choose the "resubmit only failed custom_ids" answer over "resubmit the whole batch".

The Retry Flow

Poll the batch until processing_status == "ended".
Iterate the JSONL result stream and collect (custom_id, error_type, original_params) triples for every error result.
Filter out permanent errors (invalid schema, content-filter trigger) that will never succeed regardless of retry.
Build a new batch whose requests array contains only the retryable items, reusing the same custom_id values so downstream joins still align.
Submit the retry batch and track its identifier separately.

Why Reuse the Same custom_id

If you reuse the original custom_id, the upsert into your system of record is idempotent: the retry result overwrites the error row with a successful message row. If you generate a fresh custom_id for the retry, you have to maintain a separate mapping table from retry-id back to original-id, which adds failure modes for no benefit. Reuse is the standard pattern.

Distinguishing Transient vs Permanent Errors

Not every error is worth retrying. Transient errors (temporary platform blips, transient rate-limit responses during peak) benefit from retry; permanent errors (the request body violates a schema, the content is rejected by a safety filter, the params contain an invalid model name) do not. Inspect the error category before retry — blind retry of permanent errors burns quota and delays the human-escalation path.

Retry is Its Own Batch, Not an Append

There is no "append to existing batch" operation. Once a batch has ended, you cannot modify it. The retry is always a new batch with a new batch ID and its own 29-day retention window.

Throughput Optimisation — Saturating Batch Limits for Maximum Throughput

When the workload is truly large — millions of items per day — throughput becomes the dominant architectural concern, and CCA-F expects candidates to know the saturation patterns.

Pack Batches to Near the Limits

A batch with 10 requests uses the same scheduling overhead as a batch with 100 000 requests. Pack each batch toward the structural limits (100 000 requests or 256 MB of payload, whichever comes first) to amortise overhead. Splitting a 200 000-request workload into two 100 000-request batches is more efficient than splitting it into twenty 10 000-request batches.

Run Multiple Batches in Parallel

Multiple batches can be in flight simultaneously. A pipeline pushing 500 000 items overnight should submit five parallel 100 000-request batches rather than serializing them. Each batch has its own completion SLA; parallelism does not impair the SLA of any individual batch.

Group by Model for Simpler Scheduling

While heterogeneous batches (mixed models per request) are supported, grouping batches by model simplifies cost tracking, downstream consumption, and retry logic. The operational cost of mixing models inside a batch is usually higher than the fixed-cost saving.

Pre-Filter Before Batching

Batch tokens are cheap but not free. Pre-filter inputs with cheap logic (regex, small classifier, rule-based triage) to avoid sending obviously-empty or obviously-ineligible records to Claude at all. The cost curve rewards the step where you remove work entirely rather than discounting the work by 50 %.

Watch the 256 MB Ceiling

Very large prompts (for example, full document contents batched at 100 000 at a time) can hit 256 MB before 100 000 requests. If your per-request payload averages 3 KB, one batch can hold 80 000 requests before the byte cap bites. If your per-request payload averages 30 KB, the cap bites at 8 000 requests. Size the batch to whichever limit binds first.

Structured Output Inside Batches — Tool Use and JSON Schema Enforcement

Batched requests inherit the full structured-output toolkit of synchronous Messages API. On CCA-F, structured data extraction scenarios almost always combine batch with strict tool use.

Strict Tool Use for Schema Guarantees

Each batched request's params can include a tools array with one tool marked strict: true. When combined with tool_choice set to that tool, Claude is forced to produce output that exactly matches the tool's JSON schema. This is the recommended approach for any extraction workload where downstream consumers depend on schema conformance.

Why Schema Enforcement Matters in Batch

In a batch of 100 000 items, schema drift on even a fraction of a percent is thousands of malformed rows downstream. Strict tool use eliminates that class of failure by making schema conformance a structural guarantee rather than a prompt-based hope. Candidates who choose "stronger prompt instructions" over strict tool use in batch extraction questions are being penalised for the programmatic-enforcement-beats-prompt-guidance rule (community pain point pp-01).

Validation Before Persistence

Strict tool use guarantees the shape is correct; it does not guarantee the values are semantically valid. A strict schema accepts "2026-13-45" as a date string unless the regex is tight. Downstream validation (date parsing, enum membership, range checks) still belongs in the consumer path, typically before persistence to the system of record.

Strict tool use is the pattern of setting strict: true on a tool definition inside a batched request's params and using tool_choice to force Claude to call exactly that tool. The platform enforces that the output matches the tool's JSON schema structurally — no extra fields, no missing required fields, no type mismatches. This transforms batch extraction from a probabilistic "please return JSON" to a deterministic schema contract, and is the CCA-F default for any batch workload that feeds downstream structured consumers. Source ↗

Plain-English Explanation

Batch-processing concepts become concrete when anchored to physical systems candidates already understand. Three analogies cover the decision, the mechanics, and the retry pattern.

Analogy 1: The Postal System — Express Courier vs Bulk Mail

A synchronous Messages API call is an express courier: you hand over one envelope, a rider runs it to the destination, and you get a signed receipt back in fifteen minutes. You pay the full rate because you are paying for speed. A batch submission is bulk mail: you hand over a sack of ten thousand envelopes, each addressed to a different recipient (each envelope has its own custom_id), and the postal sorting centre processes them overnight at half the per-envelope price. You do not know the exact minute any given envelope will arrive, but every envelope will arrive within the service-level window (24 hours), and they all come back sorted by your addressing label so you can reconcile them with your mailing list. If one envelope has an illegible address (a per-item error), the rest still deliver; only that one comes back marked undeliverable. If the whole sack never reaches the sorting centre (a batch-level failure), nothing delivers. An express courier for a birthday card sent to someone abroad is sensible; bulk mail for ten thousand newsletters is sensible; using bulk mail for a birthday card (sending the card a day late) or using express courier for ten thousand newsletters (paying ten thousand courier fees) are both architecturally wrong. CCA-F batch-versus-real-time questions are exactly this choice.

Analogy 2: The Laundromat — Wash-and-Fold Versus Self-Service

Think of synchronous Messages API as a self-service laundromat: you put in one load, you wait, you watch it cycle, you fold it right there, and you walk out with folded clothes in ninety minutes. Batch processing is wash-and-fold service: you drop off a week's worth of laundry in a labelled bag (each piece has an owner-tag — the custom_id), you walk away, you come back the next day and collect the sorted pile at half the per-pound price because the laundromat runs your bag in their efficient overnight rotation alongside twenty other customers' bags. If a single shirt gets a stain during wash (a per-item error), you get that one shirt back with an apology slip; the rest of the laundry is still folded and waiting. If the laundromat's machine breaks and the whole batch is delayed past the promised pickup time, that is a batch-level concern and you get a refund or rerun (cancellation or expiry semantics). You do not use wash-and-fold for the shirt you need to wear to tonight's dinner; you do not use self-service for three families' worth of monthly laundry. The matching of workload to primitive is the CCA-F decision matrix.

Analogy 3: The Restaurant Dinner Rush vs Catering Order — Latency Tolerance in Practice

A restaurant dinner service is the synchronous Messages API — each table (each user) places an order and expects food within fifteen minutes; the kitchen is optimised for low per-order latency even though per-unit cost is high. A catering order for a corporate event tomorrow is the batch API — the caterer accepts the full order today, prepares it overnight when the kitchen is quiet, charges a lower per-plate rate because the batch amortises setup and cleanup across hundreds of covers, and delivers everything at the agreed time tomorrow morning. The diner at table twelve cannot be told "your risotto will be ready within 24 hours, at the platform's discretion" — that is a business disaster. The event client cannot be told "your 200 plates are each prepared on demand at full dinner-service prices, delivered in the order they are finished, starting whenever each one is ready" — that is also a business disaster. Each primitive is correct for its shape of demand, and using the wrong one destroys either the customer experience (batch for interactive) or the unit economics (real-time for bulk). That is the entire architectural thesis of task 4.5.

Which Analogy Fits Which Exam Question

Questions about the submit-and-collect flow → postal system analogy.
Questions about per-item error handling and retry → laundromat analogy.
Questions about batch vs real-time choice → restaurant-versus-catering analogy.

Common Exam Traps

CCA-F batch questions consistently exploit five trap patterns. Every one is documented in community pass reports as a "close but wrong" answer choice.

Trap 1: "24-Hour SLA Means Exactly 24 Hours"

The 24-hour SLA is a maximum, not a target time. Most batches complete in minutes to hours. Answers that treat the 24-hour number as the expected completion time, or that design downstream pipelines around waiting a full day, are wrong. Correct framing is "typically much faster than 24 hours; guaranteed not slower."

Trap 2: "custom_id is System-Generated"

custom_id is always caller-defined. There is no automatic identifier the API provides for correlating results back to application records. Answers that describe extracting a system-generated batch-item-ID from the response — or that skip custom_id entirely in favour of array indices — are architecturally broken. Always derive custom_id from your application primary key.

Trap 3: "Batch API Supports Streaming"

Batch does not support streaming. Each request's response is returned whole, there are no incremental tokens, and there are no server-sent events. If a scenario requires streaming (chat UI, live dashboard), synchronous Messages API is the only correct answer. Candidates who select batch for a streaming-requiring workload because of the 50 % discount are directly penalised.

Trap 4: "Batch API Supports Multi-Turn Agentic Loops"

Each batched request is a single Messages API call. There is no mechanism to observe a tool_use mid-batch, execute the tool, and inject a tool_result back for Claude to consume. Workloads that require agentic loops (read a file, decide, read another file) are not batch-appropriate. Tool use inside batch is single-turn only, typically as strict-schema extraction via tool_choice.

Trap 5: "Always Batch to Save Cost"

The 50 % discount is attractive, but it does not override latency requirements. Scenarios that describe interactive agents, live user-facing applications, or real-time developer feedback loops require synchronous Messages API regardless of cost. Defaulting to batch because it is cheaper is the mirror image of the "always real-time for safety" anti-pattern and is penalised equally.

Practice Anchors

Batch-processing concepts show up most heavily in two of the six CCA-F scenarios. Treat the following as the architecture spine for scenario-cluster questions on task 4.5.

Structured-Data-Extraction Scenario

This is the scenario that most directly exercises batch design. A team needs to extract structured fields from a large corpus of documents — resumes, invoices, contracts, support tickets, research papers, regulatory filings. Expect questions that test: whether batch or synchronous is the right primitive for a given volume/latency mix; how to combine batch with strict tool use to force schema conformance; how to design custom_id so results join back to the source corpus cleanly; how to handle per-item errors without failing the batch; how to retry only the failed custom_id values; and how to size batches against the 100 000-request and 256 MB caps. The expected answer shape is almost always "submit a batch with strict-tool-use extraction, poll for completion, consume the JSONL result stream into your database keyed on custom_id, retry failed items in a follow-up batch."

Claude-Code-For-Continuous-Integration Scenario

This scenario exercises batch more subtly, in the context of CI/CD pipelines that run Claude Code analysis across large numbers of files, PRs, or issues on a nightly schedule. Expect questions that test when batch is the right primitive inside a CI pipeline (scheduled nightly corpus analysis) versus when synchronous Messages API is required (per-commit PR feedback for a developer who is waiting). Questions may also probe how -p (non-interactive) Claude Code invocations in CI integrate with or differ from the Message Batches API for raw Messages workloads. The architectural decision rule: if the CI task is per-event and a human is waiting, synchronous; if the CI task is scheduled bulk processing, batch.

FAQ — Message Batches API Top 5 Questions

What is the 50 % batch discount, and when does it apply?

The Message Batches API bills input, output, and cached tokens at 50 % of the synchronous Messages API rate for the same model. The discount applies automatically to any request submitted via the batch endpoint, across the whole Claude family (Haiku, Sonnet, Opus). The discount does not waive structural limits (100 000 requests, 256 MB) or the 24-hour SLA. On CCA-F, the 50 % discount is the economic pivot for batch-versus-real-time questions: it is the correct architectural choice when latency allows and volume is high, but it is never a reason to choose batch for interactive workloads where users are waiting on individual results.

What does the 24-hour SLA actually mean?

The 24-hour SLA is the maximum time Anthropic commits to completing a submitted batch under normal conditions. In practice, most batches complete in minutes to hours; the 24-hour figure is an upper bound, not a target. If a batch does not complete within 24 hours, it transitions to an expired terminal state — partial results are still retrievable by custom_id, but unprocessed requests are lost and must be resubmitted as a new batch. Applications that require faster-than-24-hour guarantees on individual results should use synchronous Messages API; applications that are comfortable with "overnight" latency use batch.

How do I correlate batch results back to my application records?

Use the custom_id field. Every request in the requests array must carry a unique, caller-defined custom_id string, and every result in the output JSONL stream is tagged with that same string. Derive custom_id from a stable application primary key (for example ticket-{{id}} or resume-{{uuid}}) so downstream upserts into your system of record are idempotent. Anthropic never generates custom_id for you — the correlation contract is entirely on the caller side. Never use array index or response ordering to correlate; the batch API does not guarantee order preservation.

What happens when individual requests inside a batch fail?

The batch API uses partial-success semantics. A per-item error affects only that request — the rest of the batch continues to process, and successful items are returned normally. The request_counts block tracks succeeded, errored, canceled, and expired counts, and the result stream includes an error block (instead of a message block) for each failed custom_id. The correct retry strategy is to filter the result stream for errors, exclude permanent errors (schema violations, content-filter rejections) from retry, and submit a new batch containing only the retryable custom_id values — reusing the original custom_id strings for idempotent upserts.

When should I choose batch over synchronous Messages API, and vice versa?

Choose batch when the workload is high-volume, non-interactive, and the business deadline is measured in hours or the next day rather than seconds — structured extraction over document corpora, nightly dataset labelling, scheduled CI/CD bulk analysis, multi-pass review pipelines. Choose synchronous Messages API when an individual result needs low-latency delivery (seconds), when a human is actively waiting, when the workload requires multi-turn agentic loops, or when streaming output is required (chat UIs, live dashboards). The CCA-F decision rule: read latency first, then volume, then weigh cost. Defaulting to batch because it is cheaper or to synchronous because it is safer — without matching to the workload shape — is the classic anti-pattern.

What the Message Batches API Is — Asynchronous Bulk Inference for Claude

Synchronous Messages API vs Asynchronous Batch API in One Sentence

Batch Processing Use Cases — High-Volume Extraction, Async Review, Dataset Processing

High-Volume Structured Data Extraction

Asynchronous Multi-Pass Review

Dataset Labelling and Evaluation

CI/CD Bulk Analysis

Message Batches API Structure — Submitting Multiple Requests in a Single API Call

The Request Array Shape

Submission Response

Structural Limits

Batch Pricing — 50 % Cost Reduction vs Real-Time API Calls

How the Discount Applies

What the Discount Does Not Cover

Economic Decision Framing

Batch SLA — 24-Hour Processing Guarantee and Availability Window

"Within 24 hours" is an Upper Bound, Not a Target Time

Results Retention Window

Expired Batches

custom_id Field — Correlating Batch Requests to Application Records

custom_id is Caller-Defined, Always

Derive custom_id from a Stable Key

Uniqueness Within Each Batch

Correlation Beats Ordering

Polling for Batch Results — Checking Completion Status and Retrieving Results

Processing Status Values

Request-Level Counts

Polling Cadence Guidance

Retrieving Results

Batch Request Structure — requests Array With Per-Item custom_id and params

Every Item Can Have Its Own System Prompt, Tools, and Model

Tool Use Works Inside Batch, But Only Single-Turn

Streaming is Not Supported

Batch vs Real-Time Decision Matrix — Latency Tolerance, Cost Sensitivity, Volume

Axis 1: Latency Tolerance

Axis 2: Cost Sensitivity

Axis 3: Volume

The 2x2 Mental Model

Cost-at-Scale Quick Estimate

Error Handling in Batches — Per-Item Error vs Batch-Level Failure

Per-Item Errors

Batch-Level Failures

Expired Requests

Cancellation

Retry Strategy for Failed Batch Items — Re-submitting Only Failed custom_ids

The Retry Flow

Why Reuse the Same custom_id

Distinguishing Transient vs Permanent Errors

Retry is Its Own Batch, Not an Append

Throughput Optimisation — Saturating Batch Limits for Maximum Throughput

Pack Batches to Near the Limits

Run Multiple Batches in Parallel

Group by Model for Simpler Scheduling

Pre-Filter Before Batching

Watch the 256 MB Ceiling

Structured Output Inside Batches — Tool Use and JSON Schema Enforcement

Strict Tool Use for Schema Guarantees

Why Schema Enforcement Matters in Batch

Validation Before Persistence

Plain-English Explanation

Analogy 1: The Postal System — Express Courier vs Bulk Mail

Analogy 2: The Laundromat — Wash-and-Fold Versus Self-Service

Analogy 3: The Restaurant Dinner Rush vs Catering Order — Latency Tolerance in Practice

Which Analogy Fits Which Exam Question

Common Exam Traps

Trap 1: "24-Hour SLA Means Exactly 24 Hours"

Trap 2: "custom_id is System-Generated"

Trap 3: "Batch API Supports Streaming"

Trap 4: "Batch API Supports Multi-Turn Agentic Loops"

Trap 5: "Always Batch to Save Cost"

Practice Anchors

Structured-Data-Extraction Scenario

Claude-Code-For-Continuous-Integration Scenario

FAQ — Message Batches API Top 5 Questions

What is the 50 % batch discount, and when does it apply?

What does the 24-hour SLA actually mean?

How do I correlate batch results back to my application records?

What happens when individual requests inside a batch fail?

When should I choose batch over synchronous Messages API, and vice versa?

Further Reading