Error Propagation Strategies Across Multi-Agent Systems

Task statement 5.3 — "Implement error propagation strategies across multi-agent systems" lives inside Domain 5 (Context Management & Reliability, 15 % exam weight) of the Claude Certified Architect — Foundations (CCA-F) exam. Although Domain 5 is the lightest domain by weight, its scenario density is extremely high: every multi-agent question — and multi-agent questions dominate Domain 1 (27 %) as well — eventually bends toward "what happens when a subagent breaks?" Error propagation is the stitching that decides whether a multi-agent system degrades gracefully, halts safely, or cascades into complete failure.

This study note walks through the full error-propagation surface a CCA-F candidate must be able to design at the architecture level: why nested agent systems create new failure boundaries, how to classify errors as transient versus permanent and local versus cascading, when fail-fast beats fail-safe, how to contain a subagent crash so the coordinator survives, the exact shape of a structured error payload that crosses an agent boundary without losing semantics, how coordinators decide whether to retry a failed subagent, how to proceed with partial results when only some subagents succeed, how to accumulate soft failures until a hard-abort threshold trips, the circuit-breaker pattern for disabling repeat offenders, how to chain root-cause error reporting up to the top-level coordinator, and the distributed-tracing surface you need to debug any of it. A final traps section and six-question FAQ anchor every abstraction to the two exam scenarios that exercise this task heaviest: multi-agent-research-system and customer-support-resolution-agent.

The Error Propagation Challenge — Failures in Nested Agent Systems Crossing Boundaries

A single-agent Claude system has exactly one failure boundary — the agentic loop itself. When a tool fails, the loop sees it; when the loop decides to abort, the caller sees it. A multi-agent system has as many failure boundaries as it has agent layers. A coordinator that spawns three subagents, each of which spawns two more subagents of its own, has a failure surface with at least three boundary types: tool-to-subagent, subagent-to-coordinator, and coordinator-to-caller. Each boundary is an opportunity for an error to be swallowed, distorted, or turned into a silent cascade.

Why Boundaries Matter

A boundary is a point where execution context changes hands. Inside a single agentic loop, Claude sees every tool_result — including errors — as part of a continuous conversation. Across an agent boundary, only what you explicitly pass through is visible. A subagent that fails catastrophically will not automatically tell the coordinator what happened; it will simply return whatever your integration code decides to return. That decision is the error-propagation strategy.

The Three Failure Modes of Multi-Agent Systems

Silent degradation — A subagent fails and returns a plausible-looking placeholder result. The coordinator trusts the placeholder, synthesizes it into the final answer, and the user receives fabricated content with no error indicator. This is the worst failure mode because it is invisible.
Cascade failure — A subagent fails and throws an exception that propagates up through the coordinator, crashing the entire run even though other subagents were healthy. A coordinator that cannot proceed without every subagent's result is too brittle.
Infinite retry — A coordinator retries a failing subagent indefinitely, exhausting tokens and wall-clock budget without ever producing output. This is the failure mode that circuit breakers exist to prevent.

Error propagation is the set of mechanisms by which a failure at one layer of a multi-agent system is surfaced to the next layer up, with enough structured detail for that layer to decide whether to retry, skip, degrade, or abort. Error propagation is the architectural complement to error handling: error handling decides what to do at the failure site; error propagation decides what the failure site tells upstream. A multi-agent system without explicit propagation strategy is a system that silently fabricates answers. Source ↗

Why CCA-F Tests This

Community pass reports repeatedly cite Domain 5 scenarios where the candidate must pick the correct coordinator behaviour when a subagent returns a structured error. The distractors are almost always "retry indefinitely", "silently drop the subagent", or "abort the whole run" — each of which is correct in specific circumstances and wrong in others. Getting error propagation right is a core differentiator between a 720 pass and a 900-plus score.

Error Classification — Transient vs Permanent, Local vs Cascading

Before designing any propagation strategy you must classify the error. CCA-F expects recognition of two orthogonal axes.

Axis 1: Transient vs Permanent

Transient — The underlying cause is expected to resolve on its own without human intervention. Examples: network timeout, rate-limit throttle, momentary database contention, a downstream service rebooting. Transient errors are candidates for retry.
Permanent — The underlying cause will not resolve without a change to inputs, configuration, or external state. Examples: malformed input, permission denied, resource not found, schema violation. Permanent errors must not be retried; retrying them wastes budget and can mask real problems.

The distinction matters because a coordinator that retries permanent errors will exhaust the iteration cap before giving up, while a coordinator that does not retry transient errors will abort on the first network blip.

Axis 2: Local vs Cascading

Local — The error affects only one subagent's work and does not invalidate work already completed by peer subagents. A research agent that cannot reach one particular source still has valid results from three other sources.
Cascading — The error invalidates upstream assumptions or downstream work that depends on this subagent's output. If the "authenticate user" subagent fails, every downstream "fetch user data" subagent is now operating on invalid premises.

A local error is a candidate for partial-failure handling (proceed with what you have). A cascading error must short-circuit downstream work.

The Four Quadrants

	Local	Cascading
Transient	Retry within the subagent; coordinator proceeds with partial results if retry exhausts	Retry with backoff; if retry exhausts, abort the whole pipeline
Permanent	Skip this subagent; coordinator proceeds with partial results and annotates the gap	Abort immediately; the pipeline's premises are broken

CCA-F distractors frequently scramble these quadrants. The trap pattern is to offer "retry three times" as the answer for a permanent local error, or "proceed with partial results" for a cascading permanent error. Before you pick a propagation strategy on exam day, run the scenario through the four-quadrant matrix: is it transient or permanent, and is it local or cascading? The answer falls out almost mechanically. Source ↗

Fail-Fast vs Fail-Safe Strategies — When to Abort vs When to Gracefully Degrade

Error-propagation strategy lives on a spectrum between two poles: fail-fast (stop at the first sign of trouble) and fail-safe (keep producing useful output even when parts of the system misbehave). Neither pole is universally correct; CCA-F tests whether you can match strategy to scenario.

Fail-Fast

Fail-fast strategies abort execution at the earliest detection of a problem. The coordinator stops spawning new subagents, reports the error upward with full context, and lets the caller decide what to do. Appropriate when:

Downstream work depends on the correctness of the failed step (cascading permanent errors).
The system handles critical state changes (financial transactions, medical records, destructive operations).
Partial output is worse than no output (a legal document with missing clauses is not usable; skip-and-proceed would produce something dangerous).
The caller is a human or another automated system that can recover only from explicit error signals.

Fail-Safe

Fail-safe strategies keep the coordinator running despite subagent failures, producing a best-effort result that is clearly annotated with which parts succeeded and which did not. Appropriate when:

Subagent work is independent (local errors).
Partial results have genuine user value (three out of four research sources is still useful research).
The user can make an informed decision from a partial answer (a support agent that resolved the billing question can still do so even if the product-info lookup failed).
The cost of aborting is high (user frustration, lost work, missed SLA).

The Coordinator's Role

A coordinator is where the fail-fast vs fail-safe decision is actually implemented. Subagents typically do not know whether their failure is cascading or local — only the coordinator has the dependency map. The coordinator reads the structured error, consults its own knowledge of which subagents are critical versus optional, and picks the strategy.

When an exam question describes a multi-agent research system where four subagents investigate independent topics, the correct strategy is almost always fail-safe with partial-result handling. When an exam question describes a support agent where the "verify customer identity" subagent fails, the correct strategy is almost always fail-fast — you cannot safely proceed with unverified identity. Pattern-match on whether downstream work depends on the failed step. Source ↗

Error Containment — Preventing Subagent Failure From Crashing the Coordinator

A subagent that throws an unhandled exception can tear down the coordinator's entire run if you do not design containment. The CCA-F expectation is that subagent invocations are wrapped in containment primitives that turn raw failures into structured return values the coordinator can reason about.

Three Levels of Containment

Exception boundary — Every subagent invocation must be inside a try-catch (or language-equivalent) block. An uncaught exception inside a subagent must be caught at the invocation site and converted into a structured error return value. Never let a subagent exception propagate raw.
Resource isolation — Subagents should not share mutable state with the coordinator. If a subagent corrupts its own conversation history, that corruption must not reach the coordinator's history. The Agent SDK's subagent primitives provide this isolation by default — each subagent runs with its own context window. You lose that isolation if you naively merge subagent messages into the coordinator's message history.
Timeout wall — Every subagent invocation must have a maximum wall-clock budget. A subagent that hangs on a slow tool must not hold the coordinator hostage indefinitely. The timeout is part of containment because it prevents "hung subagent" from turning into "hung coordinator."

Subagent Context Isolation Is Not Error Isolation

This is the single most tested nuance in the CCA-F error-propagation surface. Subagent context is isolated — the subagent does not see the coordinator's conversation history, and vice versa. That isolation is useful for token management and information hiding. It does not automatically isolate errors. If the subagent throws an exception and your code does not catch it at the invocation site, the coordinator's process crashes even though the contexts were separate.

Community pass reports flag this trap specifically. The distractor reads "subagents have isolated context, so their errors cannot corrupt the coordinator." That statement is only half true. Context isolation prevents conversation-history corruption, but it does not prevent process-level or control-flow corruption. You still need explicit try-catch around every subagent invocation. Answers that conflate context isolation with error isolation are wrong. Source ↗

Containment in the Agent SDK

When you use the Agent SDK's subagent primitives, the SDK provides default containment: the subagent runs in its own loop, its result (or error) is returned as a structured value to the coordinator, and the coordinator receives it as an observation it can reason about. You still need to check for error indicators on that result and decide what to do. The SDK will not invent a propagation strategy for you.

Structured Error Payloads Across Agent Boundaries — Preserving Error Semantics

When an error crosses an agent boundary — subagent to coordinator, or coordinator to caller — it must carry enough structured information for the receiver to decide what to do. A generic string like "something went wrong" forces every receiver to guess.

The Five Fields Every Cross-Boundary Error Payload Needs

errorCategory — A controlled vocabulary classification: transient, permission, business, internal, validation. This tells the receiver which of the four quadrants applies.
isRetryable — Boolean. Directly encodes the retry decision so the receiver does not have to re-derive it from the category.
message — A concise, human-readable description of what went wrong. Actionable, not a stack trace dump.
context — Structured metadata about the failure site: which subagent, which tool, which input, which iteration. Essential for debugging and for the coordinator to attribute failure correctly.
correlationId — The distributed-tracing identifier that ties this error back to the top-level request. Essential for observability (covered later).

Why Structured Payloads Matter

A coordinator that receives only a string has to pattern-match on that string to decide what to do. If the subagent library changes its error wording, the coordinator's logic breaks silently. A coordinator that receives a structured payload can dispatch on errorCategory and isRetryable with deterministic code. This is the same pattern Domain 2 tests for MCP tool errors — the consumption pattern is identical when the consumer is a coordinator instead of Claude.

Preserving Semantics Up the Chain

When a coordinator re-raises an error to its own caller, it must not collapse the structured payload into a string. Wrap, do not flatten. The top-level caller should still be able to see the original errorCategory, the original subagent's identity, and the original context — not a seven-layer stack of "error in step X: error in step Y: error in step Z."

A structured error payload is a machine-readable object that crosses an agent boundary carrying at minimum: errorCategory (controlled vocabulary), isRetryable (boolean), message (human-readable), context (structured metadata), and correlationId (distributed tracing ID). Structured payloads are what let a coordinator make correct retry, skip, or abort decisions without guessing. Generic string errors are the canonical anti-pattern because they force every receiver to re-derive semantics from text. Source ↗

Retry Decisions at Coordinator Level — When Coordinator Should Retry a Subagent

Once a coordinator receives a structured error from a subagent, the next decision is: retry the subagent, skip it, or abort? Retry logic at the coordinator level is subtly different from retry logic inside a single agentic loop.

The Retry Decision Tree

Read isRetryable on the structured error. If false, do not retry; either skip (if local) or abort (if cascading).
If isRetryable is true, check the retry budget for this subagent on this coordinator run. If the subagent has already been retried N times (typical N = 2 or 3), stop retrying.
If the retry budget is available, apply exponential backoff before the retry — waiting 1 second, then 2, then 4. Retrying instantly usually re-hits the same transient condition.
If retry succeeds, proceed with the result. If it fails again, increment the failure counter and return to step 2.
If the retry budget is exhausted, treat the subagent's failure as terminal for that subagent and apply the appropriate fallback (partial-result handling for local; abort for cascading).

Retry Budget vs Iteration Cap

A coordinator with three subagents and a retry budget of two per subagent has an effective outer-loop iteration count of up to nine subagent executions (three subagents times three attempts each). Combine this with each subagent's inner agentic-loop iteration cap and the total token envelope can be large. Budget planning is part of the design.

Idempotency Considerations

Retries are safe only if the subagent's work is idempotent. A subagent that creates a database record on every run is not safely retryable — retrying produces duplicates. A subagent that fetches and summarizes a document is idempotent — retrying is safe. Distinguish these at design time. If a subagent is not idempotent, encode that in its definition so the coordinator refuses to retry it, regardless of isRetryable.

CCA-F scenarios that describe subagents performing destructive actions (writing to production databases, sending emails, calling paid APIs) should not be paired with coordinator-level retry. Answers that recommend automatic retry for destructive actions are wrong. Retry is for idempotent work only; destructive work should fail-fast and escalate to a human. Source ↗

Partial Failure Handling — Coordinator Proceeding With Available Subagent Results

Fail-safe strategy in practice looks like this: the coordinator collects results from every subagent, treats absences and errors as gaps, and produces the best answer it can with the available data.

The Annotated-Gap Pattern

Every partial result must be explicitly annotated with what is missing. A research summary that says "according to the three sources consulted" is honest and auditable; a research summary that says "according to four sources" when only three responded is a lie by fabrication.

The annotation lives in the final synthesis step. When the coordinator gathers subagent outputs, it builds a data structure that records both successful results and failure placeholders, then passes that structure to the synthesis prompt with instructions to explicitly note the gaps.

Threshold Logic

Partial-failure handling is only safe above a minimum-completeness threshold. If a research agent was supposed to consult four sources and only one responded, "partial result" is not partial — it is one-source. Define the threshold in advance:

Absolute — at least N successful subagents.
Proportional — at least X % of subagents.
Criticality-weighted — specific subagents are marked "required" and their failure forces abort regardless of how many others succeeded.

Graceful Degradation in Exam Scenarios

The multi-agent-research-system scenario is the canonical fail-safe candidate: four topic researchers feed one synthesizer; losing one is acceptable, losing three is not. The customer-support-resolution-agent scenario mixes strategies: the identity-verification subagent is critical (fail-fast on failure), while the suggested-articles subagent is optional (fail-safe).

Error Accumulation — Tracking Multiple Soft Failures Before Hard Abort

Not every error is individually fatal. A coordinator may tolerate one flaky tool call or one subagent hiccup but must abort if the pattern repeats. Error accumulation is the mechanism that distinguishes noise from a real outage.

The Accumulator Pattern

Maintain a failure counter across the coordinator's run — per subagent, per tool, or globally. On each soft failure (a tool_result with is_error: true, a structured error with errorCategory: transient, a retry that succeeded only on the final attempt), increment the counter. When any counter crosses a threshold, escalate the coordinator's response:

Below threshold: log, continue, rely on the built-in retry logic.
At threshold: alert observability, switch to a safer strategy (lower parallelism, longer timeouts, simpler tool paths).
Above threshold: hard abort the run and escalate to human review.

Why Accumulation Beats Per-Error Reaction

A coordinator that reacts to every single soft failure by aborting is too brittle for production. A coordinator that never reacts to accumulated failures is too permissive and will happily burn budget in a degraded state. Accumulation thresholds let you tolerate the noise floor while still catching sustained failures.

Accumulation in Multi-Agent Scenarios

A research coordinator spawning ten topic subagents might tolerate one or two individual failures (partial-result handling) but hard-abort if four or more fail within the same run — because four failures in one run likely signals an upstream outage rather than individual source issues.

Error accumulation checklist (commit to memory for exam day):

Maintain counters per subagent and per error category.
Increment on every soft failure, including successful-on-retry.
Set thresholds before the run starts, not dynamically.
Escalate response when thresholds cross — do not wait for hard abort.
Reset counters at the start of each new coordinator run.

If an answer describes a coordinator that reacts only to the current error without any accumulation history, it is missing the reliability pattern CCA-F expects. Source ↗

Circuit Breaker Pattern — Disabling Failing Subagents After Repeated Failures

The circuit breaker is the industry-standard pattern for preventing a downstream failure from swamping an upstream system. CCA-F expects you to recognize the pattern, know its three states, and — importantly — know that it is a design pattern you implement yourself, not a built-in Agent SDK feature.

The Three States

Closed — Normal operation. Requests flow to the subagent. Failures are counted.
Open — The failure count has crossed the threshold. All requests are rejected at the coordinator without even attempting the subagent. Fast failure replaces slow failure.
Half-Open — After a cooldown period, a single probe request is allowed through. If it succeeds, the breaker returns to Closed. If it fails, the breaker re-opens for another cooldown.

Why the Breaker Helps

Without a breaker, a coordinator with a persistently failing subagent will retry that subagent on every run, wasting tokens and latency on failures that are now near-certain. A breaker lets the coordinator skip the failing subagent entirely after it has proven itself unhealthy, reserving resources for subagents that still work. In partial-failure scenarios this is pure upside: you get the healthy subagents' output faster.

Breaker vs Accumulator

Accumulators track errors across a single run to decide when to abort. Breakers track errors across multiple runs (or longer time windows) to decide when to disable. They are complementary, not alternatives. A production multi-agent system typically has both.

Breaker Is a Pattern, Not an SDK Feature

This is the second most tested CCA-F nuance in the error-propagation surface. The Claude Agent SDK does not ship a built-in circuit breaker for subagent invocations. If your design relies on a circuit breaker, you are responsible for implementing it — whether via a wrapper around subagent calls, a library like Polly / resilience4j in your runtime, or custom state in your coordinator.

A common distractor claims "enable the Agent SDK circuit breaker" or "configure the built-in subagent circuit breaker in settings." Both phrasings are wrong. Circuit breaker is a pattern choice that you implement in your coordinator's invocation wrapper; it is not a built-in SDK feature you turn on with a flag. Answers that describe circuit breaker as an SDK setting are incorrect. Source ↗

Error Reporting Chain — Surfacing Root-Cause Errors to the Top-Level Coordinator

In a deep multi-agent system, an error that occurs three layers down must be visible to the top-level coordinator with its root cause preserved. The error-reporting chain is the discipline that makes this possible.

Wrap, Do Not Flatten

When layer N's coordinator catches an error from layer N+1's subagent, it must wrap that error in its own structured payload — adding its own context — rather than collapsing it into a generic string. The receiver at layer N-1 should be able to drill from the outer wrapper down to the inner root cause.

Conceptually the payload chain looks like a nesting doll:

Top wrapper: { source: "synthesis_coordinator", errorCategory: "subagent_failure", cause: ... }
Middle wrapper: { source: "topic_coordinator", errorCategory: "tool_failure", cause: ... }
Innermost: { source: "web_search_tool", errorCategory: "transient", message: "connection timeout" }

Root-Cause Attribution

Top-level logs and user-facing error messages should prefer the innermost error's category and message while preserving the full chain for debugging. A user-facing message like "We could not complete your request because a search tool timed out" is clearer than "Synthesis coordinator failed: topic coordinator failed: web search tool failed: timeout" even though the latter is more complete.

The Correlation-ID Tie-Back

Every error in the chain carries the same correlationId. This lets the observability layer reconstruct the full error tree post-hoc from distributed traces — essential when the end user's error report is the thin summary but the engineer needs the full chain.

Observability for Multi-Agent Errors — Distributed Tracing and Correlation IDs

You cannot debug what you cannot see. Multi-agent error propagation is invisible without explicit observability. CCA-F expects candidates to know the minimum observability surface for a production multi-agent system.

What Every Subagent Invocation Should Emit

Start event — timestamp, subagent name, coordinator correlationId, parent span ID, input token count.
Tool call events — every tool_use and tool_result inside the subagent (matching the single-agent loop observability surface).
End event — timestamp, duration, final status (success / soft error / hard error), output token count, structured error payload if failed.

Correlation IDs

A correlationId (or traceId) is a unique identifier generated at the top-level coordinator and propagated through every subagent invocation, tool call, and error payload. It is the primary key that lets you reconstruct the entire cross-agent flow from logs.

Without correlation IDs, logs from a coordinator and its subagents are three separate rivers and you cannot tell which subagent's log lines belong to which coordinator run. With correlation IDs, every log line on every agent can be stitched into a single trace.

Distributed-Tracing Spans

Each subagent invocation is a span under the coordinator's parent span. Each tool call inside the subagent is a child span under the subagent's span. The resulting tree visualizes the exact shape of the run — including which subagent took how long, which tool calls failed, and where latency accumulated. Tracing tools (OpenTelemetry, Datadog APM, etc.) render this tree natively when spans and correlationIds are emitted correctly.

Why Observability Is a Propagation Strategy Component

Error propagation is what you do with an error after it occurs. Observability is what you see when an error occurs. Without observability, even a correctly-propagated error is debugged by guesswork. CCA-F scenarios will sometimes frame the question as "how do you know this multi-agent system is failing?" — and the answer is always structured logs plus correlation IDs plus distributed tracing, not "ssh into the server and grep."

Multi-agent observability is not optional on CCA-F. Scenario answers that recommend fixing a broken multi-agent system without first adding tracing and correlation IDs are incorrect. You cannot reason about cross-agent errors — let alone propagate them sensibly — without the observability to see them. Source ↗

Plain-English Explanation

Error propagation is abstract until you ground it in systems most people already know. Three very different analogies cover the full sweep of the concept.

Analogy 1: The Hospital Triage Network — Error Classification and Fail-Fast vs Fail-Safe

Imagine a hospital with a main triage desk (the coordinator) that directs incoming patients to specialist wards (subagents): cardiology, radiology, orthopedics, and a general lab. A patient comes in complaining of chest pain. The triage desk sends requests to all four wards simultaneously. Cardiology reports back quickly with an EKG result; radiology reports a chest X-ray; orthopedics reports nothing relevant; but the general lab reports that its centrifuge is broken and the blood panel cannot be run for two hours. Now the triage desk must decide. If the missing blood panel is critical (cardiac enzymes for a suspected heart attack), the triage desk fails fast — it escalates to the on-call physician and does not pretend to have a full picture. If the missing lab result is nice-to-have (a routine vitamin D check), the triage desk fails safe — it produces a report annotated with "vitamin D pending, retry tomorrow." A centrifuge that breaks for the third time this week is a cascading permanent problem (order a new centrifuge, do not keep retrying), while a centrifuge that occasionally glitches is transient (retry after reboot). The hospital also has a rule: if three pieces of equipment break in one shift, escalate to the department head — that is error accumulation. If one piece of equipment has broken twelve times in a month, stop sending jobs to it entirely until maintenance certifies it fixed — that is a circuit breaker. Every patient, every test, every technician carries the same admission ID so any problem can be traced back — that is the correlationId. A multi-agent Claude system with good error propagation is a hospital where the triage desk never pretends a missing result is present and never grinds to a halt just because one machine is down.

Analogy 2: The Shipping Container Dock — Containment and Boundaries

Picture a container dock with multiple cranes, each loading a different ship. If one crane's hydraulics fail mid-lift, you do not want that failure to cascade to the neighbouring cranes, bring down the dockyard control tower, or sink the ship currently being loaded. Each crane operates in a steel-reinforced zone with automatic shutoffs that isolate the failure: the crane stops, alarms sound, the control tower gets a structured incident report (crane #3, hydraulic pressure loss, load already suspended, time of failure), and the other cranes keep working. That is error containment. The control tower's incident report is a structured payload, not a panicked shout of "something is wrong with a crane somewhere!" — it specifies which crane, what category of failure (mechanical vs electrical vs operator vs weather), whether it is retryable (let it cool down for ten minutes) or permanent (call maintenance), and a reference number that ties this incident to the shipment manifest. The dock foreman does not decide whether to abort the whole ship's loading based on one crane's alarm; he looks at his dependency map — can ship #7 sail with four of six containers? If yes, fail-safe with partial loading; if no, fail-fast and reschedule. A container dock without this choreography turns every crane hiccup into a full dock shutdown. A multi-agent system without containment turns every subagent exception into a coordinator crash.

Analogy 3: The Power Grid — Circuit Breakers and Cascade Prevention

A city power grid is the cleanest analogy for the circuit breaker pattern because the name itself comes from electrical engineering. When one transformer faults, the breaker attached to that transformer trips open — cutting the transformer off from the rest of the grid — before the fault can burn out neighbouring equipment. The breaker is the boundary between a local failure and a cascading blackout. Every home has one (the main breaker), every neighbourhood substation has one, and every transmission line has one, stacked in layers exactly like a multi-agent system's coordinator hierarchy. After the breaker trips, grid operators do not immediately re-close it; they wait out a cooldown period (half-open state) and then attempt a test re-energization. If the fault was transient (a branch fell on a line and has now been cleared), the breaker stays closed and service resumes. If the fault was permanent (the transformer itself is ruined), the breaker trips again immediately and stays open until a human intervenes. A grid without breakers is a grid where any single fault blacks out an entire region — exactly what happens in a multi-agent Claude system that retries a dead subagent indefinitely. Importantly, the breakers are not something the grid operator "turns on with a switch in the control software" — they are physical devices that had to be installed and wired in. CCA-F tests this nuance: a circuit breaker for subagents is a pattern you implement, not an SDK setting you toggle.

Which Analogy Fits Which Exam Question

Questions about classification, fail-fast vs fail-safe, partial failure handling → hospital triage analogy.
Questions about containment boundaries, structured payloads, preventing crash propagation → shipping dock analogy.
Questions about circuit breakers, repeated-failure disabling, cascade prevention → power grid analogy.

Common Exam Traps

CCA-F Domain 5 consistently exploits six recurring trap patterns around error propagation in multi-agent systems. Every trap below has appeared in community pass reports disguised as a plausible distractor.

Trap 1: Assuming Context Isolation Equals Error Isolation

Subagents run with isolated context windows, which protects the coordinator's conversation history from being polluted by subagent messages. Distractors extend this truth into the false claim that subagent failures cannot affect the coordinator. They can. An uncaught subagent exception still crashes the coordinator's process; a subagent that returns malformed output still corrupts the coordinator's synthesis step. You still need explicit try-catch plus output validation around every subagent invocation. Answers that treat context isolation as automatic error isolation are wrong.

Trap 2: Treating Circuit Breaker as an SDK Built-In

The circuit breaker pattern is widely understood in distributed systems, so candidates assume the Agent SDK must provide one. It does not. Distractors such as "enable the SDK's subagent circuit breaker" or "set circuit-breaker threshold in settings.json" are fabrications. Circuit breaker is a design pattern you implement in your coordinator's subagent invocation wrapper. Answers that phrase it as an SDK toggle are incorrect.

Trap 3: Retrying Permanent Errors

Retry is for transient errors. A permission denied, a schema violation, or a resource-not-found will not become retryable by waiting and trying again. Distractors offer "retry three times with exponential backoff" as the answer for permission errors; this is both wasteful (consuming token and time budget on known-hopeless attempts) and actively dangerous (on a destructive operation, the second and third attempts could compound damage). Always gate retry on isRetryable: true.

Trap 4: Flattening the Error Chain

A coordinator that catches a subagent error and re-raises it as a plain string destroys the root-cause information needed for both recovery logic and debugging. Distractors recommend "convert the error to a descriptive message before re-throwing" — which sounds responsible but actually erases structure. Correct behaviour is to wrap the inner structured error in an outer structured error, preserving the full nesting. The user-facing message can be simplified; the logged payload must not be.

Trap 5: Retrying Non-Idempotent Subagents

A subagent that writes to a database, sends an email, or charges a credit card is not safely retryable. Retrying may duplicate the side effect. Distractors present "retry the transaction subagent on any transient error" as the correct answer; this is wrong. Non-idempotent subagents must fail-fast on any error and defer to human review or to a compensating action subagent. Idempotency is a property of the subagent's work, not of the error type.

Trap 6: Aborting on Every Local Error

Fail-fast is not always correct. When subagents are independent and partial output is valuable, aborting on the first local error wastes the work of the healthy subagents. Distractors frame "abort the research pipeline when any topic subagent fails" as the "safe" choice; in a multi-agent-research-system scenario that loses three successful sources to preserve consistency with one failed source, the fail-safe partial-result path is correct. Always inspect whether the error is local or cascading before deciding to abort.

Practice Anchors

Error-propagation concepts show up most heavily in two of the six CCA-F scenarios. Treat the following as the architecture spine for scenario-cluster questions that test task 5.3.

Multi-Agent-Research-System Scenario

In this scenario, a top-level research coordinator spawns a set of topic subagents (often four to eight), each of which researches one independent topic by calling web-search, document-read, and summarization tools. The subagents report back, and a synthesis subagent stitches their outputs into a final report. Error propagation questions here almost always fall into the fail-safe / partial-result quadrant: one or two topic subagents fail on transient network errors; the coordinator must decide whether to proceed. The correct answer almost always includes (1) structured error payloads with errorCategory and isRetryable, (2) coordinator-level retry with exponential backoff up to a budget, (3) partial-result handling with explicit annotation of missing sources in the final synthesis, and (4) a threshold below which the coordinator aborts entirely (for example, fewer than half the subagents succeed). Expect traps that offer "abort on first failure" or "retry indefinitely" as the wrong paths.

Customer-Support-Resolution-Agent Scenario

In this scenario, a support coordinator routes an inbound ticket through a sequence of subagents: identity verification, account lookup, knowledge-base search, and response drafting. Unlike the research scenario, these subagents are not independent — downstream work depends on upstream results. Error propagation questions here test the fail-fast quadrant: if identity verification fails, every downstream subagent is working on invalid premises and must not proceed. The correct answer typically includes (1) cascading-error detection at the coordinator (identity failure is critical), (2) immediate fail-fast with escalation to a human agent rather than retry, (3) structured error reporting up to the caller so the human sees the root cause, and (4) a circuit breaker around the identity-verification subagent so a sustained outage of that service does not block the entire queue. Expect traps that conflate this scenario with the research scenario — partial-result handling is wrong here because partial support resolution on unverified identity is dangerous.

FAQ — Error Propagation Top 5 Questions

How do errors propagate from a subagent back to the coordinator?

They propagate only through whatever return value your code produces at the subagent invocation site. The coordinator does not automatically "see" subagent errors; it sees whatever your wrapper returns. The correct pattern is to wrap every subagent invocation in a try-catch, turn any caught exception into a structured error payload with errorCategory, isRetryable, message, context, and correlationId, and return that payload as the subagent's result. The coordinator then inspects the result and dispatches on errorCategory to decide whether to retry, skip, or abort. A subagent exception that is not caught at the invocation site crashes the coordinator's process regardless of context isolation — context isolation protects conversation history, not control flow.

Does subagent context isolation automatically prevent error cascades to the coordinator?

No. This is the single most tested nuance in the error-propagation surface. Subagent context isolation means the subagent runs with its own conversation history that does not bleed into the coordinator's — that prevents history corruption and helps with token management. It does not prevent process-level or control-flow cascades: an unhandled subagent exception still crashes the coordinator, and a malformed subagent result still corrupts the coordinator's synthesis. You need explicit try-catch around every subagent invocation plus output validation to achieve true error isolation. Community pass reports consistently flag answers that conflate context isolation with error isolation as a top distractor.

When should a coordinator retry a failed subagent versus proceed with partial results?

Retry when the error payload's isRetryable is true, the retry budget for this subagent has not been exhausted, and the subagent's work is idempotent (retrying cannot duplicate side effects). Use exponential backoff between retries — typical 1 / 2 / 4 seconds — to avoid immediately re-hitting the transient condition. Proceed with partial results when the error is permanent or the retry budget is exhausted, the subagent is local (other subagents' work is still valid), and the downstream synthesis step can honestly annotate the gap. Abort the whole run when the error is cascading (downstream work depends on this subagent) or when accumulated soft failures across the run exceed a pre-set threshold.

Is the circuit breaker pattern a built-in Agent SDK feature?

No. The circuit breaker is a design pattern you implement yourself, typically as a wrapper around subagent invocations or via a resilience library in your runtime. The Agent SDK provides subagent primitives with context isolation and structured results, but it does not ship a configurable circuit breaker — there is no circuitBreaker: { threshold: 5 } setting you can toggle. Circuit breakers track failure counts across runs (or over a time window), open when the threshold is crossed, wait a cooldown, and then probe with a single request before fully closing again. Answers that describe circuit breaker as an SDK configuration flag are fabrications and are wrong on the exam.

How do I preserve root-cause error information across multiple coordinator layers?

Wrap, do not flatten. When an outer coordinator catches a structured error from an inner coordinator or subagent, it should add its own context (which inner component failed, at what iteration, with what input) while preserving the original structured payload as a cause field. The resulting error is a nested structure where the topmost wrapper reflects where the coordinator saw the failure and the innermost layer reflects the true root cause. Every layer shares the same correlationId so distributed tracing can stitch the chain back together. User-facing messages can summarize the root cause (for example, "web search tool timed out"), but the logged payload should retain the full nesting so engineers can debug without guessing. Answers that recommend "convert inner errors to a descriptive string before re-throwing" destroy the information you need and are wrong.

The Error Propagation Challenge — Failures in Nested Agent Systems Crossing Boundaries

Why Boundaries Matter

The Three Failure Modes of Multi-Agent Systems

Why CCA-F Tests This

Error Classification — Transient vs Permanent, Local vs Cascading

Axis 1: Transient vs Permanent

Axis 2: Local vs Cascading

The Four Quadrants

Fail-Fast vs Fail-Safe Strategies — When to Abort vs When to Gracefully Degrade

Fail-Fast

Fail-Safe

The Coordinator's Role

Error Containment — Preventing Subagent Failure From Crashing the Coordinator

Three Levels of Containment

Subagent Context Isolation Is Not Error Isolation

Containment in the Agent SDK

Structured Error Payloads Across Agent Boundaries — Preserving Error Semantics

The Five Fields Every Cross-Boundary Error Payload Needs

Why Structured Payloads Matter

Preserving Semantics Up the Chain

Retry Decisions at Coordinator Level — When Coordinator Should Retry a Subagent

The Retry Decision Tree

Retry Budget vs Iteration Cap

Idempotency Considerations

Partial Failure Handling — Coordinator Proceeding With Available Subagent Results

The Annotated-Gap Pattern

Threshold Logic

Graceful Degradation in Exam Scenarios

Error Accumulation — Tracking Multiple Soft Failures Before Hard Abort

The Accumulator Pattern

Why Accumulation Beats Per-Error Reaction

Accumulation in Multi-Agent Scenarios

Circuit Breaker Pattern — Disabling Failing Subagents After Repeated Failures

The Three States

Why the Breaker Helps

Breaker vs Accumulator

Breaker Is a Pattern, Not an SDK Feature

Error Reporting Chain — Surfacing Root-Cause Errors to the Top-Level Coordinator

Wrap, Do Not Flatten

Root-Cause Attribution

The Correlation-ID Tie-Back

Observability for Multi-Agent Errors — Distributed Tracing and Correlation IDs

What Every Subagent Invocation Should Emit

Correlation IDs

Distributed-Tracing Spans

Why Observability Is a Propagation Strategy Component

Plain-English Explanation

Analogy 1: The Hospital Triage Network — Error Classification and Fail-Fast vs Fail-Safe

Analogy 2: The Shipping Container Dock — Containment and Boundaries

Analogy 3: The Power Grid — Circuit Breakers and Cascade Prevention

Which Analogy Fits Which Exam Question

Common Exam Traps

Trap 1: Assuming Context Isolation Equals Error Isolation

Trap 2: Treating Circuit Breaker as an SDK Built-In

Trap 3: Retrying Permanent Errors

Trap 4: Flattening the Error Chain

Trap 5: Retrying Non-Idempotent Subagents

Trap 6: Aborting on Every Local Error

Practice Anchors

Multi-Agent-Research-System Scenario

Customer-Support-Resolution-Agent Scenario

FAQ — Error Propagation Top 5 Questions

How do errors propagate from a subagent back to the coordinator?

Does subagent context isolation automatically prevent error cascades to the coordinator?

When should a coordinator retry a failed subagent versus proceed with partial results?

Is the circuit breaker pattern a built-in Agent SDK feature?

How do I preserve root-cause error information across multiple coordinator layers?

Further Reading

Official sources