Multi-Instance and Multi-Pass Review Architectures

Task statement 4.6 of the Claude Certified Architect — Foundations (CCA-F) exam — "Design multi-instance and multi-pass review architectures" — sits inside Domain 4 (Prompt Engineering and Structured Output, 20 % weight) and is the pattern language a CCA-F candidate reaches for when a single Claude call is not reliable enough. The exam presents this task primarily through two of its six scenario clusters: Structured Data Extraction (where ensemble and judge-model patterns raise precision on high-stakes fields) and Multi-Agent Research System (where sequential specialized passes and reviewer instances guard synthesis quality). Community pass reports consistently identify 4.6 as a "small-weight, high-ambiguity" topic where distractor answers sound defensible unless you have memorized the cost-quality trade-offs and the correlated-error failure mode of same-model judging.

This study note walks through the full multi-instance and multi-pass surface a CCA-F candidate is expected to design: the definitional split between running N parallel Claude instances on the same input versus running N sequential specialized passes over the same artifact, the concrete use cases (high-stakes classification, legal review, complex code analysis), ensemble voting mechanics, the judge-model critique pattern, specialized pass decomposition (grammar, logic, compliance), pass sequencing by dependency, cost-quality economics against a single high-effort call, confidence aggregation across instances and passes, disagreement escalation thresholds, and temperature variation as a diversity lever. Common exam traps, CCA-F practice anchors, and a six-question FAQ close the note.

What Are Multi-Instance and Multi-Pass Review Architectures?

Multi-instance and multi-pass review architectures are two complementary families of patterns for raising the quality of a Claude output beyond what a single call can reliably produce. They are distinct patterns that candidates frequently conflate, and the exam tests whether you can pick the right one for a given scenario.

A multi-instance architecture runs the same input through several independent Claude instances in parallel and aggregates the outputs (typically by voting or judging). The axis of variation is Claude itself — different random seeds, different temperature settings, occasionally different models — applied to the same prompt and the same input.

A multi-pass architecture runs the same input through several sequential passes, each with a different prompt, role, or focus. The axis of variation is the review lens — a grammar pass, then a logic pass, then a compliance pass — applied to the same artifact.

The two families can be combined (a multi-pass pipeline whose critical pass is itself a multi-instance ensemble), but they solve different problems. Multi-instance reduces the variance of a single answer. Multi-pass decomposes a complex review into narrower, more tractable sub-reviews.

A multi-instance review architecture invokes N independent Claude calls on the same input and aggregates their outputs via voting, averaging, or a judge model. The goal is to reduce the variance of the final answer by averaging over the randomness of individual inferences. It does not reduce systematic (correlated) errors — if the underlying model has a blind spot, every instance shares it. Multi-instance is most valuable on high-stakes classification and extraction tasks where single-call variance is the dominant failure mode. Source ↗

A multi-pass review architecture runs the same artifact through a sequence of specialized passes, each with a distinct review lens (grammar, logic, compliance, style, and so on). Passes are sequential because later passes typically depend on the output of earlier ones. A multi-pass pipeline decomposes a complex judgement into narrower sub-judgements, each of which fits comfortably inside a single high-quality Claude call. It is the complement of multi-instance: multi-pass widens the review surface; multi-instance deepens any single point in that surface. Source ↗

The Two Families at a Glance

Dimension	Multi-Instance	Multi-Pass
Axis of variation	Claude randomness	Review lens
Execution	Parallel	Sequential
Aggregation	Vote, average, judge	Chain each output forward
Primary benefit	Variance reduction	Decomposition of complexity
Primary cost	N times single call	Latency sum of all passes
Exam scenarios	Structured Data Extraction	Multi-Agent Research System

Multi-Instance Architecture — Running Multiple Claude Calls on Same Input for Consensus

The simplest multi-instance architecture is the N-sample ensemble: the same system prompt, the same user input, and the same schema are issued to Claude N times concurrently. Each call returns an independent output; an aggregator combines the N outputs into a single final answer.

Why Repeat the Same Call N Times

Claude's outputs are not fully deterministic even at temperature 0, and at higher temperatures the variance grows meaningfully. On tasks where the same prompt occasionally yields an edge-case wrong answer — a misclassified category, a missed extraction field, a drifted sentiment label — running the call N times and taking the majority answer smooths over single-call noise. The underlying statistical claim is straightforward: if each call is independently correct with probability p > 0.5, the probability that the majority of N calls is correct rises rapidly with N.

When Multi-Instance Is Worth the Cost

Multi-instance is valuable when all three of the following hold:

Single-call quality is close to the threshold but not reliably above it.
The per-unit cost of running N calls is acceptable (usually because the downstream decision is high-value).
The error modes are uncorrelated across calls (variance, not systematic blind spots).

When Multi-Instance Is the Wrong Tool

Multi-instance does not help when the underlying model has a systematic blind spot — if a prompt consistently misclassifies the same edge case, running it ten times just produces ten copies of the same wrong answer. It also does not help on tasks where a single high-effort call (more context, better examples, stricter tool schema) would already clear the quality bar; in those cases, investing in the single call beats ensembling weak calls.

Multi-Pass Architecture — Sequential Specialized Passes Over the Same Artifact

A multi-pass architecture reviews the same artifact (a legal document, a code diff, a support ticket draft) through several sequential passes, each with a narrower focus than a single all-in-one review prompt could achieve. The output of earlier passes feeds into later passes as structured context.

Why Specialize Passes

A single prompt that asks Claude to "review this pull request for security, performance, style, test coverage, and business logic" distributes attention thinly across all five lenses. The same five concerns, executed as five sequential passes, each get the full instruction budget and the full output budget of one Claude call. Community-reported exam language is explicit: decomposition of a complex judgement into narrower steps outperforms a single mega-prompt whenever the lenses would otherwise compete for attention.

The Three-Pass Canonical Shape

A common CCA-F-grade pipeline looks like this:

Structure pass — Does the artifact have the right shape? (Valid JSON, expected sections present, required fields populated.)
Content pass — Is the content correct given the business context? (Facts consistent with source, logic sound, claims supported.)
Style / compliance pass — Does the artifact meet presentation and regulatory rules? (Tone, length, required disclosures, forbidden language.)

Each pass produces a structured report (pass/fail plus notes) that the next pass ingests. The final output is the combined verdict.

Multi-Pass Is Chain Prompting With Reviewers

Multi-pass review is a specific application of prompt chaining. The chain-prompts documentation describes the general pattern; multi-pass review narrows it to the case where each step is a reviewer examining the same artifact from a different angle, rather than a builder producing a new artifact.

Use Cases — High-Stakes Classification, Legal Review, Complex Code Analysis

CCA-F scenario questions tend to pick use cases where a single call is plausibly good enough and ask you to justify the extra cost of a review architecture. Three canonical families carry most of the weight.

High-Stakes Classification (Structured Data Extraction Scenario)

When a classification label drives downstream business action — fraud/not-fraud, escalate/auto-close, match/no-match for a regulatory filing — single-call variance translates directly into business risk. A five-instance ensemble with majority vote turns an 85 %-accurate single call into a far-more-accurate final answer, often at a cost that is still cheap compared to the downstream consequence of a wrong label.

Legal Review

Legal documents combine several independently complex review lenses — definitional consistency across clauses, cross-reference integrity, compliance with jurisdiction-specific rules, contract-term boundaries — that do not fit cleanly into one prompt. A multi-pass pipeline with one pass per lens, chained in dependency order, consistently outperforms a single "review this contract" call. The multi-pass result is also more auditable because each pass produces an independent trail.

Complex Code Analysis (Code Generation With Claude Code Scenario)

Code review requires correctness review, security review, style review, and test-coverage review. A multi-pass pipeline runs each lens in its own pass, with the structure pass (does the diff even compile?) gating the later semantic passes. For very high-stakes changes, a critical pass (for example, security) can itself be executed as a multi-instance ensemble, combining both families.

Other Production-Grade Candidates

Medical triage note review (multi-pass: symptom extraction, then risk scoring, then disposition).
Financial disclosure drafting (multi-pass: facts, then compliance, then tone; multi-instance on the compliance pass).
Academic peer review simulation (multi-instance on the same paper with temperature variation to surface diverse critiques).

Ensemble Voting — Majority Vote Across Independent Instances for Robustness

Ensemble voting is the archetypal multi-instance aggregator. Three voting strategies cover almost every exam case.

Plain Majority Vote

N instances each produce a categorical output. The aggregator picks the category with the most votes. Ties are broken by a deterministic rule (for example, lowest lexicographic label, or re-ask with a tiebreaker prompt). This is the right aggregator for single-label classification (spam/not-spam, category of 12 choices, severity level 1-5).

Weighted Vote by Confidence

Each instance returns not only a label but a confidence score. Votes are weighted by confidence. A highly confident instance outweighs two low-confidence instances. This works when the model produces calibrated confidence — which it often does not by default, so always evaluate calibration before relying on weighted voting.

Field-Level Vote for Structured Extraction

For structured extraction, voting runs field by field. If five instances extract a JSON object with fields name, date_of_birth, diagnosis_code, the aggregator votes independently for each field. This is critical because a single instance may nail four fields and miss one; field-level voting preserves the correct four and replaces the fifth with the ensemble majority.

On CCA-F structured-data-extraction questions, the correct ensemble aggregator is almost always field-level vote rather than whole-object vote. Whole-object voting throws away correct fields whenever any field disagrees; field-level voting keeps every field that any majority supports. Answers that describe ensemble extraction but aggregate at the object level are subtly wrong even when the rest of the design is sound. Source ↗

Why Odd N

Always pick an odd N (3, 5, 7) for majority-vote ensembles so that ties are impossible on binary or small-category questions. Even N requires a tiebreaker, which either adds latency (re-ask) or introduces bias (always pick the first, always pick the most confident).

Judge-Model Pattern — Using a Second Claude Instance to Score or Critique First Output

The judge-model pattern is a two-step multi-instance architecture where the first instance produces a candidate output and a second instance — the judge — scores, critiques, or accepts/rejects the candidate. The judge instance is given a different, review-oriented system prompt.

Two Common Variants

Score-and-gate — The judge returns a numeric score plus a pass/fail; outputs below threshold are rejected and either regenerated, escalated, or passed to a human reviewer.
Critique-and-revise — The judge returns structured feedback (concrete issues, not just a grade), the original instance regenerates with that feedback, and the loop runs until the judge approves or a retry cap is hit.

When the Judge Adds Value

The judge adds value when the review task is easier than the generation task. Writing a coherent summary is hard; evaluating whether a summary faithfully represents the source is easier. Drafting a legally compliant clause is hard; checking whether a drafted clause hits seven required points is easier. This asymmetry is the core justification for separating generation from review.

The Correlated-Error Risk

Because the judge is typically the same Claude model as the candidate producer, any systematic blind spot the model has is shared by the judge. If Claude consistently misreads a certain idiomatic phrase, the judge will not flag it. A same-model judge reduces random error and stylistic error effectively, but it does not catch correlated errors. Community pass reports flag this as a recurring trap — answers that claim "the judge guarantees correctness" are wrong.

Judge-model architectures using the same model for both the producer and the judge reduce variance but do not catch correlated (systematic) errors. If the producer and the judge share the same blind spot, every judgement silently approves the same flawed output. CCA-F distractor choices frame the judge pattern as a correctness guarantee; the correct framing is that the judge is a variance-reduction tool, not a correctness oracle. Source ↗

Judge Prompt Design

A judge prompt should be explicit about scoring criteria, should be concise enough that the judge's attention is fully spent on the review, and should demand structured output (schema or tool use) so the aggregator can act programmatically on the judge's verdict. Judges built around free-form prose are hard to integrate into a reliable pipeline.

Specialized Pass Design — Decomposing Review Into Grammar, Logic, Compliance Passes

Specialized pass design is the craft of splitting a broad review into narrow passes that can each be executed accurately by a single Claude call. The specifics depend on the domain, but the method is consistent.

Pick Orthogonal Lenses

Good passes are orthogonal — they examine different dimensions of the artifact so their findings do not overlap. Grammar, logic, and compliance are orthogonal: grammar is about form, logic is about meaning, compliance is about rules. Poorly-chosen passes overlap (for example, "style" and "tone" will largely duplicate each other) and waste the per-pass call budget.

Keep Each Pass's Scope Narrow

Each pass should have a prompt under roughly a page of focused instructions. A pass whose instructions run to many pages is really several passes pretending to be one and should be split further.

Produce Structured Reports, Not Prose

Every pass should produce a structured report (tool use or JSON schema) so the pipeline can act on findings without free-text parsing. A grammar pass returns a list of {issue, location, severity}. A logic pass returns a list of {claim, status, evidence}. A compliance pass returns a list of {rule_id, satisfied, explanation}.

Integrate Findings Back Into the Artifact

Specialized passes typically feed into a final revision pass that takes the artifact plus every pass's report and produces a revised version. This is distinct from the reviewer passes themselves and should be written as a separate call.

Pass Sequencing — Ordering Passes by Dependency (Structure → Content → Style)

Ordering passes correctly is as important as choosing them. The wrong order wastes budget on passes whose findings are invalidated by later passes.

Dependency-First Ordering

Run structural passes first (does the artifact have the right shape at all?), then content passes (is what it says correct?), then style or compliance passes (does it meet the rules?). Reversing this order — for example, running a compliance pass before a structure pass — risks the compliance pass reviewing an artifact that the structure pass then rejects outright, which discards the compliance work.

Short-Circuit on Hard Failures

If the structure pass returns "this is not a valid contract at all," do not run the content pass. Early exit on hard failures is the sequencing analogue of the retry-or-abort decision inside an agentic loop — the cheapest work is the work you skip because the preceding step already decided the answer.

Allow Passes to Feed One Another

Some passes need the output of previous passes as input. A "fact verification" pass needs the "claim extraction" pass to list the claims first. A "redline review" pass needs the "diff extraction" pass to list the changes first. Pipeline DAGs, not strict linear chains, cleanly express these dependencies.

Parallelize Orthogonal Passes When Safe

Passes that are genuinely independent can run in parallel. A grammar pass does not depend on a logic pass's output and vice versa; running them concurrently cuts wall-clock latency without changing the aggregated result.

Pipeline design for multi-pass review follows the same rule as database query planning: do the cheapest, most-selective work first. A structure pass that rejects a third of inputs for free is worth running before any expensive content pass. On CCA-F, the right answer almost always orders passes in dependency-and-selectivity-first order; answers that run every pass unconditionally or in arbitrary order are usually wrong. Source ↗

Cost-Quality Tradeoffs — Multi-Instance vs Single High-Effort Call Economics

Multi-instance architectures multiply per-unit cost by N. Before reaching for an ensemble, every CCA-F architect should ask whether a single high-effort call — better context, better examples, stricter tool schema, a more capable model — would already clear the quality bar more cheaply.

The Three Budget Questions

Quality headroom — Is a single call already close to the threshold, or far below it? Ensembling a far-below-threshold call still leaves you below threshold. Ensembling a close-to-threshold call pushes you over.
Cost headroom — Is the downstream decision valuable enough to justify N times the call cost? A $0.001 classification used by a free newsletter cannot support a 5-instance ensemble; a $0.001 classification used to gate a $10 000 legal filing obviously can.
Latency headroom — Can the user wait? Multi-instance is typically parallel (no latency hit beyond the slowest call) but multi-pass is sequential (latency sum of all passes). In latency-sensitive contexts, multi-pass is more expensive than its cost table suggests.

Batch Processing As A Cost Lever

For non-latency-sensitive multi-instance pipelines, the Message Batches API reduces per-call cost by roughly half (50 % discount in current pricing). A five-instance ensemble whose instances are all submitted as one batch job approaches the cost of a 2.5-instance synchronous ensemble. Batch is not appropriate when the pipeline sits inside a user-facing request, because batch results can take up to 24 hours.

Single High-Effort Call As A Counter-Design

A single call with richer context, more examples, a stricter JSON schema (strict tool use), and a more capable model (Opus-tier rather than Sonnet-tier) sometimes beats a five-instance ensemble on equivalent cost. The exam rewards architects who consider this counter-design before defaulting to N-sample ensembling.

CCA-F consistently tests whether candidates reach for multi-instance without first considering whether a single high-effort call (better examples, strict schema, larger model) would clear the quality bar more cheaply. The decision rule is: ensemble only when the error modes are random (not systematic), the downstream decision justifies N times the cost, and a better single call has been attempted or ruled out. Answers that ensemble by default on cost-sensitive pipelines are typically wrong. Source ↗

Confidence Aggregation — Combining Confidence Signals Across Instances or Passes

An ensemble or multi-pass pipeline produces more than a final answer — it also produces a richer confidence signal than any single call could report.

Agreement Rate As Confidence

In a plain-majority ensemble, the fraction of instances voting for the winning answer is a usable proxy for confidence. Five-of-five agreement is strong confidence. Three-of-five is weak confidence and should trigger human review. This proxy is model-free — it works even when the model itself does not produce calibrated confidence scores.

Weighted Confidence From Per-Instance Scores

When each instance returns a confidence score, the final confidence can be computed as the weighted average of the winning votes. This is more informative than agreement rate alone but requires the model to produce calibrated scores.

Multi-Pass Confidence Is Conjunctive

In a multi-pass pipeline, every pass must approve for the artifact to pass. Overall confidence is the product (or minimum) of per-pass confidences, not the sum. If the compliance pass is 70 % confident and the structure pass is 99 % confident, the pipeline confidence is at most 70 %.

Calibration Matters More Than Magnitude

A confidence score is only useful if it is calibrated — a 0.8 confidence should correspond to 80 % empirical accuracy. Claude's raw confidence scores are often miscalibrated and need application-side calibration (for example, through a held-out validation set) before they drive routing decisions.

Disagreement Handling — Escalation When Instance Votes Diverge Significantly

Multi-instance pipelines are most valuable on their disagreement signal. When instances agree, the pipeline passes through smoothly; when instances disagree, the pipeline has surfaced a genuine uncertainty that deserves special handling.

Thresholds For Escalation

Typical production thresholds:

Unanimous agreement — Accept the answer automatically.
Supermajority (e.g., 4 of 5) — Accept but log for periodic audit.
Bare majority (e.g., 3 of 5) — Escalate to human review or to a higher-capability model.
No majority — Route to human review; do not auto-resolve.

Escalation Targets

Human reviewer — The canonical escalation for high-stakes work.
Larger model — Sometimes running the disputed input through Opus rather than Sonnet resolves the disagreement (a small cost relative to always running Opus).
Extended-thinking variant — Some disagreements resolve when the same input is run with an extended-thinking system prompt that encourages deeper deliberation.
Add-context retry — Supplying more context (source documents, canonical examples) often converts a three-of-five into a five-of-five.

Disagreement Logging Is Its Own Product

Every disagreement is a signal that the prompt, the examples, or the schema may need improvement. A production pipeline that archives its disagreement traces has a feedback loop for prompt improvement. This is why the Anthropic escalation guidance emphasizes logging disagreements with full context for periodic review.

On CCA-F scenario questions about multi-instance pipelines, the correct answer for "what happens on disagreement" is escalation (to a human, a larger model, or extended thinking), not silent majority-vote resolution. Silent resolution discards the most valuable signal the ensemble produces. Answers that ignore the disagreement path are usually wrong. Source ↗

Temperature Variation — Using Different Temperature Settings Across Instances for Diversity

Running N instances all at temperature 0 gives you N nearly-identical answers, which defeats the purpose of ensembling. Temperature variation is the primary lever for producing diverse-enough outputs to make majority voting meaningful.

The Temperature-Diversity Curve

At temperature 0, Claude's outputs are as deterministic as possible — ensemble instances collapse to the same answer. At moderate temperature (around 0.5), outputs vary enough for voting to be informative while each individual output is still high quality. At very high temperature (above 1.0), outputs become erratic and quality falls faster than diversity rises.

Practical Temperature Schedules

All instances at 0.5 to 0.7 — Simple, effective, and the community-reported default.
Staggered temperatures (e.g., 0.3, 0.5, 0.7, 0.9, 1.1) — Useful for creative tasks where you want a wide range of stylistic choices.
Temperature 0 + seed variation — If supported; gives tightly controlled diversity. Not always available.

Temperature Is Not A Quality Lever

Raising temperature does not make Claude smarter; it makes Claude's outputs more varied. The role of temperature in multi-instance architectures is purely to produce enough diversity that voting is informative. Low temperature with no seed variation defeats ensembling; high temperature with no quality guard tanks ensembling.

Combining Temperature With Tool Use

When extraction is enforced via strict tool use, the schema is fixed but the values inside the schema still vary with temperature. A five-instance extraction ensemble at temperature 0.5 still produces five valid JSON objects; voting then runs field-by-field over those five objects.

Combining Multi-Instance and Multi-Pass

The two families are complementary and frequently combined. The critical pass in a multi-pass pipeline is often executed as a multi-instance ensemble while the non-critical passes stay as single calls.

Example Pipeline

A contract-review pipeline:

Structure pass (single call, cheap).
Content pass (single high-effort call).
Compliance pass (five-instance ensemble with field-level vote, because compliance errors are high-stakes and the ensemble protects against single-call variance).
Revision pass (single call, consumes all previous pass reports).

This hybrid spends N times the call cost only where it matters most.

Why Not Always Ensemble Every Pass

Because multi-instance cost scales linearly with N, ensembling every pass in a multi-pass pipeline multiplies cost by the product of per-pass N values. A three-pass pipeline with five instances per pass costs fifteen call budgets per artifact. Architects ensemble only the passes where single-call variance is the dominant risk.

Observability For Review Architectures

Observability for review pipelines is a superset of agentic-loop observability. Every instance's output, every pass's output, every vote result, and every disagreement trace must be archivable.

What to Log

Per-instance output (full JSON or text).
Per-instance confidence score (if produced).
Aggregated vote result per field.
Per-pass report (structured findings).
Pass dependency trace (which passes ran, in what order, with what inputs).
Disagreement traces (instance IDs, temperatures, divergent outputs) with enough context to replay.

Why Logging Is a First-Class Concern

Review pipelines are often audited. Legal-review pipelines must be able to explain why a clause was approved. Medical-triage pipelines must be able to explain why a note was routed to disposition X. The pipeline's output is not the final verdict alone — it is the final verdict plus the reasoning trail that justifies it.

Plain-English Explanation

Three very different analogies cover the full sweep of multi-instance and multi-pass review architectures.

Analogy 1: The Panel of Doctors — Multi-Instance Ensemble For a High-Stakes Diagnosis

Imagine a patient with an ambiguous diagnosis. A single doctor's reading of the scan might be 85 % accurate — usually right, occasionally wrong. For a high-stakes case, the hospital convenes a panel of five independent doctors who each review the same scan without consulting each other. Four of the five say "benign"; one says "malignant." The panel now has both a majority answer and a clear disagreement signal. If all five had agreed, the case would be closed; because one disagreed, the panel escalates to a biopsy. This is a multi-instance architecture in miniature. The scan is the input. Each doctor is a Claude instance. The variance across doctors is diversity — if every doctor trained in the same program, their errors are correlated and the panel provides less lift (the same-model judge problem). The tiebreak-or-escalate rule is disagreement handling. The biopsy is the escalation target. Multi-instance review is, in almost every respect, a panel of independent reviewers looking at the same artifact.

Analogy 2: The Manuscript Editorial Process — Multi-Pass Sequential Specialized Review

Imagine a book manuscript going through a publisher's editorial process. A developmental editor reads first and asks "is this the right book at all — does it cohere, does the structure work, are the stakes clear?" If the manuscript clears the developmental edit, a copy editor reads next and asks "are the facts right, are claims cited, is the logic sound?" If the manuscript clears copy edit, a line editor reads next for style and voice. If it clears line edit, a proofreader reads last for typos and formatting. Each editor has a narrow, specialized job. None of them tries to be every kind of editor at once — that is why each of them is good at their lens. This is a multi-pass architecture in miniature. The manuscript is the artifact. Each editor is a specialized Claude pass. Passing to a later editor before earlier editors have signed off is wasteful — a proofreader correcting typos on a manuscript that will be structurally rewritten is lost work, which is why pass sequencing matters. Short-circuiting on hard failure is the developmental editor rejecting a manuscript that cannot be saved; there is no point running the line editor on it. Multi-pass review is, very nearly, the book-publishing editorial stack.

Analogy 3: The Airline Cockpit Checklist — Why Both Families Together Beat Either Alone

Airline safety does not rely on a single pilot reading a single checklist. Before takeoff, a captain and a first officer run through a sequence of checklists — weather, fuel, weight-and-balance, systems — with each pilot independently reading and confirming. The sequential checklists are a multi-pass architecture: each checklist is a specialized pass covering a different lens, run in dependency order (fuel before weight-and-balance before engine start). The dual-pilot cross-check on each checklist is a two-instance multi-instance architecture: when both pilots read the fuel quantity independently and agree, the system has higher confidence than a single-pilot read. Where they disagree, they escalate (look again together, consult ground ops). Aviation safety is the canonical real-world example of combining multi-pass sequencing with multi-instance cross-checking, and the economic logic (both families cost more than a single pilot reading a single checklist, and both are worth it on high-consequence tasks) transfers cleanly to high-stakes Claude pipelines.

Which Analogy Fits Which Exam Question

Questions about multi-instance ensembles, voting, and the judge pattern → panel-of-doctors analogy.
Questions about multi-pass pipelines, specialized passes, and sequencing → manuscript editorial analogy.
Questions about combining both families and cost-quality trade-offs → airline-cockpit analogy.

Common Exam Traps

CCA-F Domain 4 consistently exploits five recurring trap patterns around multi-instance and multi-pass review. All five appear disguised as plausible distractor choices in community pass reports.

Trap 1: Treating Multi-Instance As A Correctness Guarantee

Multi-instance reduces variance; it does not reduce correlated error. If the underlying Claude model has a systematic blind spot on a given input, every instance in the ensemble shares that blind spot and every instance returns the same wrong answer. Distractor answers frame ensembles as "guaranteed correct" or "immune to model errors"; both framings are wrong. The correct framing is that ensembles trade N times the cost for lower variance on random errors.

Trap 2: Same-Model Judge Claiming Independence

Judge-model pipelines where both the producer and the judge are the same model have correlated-error risk. If Claude misreads a legal phrase consistently, a Claude-judge misses the same error. The exam rewards candidates who recognize this limitation and reach for a different-model judge (where available), a human-in-the-loop judge for high-stakes work, or an explicit "judge reviews N candidate outputs" ensembling shape.

Trap 3: Multi-Pass Latency Underestimated

Because multi-pass runs sequentially, latency adds up linearly. A four-pass pipeline at two seconds per pass is eight seconds before any batching or user-facing rendering. Distractor answers that use multi-pass for low-latency chat workflows are usually wrong. The exam rewards candidates who route latency-sensitive work to single calls (possibly with strict tool schemas) and route latency-insensitive work to multi-pass pipelines.

Trap 4: Whole-Object Vote For Structured Extraction

When five instances each return a JSON object with ten fields, whole-object voting throws away every correct field in any object that disagrees on any single field. Field-level voting preserves the maximum information. Distractor answers that describe ensemble extraction as "pick the majority JSON object" are subtly wrong. The correct aggregator is field-by-field voting.

Trap 5: Ensembling On Cost-Sensitive Pipelines Without Trying A Better Single Call First

Multi-instance multiplies cost by N. If a single call at temperature 0 with a stricter schema or more examples would already clear the quality bar, the ensemble is wasted cost. The exam rewards architects who justify ensembling by showing that simpler interventions were tried or ruled out. Distractor answers that jump to ensembling without first exhausting single-call improvements are usually wrong.

Five 4.6 traps, one sentence each:

Multi-instance does not fix correlated errors — only variance.
Same-model judges share the producer's blind spots.
Multi-pass latency is sequential and adds up linearly.
Field-level voting beats whole-object voting for structured extraction.
Try a better single call before reaching for N-sample ensembles.

Distractor cue: any answer that treats multi-instance as a correctness guarantee, or a Claude-judge as an independent oracle, is wrong. Source ↗

Practice Anchors

Multi-instance and multi-pass concepts appear most heavily in two of the six CCA-F scenarios.

Structured Data Extraction Scenario

In this scenario, Claude extracts structured fields (entities, dates, amounts, classification labels) from unstructured text — invoices, claims forms, legal filings, clinical notes. A single call's variance is often the dominant error source, especially on edge-case inputs. Expect questions that test whether you correctly specify a multi-instance ensemble with temperature variation, whether you apply field-level voting rather than whole-object voting, whether you recognize the correlated-error limitation of same-model judging, and whether you route disagreements to human review rather than silently resolving by majority. Strict tool use is almost always used in the same pipeline to guarantee each instance returns a schema-valid object.

Multi-Agent Research System Scenario

In this scenario, a coordinator agent dispatches subagents to research sub-questions and synthesizes their findings into a final report. Multi-pass review shows up at the synthesis stage: a structure pass (does the report have all required sections?), a fact pass (are the claims supported by the subagent findings?), and a coherence pass (does the synthesis flow well?) run in sequence before the report is returned to the user. Critical passes (for example, fact-checking) may be multi-instance ensembles. Expect questions that test your ability to order passes by dependency, short-circuit on structure-pass failure, ensemble only the critical passes, and surface disagreements back through the coordinator's escalation logic.

FAQ — Multi-Instance Review Top 5 Questions

What is the difference between multi-instance and multi-pass review architectures?

Multi-instance runs N parallel Claude calls on the same input with the same prompt and aggregates outputs via voting, averaging, or a judge. Its axis of variation is Claude's own randomness (temperature variation across instances). Multi-pass runs N sequential Claude calls on the same artifact with different prompts, each pass applying a narrower review lens (structure, content, compliance). Its axis of variation is the review angle. Multi-instance reduces variance on one judgement; multi-pass decomposes a broad judgement into narrower sub-judgements. The two families are complementary — a multi-pass pipeline can ensemble its critical passes — but they solve different problems and are often confused on the exam.

When should I use a judge model rather than plain majority voting?

Use a judge model when the review task is genuinely easier than the generation task — for example, checking whether a drafted summary covers seven required points is easier than writing the summary. Use plain majority voting when the task is fundamentally a classification (spam/not-spam, category of twelve choices) and any single instance has reasonable accuracy. The judge adds per-call cost (the judge call is itself a Claude call) and shares the producer's correlated errors when both use the same model, so it is not a universal replacement for voting. Reach for a different-model judge or human-in-the-loop judge when correlated-error risk is high.

How do I decide how many instances (N) to run in a multi-instance ensemble?

Pick odd N (3, 5, 7) so ties are impossible on binary and small-category questions. Start with N = 3 for moderate-confidence tasks and scale to N = 5 or N = 7 when a single call is far below threshold. Above N = 7, the marginal quality gain drops sharply while cost continues to scale linearly. If the downstream decision is extremely high-stakes (medical, legal, regulatory) consider a hybrid where N = 5 instances feed a human reviewer on disagreement, rather than pushing N higher.

Should I run multi-pass passes in parallel or in sequence?

In sequence whenever later passes depend on earlier passes' output — the structure pass must precede the content pass because content review is wasted on structurally invalid artifacts. In parallel when passes are genuinely independent — a grammar pass and a logic pass on the same document do not depend on each other and can run concurrently to cut wall-clock latency. Model your pipeline as a DAG rather than a strict chain; the DAG captures both the mandatory sequential dependencies and the safe parallel opportunities.

How do I handle disagreements inside a multi-instance ensemble?

Define escalation thresholds by agreement rate: unanimous agreement accepts automatically, supermajority accepts with audit logging, bare majority escalates to human review or a larger model, and no-majority always escalates. Do not silently resolve disagreements by taking the plurality — that discards the most valuable signal the ensemble produces. Log every disagreement trace (all instance outputs, temperatures, and confidence scores) so the prompt or schema can be improved over time. CCA-F scenario answers that omit the escalation path when instances disagree are typically wrong even when the rest of the aggregation logic is correct.

What Are Multi-Instance and Multi-Pass Review Architectures?

The Two Families at a Glance

Multi-Instance Architecture — Running Multiple Claude Calls on Same Input for Consensus

Why Repeat the Same Call N Times

When Multi-Instance Is Worth the Cost

When Multi-Instance Is the Wrong Tool

Multi-Pass Architecture — Sequential Specialized Passes Over the Same Artifact

Why Specialize Passes

The Three-Pass Canonical Shape

Multi-Pass Is Chain Prompting With Reviewers

Use Cases — High-Stakes Classification, Legal Review, Complex Code Analysis

High-Stakes Classification (Structured Data Extraction Scenario)

Legal Review

Complex Code Analysis (Code Generation With Claude Code Scenario)

Other Production-Grade Candidates

Ensemble Voting — Majority Vote Across Independent Instances for Robustness

Plain Majority Vote

Weighted Vote by Confidence

Field-Level Vote for Structured Extraction

Why Odd N

Judge-Model Pattern — Using a Second Claude Instance to Score or Critique First Output

Two Common Variants

When the Judge Adds Value

The Correlated-Error Risk

Judge Prompt Design

Specialized Pass Design — Decomposing Review Into Grammar, Logic, Compliance Passes

Pick Orthogonal Lenses

Keep Each Pass's Scope Narrow

Produce Structured Reports, Not Prose

Integrate Findings Back Into the Artifact

Pass Sequencing — Ordering Passes by Dependency (Structure → Content → Style)

Dependency-First Ordering

Short-Circuit on Hard Failures

Allow Passes to Feed One Another

Parallelize Orthogonal Passes When Safe

Cost-Quality Tradeoffs — Multi-Instance vs Single High-Effort Call Economics

The Three Budget Questions

Batch Processing As A Cost Lever

Single High-Effort Call As A Counter-Design

Confidence Aggregation — Combining Confidence Signals Across Instances or Passes

Agreement Rate As Confidence

Weighted Confidence From Per-Instance Scores

Multi-Pass Confidence Is Conjunctive

Calibration Matters More Than Magnitude

Disagreement Handling — Escalation When Instance Votes Diverge Significantly

Thresholds For Escalation

Escalation Targets

Disagreement Logging Is Its Own Product

Temperature Variation — Using Different Temperature Settings Across Instances for Diversity

The Temperature-Diversity Curve

Practical Temperature Schedules

Temperature Is Not A Quality Lever

Combining Temperature With Tool Use

Combining Multi-Instance and Multi-Pass

Example Pipeline

Why Not Always Ensemble Every Pass

Observability For Review Architectures

What to Log

Why Logging Is a First-Class Concern

Plain-English Explanation

Analogy 1: The Panel of Doctors — Multi-Instance Ensemble For a High-Stakes Diagnosis

Analogy 2: The Manuscript Editorial Process — Multi-Pass Sequential Specialized Review

Analogy 3: The Airline Cockpit Checklist — Why Both Families Together Beat Either Alone

Which Analogy Fits Which Exam Question

Common Exam Traps

Trap 1: Treating Multi-Instance As A Correctness Guarantee

Trap 2: Same-Model Judge Claiming Independence

Trap 3: Multi-Pass Latency Underestimated

Trap 4: Whole-Object Vote For Structured Extraction

Trap 5: Ensembling On Cost-Sensitive Pipelines Without Trying A Better Single Call First

Practice Anchors

Structured Data Extraction Scenario

Multi-Agent Research System Scenario

FAQ — Multi-Instance Review Top 5 Questions

What is the difference between multi-instance and multi-pass review architectures?

When should I use a judge model rather than plain majority voting?

How do I decide how many instances (N) to run in a multi-instance ensemble?

Should I run multi-pass passes in parallel or in sequence?

How do I handle disagreements inside a multi-instance ensemble?

Further Reading