Iterative refinement is the craft of turning a merely acceptable first-pass Claude output into a production-quality artifact through successive review and revision cycles. Task statement 3.5 of the Claude Certified Architect — Foundations (CCA-F) exam — "Apply iterative refinement techniques for progressive improvement" — sits inside Domain 3 (Claude Code Configuration & Workflows, 20 % weight) and is tested heavily inside the code-generation-with-claude-code and structured-data-extraction scenarios. It is one of the quiet domains where candidates who "know Claude" but have never architected a refinement loop lose points to candidates who have.
This study note walks through the full surface a CCA-F candidate is expected to design at the architecture level: the draft-review-revise cycle, why iteration beats one-shot prompting, the anatomy of a refinement loop, self-critique prompting, criterion-driven review passes, diff-based edits versus full rewrites, convergence detection, human-in-loop refinement, the decision between refining an existing output and regenerating from scratch, iteration budgets, and automated refinement in CI/CD where failing tests are the refinement signal. A closing traps section, practice-anchor cluster, and six-question FAQ tie every abstraction back to the exam scenarios that exercise refinement workflows.
Iterative Refinement Concept — Draft, Review, Revise Cycle Applied to Code and Content
Iterative refinement is a workflow pattern in which an initial Claude output (the draft) is treated as a starting point rather than a final deliverable, and is improved through one or more explicit review-and-revise passes. Each pass applies a specific critique lens — correctness, style, coverage, structure — and produces a new revision that incorporates the feedback. The workflow is the programmatic equivalent of a writer who never ships their first draft.
The canonical three-phase cycle every CCA-F candidate should internalize is:
- Draft — Claude produces an initial output against the task prompt. Quality is "good enough to critique," not "good enough to ship."
- Review — Claude (or a human, or an automated check) evaluates the draft against explicit criteria and produces structured feedback: what is wrong, what is missing, what is weak.
- Revise — Claude rewrites or edits the draft to address the feedback, yielding an improved revision. The cycle may repeat until a convergence criterion is met.
Iterative refinement applies identically to code (draft function → review against requirements → revise) and to content (draft document → review against style guide → revise). The exam treats both as instances of the same pattern.
Iterative refinement is a multi-pass workflow in which Claude produces an initial draft, evaluates that draft against explicit quality criteria, and rewrites it to address identified gaps, repeating until convergence or an iteration budget is exhausted. The pattern applies to code generation, structured extraction, and long-form content. Unlike a single-turn prompt, refinement accumulates improvement across turns rather than relying on the first forward pass to be correct. Source ↗
How Refinement Differs from Chaining and Agentic Loops
Iterative refinement is structurally adjacent to two other CCA-F patterns but is not the same.
- Prompt chaining (task 1.4, 1.6) decomposes a task into sequential sub-tasks, each producing a different artifact (plan, then draft, then summary). Chaining moves forward through a pipeline.
- Agentic loops (task 1.1) iterate on tool calls with Claude proposing actions and your system returning observations. Loops move forward by gathering new information from the world.
- Iterative refinement (task 3.5) iterates on the same artifact — the same function, the same extraction, the same document — progressively improving it. The artifact stays; only its quality changes across turns.
Confusing these three patterns is a common Domain 3 trap. If the system is producing new artifacts, it is chaining. If the system is calling tools to gather new facts, it is a loop. If the system is re-editing the same artifact against quality criteria, it is refinement.
Why Iterate? First-Pass Outputs and Compound Improvement Through Feedback
The intuition for why iteration works is rooted in two observations about large language models. First, a single forward pass can only express the model's best guess given the prompt at hand — it cannot consult a review of itself because that review does not yet exist. Second, when a prior draft is present in the context window, Claude can apply a different cognitive lens (evaluator rather than generator) to surface defects that were invisible during generation.
The Compound Improvement Argument
Let a single-pass output have quality q (normalized 0 to 1). A review pass will typically surface a fraction d of the remaining defect surface (say d = 0.5), yielding a revision of quality q + d·(1 − q). A second pass compounds: q + d·(1 − q) + d·(1 − q − d·(1 − q)). After n rounds the quality converges toward 1 asymptotically. The practical implication is not a clean mathematical proof but an engineering rule of thumb: each pass captures roughly half of what the previous pass missed, so two-to-three passes usually outperform a single pass by a wide margin, and further passes exhibit diminishing returns.
Why Not Just Write a Better First Prompt?
Experienced candidates sometimes argue that a sufficiently well-crafted prompt makes refinement unnecessary. The argument has partial merit — tighter prompts do raise q — but ignores a structural fact: generation and evaluation use different cognitive operations. A prompt cannot fully pre-specify the review lens because the review depends on the actual draft. An output that confidently asserts a plausible-but-wrong fact is hard to prevent in generation but trivial to catch in review. Refinement is not a substitute for good prompting; it is a different kind of work.
When Refinement Is Not Worth It
Refinement is not free. Each pass costs tokens, latency, and potentially human review time. Skip refinement when:
- The task is routine and single-pass quality is consistently above the required threshold.
- The output is cheap to regenerate entirely if a defect is found downstream.
- Latency is the binding constraint and "good enough now" beats "better in 10 seconds."
- Downstream consumers (other agents, structured schemas) will catch defects anyway.
CCA-F scenario questions frequently test the decision to refine versus to ship. The exam rewards architects who recognize that refinement has a cost and who choose it only when the quality uplift justifies the latency and token spend. Answers that add a refinement pass to every workflow are over-engineering and lose points. Source ↗
Refinement Loop Anatomy — Output, Evaluation Criteria, Gap Identification, Targeted Edit
A refinement loop has four architectural parts. Missing any of them produces a loop that either never improves or cannot terminate.
1. The Current Output
The artifact under refinement. On turn 1, this is the initial draft produced from the task prompt. On turn n, it is the output of turn n-1's revise step. The refinement loop always operates on the most recent revision, not the original draft.
2. The Evaluation Criteria
An explicit list of dimensions against which the output is reviewed. Examples for code: correctness against requirements, test coverage, adherence to style guide, absence of TODO placeholders, handling of edge cases. Examples for extraction: schema conformance, coverage of source document, confidence on each field, flagging of ambiguous cases. Criteria must be explicit — a review pass without named criteria degrades into generic "is this good?" which Claude can only answer with generic affirmations.
3. The Gap Identification
A structured enumeration of what the current output fails to satisfy. The gap list should be as concrete as possible — "the function does not handle an empty input list" is useful, "error handling could be better" is not. When Claude produces gap lists, prompt it to output them as a numbered list with one gap per entry, each tied to a specific criterion.
4. The Targeted Edit
The revise step that incorporates the gap list. Two execution strategies apply: a full rewrite (Claude reproduces the entire artifact with corrections applied) or a diff-based edit (Claude produces only the changes, which your system applies to the current output). Diff-based edits are almost always preferable for long artifacts because they reduce token spend, localize change, and preserve correct portions of the draft.
A refinement loop is composed of four structural parts: (1) the current output under review, (2) the explicit evaluation criteria the output is scored against, (3) the gap list that enumerates where the output falls short of the criteria, and (4) the targeted edit that produces the next revision. Each part has a corresponding failure mode: an implicit criterion list produces vague reviews, an unstructured gap list produces scattershot edits, and a full-rewrite edit on a long artifact wastes tokens and can regress correct portions. Source ↗
Self-Critique Pattern — Prompting Claude to Identify Weaknesses in Its Own Output
The self-critique pattern uses Claude as its own reviewer. After producing a draft, Claude is re-prompted to enumerate the draft's weaknesses against a criterion list, and the output of that critique is fed into a revise pass.
Why Self-Critique Works
A draft-production prompt biases Claude toward producing a coherent, confident artifact. That bias makes the generator a poor critic — it tends to rationalize rather than interrogate. Reprompting Claude with a clearly different role ("you are now a strict code reviewer; list every defect you can find") activates a different evaluative stance. Claude is capable of identifying defects in its own prior output when the new prompt makes critique the primary task.
A Three-Turn Self-Critique Template
- Turn 1 (Generate): "Write a Python function that
." - Turn 2 (Critique): "Act as a strict code reviewer. Here is the function from turn 1 and the original requirement. List, in order of severity, every defect you can identify: correctness bugs, missing edge cases, style violations, unclear names. Be specific; reference line numbers."
- Turn 3 (Revise): "Using the critique above, revise the function. Fix every defect listed. Do not introduce new behaviour that the original requirement did not specify."
This three-turn shape is the minimum self-critique loop. The same shape extends to four, five, or more turns, but diminishing returns kicks in fast.
Self-Critique Requires Explicit Instruction
Self-critique does not happen automatically. Without an explicit critique prompt, Claude will not spontaneously interrogate its own prior output — it will treat the prior output as context and move forward. Candidates who assume Claude "naturally iterates" design workflows that never actually refine anything. The exam exploits this assumption in distractor answers.
Iterative refinement does not occur automatically by simply prolonging a conversation or by raising the temperature. Self-critique requires an explicit prompt that puts Claude into an evaluator role with named criteria. Answers that describe refinement as "Claude will automatically improve its output over multiple turns" are wrong. Similarly, increasing temperature adds randomness, not refinement; it often degrades quality rather than improving it. Source ↗
Criterion-Driven Refinement — Specifying Explicit Quality Dimensions for the Review Pass
A review pass without explicit criteria is a review pass that produces vague feedback. Criterion-driven refinement pre-commits the dimensions against which the output is evaluated, ensuring that the review pass produces structured, actionable gap lists.
Writing Criteria That Produce Useful Reviews
Good criteria share three properties. They are specific — "the function handles empty input, null input, and input larger than 10 000 elements" rather than "the function is robust." They are testable — a reviewer can answer yes or no for each criterion without further interpretation. They are independent — each criterion addresses a different axis of quality, so a single defect does not fire multiple criteria at once.
Criterion Sources for Common CCA-F Scenarios
- Code-generation-with-claude-code: requirement coverage, test coverage (if tests exist), style guide adherence, error handling, type annotations, docstring presence, absence of TODOs and placeholders, security anti-patterns (hard-coded secrets, SQL injection), performance concerns relevant to the use case.
- Structured-data-extraction: schema conformance, presence of every required field, confidence tagging on uncertain extractions, handling of ambiguous source passages, explicit null/unknown markers rather than fabricated values.
- Multi-agent-research-system: citation coverage, source diversity, claim-to-source traceability, absence of unsupported assertions, readability for the declared audience.
Weighted Criteria
Not all criteria are equal. A correctness bug outweighs a style violation. A schema-nonconformant extraction outweighs a confidence-missing field. Prompt the reviewer to report criterion violations with severity (critical, major, minor) and prompt the revise step to address criticals first. A loop that treats all gaps equally spends turns on cosmetic issues while correctness bugs persist.
Diff-Based Refinement — Applying Minimal Changes Rather than Full Rewrites
When the artifact under refinement is long (a 300-line file, a 2000-word document, a 50-field extraction), asking Claude to produce the full revised artifact on every pass wastes tokens and creates risk. The diff-based refinement pattern constrains Claude to output only the changes.
Why Full Rewrites Are Risky on Long Artifacts
A full-rewrite revise pass can regress correct portions of the draft. Claude is not aware that line 147 of a 300-line file was already correct; if it is asked to reproduce the entire file, it may rephrase or restructure line 147 and introduce a regression. Diff-based edits avoid this failure mode by leaving unchanged portions literally untouched.
Diff Formats Claude Can Produce
- Unified diff (patch format): classic
diff -uoutput with line numbers and context. Machine-applicable. Ideal when the target is a file and a patch-applying step is available. - Natural-language edit instructions: "In the function
authenticate, replace the lineif user.token:withif user.token and not user.token.expired:; inlogout, addsession.clear()after line 42." Requires a follow-up application step (another Claude turn, or a human, or the Edit built-in tool). - Structured edit blocks: a JSON or XML structure listing each edit with
path,line_range,old_content,new_content. Easiest for programmatic application.
Claude Code's built-in Edit tool consumes natural-language or structured edit specifications directly; when refinement happens inside Claude Code, the diff-based pattern is the native shape.
When Full Rewrite Is Acceptable
Short artifacts (sub-500-token drafts), artifacts that require deep structural rework that cannot be expressed as a few edits, or artifacts where Claude has already identified that "most of the draft is wrong" — all legitimately call for a regenerate rather than a diff. The decision is context-dependent, and the exam tests your ability to pick correctly.
For long artifacts, prefer diff-based refinement: ask Claude to produce only the changes, not the full revised output. This reduces token spend, preserves correct portions of the draft, and localizes the change for review. Full-rewrite refinement is appropriate only for short artifacts or when a structural rework means most of the draft needs replacement. Source ↗
Convergence Detection — Knowing When Further Iteration Yields Diminishing Returns
A refinement loop that never terminates burns through budgets just as surely as an agentic loop with no iteration cap. Convergence detection is the mechanism by which the loop decides it has extracted as much improvement as it is going to extract and should stop.
Four Convergence Signals
- Empty gap list — The reviewer pass reports no criterion violations. This is the cleanest termination signal and the one most refinement loops should target first.
- Trivial gap list — The reviewer returns only minor-severity gaps below a threshold. Appropriate when perfection is not required and time matters.
- Repeated gap list — Two consecutive review passes produce the same gap list, indicating the revise step is not making progress. Treat as a hard stop and escalate (human review, regenerate, or accept).
- Diminishing delta — The semantic distance between two consecutive revisions falls below a threshold. Harder to measure in practice but useful when the artifact is structured (JSON, code) and a diff size can serve as a proxy.
Why Convergence Detection Matters
Without a termination test, your loop either runs until the iteration budget expires (wasteful if convergence actually happened earlier) or runs until the reviewer happens to return an empty gap list (unreliable; Claude sometimes returns empty gap lists prematurely and sometimes never does). An explicit convergence test makes termination deterministic.
Composite Termination
Production refinement loops typically combine convergence signals with a hard iteration cap. The loop stops on whichever fires first: the reviewer says "no more gaps," the gap list stops changing across two passes, or the iteration budget hits its limit. This belt-and-braces approach avoids both premature termination and runaway spend.
Refinement convergence, four signals:
- Empty gap list — reviewer finds no violations; terminate.
- Trivial gap list — only minor gaps remain; terminate if acceptable.
- Repeated gap list — two passes produce the same gaps; terminate and escalate.
- Iteration cap — hard maximum turns reached; terminate regardless.
Every production refinement loop must use at least one convergence signal plus a hard iteration cap. Source ↗
Human-in-Loop Refinement — Incorporating Reviewer Feedback Between Iterations
Not every refinement pass needs to be automated. Human-in-loop refinement inserts a human reviewer between turns, substituting human judgment for Claude's self-critique when the task requires it.
When Human Review Beats Self-Critique
- Novel or ambiguous requirements where "correct" is not fully specified in the prompt and Claude cannot test against an absent standard.
- Legal, medical, or safety-critical outputs where the cost of a missed defect exceeds the cost of human time.
- Subjective quality dimensions — tone, brand voice, aesthetic judgment — where human preference is the ground truth.
- Adversarial settings where Claude may be predictably biased toward certain answers and needs external correction.
The Architecture
The loop shape is identical to self-critique but replaces the Claude-as-reviewer turn with a human-as-reviewer step. The human produces the gap list (often in free text, sometimes through a structured form), and Claude consumes it in the revise turn. Plan mode in Claude Code is a native instance of this pattern: Claude produces a plan, the human approves or revises it, and Claude then executes or re-plans.
Cost and Latency Trade-Offs
Human-in-loop refinement can multiply latency from seconds to hours depending on reviewer availability. For high-throughput workflows this is a fatal cost; for high-stakes workflows it is a bargain. The architecture should pick human review only when warranted by stakes, and automate the rest.
Plan mode is the canonical Claude Code instance of human-in-loop refinement. Claude proposes a plan; the human reviews, edits, or approves; Claude executes or replans. The exam tests whether candidates know that plan mode is not a universal safety mechanism — it adds latency and breaks non-interactive pipelines — and is appropriate only when task risk justifies the human turn. Applying plan mode to a trivial task is over-engineering. Source ↗
Refinement vs Regeneration — When to Refine Existing Output vs Start Fresh
Refinement is not always the right move. Sometimes the correct response to a flawed draft is to throw it out and regenerate from scratch with a better prompt.
Signs the Draft Is Refinable
- Most of the draft satisfies the criteria; the gaps are localized and identifiable.
- The failures are in specific places (a wrong branch, a missing edge case, a typo) rather than in the overall approach.
- The prompt and draft together still fit comfortably inside the context budget.
- The reviewer can articulate the gaps as concrete edits.
Signs the Draft Should Be Regenerated
- The overall approach is wrong. Refinement that tweaks a fundamentally flawed approach produces a polished wrong answer.
- The gap list is dominated by "rewrite this section" or "this entire function is the wrong shape."
- The draft has accumulated multiple refinement passes without convergence. Sunk-cost bias often keeps architects refining when regeneration would finish faster.
- The prompt itself was the problem. A better prompt might produce a far better draft in one pass than five refinement turns could produce on the broken draft.
The Regenerate-With-Lessons Pattern
A middle path: discard the draft but retain the gap list produced by the review pass. Use the gap list to strengthen the prompt for the regenerate pass. This preserves the information extracted from the failed attempt while avoiding the sunk-cost trap of editing a broken draft.
When the Exam Tests This Choice
CCA-F scenario questions sometimes describe a scenario in which an architect has been refining an output for many turns without improvement and ask what to do next. The correct answer is usually "regenerate with a stronger prompt informed by what the refinement passes surfaced." Distractors will offer "add another refinement pass" or "increase the temperature" — both wrong.
Iteration Budget — Setting a Max Rounds Cap to Prevent Infinite Refinement Loops
Just as agentic loops need iteration caps, so do refinement loops. Without a cap, a loop whose convergence test never fires will run until token budgets or API budgets run out.
Typical Iteration Budgets
- Short code generation: 2 to 3 passes. Each pass typically captures half of the remaining defect surface; after three passes, diminishing returns are heavy.
- Long document refinement: 3 to 5 passes, often split by criterion group (first pass covers correctness, second pass covers style, third pass covers flow).
- Structured extraction validation: 1 to 2 re-ask passes on validation failure, then escalate. Extraction errors that survive two retry passes are usually unresolvable by further refinement.
Why Budgets Are Separate from Convergence
Convergence tests stop the loop when improvement is happening and plateaus. Budgets stop the loop when improvement refuses to happen at all. Both are needed: convergence alone misses pathological non-terminating loops; budget alone wastes turns on loops that have already converged.
Business-Level Budgets
In production, iteration budgets should be tied to business cost envelopes, not engineering defaults. A content-generation pipeline that refines 10 000 documents a day cannot afford five passes each. A legal contract refinement loop that produces one output a week can afford fifteen. The architect picks budgets from the deployment economics.
Distractor answers often propose unbounded refinement loops that terminate "when Claude decides the output is good." This is insufficient. Claude's judgment of "good enough" is not reliable across arbitrary iterations — it may declare convergence prematurely or never declare it at all. Every production refinement loop needs an explicit iteration budget in addition to any convergence tests. Answers that omit the budget are marked incorrect even when the rest of the design is sound. Source ↗
Refinement in CI/CD — Automated Test Failure as the Refinement Signal
The CI/CD scenario on CCA-F turns the refinement pattern inside out: instead of a human or a Claude reviewer producing the gap list, an automated test suite produces it. Failing tests become the structured critique that drives the revise pass.
The Pattern
- Claude Code produces a code change (the draft).
- A test runner executes and produces a pass/fail report with named failures and diagnostic output.
- If all tests pass, the loop terminates. If tests fail, the failure output is fed back to Claude as a structured critique.
- Claude produces a revised change that addresses the failing tests.
- The test runner runs again. The loop continues until all tests pass or the iteration budget is hit.
This is architecturally identical to the self-critique pattern with the reviewer replaced by npm test or pytest. The advantages are significant: tests are objective, fast, and produce the same gap list across runs. The disadvantages are that tests only cover what they test — a change that passes tests but introduces a regression outside the test coverage will converge prematurely.
Non-Interactive Mode Is Required
CI/CD runs Claude Code non-interactively, which means the -p flag (non-interactive / print mode) must be used. Plan mode and interactive prompts are incompatible with automated pipelines — the pipeline has no human to respond to approval prompts. Candidates who propose a CI/CD refinement loop without the -p flag produce unrunnable designs.
Permission Scoping in CI
CI environments need tightly scoped tool permissions. A refinement loop running in CI should only have access to the tools it needs (Read, Edit, Bash with a restricted command allowlist) and should never have network or filesystem access beyond the workspace. This is a general CI/CD security principle that intersects with refinement architecture.
CCA-F consistently tests the CI/CD refinement pattern in the claude-code-continuous-integration scenario cluster. Expect questions that test: (1) use of the -p flag for non-interactive mode; (2) the use of failing tests as the refinement signal; (3) iteration budgets to prevent runaway CI jobs; and (4) tool permission scoping. Answers that omit -p are almost always incorrect, regardless of the rest of the design.
Source ↗
Plain-English Explanation
Abstract refinement mechanics become intuitive when anchored to physical processes most candidates already know. Three very different analogies cover the full sweep of iterative refinement.
Analogy 1: The Editor's Red Pen — Criterion-Driven Review
Picture a newspaper newsroom. A reporter files a draft of an article (Claude's draft). The copy editor picks up a red pen and a printed style guide (the explicit evaluation criteria) and marks every violation: weak leads, passive voice, unnamed sources, fact-check needed (the gap list). The reporter rewrites against the markups (the revise pass) and files a new draft. The copy editor picks up the pen again. By the second or third pass, the red marks are sparse — the article has converged. By the fifth pass, the copy editor has to invent things to mark (diminishing returns). A newsroom without a style guide produces copy editors who write "this could be better" in the margins, which tells the reporter nothing. The explicit criterion list is what converts a vague "make it better" into a loop that actually terminates. This is exactly the shape of criterion-driven refinement: the gap list is the red pen, the criterion list is the style guide, and the revise pass is the reporter rewriting against the markups.
Analogy 2: The Pottery Wheel — Progressive Shaping and Convergence
A potter at a wheel never produces the final vase in a single motion. They start with a rough lump (the draft — structurally right-ish but crude), then shape it through successive passes: pull up the walls, narrow the neck, refine the lip, smooth the surface. Each pass removes a little more material and brings the vase closer to the finished shape. The potter knows by touch when further shaping will do more harm than good — the clay has "converged," and another pass would make the walls too thin or spoil the symmetry (diminishing returns). If the initial lump is centred wrong, no amount of shaping saves it; the potter throws it back and restarts (regeneration). The potter respects the iteration budget because the clay dries as they work — there is a hard limit on how long the session can last. This is exactly the shape of a refinement loop with convergence detection and iteration budgets: the draft is the rough lump, each pass is a shaping step, convergence is the tactile "this is ready" moment, and the drying clay is the iteration budget that forces termination whether or not the potter is satisfied.
Analogy 3: The Workshop Jig — Repeated Measurement Against a Template
A cabinet maker building a drawer front checks their work against a marking jig after every cut. The jig encodes the spec — the drawer must be square, the corners at exactly 90 degrees, the face flush. The cabinet maker cuts, holds the piece against the jig, spots a gap (the gap list), trims the problem corner (the targeted edit), and checks again. When the piece sits flush in the jig with no gaps visible, it has converged. If the cabinet maker keeps trimming after the piece already fits, they overshoot — the drawer is now too small (degradation from over-refinement). The jig is not optional; without it, the cabinet maker is guessing at every cut, and the drawer either never fits or never stops being trimmed. CCA-F answers that describe refinement without explicit criteria are the cabinet maker without the jig: they know they are iterating but have no way to know when they are done. The evaluation criteria are the jig, the gap list is the measured discrepancy, and convergence is the moment the piece seats flush without wobble.
Which Analogy Fits Which Exam Question
- Questions about criterion-driven review → editor's red pen analogy.
- Questions about convergence and diminishing returns → pottery wheel analogy.
- Questions about explicit evaluation criteria being mandatory → workshop jig analogy.
Common Exam Traps
CCA-F Domain 3 consistently exploits five recurring trap patterns around iterative refinement. All five show up in community pass reports and appear disguised as plausible distractor choices.
Trap 1: Confusing Refinement with Temperature Increase
Some distractors propose "increase the temperature to produce a more varied and refined output." This is wrong. Temperature controls sampling randomness, not output quality — a higher temperature produces more diverse first-pass drafts, not better refined ones. In fact, high temperature often degrades refinement because each pass produces a slightly different draft, making the gap list harder to track across turns. Iterative refinement requires explicit re-prompting for review and revision, not parameter tuning.
Trap 2: Assuming Self-Critique Is Automatic
A common trap frames refinement as "Claude will naturally critique its own output as the conversation continues." This is false. Without an explicit prompt that puts Claude into an evaluator role with named criteria, Claude treats prior outputs as context and keeps moving forward — it does not spontaneously interrogate itself. Self-critique requires a dedicated turn with a critique prompt.
Trap 3: No Iteration Budget on the Refinement Loop
Answers that describe a refinement loop terminating only on "when Claude says the output is good enough" are incomplete. Claude's judgment about its own convergence is not reliable; refinement loops without iteration budgets can run indefinitely on pathological inputs. Every production refinement loop needs a hard iteration cap in addition to any convergence tests.
Trap 4: Full-Rewrite Refinement on a Long Artifact
When the artifact is a 300-line file or a 2000-word document, full-rewrite refinement wastes tokens and risks regressing correct portions. Diff-based or targeted-edit refinement is the correct shape. Distractors that propose "re-generate the full file on every refinement pass" are wrong for long artifacts.
Trap 5: Applying Refinement to CI/CD Without the -p Flag
CI/CD is non-interactive. A refinement loop driven by failing tests runs in an automated pipeline with no human to respond to prompts. The -p (print / non-interactive) flag is mandatory. Answers that describe a CI/CD refinement pipeline without the -p flag are architecturally broken — the pipeline will hang or error waiting for interactive input.
Trap 6: Plan Mode as a Universal Safety Mechanism
Plan mode is a specific instance of human-in-loop refinement and is appropriate for risky or ambiguous tasks. It is not a universal default. Applying plan mode to every task adds latency and breaks non-interactive pipelines. The exam penalizes designs that use plan mode indiscriminately.
Practice Anchors
Iterative refinement concepts show up most heavily in two of the six CCA-F scenarios. Treat the following as the architectural spine for scenario-cluster questions in this task area.
Code-Generation-With-Claude-Code Scenario
In this scenario, Claude Code produces a code change against a requirement. The refinement workflow is: Claude generates the draft, a review pass (self-critique or test-driven) surfaces defects, and Claude revises. Expect questions that test whether you correctly apply diff-based edits on long files, whether you use failing tests as the refinement signal when tests exist, whether you include an iteration budget, and whether you distinguish refinement (editing the same file) from chaining (producing a separate test file then the implementation). The self-critique template — generate, critique with named criteria, revise — is the default answer shape when the scenario has no automated tests to drive the loop.
Structured-Data-Extraction Scenario
In this scenario, Claude extracts fields from an unstructured document into a structured schema. The refinement workflow is: Claude produces an initial extraction, a validation step (schema check, plus review against the source) identifies missing, wrong, or low-confidence fields, and Claude revises. Expect questions that test whether you use a structured critique (field-by-field gap list) rather than a free-text review, whether you keep the source document in context for the revise pass, whether you cap retries at one or two passes before escalating, and whether you distinguish refinement from validation-retry loops (task 4.4 — see related topics). A common trap: refinement loops on extraction that refuse to terminate when the source genuinely lacks a field. Correct answer: mark the field as unknown/null rather than fabricating and continuing to refine.
Claude-Code-For-Continuous-Integration Scenario
Though primary weight lives in task 3.6, this scenario intersects refinement directly: CI pipelines that run Claude Code on every pull request typically implement an automated refinement loop where failing tests are the gap list. Expect questions that test the -p flag, iteration budgets, permission scoping, and correct handling of test failures that cannot be fixed by Claude (infrastructure errors, environment differences) — these should be surfaced rather than refined around.
FAQ — Iterative Refinement Top 6 Questions
What exactly is iterative refinement in the Claude context?
Iterative refinement is a workflow pattern in which Claude produces an initial output (the draft), evaluates it against explicit criteria to produce a gap list, and revises the output to address the gaps — repeating the review-revise cycle until a convergence signal fires or an iteration budget is exhausted. The pattern applies to code, structured extraction, and long-form content. It is architecturally distinct from agentic loops (which iterate on tool calls to gather new information) and prompt chaining (which produces separate artifacts across sequential steps). Iterative refinement keeps operating on the same artifact; only its quality changes across turns.
How is iterative refinement different from just increasing temperature or running the same prompt again?
Temperature controls sampling randomness; it does not produce refinement. Higher temperature yields more diverse first-pass outputs, not better-refined ones. Running the same prompt twice without a critique turn between them produces two independent drafts, not a refined one. True iterative refinement requires an explicit review step — a dedicated prompt that puts Claude into an evaluator role with named criteria, followed by a revise prompt that consumes the gap list. Without these two additional turns, you have not refined anything; you have generated two drafts.
When should I refine an existing output versus regenerate from scratch?
Refine when the draft is largely correct with localized defects (a missing edge case, a wrong branch, a typo) and the overall approach is sound. Regenerate when the approach itself is wrong, when the gap list is dominated by "rewrite this section" entries, when multiple refinement passes have failed to converge, or when the prompt itself can be strengthened with lessons from the failed draft. A useful middle path: discard the draft but keep the gap list, and use it to enrich the regenerate prompt. Sunk-cost bias frequently pushes architects to keep refining a fundamentally broken draft when regenerating with a better prompt would finish faster.
How many refinement passes are typically enough?
For short code generation, two to three passes is typical — each pass captures roughly half the remaining defect surface, so the third pass usually hits heavy diminishing returns. For long document refinement, three to five passes split by criterion group (correctness pass, then style pass, then flow pass) works well. For structured extraction validation, one to two retry passes before escalating — extraction errors that survive two retries are usually symptoms of missing source information rather than bugs refinement can fix. Every production refinement loop should combine an explicit convergence test (empty or trivial gap list) with a hard iteration budget.
How do I implement refinement in a CI/CD pipeline?
Run Claude Code with the -p (non-interactive / print) flag so the pipeline does not hang on interactive prompts. Use the failing-test output as the structured critique — each failing test is a gap with named diagnostic detail. Configure a tight iteration budget (typically two to four passes) to prevent runaway CI jobs, and scope tool permissions narrowly (Read, Edit, Bash with a command allowlist; no network access unless required). Surface test failures that cannot be addressed by Claude (infrastructure differences, environment issues) as human-intervention signals rather than burning iterations trying to refine around them. Plan mode is incompatible with non-interactive CI and must not be used there.
Why does self-critique work when I am using the same model as both author and reviewer?
Self-critique works because generation and evaluation invoke different cognitive stances even in the same model. A draft-production prompt biases Claude toward producing a coherent, confident output — that bias actively suppresses critique. Reprompting Claude with a clearly different role ("act as a strict reviewer; list every defect against these named criteria") activates an evaluative stance that can surface defects the generation stance produced. This is not Claude "noticing its own mistakes in real time" — it is Claude answering a different question on a different turn. The critical implementation detail is that the critique prompt must be explicit and criterion-driven. A vague "is this good?" produces generic affirmations; a specific "list every violation of criterion X, Y, and Z" produces an actionable gap list.
Further Reading
- Plan mode and iterative refinement — Claude Code Docs: https://docs.anthropic.com/en/docs/claude-code/plan-mode
- Claude 4 prompting best practices: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices
- Chain complex prompts for stronger performance: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/chain-prompts
- Handle tool calls — validation and retry loops: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/handle-tool-calls
- Integrate Claude Code into CI/CD pipelines: https://docs.anthropic.com/en/docs/claude-code/ci-cd
- Prompt engineering overview — Claude API Docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
Related ExamHub topics: Plan Mode vs Direct Execution, Validation, Retry, and Feedback Loops for Extraction, Multi-Step Workflows with Enforcement and Handoff, Agentic Loops for Autonomous Task Execution.