Prompt Design with Explicit Criteria for Precision

Explicit criteria prompt design is the mechanical core of Domain 4 of the Claude Certified Architect — Foundations (CCA-F) exam. Task statement 4.1 — "Design prompts with explicit criteria to improve precision and reduce false positives" — anchors a domain that carries 20 % weight and supplies a disproportionate share of the Structured-Data-Extraction scenario questions. The exam guide is explicit about what it wants: candidates who can rewrite a vague instruction ("flag suspicious transactions") into an enforceable criteria block ("flag transactions where amount > $10,000 AND merchant_country differs from billing_country AND the account has no prior transactions in that country in the last 90 days") are the ones who pass. Candidates who treat prompting as prose rather than specification tend to miss this task statement on exam day.

This study note walks through the full surface of explicit-criteria design a CCA-F candidate must master: why specificity reduces hallucination, the condition-threshold-action triplet structure, positive versus negative criteria, numeric thresholds versus qualitative adjectives, edge-case enumeration, classification boundary rules, the precision-versus-recall tradeoff that tightening creates, criteria testing against labeled edge sets, versioning discipline, and field-level extraction rules. A common-traps section and FAQ close the loop by tying every abstract principle back to the Structured-Data-Extraction scenario that the exam draws from most aggressively for this task statement.

Why Explicit Criteria Matter — Specificity Reduces Hallucination

Large language models are pattern completers. When a prompt says "flag suspicious transactions," Claude must infer what "suspicious" means from its pretraining distribution — and that distribution contains everything from bank-fraud textbooks to anecdotal forum posts to movie plots. Without explicit criteria, the model fills the gap with an averaged, unpredictable definition, and two different runs of the same prompt can produce incompatible results.

Explicit criteria collapse that inference surface. When the prompt says "flag transactions where amount > $10,000 AND merchant_country differs from billing_country," there is nothing left for Claude to invent. The prompt has become a specification that any conforming implementation — human, rules engine, or model — can execute deterministically. Precision rises because the criteria narrow the match surface; false positives fall because edge cases that used to sneak through on loose semantic overlap are now explicitly excluded.

The CCA-F exam consistently rewards the tightest, most specification-like criteria among four options. A distractor that reads "consider the transaction context and use judgment to flag suspicious items" may sound sophisticated but will reliably lose to "flag transactions where amount > $10,000 AND merchant_country differs from billing_country AND the account has no prior transactions in that country in the past 90 days." Explicit criteria beat model judgment almost every time the exam offers both as options.

Explicit criteria are prompt instructions that express a decision rule as a fully specified predicate — a combination of observable conditions, numeric thresholds, and required actions — leaving no interpretive surface for the model to fill in from pretraining priors. Explicit criteria replace vague phrases like "suspicious," "relevant," or "important" with testable, auditable logic that produces reproducible results across runs and that a non-technical reviewer can evaluate against concrete examples. Source ↗

Vague Instructions Create Ambiguity Budgets

Every vague word in a prompt — "relevant," "reasonable," "significant," "high-quality" — creates an ambiguity budget that Claude spends by drifting toward its training priors. The fewer ambiguous tokens in the prompt, the less drift. On the CCA-F Structured-Data-Extraction scenario, a prompt that defines "significant finding" as "a finding whose confidence score is ≥ 0.85 AND whose severity is in {high, critical}" will out-perform a prompt that relies on the word "significant" on its own, both in precision and in run-to-run stability.

Criteria Anatomy — The Condition-Threshold-Action Triplet

Every well-formed explicit criterion decomposes into three parts. Internalizing this triplet structure is the single highest-leverage move a CCA-F candidate can make for Domain 4.

Condition — the observable feature of the input that the rule tests. Examples: transaction.amount, document.length, message.sender_domain, address.country_code.
Threshold — the numeric bound, enumerated set, or pattern against which the condition is compared. Examples: > 10000, in {US, CA, GB}, matches /^\d{3}-\d{2}-\d{4}$/.
Action — the decision the criterion triggers when the condition meets the threshold. Examples: set flag = true, add to review queue, skip extraction, return confidence = low.

A criterion missing any of the three parts is underspecified. "Flag suspicious transactions" has an action (flag) but no condition and no threshold. "Amount > $10,000" has a condition and a threshold but no action. "If suspicious, flag it" has a condition and an action but no threshold — and "suspicious" is not observable.

Writing Criteria in Triplet Form

The recommended prompt shape for a CCA-F-grade criterion block is:

<criteria>
  <rule id="R1">
    Condition: transaction.amount
    Threshold: > 10000 USD
    Action: add "high-value" flag
  </rule>
  <rule id="R2">
    Condition: merchant_country != billing_country
    Threshold: equality test
    Action: add "cross-border" flag
  </rule>
  <rule id="R3">
    Condition: account.prior_transactions_in(merchant_country, 90 days)
    Threshold: == 0
    Action: add "novel-geography" flag
  </rule>
  <rule id="COMBINED">
    Condition: count(flags) >= 2
    Threshold: >= 2
    Action: route to fraud review; otherwise pass through
  </rule>
</criteria>

This shape maps one-to-one onto the triplet structure and onto XML-structured prompts recommended for Claude. The explicit <rule id="..."> scaffolding is exam-friendly — scenario questions that present a rewrite from prose to triplet form consistently mark the triplet form as correct.

Every explicit criterion has three parts:

Condition — the observable feature being tested (e.g., transaction.amount).
Threshold — the numeric bound or enumerated set (e.g., > 10000).
Action — the decision triggered when the condition meets the threshold (e.g., flag for review).

A criterion missing any one of the three is underspecified. CCA-F scenario questions consistently mark triplet-form answers as correct over prose-form distractors. Source ↗

Positive vs Negative Criteria — Specifying What to Do AND What Not to Do

A subtle but high-frequency exam pattern: explicit criteria must cover both sides of every boundary. Saying only what to flag leaves Claude to infer what not to flag from pretraining priors; saying only what to ignore leaves the same gap on the other side.

Positive Criteria — Inclusion Rules

Positive criteria describe when to take the target action. They narrow the match surface from "anything Claude considers suspicious" to "exactly these enumerated cases."

Negative Criteria — Exclusion Rules

Negative criteria describe cases that look like matches but must be excluded. Without them, Claude's pattern-completion tendency pulls in near-matches — a behaviour that inflates false positives on the cases that matter most.

Pairing Positive and Negative Criteria

A production-grade criteria block pairs inclusion and exclusion rules for each sensitive boundary:

<criteria>
  <include>
    Flag transactions where amount > 10000 USD AND merchant_country != billing_country.
  </include>
  <exclude>
    Do NOT flag transactions where the merchant is on the user's allowlist,
    regardless of amount or geography.
  </exclude>
  <exclude>
    Do NOT flag recurring subscription charges (same merchant, same amount,
    monthly cadence, at least 3 prior payments) even if amount > 10000 USD.
  </exclude>
</criteria>

Pairing explicit inclusion and exclusion is the standard pattern used in the Structured-Data-Extraction scenario questions. Scenario answers that include negative criteria consistently beat scenario answers that only list positive criteria, because the negative criteria are what actually reduce false positives.

On CCA-F scenarios, whenever the question mentions "reduce false positives" as the goal, search the option text for explicit negative or exclusion criteria. Options that only list inclusion criteria may tighten precision in the limit, but the exam preferentially marks options that pair inclusion with explicit exclusion as correct when false-positive reduction is the stated objective. Source ↗

Quantifying Criteria — Numeric Thresholds Over Qualitative Adjectives

Numeric thresholds are deterministic; qualitative adjectives are interpretive. Every opportunity to replace an adjective with a number is an opportunity to remove an ambiguity budget from the prompt.

Adjectives to Avoid

The following words are red flags in a CCA-F prompt rewrite question:

"high" → replace with an explicit numeric threshold (e.g., > 0.85).
"significant" → replace with a measurable magnitude (e.g., change >= 10 %).
"large" → replace with units (e.g., document_length > 50000 characters).
"recent" → replace with a time bound (e.g., within the last 7 days).
"many" → replace with a count (e.g., >= 3 occurrences).
"important" → replace with enumerated categories (e.g., severity in {high, critical}).
"suspicious" → decompose into condition-threshold-action rules.

Why Numbers Beat Adjectives

Numbers force the prompt author to negotiate with reality. "High confidence" is a feeling; "confidence_score >= 0.85" is a testable boundary that can be adjusted in response to evaluation metrics. When an engineer notices that precision is too low, they can move the number to 0.90 and measure the impact. No equivalent calibration exists for the word "high."

When Adjectives Are Unavoidable

Some criteria genuinely resist numeric reduction — "the document is written in formal register," "the answer is off-topic," "the tone is unprofessional." For these, the correct fallback is enumeration by example: provide a small set of labelled examples that anchor the adjective. This is where explicit criteria meet few-shot prompting (task 4.2). Criteria provide the rule; few-shot examples provide the calibration for any irreducibly qualitative part.

Edge Case Specification — Enumerating Known Ambiguous Cases in the Prompt

Even a well-written criteria block misses boundary cases that an experienced domain expert would spot. The correction is not to broaden the criteria — it is to enumerate the known ambiguous cases directly inside the prompt.

Why Edge Cases Belong in the Prompt

Claude cannot infer an edge case it has never seen. If your domain routinely encounters ambiguous inputs that reasonable humans disagree on — a transaction that meets the amount threshold but was flagged by the customer in advance; an extraction field that is present but empty; a document that contains the target entity but in a quoted context — the correct place to resolve the ambiguity is the prompt, not in downstream cleanup code.

Edge Case Enumeration Pattern

<edge_cases>
  <case id="E1">
    If a transaction matches R1-R3 but the customer has a "travel notice"
    flag active for the merchant_country, DO NOT flag the transaction.
  </case>
  <case id="E2">
    If the extraction field is present in the document but the value is
    empty string, null, or whitespace-only, return { "value": null, "confidence": 0.0 }.
    Do NOT attempt to infer a value from surrounding context.
  </case>
  <case id="E3">
    If the target entity appears inside a quotation (surrounded by quote marks
    or cited as a source), extract it but mark attribution = "quoted".
  </case>
</edge_cases>

Enumerated edge cases are a CCA-F favourite because they convert tacit domain knowledge into explicit instructions that are auditable, testable, and versionable. They also compose with few-shot examples — each edge case can be paired with a labelled example to reinforce the rule.

The CCA-F exam consistently prefers answers that enumerate known edge cases in the prompt over answers that rely on Claude's judgment. When a scenario question describes a recurring ambiguous case ("sometimes the field is empty," "sometimes the merchant is on an allowlist," "sometimes the entity is cited rather than discussed"), the correct answer usually adds an explicit edge-case rule rather than expanding the main criteria or post-processing with code. Source ↗

Classification Criteria Design — Decision Rules for Boundary Cases

Classification is the CCA-F task that benefits most from explicit criteria. Classification decisions are boundary decisions; every pair of adjacent classes has a boundary that the prompt must resolve. Leaving the boundary implicit hands it to pretraining priors; making it explicit stabilizes the classifier.

Single-Class Boundary Rule Pattern

For a binary classifier, you need one rule that distinguishes the two classes unambiguously. A vague rule ("classify as urgent if the message seems time-sensitive") fails; an explicit rule ("classify as urgent if any of: the message contains the phrase 'by end of day', the message was sent by a VIP account, the message references an active outage") succeeds.

Multi-Class Boundary Rule Pattern

For an N-class classifier, you need N-1 boundary rules plus a priority order. Without priority, Claude may assign inputs to whichever class is listed first or last in the prompt. Explicit priority resolves this:

<classification>
  <rule priority="1">
    If message contains explicit outage language (down, broken, not working)
    OR sender is a VIP account, classify as CRITICAL.
  </rule>
  <rule priority="2">
    If message asks a question that references a specific feature or workflow
    AND no CRITICAL conditions match, classify as TECHNICAL.
  </rule>
  <rule priority="3">
    If message is a thank-you, feedback, or general comment
    AND no higher-priority rule matches, classify as FEEDBACK.
  </rule>
  <rule priority="4">
    Otherwise classify as OTHER.
  </rule>
</classification>

Class Definitions Belong in the Prompt

Never assume Claude shares your exact class definitions. The prompt must contain a one-to-two-sentence definition of each class, written in the same language as the criteria. If your classifier has five classes, your prompt has five class definitions. The combined cost (a few hundred tokens) is trivial compared to the precision gain.

False Positive Reduction — Tightening Criteria to Narrow the Match Surface

False positives are the most-cited precision failure in the CCA-F Structured-Data-Extraction scenario. Reducing them mechanically is the job of tightening criteria. Four levers dominate.

Lever 1: Raise Numeric Thresholds

Moving confidence_score > 0.70 to confidence_score > 0.85 mechanically reduces the false-positive rate. The cost is lower recall (some true positives with 0.70 ≤ score < 0.85 will be missed), but for workflows where false positives are expensive — fraud review queues, legal holds, compliance flags — the trade is usually worth it.

Lever 2: Add Required AND-Conditions

Every positive criterion joined by AND narrows the match surface. "Amount > $10,000" catches too many legitimate transactions; "Amount > $10,000 AND merchant_country != billing_country AND no prior transactions in merchant_country" catches a much smaller, much higher-precision set.

Lever 3: Add Negative (Exclusion) Criteria

Exclusions carve legitimate look-alikes out of the match surface without further restricting the positive side. "Do not flag recurring subscriptions" excludes a whole class of false positives without affecting the genuine fraud signal.

Lever 4: Require Evidence Fields

Require Claude to return the specific evidence that triggered the match — "include the exact text excerpt that demonstrates the match," "include the field names that satisfied the rule." Asking for evidence reduces hallucinated matches because Claude cannot fabricate evidence as easily as it can fabricate a flag.

False positive reduction via criteria tightening is the practice of narrowing a prompt's match surface by raising numeric thresholds, adding AND-conditions, adding explicit exclusion rules, and requiring evidence fields in the output. Each lever improves precision at the cost of some recall; a well-calibrated criteria block balances the two based on the business cost of each error type. Structured-Data-Extraction scenario questions on CCA-F consistently reward answers that apply at least two of the four tightening levers simultaneously. Source ↗

False Negative Tradeoff — Precision vs Recall When Criteria Tighten

Every criterion you add reduces both false positives and true positives. This is the precision-versus-recall tradeoff, and the CCA-F exam tests whether candidates recognize it.

Precision vs Recall — Definitions for Prompt Design

Precision = of the items the prompt flagged, what fraction are actually correct matches. High precision means few false positives.
Recall = of all the items that are actually correct matches, what fraction did the prompt flag. High recall means few false negatives.

A prompt with criteria so loose that it flags everything has 100 % recall and low precision. A prompt with criteria so tight that it flags nothing has undefined recall and zero false positives. The operating point between these extremes is a business decision, not a model decision.

Domain Cost Determines the Operating Point

Tight criteria (high precision, low recall) are appropriate when false positives are expensive — legal review, fraud queues, security incidents. Loose criteria (high recall, low precision) are appropriate when false negatives are expensive — medical screening, safety-critical alerts, regulatory compliance. The CCA-F exam expects candidates to articulate this tradeoff explicitly in scenario answers.

Calibration Through Evaluation

You cannot calibrate a precision-recall tradeoff without a labelled evaluation set. The correct workflow:

Run the prompt over a labelled set.
Measure precision and recall at the current criteria.
Adjust criteria (tighten or loosen).
Re-run and re-measure.
Iterate until the operating point matches the business requirement.

This is the bridge from prompt design (task 4.1) to validation and retry loops (task 4.4). Explicit criteria give you knobs to turn; labelled evaluation sets tell you how far to turn them.

CCA-F scenario answers that acknowledge the precision-recall tradeoff — "tightening these criteria will reduce false positives but may increase false negatives on edge cases; monitor recall on the labelled eval set" — consistently outperform answers that treat tightening as a free improvement. The exam rewards the engineering maturity of naming the cost, not just the benefit. Source ↗

Criteria Testing — Evaluating Prompts Against Labelled Edge Case Sets

An explicit criteria block is only as good as the evaluation that calibrates it. CCA-F expects candidates to treat prompt changes like code changes: measured against a fixed test set, with explicit pass/fail thresholds.

The Minimum Labelled Set

A usable evaluation set for a criteria block is:

30-100 positive cases — inputs that should match the criteria, covering the typical distribution.
30-100 negative cases — inputs that should NOT match, including look-alikes that previously produced false positives.
10-30 edge cases — the enumerated ambiguous cases the prompt explicitly addresses, each labelled with the expected output.

This is not a research-grade evaluation; it is the minimum viable harness for catching regressions when you tighten or loosen criteria.

Running the Evaluation

Run the prompt over the full labelled set. Compute precision and recall on the positive/negative split. For the edge cases, compute exact-match accuracy — did the prompt produce the expected output for each known-ambiguous case.

Reading the Results

If precision is too low, tighten criteria using the four levers above.
If recall is too low, loosen criteria or relax thresholds.
If edge-case accuracy is low, the enumerated edge-case rules are not being applied — refine their wording or add reinforcing few-shot examples.

The Eval Loop Is the Design Loop

Explicit criteria without an evaluation loop are guesses. The loop is: change criteria → run eval → measure precision/recall/edge-case accuracy → decide whether to commit, refine, or revert. This is the same discipline as software engineering unit tests, applied to prompts.

Criteria Versioning — Tracking Prompt Changes and Their Precision Impact

Criteria blocks are source code. They belong in version control, with change history, code review, and rollback paths.

What to Version

The full prompt text, including system prompt, criteria block, edge cases, and few-shot examples.
The evaluation set (positive, negative, edge cases, with labels).
The measured metrics (precision, recall, edge-case accuracy) at each committed version.

Changelog Discipline

Every criteria change should have an entry recording:

What changed (which rule, which threshold, which edge case).
Why it changed (which failing example, which stakeholder request, which business rule update).
The measured precision/recall before and after on the eval set.
The commit hash or equivalent version identifier.

Rollback Is a Design Feature

Tightening criteria in response to a false-positive complaint can inadvertently crush recall on legitimate cases. Without versioning, you cannot roll back; with versioning, rollback is one command. CCA-F scenario answers that mention prompt versioning consistently outperform answers that treat prompts as one-off artifacts.

Criteria versioning is a frequent CCA-F scenario hook. When a question describes a prompt that used to work and now produces regressions, the correct answer almost always includes "roll back to the previous prompt version and diff the criteria changes" or "check the criteria changelog to identify which rule change triggered the regression." Answers that propose re-engineering from scratch without consulting the version history are marked as over-engineering. Source ↗

Criteria for Extraction Tasks — Field-Level Rules for Structured Data

The Structured-Data-Extraction scenario is where explicit criteria do their heaviest lifting on the CCA-F exam. Structured extraction means pulling named fields out of unstructured or semi-structured input and emitting a schema-conformant object. Each field deserves its own criteria block.

Field-Level Criteria

For every field in the extraction schema, the prompt should specify:

Source rule — where in the input to look (e.g., "customer_name is the value following 'Customer:' in the header block").
Format rule — the expected shape of the value (e.g., "ISO 8601 date string," "E.164 phone number," "all-caps country code").
Presence rule — what to do when the field is missing (e.g., "return null; do NOT infer from surrounding context").
Ambiguity rule — what to do when multiple candidate values exist (e.g., "if multiple customer names appear, choose the one in the header; if no header, return null").

Strict Tool Use as the Enforcement Layer

Explicit criteria produce the correct values; strict tool use produces the correct shape. Defining the extraction as a tool call with strict: true on the tool schema guarantees that Claude's output conforms to the JSON Schema — missing fields, wrong types, and extra keys all become impossible. The criteria block governs the content; the strict schema governs the structure. These two layers compose — they are not substitutes.

Evidence-Linked Fields

For audit-sensitive extractions, require each field to include the source text excerpt that justified the value. This is cheap to add to the criteria ("for each extracted field, include a source_excerpt containing the verbatim text from the input that justifies the value") and dramatically reduces hallucinated extractions. The exam recognizes this pattern as a Structured-Data-Extraction best practice.

Field-level extraction criteria are per-field prompt rules that specify, for each field in the output schema, where to look in the input (source rule), what shape the value must take (format rule), what to do when the field is absent (presence rule), and how to resolve multiple candidates (ambiguity rule). When combined with strict tool use and evidence-linked output fields, field-level criteria are the CCA-F-preferred pattern for Structured-Data-Extraction workflows because they produce precise, auditable extractions that are testable against a labelled set. Source ↗

How Criteria Compose with Few-Shot Examples

Explicit criteria and few-shot examples are complementary, not competitive. The exam consistently tests whether candidates know this.

Criteria Define the Rule; Examples Calibrate the Rule

Criteria express the decision rule in prose or structured form. Examples ground the rule in concrete instances that disambiguate subtle edge cases. A prompt with criteria and zero examples tends to execute the letter of the rule but miss the spirit; a prompt with examples and no criteria tends to overfit the examples and misbehave on inputs that differ from the example set.

When to Add Examples to Criteria

Add few-shot examples whenever:

A criterion contains an irreducibly qualitative element (tone, register, professionalism).
An edge-case rule has non-obvious correct output that benefits from visual reinforcement.
The output format is complex enough that a literal instance demonstrates it better than prose.

Recommended Combined Shape

<instructions>
  [task description]
</instructions>
<criteria>
  [explicit rules as triplets]
</criteria>
<edge_cases>
  [enumerated ambiguous cases with rules]
</edge_cases>
<examples>
  [3-5 input/output pairs that exercise criteria and edge cases]
</examples>
<input>
  [the actual input to process]
</input>

This order matches Anthropic's published recommendations: criteria and edge cases establish the logic; examples ground the logic; the input comes last so the latest context is freshest in attention.

XML Tags Are Not Optional

Claude is trained to parse XML tags in prompts. Using <criteria>, <edge_cases>, <examples>, and <input> as explicit sections dramatically improves criterion adherence compared to unstructured prose. The exam consistently marks XML-tagged prompts as correct over identical-content prose prompts.

Plain-English Explanation

Abstract criterion mechanics become intuitive when you anchor them to physical systems most candidates already know. Three very different analogies cover the full sweep of explicit-criteria design.

Analogy 1: The Health-Inspector Checklist — Criteria as Triplets

Imagine a restaurant health inspector walking into a kitchen. A vague inspector walks around saying, "flag anything unsanitary." Two inspectors will disagree constantly about what counts as unsanitary, and the same inspector will reach different conclusions on different days. A professional inspector carries a checklist: "flag if surface temperature on cooked meat is below 60 °C" (condition + threshold + action); "flag if raw chicken is stored above ready-to-eat food" (condition + threshold + action); "do NOT flag minor water spots on stainless steel that wipe off" (negative criterion). The checklist turns inspection from interpretive art into a reproducible procedure. Two inspectors with the same checklist produce the same report. Explicit criteria do for Claude what the checklist does for the inspector: they replace interpretive judgment with observable rules, which is why precision improves and run-to-run variance drops. The CCA-F exam rewards candidates who can write the checklist version of a vague instruction.

Analogy 2: The Airport Security Screening Line — Precision, Recall, and Tradeoffs

Airport security is a living precision-versus-recall experiment. A loose screening policy lets everyone through quickly (high recall for non-threats, zero false positives, but real threats slip past — low precision on threat detection). A strict policy scrutinizes every passenger (high precision on threats, but many innocent travelers are flagged — low recall of non-threats through quickly). Security leadership must pick an operating point based on the cost of each error type: the cost of a missed threat versus the cost of a false alarm. The policy expresses the operating point as explicit criteria: "send a passenger for secondary screening if any of the following apply — liquid > 100 ml, metal > X grams, randomly selected, or on a watchlist." These are the condition-threshold-action triplets. When a new threat emerges, the criteria tighten (lower threshold, new condition); when complaints about long lines spike, the criteria loosen. Prompt engineers for Claude face the same economy: tighter criteria reduce false positives at the cost of recall; looser criteria increase recall at the cost of false positives. The CCA-F exam wants candidates who articulate the tradeoff rather than treat tightening as a free win.

Analogy 3: The Pharmacist's Prescription Check — Edge Cases and Exclusion Rules

A pharmacist receives a prescription and must decide whether to dispense. Positive criteria: "prescription has a valid signature, patient ID matches, drug is stocked, dose is within guidelines." Negative criteria: "do NOT dispense if the patient is on a contraindicated medication; do NOT dispense if the dose exceeds the weight-adjusted maximum; do NOT dispense if insurance rejects the claim." Edge cases the pharmacist has seen before: "if the prescribed dose looks unusual but the prescriber is a known specialist for this condition, call the prescriber to confirm before dispensing — do not refuse outright." Each of these maps onto a prompt pattern. Positive criteria are inclusion rules. Negative criteria are exclusion rules. Edge cases are enumerated rules with explicit resolutions. The pharmacist who writes down these rules and follows them consistently makes fewer errors than the pharmacist who relies on experience alone, because written rules are auditable, teachable to new staff, and updatable when new safety information arrives. Prompts with explicit criteria behave the same way — they produce audit-ready, precision-optimized outputs that a reviewer can trace back to specific rules.

Which Analogy Fits Which Exam Question

Questions about criterion structure → health-inspector checklist analogy.
Questions about tightening criteria and tradeoffs → airport security analogy.
Questions about edge cases and exclusions → pharmacist analogy.

Common Exam Traps

CCA-F Domain 4 consistently exploits five recurring trap patterns around explicit criteria design. All five appear disguised as plausible distractor choices in the Structured-Data-Extraction scenario.

Trap 1: "More Criteria Is Always Better"

Over-specification causes brittleness. A prompt with 40 criteria will out-flag a prompt with 4 criteria on the training examples, but it will also catastrophically misbehave on any input that does not match one of the 40 anticipated shapes. The CCA-F exam consistently marks "add more rules" as the wrong answer when the scenario involves inputs that drift from the training distribution. The correct answer is often to keep criteria tight but small, and let few-shot examples cover the long tail.

Trap 2: Explicit Criteria Replace Few-Shot Examples

They do not. Criteria and examples are complementary. A prompt that replaces its few-shot examples with more criteria loses the example-grounded calibration that anchors qualitative terms; a prompt that replaces its criteria with more few-shot examples loses the rule-based determinism that makes novel inputs predictable. CCA-F scenario answers that explicitly retain both criteria and examples consistently out-score answers that favour one over the other.

Trap 3: Vague Adjectives Dressed as Criteria

"High confidence" is not a criterion. "Significant impact" is not a criterion. "Suspicious behaviour" is not a criterion. Scenario distractors frequently wrap vague adjectives in criterion-like syntax — <rule>flag anything significant</rule> — and offer this as the "explicit criteria" option. It is not. The correct answer replaces the adjective with a numeric threshold or an enumerated category.

Trap 4: Only Positive Criteria with No Exclusions

Scenario questions with the goal "reduce false positives" often offer an option that tightens positive criteria but adds no exclusion rules. This option improves precision at the margin but loses to the option that pairs the same positive criteria with explicit exclusion rules for known look-alikes. On CCA-F, the exclusion-paired answer wins almost every time when the stated goal is false-positive reduction.

Trap 5: Criteria Changes Without Eval or Versioning

Scenario questions that ask "what would you do next after tightening criteria" often offer "deploy immediately" as a distractor. The correct answer involves running the tightened criteria against a labelled eval set, measuring precision and recall, and versioning the prompt before deployment. Treating prompts as one-off artifacts is penalized; treating them as versioned code is rewarded.

Practice Anchors

Explicit criteria design appears most heavily in one of the six CCA-F scenarios. Treat the following as the architecture spine for scenario-cluster questions.

Structured-Data-Extraction Scenario

In this scenario, a pipeline ingests documents (invoices, medical records, contracts, support tickets) and extracts named fields into a structured schema. Expect questions that test whether you:

Rewrite vague extraction instructions ("pull the key fields") into field-level criteria with source, format, presence, and ambiguity rules.
Pair positive inclusion criteria with explicit exclusion rules for look-alike fields.
Replace adjectives like "important" or "significant" with numeric thresholds or enumerated categories.
Enumerate known edge cases (empty fields, quoted entities, multi-match inputs) inside the prompt.
Combine explicit criteria with strict: true tool use for guaranteed schema conformance.
Version the prompt and run a labelled eval set before deploying criteria changes.

Customer-Support-Resolution-Agent Scenario

The Customer-Support scenario exercises explicit criteria when the agent must classify tickets, detect urgency, or decide when to escalate. Expect questions that test classification rule design with priority ordering, explicit urgency thresholds rather than subjective "seems urgent" heuristics, and enumerated escalation triggers rather than Claude-judgment escalation. The same condition-threshold-action triplet structure applies.

Multi-Agent-Research-System Scenario

The Multi-Agent-Research scenario exercises explicit criteria when sub-agents decide what counts as "a sufficient answer" or "a high-quality source." Expect questions that test whether answer-quality criteria are spelled out as explicit rules (minimum number of cited sources, minimum confidence score, required evidence fields) rather than left to sub-agent judgment. Explicit criteria at the sub-agent prompt level is what prevents quality drift across the research pipeline.

FAQ — Explicit Criteria Design Top 5 Questions

Why do explicit criteria outperform natural-language instructions on CCA-F scenario answers?

Explicit criteria collapse the ambiguity surface that Claude otherwise fills with pretraining priors. A natural-language instruction like "flag suspicious transactions" requires Claude to infer a definition of "suspicious"; two runs can produce different definitions because the inference is underdetermined. Explicit criteria — amount > 10000 AND merchant_country != billing_country AND no prior transactions in that country within 90 days — leave nothing to infer. Precision rises because the match surface is narrower; consistency rises because the rule is deterministic; auditability rises because a reviewer can check each rule against each input. CCA-F scenario answers that present explicit criteria in condition-threshold-action triplet form consistently beat answers that rely on model judgment.

How do I balance tightening criteria against losing recall?

Tightening criteria is never a free improvement — every rule you add reduces both false positives and some true positives. The balance is a business decision driven by the cost of each error type. For queues where false positives are expensive (legal review, fraud investigation, safety incidents), tighten aggressively. For workflows where false negatives are expensive (medical screening, regulatory compliance), keep criteria looser and rely on downstream human review to catch false positives. Calibration requires a labelled evaluation set: measure precision and recall at the current criteria, adjust, re-measure. CCA-F scenario answers that name the tradeoff explicitly (for example, "tightening these criteria will improve precision but may reduce recall; monitor on the labelled eval set") consistently out-score answers that treat tightening as an unconditional win.

Should explicit criteria replace my few-shot examples?

No. Criteria and examples are complementary and both should be present in a CCA-F-grade prompt. Criteria define the decision rule in structured form; examples calibrate any residual qualitative element of the rule and demonstrate the output format in literal form. A prompt with criteria and no examples tends to execute the letter of the rule but miss subtle formatting conventions; a prompt with examples and no criteria tends to overfit the examples and misbehave on inputs that differ from them. The recommended shape interleaves both: criteria plus edge-case rules plus 3-5 few-shot examples, all wrapped in XML tags. The CCA-F exam consistently marks answers that retain both criteria and examples as correct over answers that favour one over the other.

How do I handle known ambiguous cases that my criteria cannot resolve?

Enumerate the ambiguous cases directly inside the prompt as explicit <edge_cases> entries. For each case, specify the condition that identifies the case ("if a transaction matches the amount threshold but the customer has an active travel notice for the merchant country"), and the explicit resolution ("do NOT flag the transaction; instead, set travel_notice_overrode_flag = true in the output"). Enumerated edge cases convert tacit domain knowledge into auditable prompt instructions. The CCA-F exam consistently prefers this pattern over solutions that expand the main criteria block or handle the case in downstream code. Edge cases belong in the prompt because that is where Claude can apply them; edge cases hidden in post-processing still let Claude produce an incorrect intermediate output.

What is the minimum evaluation discipline for a production criteria block?

A minimum viable evaluation harness for a criteria block contains 30-100 labelled positive cases, 30-100 labelled negative cases (including known look-alikes that have previously caused false positives), and 10-30 labelled edge cases corresponding to the enumerated <edge_cases> rules. Run the prompt over the full set, compute precision and recall on the positive/negative split, and compute exact-match accuracy on the edge cases. Every criteria change is accompanied by a rerun of the eval and a commit that records the before/after metrics. Version the prompt, the eval set, and the measurements together. CCA-F scenario answers that propose deploying a criteria change without an eval rerun are consistently marked as incorrect; answers that include the eval loop and the version commit are marked as the mature engineering response the exam rewards.

Why Explicit Criteria Matter — Specificity Reduces Hallucination

Vague Instructions Create Ambiguity Budgets

Criteria Anatomy — The Condition-Threshold-Action Triplet

Writing Criteria in Triplet Form

Positive vs Negative Criteria — Specifying What to Do AND What Not to Do

Positive Criteria — Inclusion Rules

Negative Criteria — Exclusion Rules

Pairing Positive and Negative Criteria

Quantifying Criteria — Numeric Thresholds Over Qualitative Adjectives

Adjectives to Avoid

Why Numbers Beat Adjectives

When Adjectives Are Unavoidable

Edge Case Specification — Enumerating Known Ambiguous Cases in the Prompt

Why Edge Cases Belong in the Prompt

Edge Case Enumeration Pattern

Classification Criteria Design — Decision Rules for Boundary Cases

Single-Class Boundary Rule Pattern

Multi-Class Boundary Rule Pattern

Class Definitions Belong in the Prompt

False Positive Reduction — Tightening Criteria to Narrow the Match Surface

Lever 1: Raise Numeric Thresholds

Lever 2: Add Required AND-Conditions

Lever 3: Add Negative (Exclusion) Criteria

Lever 4: Require Evidence Fields

False Negative Tradeoff — Precision vs Recall When Criteria Tighten

Precision vs Recall — Definitions for Prompt Design

Domain Cost Determines the Operating Point

Calibration Through Evaluation

Criteria Testing — Evaluating Prompts Against Labelled Edge Case Sets

The Minimum Labelled Set

Running the Evaluation

Reading the Results

The Eval Loop Is the Design Loop

Criteria Versioning — Tracking Prompt Changes and Their Precision Impact

What to Version

Changelog Discipline

Rollback Is a Design Feature

Criteria for Extraction Tasks — Field-Level Rules for Structured Data

Field-Level Criteria

Strict Tool Use as the Enforcement Layer

Evidence-Linked Fields

How Criteria Compose with Few-Shot Examples

Criteria Define the Rule; Examples Calibrate the Rule

When to Add Examples to Criteria

Recommended Combined Shape

XML Tags Are Not Optional

Plain-English Explanation

Analogy 1: The Health-Inspector Checklist — Criteria as Triplets

Analogy 2: The Airport Security Screening Line — Precision, Recall, and Tradeoffs

Analogy 3: The Pharmacist's Prescription Check — Edge Cases and Exclusion Rules

Which Analogy Fits Which Exam Question

Common Exam Traps

Trap 1: "More Criteria Is Always Better"

Trap 2: Explicit Criteria Replace Few-Shot Examples

Trap 3: Vague Adjectives Dressed as Criteria

Trap 4: Only Positive Criteria with No Exclusions

Trap 5: Criteria Changes Without Eval or Versioning

Practice Anchors

Structured-Data-Extraction Scenario

Customer-Support-Resolution-Agent Scenario

Multi-Agent-Research-System Scenario

FAQ — Explicit Criteria Design Top 5 Questions

Why do explicit criteria outperform natural-language instructions on CCA-F scenario answers?

How do I balance tightening criteria against losing recall?

Should explicit criteria replace my few-shot examples?

How do I handle known ambiguous cases that my criteria cannot resolve?

What is the minimum evaluation discipline for a production criteria block?

Further Reading

Official sources