Foundation Model Evaluation — AIF-C01 Study Notes

Foundation model evaluation is the disciplined process of measuring how well a large language model or other foundation model performs against a defined task, before or after you put it in front of real users. On the AWS Certified AI Practitioner exam (AIF-C01), Domain 3 (Applications of Foundation Models) and Domain 5 (Security, Compliance, and Governance for AI Solutions) both assume you can choose between automatic metrics such as ROUGE, BLEU, BERTScore, and perplexity, pick the right public benchmark such as MMLU, HellaSwag, or HumanEval, configure an Amazon Bedrock Model Evaluation job (automatic or human), route bias checks into Amazon SageMaker Clarify, and A/B test winners in production with Amazon SageMaker Shadow testing or Amazon Bedrock Provisioned Throughput. Foundation model evaluation is the single most misunderstood topic in the AIF-C01 blueprint because the metric names sound interchangeable but are not.

This AIF-C01 study guide on foundation model evaluation walks through every automatic metric, every benchmark, every AWS-native foundation model evaluation workflow, and every production A/B testing pattern. You will finish this page knowing exactly which foundation model evaluation metric fits summarization, translation, question answering, or code generation, how Amazon Bedrock Model Evaluation differs from Amazon SageMaker Clarify foundation model evaluation, and how to cost, schedule, and interpret a foundation model evaluation run without wasting thousands of dollars on the wrong metric.

What is Foundation Model Evaluation?

Foundation model evaluation is the structured measurement of a foundation model's output quality, safety, bias, latency, and cost against a task-specific ground truth or reference. Foundation model evaluation splits into three families. The first family is automatic foundation model evaluation — deterministic metrics such as ROUGE, BLEU, BERTScore, perplexity, exact match, and F1 that a script can compute without a human. The second family is benchmark-based foundation model evaluation — curated multiple-choice or code-completion datasets like MMLU, HellaSwag, HumanEval, TruthfulQA, and BIG-bench that rank models on shared leaderboards. The third family is human foundation model evaluation — human reviewers scoring outputs for helpfulness, harmlessness, honesty, tone, factual accuracy, or brand voice, often via Amazon SageMaker Ground Truth, Amazon SageMaker Ground Truth Plus, or Amazon Mechanical Turk.

AWS wraps these three foundation model evaluation families into two managed services that AIF-C01 tests heavily. Amazon Bedrock Model Evaluation supports automatic foundation model evaluation jobs (built-in metrics on built-in or custom prompt datasets) and human foundation model evaluation jobs (your workforce or an AWS-managed workforce). Amazon SageMaker Clarify extends foundation model evaluation with bias, toxicity, stereotyping, and factual-knowledge checks on Amazon Bedrock and Amazon SageMaker JumpStart models. Foundation model evaluation is not a one-shot gate; it is a recurring process that runs before fine-tuning, after fine-tuning, before deployment, during A/B testing with Amazon SageMaker Shadow, and periodically in production to catch drift.

Why foundation model evaluation matters for AIF-C01

AIF-C01 places foundation model evaluation in Task Statement 3.2 (evaluate foundation model performance) and Task Statement 5.2 (recognize features of tools to identify bias and fairness). Exam-signal analysis of the AIF-C01 question pool shows foundation model evaluation appearing in roughly one in six questions, often as a scenario that lists a task (summarize support tickets, translate product descriptions, generate Python unit tests) and asks which foundation model evaluation metric or which AWS foundation model evaluation service fits best. Missing this topic is the single fastest way to drop below the 700-point passing score.

Plain-English Walkthrough of Foundation Model Evaluation

Foundation model evaluation sounds like a PhD topic, but three analogies flatten it.

Analogy 1 — The open-book exam (開書考試)

Imagine foundation model evaluation as grading an open-book exam that a very confident student just turned in.

ROUGE is the teacher running a highlighter over the student's answer and counting how many phrases overlap with the answer key. It is perfect when the answer key is a summary — "did the student cover the main points the textbook summary covered?"
BLEU is the language teacher grading a translation exercise. She checks whether the student's English translation uses the same chunks of words as the reference translation. Exact phrasing matters because translations are judged on fluency and fidelity.
BERTScore is the wise examiner who does not care about exact words and instead asks "does the student's answer mean the same thing?" She uses a dictionary of meanings (BERT embeddings) and gives credit for paraphrases.
Perplexity is how surprised the student was by the textbook itself. Low perplexity means the student finds the textbook predictable and fluent; high perplexity means the student is confused on every page.
MMLU / HellaSwag / HumanEval are standardized tests — the SAT, the commonsense pop quiz, and the coding interview. Every student in the world takes the same test, so you can compare schools (models) fairly.
LLM-as-judge is letting a senior student grade a junior student's paper. Faster and cheaper than a human teacher, but the senior student has biases (prefers longer answers, prefers its own writing style).
Human evaluation (Ground Truth Plus, Mechanical Turk) is the actual teacher reading every essay. Slow, expensive, but the gold standard.

On the AIF-C01 exam, if the question says "summarize customer reviews," you reach for ROUGE. If it says "translate to Japanese," you reach for BLEU. If it says "semantic similarity regardless of wording," you reach for BERTScore. That is the whole trick.

Analogy 2 — The restaurant kitchen (廚房)

Foundation model evaluation is quality control in a restaurant kitchen that serves AI-generated dishes.

Automatic metrics (ROUGE, BLEU, BERTScore, perplexity) are the thermometer, the scale, and the timer. Cheap, instant, objective, but they cannot tell if the dish tastes good.
Benchmarks (MMLU, HellaSwag, HumanEval) are the Michelin inspector's standardized checklist. Same test across every restaurant; the score ranks your kitchen globally.
LLM-as-judge is the head chef tasting the sous-chef's plate. Faster than hiring food critics, but the head chef has favorites.
Human evaluation (Ground Truth Plus, Mechanical Turk) is the real diners filling out comment cards. The only ground truth that matters is revenue, and revenue comes from humans, not metrics.
Amazon Bedrock Model Evaluation is the kitchen's built-in testing station — you push a button, the station runs the thermometer, the checklist, or the taste test.
Amazon SageMaker Clarify is the food-safety inspector checking for allergens and cross-contamination — bias, toxicity, stereotyping.
Amazon SageMaker Shadow testing is serving the new recipe to a dummy plate that never reaches the customer, while the old recipe still goes out. You compare the two silently before switching menus.

Analogy 3 — The insurance underwriter (保險)

Foundation model evaluation is how you underwrite risk before deploying a foundation model.

Perplexity is the actuarial model's baseline — how predictable is this claimant's behavior? Low perplexity equals a well-calibrated language model.
ROUGE, BLEU, BERTScore are the three medical tests the underwriter runs on a new applicant. Each test measures a different organ; you pick the test that matches the policy.
Benchmarks (MMLU, HellaSwag, HumanEval) are the credit score — a composite number that the industry already trusts.
Bias evaluation via Amazon SageMaker Clarify is the fair-lending audit — does the model treat demographic groups equitably? Required by regulators, required by AIF-C01 Domain 5.
A/B testing with Amazon SageMaker Shadow and Amazon Bedrock Provisioned Throughput is the reinsurance layer — you hold back real premium exposure until the new policy proves itself on shadow traffic.
Human evaluation (Ground Truth Plus, Mechanical Turk) is the final claims adjuster — the human who signs off on the payout. Expensive, slow, but legally defensible.

Keep these three pictures — open-book exam, restaurant kitchen, insurance underwriter — in your head and every foundation model evaluation question on AIF-C01 becomes a matching exercise.

Core Principles of Foundation Model Evaluation

Foundation model evaluation follows four principles. First, the metric must match the task; a translation metric applied to a summary is worse than no metric at all. Second, the evaluation set must be held out and representative; foundation model evaluation on the training distribution overstates quality. Third, foundation model evaluation must be reproducible; same prompts, same decoding parameters, same seed, same score. Fourth, foundation model evaluation must include safety, bias, and toxicity checks alongside quality metrics — AIF-C01 Domain 5 explicitly tests this.

The task-to-metric map you must memorize

Summarization (news, tickets, meeting notes) → ROUGE (primary), BERTScore (secondary).
Translation (language A to language B) → BLEU (primary), chrF or BERTScore (secondary).
Semantic similarity / paraphrase quality / open-ended Q&A → BERTScore (primary), embedding cosine (secondary).
Language modeling fluency / base model health → Perplexity.
General knowledge / reasoning / multi-subject → MMLU.
Commonsense reasoning / sentence completion → HellaSwag.
Code generation / Python function synthesis → HumanEval, MBPP, CodeXGLUE.
Open-ended chat quality / helpfulness → LLM-as-judge (MT-Bench style) plus human evaluation.
Bias, toxicity, stereotyping → Amazon SageMaker Clarify foundation model evaluation, BBQ, CrowS-Pairs, RealToxicityPrompts.

Foundation model evaluation is the systematic measurement of a foundation model's output quality, safety, and fairness against task-specific ground truth or reference data, using automatic metrics, public benchmarks, LLM-as-judge, or human reviewers. On AWS, foundation model evaluation is delivered primarily through Amazon Bedrock Model Evaluation jobs and Amazon SageMaker Clarify foundation model evaluation. Source ↗

Automatic Metrics — ROUGE vs BLEU vs BERTScore vs Perplexity

Automatic foundation model evaluation metrics are the first line of defense. They are cheap, fast, reproducible, and scriptable, which makes them ideal for nightly foundation model evaluation pipelines and for Amazon Bedrock Model Evaluation automatic jobs. The trap on AIF-C01 is that the four headline metrics — ROUGE, BLEU, BERTScore, and perplexity — sound interchangeable and are not.

ROUGE — Recall-Oriented Understudy for Gisting Evaluation

ROUGE was introduced by Chin-Yew Lin in 2004 specifically to evaluate automatic summarization. ROUGE measures the overlap of n-grams, word sequences, or word pairs between a candidate summary (the foundation model's output) and one or more human reference summaries. The main ROUGE variants in foundation model evaluation are ROUGE-N (n-gram overlap, typically ROUGE-1 and ROUGE-2), ROUGE-L (longest common subsequence), and ROUGE-Lsum (sentence-level LCS for multi-sentence summaries). ROUGE is recall-oriented: it asks "how many of the reference summary's phrases did the candidate cover?"

Use ROUGE in foundation model evaluation whenever the task is summarization — news digests, support-ticket summaries, meeting minutes, or legal brief condensations. ROUGE is the default summarization metric in Amazon Bedrock Model Evaluation automatic jobs for the text-summarization task type.

ROUGE limitations matter on AIF-C01. ROUGE rewards word overlap, not meaning. A foundation model that paraphrases correctly can score lower than a model that copies reference phrases. Combine ROUGE with BERTScore whenever paraphrase quality matters.

BLEU — Bilingual Evaluation Understudy

BLEU was introduced by Papineni and colleagues at IBM in 2002 to evaluate machine translation. BLEU is precision-oriented: it asks "how many of the candidate translation's n-grams appear in the reference translation?" BLEU computes precision over 1-grams through 4-grams, applies a brevity penalty to punish too-short outputs, and produces a single score between 0 and 1 (often reported as 0 to 100).

Use BLEU in foundation model evaluation whenever the task is translation — product descriptions translated to Spanish, Japanese user manuals, multilingual chatbot outputs. BLEU is the default translation metric in Amazon Bedrock Model Evaluation automatic jobs for the text-generation task type when the reference is a target-language translation.

BLEU limitations also matter. BLEU penalizes legitimate paraphrases and synonyms. BLEU cannot judge semantic adequacy. Human translators routinely score below 50 on BLEU for creative translations that are objectively correct. Pair BLEU with BERTScore or human evaluation whenever style or creativity matters.

BERTScore — Semantic Similarity via BERT Embeddings

BERTScore, introduced by Zhang and colleagues in 2020, replaces n-gram overlap with contextual embedding similarity. BERTScore tokenizes both the candidate and the reference, runs each token through a BERT-family encoder, and computes cosine similarity between every candidate token and every reference token. BERTScore reports precision, recall, and F1 analogous to ROUGE but grounded in semantic meaning rather than surface form.

Use BERTScore in foundation model evaluation whenever paraphrase quality, semantic similarity, or open-ended question answering matters — chatbot responses, RAG-generated answers, paraphrase generation, or abstractive summarization where the foundation model rewords rather than copies. BERTScore is the default semantic-robustness metric in Amazon Bedrock Model Evaluation automatic jobs for question-answering task types.

BERTScore limitations: BERTScore depends on the encoder's training distribution (English-heavy unless you pick a multilingual encoder), it is more expensive to compute than ROUGE or BLEU, and it cannot detect factual errors — two sentences can be semantically similar and both wrong.

The ROUGE vs BLEU vs BERTScore cheat sheet

Summarization → ROUGE first, BERTScore second.
Translation → BLEU first, BERTScore second.
Semantic similarity / paraphrase / open-ended Q&A → BERTScore first.
Anything requiring factual grounding → automatic metrics plus human evaluation or LLM-as-judge.

The AIF-C01 exam loves scenarios that list a task and four metric options. Memorize the pairing: summarization equals ROUGE, translation equals BLEU, semantic similarity equals BERTScore. If the scenario adds "the team also wants to catch paraphrase quality," layer BERTScore on top of ROUGE or BLEU. Never answer BLEU for a summarization question; never answer ROUGE for a translation question. Source ↗

Perplexity — Intrinsic Language Model Quality

Perplexity is the exponentiated average negative log-likelihood of a held-out text under the foundation model's probability distribution. In plain English, perplexity measures how surprised the foundation model is by natural text. Lower perplexity means better language modeling; a perplexity of 10 means the model is on average choosing among 10 equally likely next tokens.

Perplexity in foundation model evaluation is intrinsic — it does not need a reference answer, it only needs held-out text. That makes perplexity perfect for monitoring base-model fluency, comparing pretraining checkpoints, or detecting distribution drift. Perplexity is not useful for ranking task-specific fine-tunes because a model can be highly fluent yet unhelpful.

Perplexity is the foundation model evaluation metric you report to the pretraining team, not the product manager. For task-specific foundation model evaluation, switch to ROUGE, BLEU, BERTScore, or a benchmark. AIF-C01 questions that mention "measuring language model fluency on held-out text" point to perplexity. Source ↗

Other automatic metrics you should recognize

Exact Match (EM) — percent of outputs identical to the reference. Used for extractive QA.
F1 over tokens — token-level precision and recall. Used for extractive QA and NER.
chrF / chrF++ — character-level F-score. A modern alternative to BLEU for morphologically rich languages.
METEOR — word alignment with synonyms and stems. Classic translation metric.
Accuracy / Robustness / Toxicity — first-class metrics in Amazon Bedrock Model Evaluation automatic jobs.

Task-Specific Benchmarks — MMLU, HellaSwag, HumanEval

Benchmarks are the public leaderboards that rank foundation models. Amazon Bedrock, Anthropic, Meta, and OpenAI all publish benchmark scores, and AIF-C01 expects you to recognize the top three by name.

MMLU — Massive Multitask Language Understanding

MMLU, introduced by Hendrycks and colleagues in 2021, is a 57-subject multiple-choice benchmark covering STEM, humanities, social sciences, medicine, law, and professional exams. MMLU tests general knowledge and reasoning at high-school to expert level. Scores range from 25 percent (random chance) to 100 percent. Modern frontier foundation models score in the high 80s to low 90s.

Use MMLU in foundation model evaluation whenever you need a single number that summarizes broad reasoning ability. MMLU is the headline benchmark on every foundation model release, including Amazon Bedrock-hosted models.

HellaSwag — Commonsense Sentence Completion

HellaSwag, introduced by Zellers and colleagues in 2019, is a commonsense reasoning benchmark where the foundation model picks the most plausible continuation of a short scenario from four choices. HellaSwag is adversarially filtered so that previous-generation language models scored near chance while humans score above 95 percent. Modern foundation models have saturated HellaSwag, but it remains a useful regression test.

HumanEval — Code Generation

HumanEval, introduced by Chen and colleagues at OpenAI in 2021, is a 164-problem Python function-synthesis benchmark. Each problem gives the foundation model a function signature and docstring; the model must write the body; the model's output is executed against unit tests. HumanEval reports pass@1, pass@10, or pass@100. HumanEval is the default code-generation benchmark for every foundation model release.

Use HumanEval in foundation model evaluation whenever the task is code generation — GitHub Copilot-style assistants, Amazon Q Developer, or custom coding chatbots built on Amazon Bedrock.

Other benchmarks AIF-C01 may mention

TruthfulQA — measures whether a foundation model repeats common misconceptions.
BIG-bench — 200+ diverse tasks, used for broad stress tests.
ARC / ARC-Challenge — grade-school science questions.
GSM8K / MATH — grade-school and competition math word problems.
MBPP — Mostly Basic Python Problems, a simpler sibling of HumanEval.
BBQ / CrowS-Pairs / RealToxicityPrompts — bias and toxicity benchmarks used by Amazon SageMaker Clarify foundation model evaluation.

MMLU for general knowledge and reasoning. HellaSwag for commonsense sentence completion. HumanEval for Python code generation. Every AIF-C01 benchmark question reduces to matching one of these three names to the scenario's task. Source ↗

LLM-as-Judge — Using a Stronger Model to Grade a Weaker Model

LLM-as-judge is the foundation model evaluation pattern where a large, high-quality foundation model scores the outputs of another foundation model. Zheng and colleagues formalized the pattern in 2023 with MT-Bench and Chatbot Arena. LLM-as-judge scales dramatically better than human evaluation — a judge model can review tens of thousands of outputs per hour at a fraction of human cost — while correlating 80 percent or better with expert human ratings on many tasks.

Common LLM-as-judge patterns

Single-answer grading — the judge scores one candidate output on a 1-to-10 rubric.
Pairwise comparison — the judge picks the better of two candidate outputs (A vs B), often used in Chatbot Arena-style Elo rankings.
Reference-based grading — the judge compares the candidate to a gold answer and returns a similarity or correctness score.
Rubric-based grading — the judge evaluates multiple dimensions (helpfulness, harmlessness, factuality, tone) and returns a structured JSON score.

Amazon Bedrock supports LLM-as-judge foundation model evaluation natively. You can spin up an Amazon Bedrock Model Evaluation job that uses Claude or another Bedrock-hosted foundation model as the judge, point it at your candidate model's outputs, and receive structured scores in Amazon S3.

LLM-as-judge pitfalls

Position bias — judges prefer the first option in pairwise comparisons. Mitigate by randomizing order.
Length bias — judges prefer longer answers. Mitigate by instructing the judge to ignore length.
Self-preference bias — a judge from the same model family favors its own family's outputs. Mitigate by using a different family as judge.
Rubric drift — judges re-interpret the rubric across runs. Mitigate by providing few-shot examples.

AIF-C01 questions sometimes trick you by equating LLM-as-judge with human evaluation. LLM-as-judge is automatic — it is fast, cheap, scalable, but it inherits biases from the judge model. Human evaluation via Amazon SageMaker Ground Truth Plus or Amazon Mechanical Turk remains the gold standard when regulators, brand safety, or high-stakes domains are involved. If the scenario says "regulated industry," "medical advice," or "legal documents," pick human evaluation, not LLM-as-judge. Source ↗

Human Evaluation — Ground Truth, Ground Truth Plus, Mechanical Turk

Human foundation model evaluation is the only foundation model evaluation family that catches nuance, brand voice, cultural fit, regulatory compliance, and subtle factual errors. AWS offers three human-labeling workforces that plug directly into foundation model evaluation workflows.

Amazon SageMaker Ground Truth — Bring your own workforce

Amazon SageMaker Ground Truth is the AWS service for data labeling and human foundation model evaluation using your own private workforce, an AWS Marketplace vendor, or Amazon Mechanical Turk. You design the labeling UI, you define the rubric, you pay per task. Ground Truth is ideal when the foundation model evaluation workforce needs domain expertise — medical reviewers, legal reviewers, fluent Japanese speakers — that a generic crowdsourcing pool lacks.

Amazon SageMaker Ground Truth Plus — AWS-managed workforce

Amazon SageMaker Ground Truth Plus is the fully managed variant where AWS professional services handle the workforce, quality control, and project management. Ground Truth Plus is the right choice when the customer lacks an internal labeling team and wants a single SLA-backed deliverable. Ground Truth Plus supports generative AI foundation model evaluation out of the box — reviewers rank model outputs, score rubrics, or provide reference answers.

Amazon Mechanical Turk — On-demand crowdsourcing

Amazon Mechanical Turk is the original on-demand human workforce for micro-tasks. Mechanical Turk is cheap and fast but has no domain guarantees, so it fits best for high-volume, low-stakes foundation model evaluation tasks — "is this output polite?" or "which of these two summaries is easier to read?"

Amazon Bedrock human model evaluation jobs

Amazon Bedrock Model Evaluation human jobs let you pick a workforce (your own via SageMaker Ground Truth, or an AWS-managed workforce), define rating scales or comparison rubrics, upload prompt datasets, and receive aggregated foundation model evaluation scores in Amazon S3. Human foundation model evaluation jobs in Amazon Bedrock are the AIF-C01-correct answer whenever the scenario mentions "human reviewers," "brand voice," "subjective quality," or "regulated content."

Whenever AIF-C01 describes a foundation model evaluation scenario with the words "brand voice," "tone," "cultural appropriateness," "helpfulness to end users," or "medical / legal / financial review," the correct answer is a human foundation model evaluation job — Amazon Bedrock Model Evaluation human job with Ground Truth Plus, or Amazon SageMaker Ground Truth with a private workforce. Automatic metrics cannot measure subjective quality. Source ↗

Amazon Bedrock Model Evaluation — Automatic vs Human Jobs

Amazon Bedrock Model Evaluation is the AWS-native foundation model evaluation service. It runs in two modes: automatic foundation model evaluation jobs and human foundation model evaluation jobs.

Automatic Bedrock Model Evaluation jobs

An automatic Amazon Bedrock Model Evaluation job selects a foundation model, a task type (text summarization, question answering, text classification, open-ended text generation), a built-in or custom prompt dataset, and one or more built-in metrics (accuracy, robustness, toxicity, and task-specific metrics such as ROUGE for summarization or BERTScore for QA). The job runs headlessly, consumes the foundation model via on-demand or provisioned throughput, writes results to Amazon S3, and renders a scorecard in the Amazon Bedrock console.

Automatic Amazon Bedrock Model Evaluation jobs are the right starting point for 90 percent of foundation model evaluation work. They are cheap (you pay only for model inference plus a small orchestration fee), fast (hours, not days), and reproducible.

Human Bedrock Model Evaluation jobs

A human Amazon Bedrock Model Evaluation job adds reviewers on top. You pick up to two foundation models for head-to-head comparison, define rating instructions, pick a workforce (your own via Amazon SageMaker Ground Truth, or AWS-managed), and ship prompts to reviewers. Reviewers score outputs per rubric — thumbs up / down, Likert scale, or pairwise preference — and the job aggregates results into an average score per metric per model.

Built-in metrics inside Amazon Bedrock Model Evaluation

Accuracy — task-specific correctness (exact match, F1, or BERTScore depending on the task).
Robustness — output stability when prompts are perturbed (typos, casing, synonyms).
Toxicity — harmful, profane, or offensive content rate.
Task-specific summarization / QA / generation metrics — ROUGE, BERTScore, and others.

The cost-optimal foundation model evaluation pattern is: run an automatic Amazon Bedrock Model Evaluation job first to filter out obvious losers, then run a human Amazon Bedrock Model Evaluation job only on the top two or three candidates. This keeps human reviewer cost below USD 5,000 while still giving you regulator-grade foundation model evaluation results. Source ↗

Amazon SageMaker Clarify Foundation Model Evaluation — Bias and Fairness

Amazon SageMaker Clarify extends classical ML fairness checks into foundation model evaluation territory. SageMaker Clarify foundation model evaluation measures bias, stereotyping, toxicity, and factual knowledge across Amazon Bedrock foundation models, Amazon SageMaker JumpStart foundation models, and custom endpoints.

What Clarify foundation model evaluation measures

Bias — disparate performance across demographic groups (BBQ-style prompts, CrowS-Pairs).
Stereotyping — does the foundation model produce stereotypical continuations for professions, genders, races, or nationalities?
Toxicity — RealToxicityPrompts-style completions scored by a toxicity classifier.
Factual knowledge — TriviaQA and similar factual benchmarks.
Semantic robustness — output stability under typos, casing, and paraphrase perturbations.

When to pick Clarify foundation model evaluation over Bedrock Model Evaluation

Amazon SageMaker Clarify foundation model evaluation is the right choice when the scenario emphasizes bias, fairness, regulatory compliance, or responsible AI. Amazon Bedrock Model Evaluation is the right choice when the scenario emphasizes task quality (summarization, translation, QA). The two services overlap on toxicity and accuracy; pick based on which sibling — Clarify or Bedrock — is mentioned by name in the scenario.

On AIF-C01 Domain 5 questions about bias, fairness, toxicity, or stereotyping, the correct foundation model evaluation answer is Amazon SageMaker Clarify. On Domain 3 questions about task quality (summarization, translation, code generation), the correct foundation model evaluation answer is Amazon Bedrock Model Evaluation. When both appear as options, read the scenario for the word "bias" or "fairness" to disambiguate. Source ↗

Cost of Foundation Model Evaluation Runs

Foundation model evaluation is not free. Understanding the cost structure is a real AIF-C01 exam topic and a real production concern.

Automatic Bedrock Model Evaluation cost drivers

Foundation model inference — input tokens plus output tokens, priced per 1,000 tokens per model per Region. A 1,000-prompt automatic Amazon Bedrock Model Evaluation job on Claude Sonnet typically costs single-digit to low-double-digit USD.
Orchestration — a small per-job fee plus Amazon S3 storage.
Judge model inference — if LLM-as-judge is used, you pay a second inference bill for the judge.

Human Bedrock Model Evaluation cost drivers

Reviewer hourly rate — Amazon SageMaker Ground Truth Plus is billed per object per reviewer; a 1,000-prompt human foundation model evaluation job with three reviewers per prompt can reach USD 1,000 to USD 10,000 depending on task complexity.
Mechanical Turk — cheaper per task (USD 0.01 to USD 1.00) but no domain guarantees.
Private workforce — you pay employee time plus SageMaker Ground Truth per-object fees.

Cost-optimization patterns

Cascade evaluation — cheap automatic metrics first, expensive human evaluation only on survivors.
Sample size tuning — 200 to 500 prompts usually suffice for automatic foundation model evaluation; human evaluation can use smaller samples (50 to 200) with higher signal per prompt.
On-demand vs Provisioned Throughput — for one-off foundation model evaluation runs, on-demand Amazon Bedrock pricing is cheaper; for continuous foundation model evaluation in production, Provisioned Throughput amortizes better.

Always set an Amazon Bedrock Model Evaluation budget cap. A careless automatic foundation model evaluation job that iterates over 10,000 prompts across 5 foundation models on a top-tier model can hit USD 1,000 in an afternoon. Use Amazon CloudWatch alarms plus AWS Budgets to stop runaway foundation model evaluation costs. Source ↗

A/B Testing Foundation Models in Production — Shadow, Canary, Provisioned Throughput

Offline foundation model evaluation tells you which foundation model is better on held-out data. Production A/B testing tells you which foundation model is better on real users. AWS gives you three production-grade A/B testing mechanisms.

Amazon SageMaker Shadow testing

Amazon SageMaker Shadow testing lets you deploy a candidate foundation model endpoint that silently receives a copy of production traffic without returning responses to end users. The shadow endpoint's outputs and latency are logged for comparison against the production endpoint. SageMaker Shadow testing is the safest foundation model evaluation mechanism in production because zero customer experience risk is incurred.

Shadow testing is ideal when the candidate model is hosted on Amazon SageMaker (JumpStart foundation models, fine-tuned Titan, self-deployed Llama, custom models). Shadow testing does not directly apply to Amazon Bedrock on-demand endpoints, but you can emulate it by duplicating prompts to two Amazon Bedrock foundation models and logging both responses.

Amazon Bedrock Provisioned Throughput for A/B

Amazon Bedrock Provisioned Throughput reserves dedicated inference capacity for a foundation model at a fixed hourly rate. For A/B testing, you can stand up two Provisioned Throughput allocations — one for the current production foundation model, one for the candidate — and route a percentage of traffic to each through your application logic. Provisioned Throughput gives you predictable latency and cost during the A/B window, which matters when foundation model evaluation results are latency-sensitive.

Canary and linear deployments via SageMaker endpoints

Amazon SageMaker endpoints support blue/green, canary, and linear traffic-shifting strategies. You deploy the new foundation model variant alongside the current variant, shift 10 percent of traffic to the new variant, monitor foundation model evaluation metrics (latency, error rate, business KPI), and ramp to 100 percent if metrics hold.

Which A/B strategy fits which AIF-C01 scenario

"Zero customer-facing risk" → Amazon SageMaker Shadow testing.
"Gradual traffic shift with rollback" → SageMaker canary or linear deployment.
"Predictable cost and latency for a time-boxed A/B on foundation models" → Amazon Bedrock Provisioned Throughput.
"Compare two Amazon Bedrock foundation models offline before launch" → Amazon Bedrock Model Evaluation human job with two candidates.

Amazon SageMaker Shadow sends a copy of traffic to the candidate with zero user exposure. Canary shifts a small percentage of real users to the candidate. Amazon Bedrock Provisioned Throughput reserves capacity so that A/B latency and cost stay predictable. Three different production foundation model evaluation knobs for three different risk profiles. Source ↗

End-to-End Foundation Model Evaluation Workflow on AWS

A production-grade foundation model evaluation workflow on AWS ties every piece together.

Define the task — summarization, translation, QA, code generation, chat.
Assemble a held-out prompt dataset — 200 to 2,000 prompts with references where applicable. Store in Amazon S3.
Pick automatic metrics — ROUGE for summarization, BLEU for translation, BERTScore for semantic, perplexity for base-model health.
Pick benchmarks — MMLU for reasoning, HellaSwag for commonsense, HumanEval for code.
Run automatic Amazon Bedrock Model Evaluation jobs — iterate across candidate foundation models.
Run Amazon SageMaker Clarify foundation model evaluation — bias, toxicity, stereotyping, factual knowledge.
Run human Amazon Bedrock Model Evaluation jobs — top two or three candidates only, via Amazon SageMaker Ground Truth Plus or a private workforce.
Pick a winner and deploy behind Amazon SageMaker Shadow testing or Amazon Bedrock Provisioned Throughput — validate on live traffic before full rollout.
Monitor continuously — periodic foundation model evaluation jobs, drift detection, user feedback loops.

Common Exam Traps on Foundation Model Evaluation

AIF-C01 loves foundation model evaluation trick questions. Watch for these.

"Summarize customer feedback" with BLEU as an option — trap. Correct answer is ROUGE.
"Translate to Spanish" with ROUGE as an option — trap. Correct answer is BLEU.
"Measure semantic similarity of paraphrased answers" with ROUGE as an option — trap. Correct answer is BERTScore.
"Regulated medical content review" with LLM-as-judge as an option — trap. Correct answer is human evaluation via Amazon SageMaker Ground Truth Plus.
"Detect bias across demographic groups" with Amazon Bedrock Model Evaluation as an option — partial trap. Correct answer is Amazon SageMaker Clarify foundation model evaluation (Bedrock Model Evaluation covers toxicity but Clarify is the bias-first answer).
"Test a new foundation model in production with zero customer risk" → Amazon SageMaker Shadow testing. Not canary.
"Predictable cost for a time-boxed A/B window on Amazon Bedrock" → Amazon Bedrock Provisioned Throughput.

Amazon Bedrock Model Evaluation automatic jobs for summarization offer ROUGE, BERTScore, and several robustness metrics, but never BLEU as the primary. If a foundation model evaluation question about summarization lists BLEU as the first option, it is a distractor. Stick with ROUGE. Source ↗

Frequently Asked Questions about Foundation Model Evaluation

Q1. When should I choose ROUGE vs BLEU vs BERTScore for foundation model evaluation?

Choose ROUGE when the task is summarization; ROUGE measures overlap with reference summaries and is recall-oriented. Choose BLEU when the task is translation; BLEU measures n-gram precision against a reference translation with a brevity penalty. Choose BERTScore when semantic similarity or paraphrase quality matters, such as open-ended question answering or abstractive summarization; BERTScore uses contextual embeddings to reward meaning rather than surface form. A common production foundation model evaluation pattern pairs one surface-form metric (ROUGE or BLEU) with BERTScore to catch both literal overlap and semantic adequacy.

Q2. What is the difference between Amazon Bedrock Model Evaluation and Amazon SageMaker Clarify foundation model evaluation?

Amazon Bedrock Model Evaluation focuses on task quality — accuracy, robustness, toxicity, ROUGE, BERTScore — for Amazon Bedrock-hosted foundation models, and it supports both automatic and human foundation model evaluation jobs. Amazon SageMaker Clarify foundation model evaluation focuses on responsible-AI dimensions — bias, stereotyping, factual knowledge, semantic robustness — across Amazon Bedrock, Amazon SageMaker JumpStart, and custom endpoints. Use Amazon Bedrock Model Evaluation as the default; layer Amazon SageMaker Clarify whenever bias, fairness, or regulatory compliance is part of the foundation model evaluation requirement.

Q3. How much does a typical Amazon Bedrock Model Evaluation automatic job cost?

Cost depends on the foundation model, the prompt count, and the token length. A representative automatic Amazon Bedrock Model Evaluation job — 500 prompts, average 1,000 input tokens and 500 output tokens, on Claude Sonnet — runs in the single-digit USD range. Scaling to 5,000 prompts across 3 foundation models with a judge model layered on top can reach USD 200 to USD 500. Human foundation model evaluation via Amazon SageMaker Ground Truth Plus typically adds USD 1 to USD 10 per reviewed prompt depending on rubric complexity.

Q4. Is LLM-as-judge acceptable for regulated-industry foundation model evaluation?

LLM-as-judge is acceptable as a first-pass filter in regulated industries but is not sufficient as the final foundation model evaluation signal for decisions that affect health, finance, legal outcomes, or safety. Regulators expect human foundation model evaluation with documented rubrics, inter-rater agreement, and audit trails. On AIF-C01, the exam treats LLM-as-judge and human evaluation as distinct tiers — pick human evaluation via Amazon SageMaker Ground Truth Plus or Amazon Bedrock human model evaluation jobs whenever the scenario mentions regulated industries, high-stakes content, or brand-critical output.

Q5. How do MMLU, HellaSwag, and HumanEval differ, and which should I report to stakeholders?

MMLU measures broad knowledge and reasoning across 57 subjects with multiple-choice questions; report MMLU to stakeholders who care about general intelligence. HellaSwag measures commonsense sentence completion; report HellaSwag when grounded everyday reasoning matters. HumanEval measures Python code generation with executable unit tests; report HumanEval to developer-experience stakeholders. A balanced foundation model evaluation scorecard for an enterprise foundation model rollout usually shows MMLU plus HumanEval plus at least one safety benchmark such as TruthfulQA or a Clarify bias score.

Q6. Can I A/B test two Amazon Bedrock foundation models without Amazon SageMaker endpoints?

Yes. For offline foundation model evaluation, run an Amazon Bedrock Model Evaluation human job with two foundation models selected — Bedrock aggregates pairwise preferences natively. For online production A/B, route a percentage of application traffic between two Amazon Bedrock Provisioned Throughput allocations from your application code or an Amazon API Gateway layer, log responses, and compute lift metrics. Amazon SageMaker Shadow testing is unnecessary when both candidates are Amazon Bedrock-hosted; SageMaker Shadow is for SageMaker-hosted foundation model endpoints.

Q7. What sample size do I need for reliable foundation model evaluation?

For automatic foundation model evaluation metrics on a homogeneous task, 200 to 500 prompts usually give statistically stable ROUGE, BLEU, or BERTScore numbers. For benchmark-based foundation model evaluation, use the full benchmark — MMLU has 14,042 questions, HellaSwag has 10,042, HumanEval has 164 — because benchmarks are reported at full size by convention. For human foundation model evaluation, 50 to 200 prompts with three reviewers per prompt often suffice; inter-rater agreement (Krippendorff alpha or Cohen kappa) tells you whether to expand the sample.

Summary of Foundation Model Evaluation for AIF-C01

Foundation model evaluation is a layered discipline. Start with automatic metrics — ROUGE for summarization, BLEU for translation, BERTScore for semantic similarity, perplexity for base-model fluency. Layer in benchmarks — MMLU for general reasoning, HellaSwag for commonsense, HumanEval for code. Add LLM-as-judge when scale matters and human evaluation when stakes matter. Operate foundation model evaluation on AWS through Amazon Bedrock Model Evaluation automatic and human jobs, Amazon SageMaker Clarify foundation model evaluation for bias and fairness, and Amazon SageMaker Ground Truth Plus or Amazon Mechanical Turk for human workforces. Budget foundation model evaluation runs carefully — automatic jobs are cheap, human jobs are not. Promote foundation model evaluation winners to production through Amazon SageMaker Shadow testing or Amazon Bedrock Provisioned Throughput before full rollout.

Memorize the three core pairings for AIF-C01: summarization equals ROUGE, translation equals BLEU, semantic similarity equals BERTScore. Memorize the three core benchmarks: MMLU, HellaSwag, HumanEval. Memorize the three core AWS foundation model evaluation services: Amazon Bedrock Model Evaluation, Amazon SageMaker Clarify, Amazon SageMaker Ground Truth Plus. With those nine nouns anchored, every AIF-C01 foundation model evaluation question becomes a matching exercise.

What is Foundation Model Evaluation?

Why foundation model evaluation matters for AIF-C01

Plain-English Walkthrough of Foundation Model Evaluation

Analogy 1 — The open-book exam (開書考試)

Analogy 2 — The restaurant kitchen (廚房)

Analogy 3 — The insurance underwriter (保險)

Core Principles of Foundation Model Evaluation

The task-to-metric map you must memorize

Automatic Metrics — ROUGE vs BLEU vs BERTScore vs Perplexity

ROUGE — Recall-Oriented Understudy for Gisting Evaluation

BLEU — Bilingual Evaluation Understudy

BERTScore — Semantic Similarity via BERT Embeddings

The ROUGE vs BLEU vs BERTScore cheat sheet

Perplexity — Intrinsic Language Model Quality

Other automatic metrics you should recognize

Task-Specific Benchmarks — MMLU, HellaSwag, HumanEval

MMLU — Massive Multitask Language Understanding

HellaSwag — Commonsense Sentence Completion

HumanEval — Code Generation

Other benchmarks AIF-C01 may mention

LLM-as-Judge — Using a Stronger Model to Grade a Weaker Model

Common LLM-as-judge patterns

LLM-as-judge pitfalls

Human Evaluation — Ground Truth, Ground Truth Plus, Mechanical Turk

Amazon SageMaker Ground Truth — Bring your own workforce

Amazon SageMaker Ground Truth Plus — AWS-managed workforce

Amazon Mechanical Turk — On-demand crowdsourcing

Amazon Bedrock human model evaluation jobs

Amazon Bedrock Model Evaluation — Automatic vs Human Jobs

Automatic Bedrock Model Evaluation jobs

Human Bedrock Model Evaluation jobs

Built-in metrics inside Amazon Bedrock Model Evaluation

Amazon SageMaker Clarify Foundation Model Evaluation — Bias and Fairness

What Clarify foundation model evaluation measures

When to pick Clarify foundation model evaluation over Bedrock Model Evaluation

Cost of Foundation Model Evaluation Runs

Automatic Bedrock Model Evaluation cost drivers

Human Bedrock Model Evaluation cost drivers

Cost-optimization patterns

A/B Testing Foundation Models in Production — Shadow, Canary, Provisioned Throughput

Amazon SageMaker Shadow testing

Amazon Bedrock Provisioned Throughput for A/B

Canary and linear deployments via SageMaker endpoints

Which A/B strategy fits which AIF-C01 scenario

End-to-End Foundation Model Evaluation Workflow on AWS

Common Exam Traps on Foundation Model Evaluation

Frequently Asked Questions about Foundation Model Evaluation

Q1. When should I choose ROUGE vs BLEU vs BERTScore for foundation model evaluation?

Q2. What is the difference between Amazon Bedrock Model Evaluation and Amazon SageMaker Clarify foundation model evaluation?

Q3. How much does a typical Amazon Bedrock Model Evaluation automatic job cost?

Q4. Is LLM-as-judge acceptable for regulated-industry foundation model evaluation?

Q5. How do MMLU, HellaSwag, and HumanEval differ, and which should I report to stakeholders?

Q6. Can I A/B test two Amazon Bedrock foundation models without Amazon SageMaker endpoints?

Q7. What sample size do I need for reliable foundation model evaluation?

Summary of Foundation Model Evaluation for AIF-C01

官方資料來源