AI Data Governance and PII Handling on AWS

AI data governance and PII handling on AWS decides whether a generative AI project ships or gets blocked by legal review. Task 5.2 of the AIF-C01 exam requires you to recognize how AWS services — Amazon Macie, Amazon Comprehend, AWS Glue Data Catalog, AWS Lake Formation, Amazon DataZone, Amazon Bedrock data policy, and regional controls — combine into an end-to-end AI data governance workflow that tracks lineage, respects consent and licensing, scrubs personally identifiable information (PII), and honors data residency. Because training data touches everything downstream — model behavior, legal liability, and regulatory exposure — AI data governance is the first foundation you audit before any model is trained or deployed.

What is AI Data Governance and PII Handling on AWS?

AI data governance on AWS is the coordinated set of policies, services, and workflows that control how data enters, moves through, and exits machine learning and generative AI pipelines. PII handling is a subset of AI data governance focused on detecting, classifying, redacting, and tracking personal data. Together they answer four questions an auditor or regulator will always ask: Where did the training data come from? Do we have permission to use it? Does it contain personal information? Can we prove where it lives today?

At the foundational AIF-C01 level, AI data governance is not a programming task. It is a service-recognition task. Exam items take the form "A company needs to X — which AWS service handles it?" If you can match training-data lineage to AWS Glue Data Catalog, large-scale S3 PII discovery to Amazon Macie, prompt-time PII redaction to Amazon Comprehend, cross-team data access governance to Amazon DataZone, and fine-grained table access to AWS Lake Formation, you can answer nearly every 5.2 item on AI data governance correctly.

Why AI data governance matters for AIF-C01

Domain 5 ("Security, Compliance, and Governance for AI Solutions") carries 14% of AIF-C01 weight, and Task 5.2 is the governance-specific slice. AI data governance questions are disproportionately scenario-heavy — they describe a team, a dataset, and a compliance constraint, then ask which AWS service moves the problem forward. Learning the small vocabulary of AI data governance services pays off on both Task 5.2 items and on cross-domain Task 4.1 responsible AI items, because responsible AI assumes the data underneath was governed first.

Scope of this topic versus adjacent topics

AI data governance is distinct from IAM access control (covered in iam-and-bedrock-security), from output filtering via Bedrock Guardrails (covered in bedrock-guardrails-and-controls), and from broader compliance frameworks like the EU AI Act (covered in security-compliance-governance-ai). In this topic the focus is training-data provenance, lineage, PII classification, regional data sovereignty, and the specific data policies of Amazon Bedrock. Keep that mental fence up while reading scenarios.

The "would-you-train-on-this?" audit

Every AI data governance program eventually boils down to a single question you should ask before training or fine-tuning: would I be comfortable if every row of this dataset appeared on the front page of a newspaper tomorrow, attributed to me? If the answer is no — because the data contains customer PII, scraped copyrighted text, or third-party records used without consent — then the AI data governance controls are what stop the pipeline before it generates legal liability. Macie, Comprehend PII, Data Catalog lineage, and DataZone access reviews exist so that this question has an auditable answer, not a gut feeling.

白話文解釋 AI Data Governance and PII Handling

Formal AI data governance prose hides how intuitive the concepts really are. Three analogies from different categories help cement the mental model for the AIF-C01 exam.

Analogy 1 — The commercial kitchen and the ingredient traceability binder

Picture a commercial kitchen that prepares meals for a hospital. Every ingredient arriving at the loading dock has a label — which farm it came from, when it was harvested, whether it contains common allergens, and who inspected it. The head chef keeps a binder on the wall listing every ingredient currently in the kitchen, what dish it is about to go into, and which supplier it came from. That binder is AWS Glue Data Catalog and AWS Lake Formation — the single source of truth about every dataset, its schema, and who is allowed to touch it. The allergen sticker on peanut jars is Amazon Macie — a dedicated scanner that finds potentially dangerous ingredients (PII) in the S3 pantry without the chef reading every label manually. Before a meal leaves the kitchen, a food-safety officer tastes it and pulls anything that slipped through — that is Amazon Comprehend PII detection running on prompts and outputs, redacting names and social security numbers before the food reaches the patient. A hospital inspector might later review the binder to confirm every ingredient had consent, a license, and was harvested in an approved region — that is an AI data governance audit. Mix up the labels, and patients get sick. Skip the binder, and regulators close you down.

Analogy 2 — The library special-collections room

Imagine the rare-book floor of a national library. Most books are free to browse in the open stacks, but the special-collections room holds manuscripts that can only be read by researchers who submit a request, sign an agreement, and wear white gloves. Amazon DataZone is the librarian behind the request desk: data producers publish "assets" into a catalog, consumers request access, and DataZone routes the request to the data owner for approval — with the whole audit trail preserved. AWS Lake Formation is the set of glass display cases that enforce row-level and column-level permissions on which manuscripts (rows, columns) each researcher can actually see even after they have been admitted. AWS Glue Data Catalog is the master card index describing every book on every floor, updated whenever a new manuscript arrives. Amazon Macie is the conservation team that periodically walks the stacks with ultraviolet scanners looking for fragile or dangerous manuscripts that need special handling (PII in S3). The EU-only reading room is a separate physical building on European soil — that is regional data sovereignty enforced by AWS Region selection and SCPs. Training an AI model is a researcher checking out a stack of manuscripts to write a book; AI data governance is every policy that decides which manuscripts they may take, which they may never take, and what the finished book is allowed to say about them.

Analogy 3 — The postal system and sealed diplomatic pouches

Think of AI data governance as an international postal system. Every package (dataset) has a declaration form listing origin, contents, and legal purpose — that is data lineage in the Glue Data Catalog. Customs officers X-ray every package looking for prohibited items — that is Amazon Macie scanning S3 buckets for PII at scale. A rule printed on the package says "contents may not leave the European Union" — that is regional data sovereignty, enforced by choosing eu-central-1 and applying SCPs. Diplomatic pouches have a special seal promising their contents are never opened by the postal service itself — that is the Amazon Bedrock data policy, under which AWS does not use your prompts or completions to train provider foundation models. Some pouches travel through a private diplomatic courier network that never touches public roads — that is Amazon Bedrock via AWS PrivateLink, keeping traffic off the internet. The fine print of each embassy's treaty differs — that is the per-provider fine print you must read on Anthropic, Meta, Mistral, Cohere, and Amazon Titan terms inside Bedrock. The postal system itself is trustworthy; what you must still audit is the treaty text.

Core Principles of AI Data Governance on AWS

AI data governance on AWS rests on five interacting principles: provenance, consent, classification, residency, and vendor data policy. Each principle maps to a recognizable set of AWS services that the AIF-C01 exam expects you to distinguish.

Provenance — where did the data come from?

Every training dataset must have a documented source. AWS Glue Data Catalog captures schema, source system, and crawled metadata. AWS Lake Formation tracks who registered the data and when. Amazon DataZone adds a business glossary and a publishing workflow so that "source" is not just an S3 URI but a named business asset with an owner.

Provenance alone is not enough. AI data governance must capture whether the data subject consented to AI training use, whether the dataset license permits derivative model training (many open datasets do not), and whether intellectual property rights were cleared. AWS does not parse contracts for you, but DataZone business metadata fields and Lake Formation resource tags let you encode "consent_status", "license_type", and "ip_cleared_on" as first-class attributes that downstream SageMaker jobs and Bedrock fine-tuning pipelines can check.

Classification — is there PII, PHI, or other sensitive content?

Amazon Macie classifies S3 objects for personal data using managed identifiers (names, emails, SSNs, credit card numbers, AWS keys, and more) and custom identifiers. Amazon Comprehend detects PII entities in free text — useful at prompt time and output time in live Bedrock applications. Classification is what converts a vague "sensitive data" concern into an enforceable policy.

Residency — where does the data physically live?

AI data governance inherits AWS's Region-based residency model. EU training data stays in EU Regions; you can enforce that with SCPs, AWS Config rules, and Lake Formation Regional registrations. Foundation models in Amazon Bedrock are hosted per-Region too: invoking Claude or Titan in eu-central-1 runs inference in eu-central-1 infrastructure. Cross-Region inference settings must be explicit.

Vendor data policy — does AWS or the model provider reuse my data?

The Amazon Bedrock data policy is unambiguous: your prompts, completions, and fine-tuning data are not used by AWS or by the foundation model provider to train or improve their models. This matters to auditors because it separates Bedrock from many consumer AI services. However — and this is the frequently tested subtlety — individual providers on Bedrock may have separate terms for their own hosted services outside Bedrock, and you must read each provider's Bedrock-specific fine print because small variations exist.

You cannot grant permission on data whose origin you do not know. Every AI data governance program on AWS starts with AWS Glue Data Catalog cataloging the dataset, then Lake Formation registering it, then DataZone publishing it as a reviewed asset. Skip provenance, and every subsequent control (PII scan, access grant, residency rule) is built on sand. Reference: https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html

Training data governance is the part of AI data governance that decides what is allowed into a model's weights. Once data is baked into a foundation model or a fine-tuned adapter, it is functionally impossible to fully unlearn. That one-way nature is why AI data governance treats training data with the strictest rules.

Data lineage for AI pipelines

Data lineage is the recorded ancestry of a dataset: which upstream tables it was joined from, which transformations were applied, and which downstream training job consumed it. On AWS, lineage is captured by AWS Glue Data Catalog metadata, SageMaker ML Lineage Tracking (for training jobs), and Amazon DataZone asset versioning. An auditor asking "which S3 prefix was version 2.3 of this fine-tuned Claude model trained on?" can answer that question only if lineage was instrumented from day one.

Consent applies to personal data: did the data subject agree their data could be used for AI model training? Licensing applies to datasets: does the license grant a right to train commercial models? These are distinct legal concepts. Store both as Lake Formation tags and DataZone business metadata so that Lake Formation's row-filter or column-filter policies can gate access based on "consent=granted" or "license=commercial_training_allowed".

Intellectual property and the "would-you-train-on-this?" audit

Intellectual property risk in AI data governance extends beyond explicit copyright. Scraped web text, third-party user-generated content, and competitor product documentation can all trigger IP disputes if used in training data. The practical tool is the "would-you-train-on-this?" audit: a pre-training review where a cross-functional team (legal, privacy, engineering) looks at a sampled cross-section of the dataset and asks whether each row is defensibly usable for model training. Failures at this stage save post-deployment lawsuits.

Data quality gates before training

Beyond legal constraints, AI data governance enforces quality gates — duplicate removal, label accuracy sampling, distribution checks, and freshness verification. SageMaker Data Wrangler and AWS Glue DataBrew offer profiling features. DataZone's business glossary lets teams codify "minimum quality score = 0.85" as a published criterion that downstream consumers can enforce.

Data provenance is where the data originally came from (one point in the past). Data lineage is the full graph of transformations from origin to current use (a full trajectory). AI data governance needs both — provenance to answer "can we legally use this?" and lineage to answer "which model version included which data?" Reference: https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

AWS Glue Data Catalog + Lake Formation for Training Data

AWS Glue Data Catalog is the metadata foundation for AI data governance on AWS. Lake Formation is the access-control and fine-grained permissions layer that sits on top.

AWS Glue Data Catalog as AI metadata store

AWS Glue Data Catalog stores schemas, partition information, table statistics, and classification labels for S3 data, Redshift tables, RDS databases, and more. For AI data governance, the Data Catalog becomes the central registry: every S3 prefix that might feed a SageMaker training job, a Bedrock fine-tuning job, or a Bedrock Knowledge Base should be cataloged first so that downstream services query metadata rather than guessing schema.

Glue Crawlers for automatic metadata discovery

Glue Crawlers scan S3 paths, infer schemas, and register tables in the Data Catalog automatically. For AI data governance teams, crawlers keep the catalog fresh as new training batches land in S3 — without a crawler, catalog entries drift out of date and lineage accuracy collapses.

AWS Lake Formation permissions for training data

AWS Lake Formation adds database-, table-, column-, row-, and cell-level permissions on top of Data Catalog objects. An AI data governance team can grant a SageMaker training role permission to only the non-PII columns of a customer table, or only the rows where consent_given = true. Lake Formation Tag-Based Access Control (LF-TBAC) lets you attach tags like ai_training_approved=yes and grant permissions at scale.

Lake Formation data filters for PII exclusion

Data filters in Lake Formation let you create named filtered views — for example, "customer_table_no_pii" that excludes columns email, phone, ssn. SageMaker jobs that use this view never see the raw PII columns, satisfying the principle of purpose limitation in GDPR and similar AI data governance frameworks.

Lake Formation versus IAM alone

Plain S3 bucket IAM policies are object-level and coarse. Lake Formation permissions are schema-aware and fine-grained. If the exam scenario says "restrict access to specific columns of a table used for AI training," the answer is Lake Formation, not IAM.

Amazon Macie — S3 PII Scanning at Scale for Training Data

Amazon Macie is AWS's managed service for discovering and classifying sensitive data in Amazon S3 buckets. Macie uses machine learning plus pattern matching to identify PII, personal health information, credential material, and custom-defined sensitive types across massive S3 estates.

What Amazon Macie finds

Macie ships with managed data identifiers for names, street addresses, email addresses, phone numbers, US and international government IDs, driver licenses, credit card numbers, AWS access keys, API tokens, and more. Each finding reports the bucket, object key, identifier type, and number of occurrences.

Macie in the AI data governance workflow

In an AI data governance pipeline, Macie runs before training. A typical flow: raw data lands in a staging S3 bucket, Macie scans it for PII, findings are routed to AWS Security Hub and Amazon EventBridge, remediation pipelines redact or quarantine offending objects, and only clean data is promoted to the training bucket. This pattern converts "we think there's no PII" into "we can prove there's no PII" — exactly what AI data governance auditors want.

Custom data identifiers in Macie

Custom data identifiers let you define your own patterns — employee IDs, internal customer numbers, proprietary product SKUs — and scan for them alongside the managed identifiers. For AI data governance in regulated industries, custom identifiers are how you enforce organization-specific data classes.

Macie sensitivity scoring

Macie assigns an automated sensitivity score (0-100) to each S3 bucket based on scan findings. Buckets hosting training data with high sensitivity scores should either be remediated or routed through column-filter views in Lake Formation before any ML consumer reads them.

Macie versus Amazon GuardDuty versus Inspector

Macie finds sensitive data inside S3 objects. GuardDuty detects threat activity (anomalous API calls, malware, credential exfiltration). Inspector scans compute workloads (EC2, ECR images, Lambda) for vulnerabilities. For AI data governance PII questions, Macie is the right answer; GuardDuty and Inspector handle different security concerns.

Amazon Macie scans Amazon S3 buckets. It does not scan DynamoDB tables, RDS databases, or in-flight prompts. For in-flight text PII redaction at inference time, you pair Macie with Amazon Comprehend PII entity detection (for text) and Amazon Bedrock Guardrails Sensitive Information Filters (for Bedrock model invocations). Macie is the training-data tool; Comprehend is the prompt/output tool. Reference: https://docs.aws.amazon.com/macie/latest/user/what-is-macie.html

Amazon Comprehend PII Entity Detection — Redact in Prompts and Outputs

Amazon Comprehend is AWS's managed natural language processing service. Its PII detection feature identifies personally identifiable information entities in free text and can return either the entity list or a redacted version of the original text.

Comprehend PII entities

Comprehend PII detection recognizes entities such as NAME, EMAIL, PHONE, ADDRESS, SSN, CREDIT_DEBIT_NUMBER, BANK_ACCOUNT_NUMBER, PASSPORT_NUMBER, DRIVER_ID, US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER, IP_ADDRESS, MAC_ADDRESS, URL, USERNAME, PASSWORD, and several more. Each entity detection returns the type, location (offset and length), and a confidence score.

Real-time versus batch detection

Comprehend offers synchronous DetectPiiEntities for request-time processing and asynchronous PII detection jobs for batch processing across large S3 corpora. For AI data governance in live Bedrock applications, the synchronous API runs as a pre-processing step on every user prompt and as a post-processing step on every model output.

Redaction mode — ContainsPiiEntities and DetectPiiEntities

Comprehend's ContainsPiiEntities returns a simple yes/no plus entity type labels — cheap enough to run on every request. DetectPiiEntities returns offsets for in-place redaction. Redacting before sending prompts to a Bedrock foundation model prevents PII from entering model provider infrastructure at all, even when the Bedrock data policy already promises no training use.

Output-side PII filtering

Models sometimes hallucinate names or phone numbers, or regurgitate PII from their training data. Running Comprehend PII detection on model outputs before returning them to the end user provides a belt-and-braces defense. This is conceptually similar to Bedrock Guardrails Sensitive Information Filters but offers more granular control in cases where you are not using Bedrock, such as SageMaker-hosted custom LLMs.

Comprehend PII versus Bedrock Guardrails Sensitive Information Filters

Bedrock Guardrails Sensitive Information Filters apply automatically to Bedrock model invocations configured with a guardrail. Amazon Comprehend PII is a standalone service you can orchestrate around any AI workflow — Bedrock, SageMaker, Lex, third-party LLMs called via Lambda. The exam likes both: Guardrails for Bedrock-specific scenarios, Comprehend for general-purpose text PII.

Amazon DataZone — Cross-Team Data Access Governance

Amazon DataZone is AWS's data management service for publishing, discovering, and governing data across teams. For AI data governance, DataZone solves the "which team can use which dataset for which AI purpose?" question at organizational scale.

DataZone projects and environments

DataZone organizes work into projects (business initiatives) and environments (the compute and storage resources available to a project). An AI team might have a "Claim Triage Model" project with a dev environment and a prod environment. Data producers publish assets to a domain catalog; consumers request subscription access through their project.

DataZone business glossary

A business glossary captures human-readable definitions of terms ("customer", "active subscriber", "churn") alongside the physical columns that implement them. For AI data governance, glossary terms bridge the gap between a data scientist asking "what counts as an active customer?" and the actual Glue Catalog column that answers it.

When a data scientist requests access to a published asset for AI training, DataZone routes the request to the data owner, records the business justification, and — upon approval — provisions Lake Formation permissions automatically. Every approval is auditable and revocable. This is the AI data governance workflow for regulated industries.

DataZone asset metadata for AI

DataZone assets carry both technical metadata (columns, types, statistics from Glue) and business metadata (owner, stewardship contact, consent flag, license type, sensitivity level). AI teams filter the catalog for assets whose metadata says "ai_training_approved = yes" and subscribe only to those.

DataZone versus Lake Formation versus Glue Catalog

Glue Data Catalog is the technical metadata store. Lake Formation enforces fine-grained permissions. DataZone is the business-layer catalog, subscription workflow, and glossary on top. They are complementary, not overlapping: AI data governance at scale uses all three.

If the AIF-C01 scenario asks about cross-team data discovery and a subscription-based access workflow with business metadata, choose Amazon DataZone. If it asks about row-level or column-level permissions on a table, choose AWS Lake Formation. If it asks about schema discovery and metadata storage, choose AWS Glue Data Catalog. Do not collapse them into a single answer. Reference: https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html

Regional Data Sovereignty — EU Training Data Must Stay EU

Regional data sovereignty is the legal requirement that data remain within a specified jurisdiction. AI data governance inherits AWS's Region-based model: training data placed in an EU Region stays in EU Region datacenters unless someone explicitly moves it.

Region choice is the primary control

For an EU-only AI training dataset, create S3 buckets in eu-central-1, eu-west-1, or another EU Region. Register those S3 paths in the Glue Data Catalog of the same Region. Run Glue ETL jobs in that Region. Train SageMaker jobs in that Region. Invoke Bedrock models in that Region. Every one of those services honors the Region you chose — AWS does not silently copy data across Regions.

Bedrock foundation model Regional availability

Not every Bedrock foundation model is available in every Region. An AI data governance plan for EU-only training must confirm that the chosen foundation model — Claude, Titan, Llama, Mistral — has an EU Region endpoint available. If the model is only available in us-east-1, either select a different model or accept the data-egress and regulatory trade-offs (rarely acceptable under GDPR).

SCPs, AWS Config, and Lake Formation Regional controls

Defense in depth: add service control policies (SCPs) that deny resource creation in non-EU Regions, AWS Config rules that flag non-EU resources, and Lake Formation Regional registrations that refuse to catalog out-of-Region data. Each layer catches mistakes the others miss.

Cross-Region inference gotcha on Bedrock

Amazon Bedrock offers a cross-Region inference feature that automatically routes a model invocation to available capacity across a group of Regions. For AI data governance in sovereignty-constrained workloads, cross-Region inference must be disabled or scoped to a group of Regions within the same jurisdiction (for example, EU-only Region group). Unscoped cross-Region inference can move prompt data outside the originating Region — an AI data governance violation that must be proactively prevented.

AWS European Sovereign Cloud and GovCloud

For the highest data sovereignty requirements, AWS operates AWS GovCloud (US) Regions (FedRAMP High, ITAR) and has announced AWS European Sovereign Cloud for EU customers who need extra isolation. AIF-C01 awareness-level: know these exist as options. Most exam scenarios are solved with standard commercial EU Regions plus SCPs and Lake Formation.

AI data governance candidates often assume that choosing an EU Region automatically prevents any EU data from leaving. That is true for S3, SageMaker training jobs, and single-Region Bedrock invocations, but NOT automatically true for Bedrock cross-Region inference, SageMaker JumpStart pulls from global model registries, or services that operate globally by design (IAM, CloudFront, Route 53 metadata). Every AI data governance plan in a sovereignty-constrained workload must explicitly review each service's Regional behavior. Reference: https://docs.aws.amazon.com/bedrock/latest/userguide/data-protection.html

Amazon Bedrock Data Policy — Your Data Is Not Used to Train Provider Models

The Amazon Bedrock data policy is a cornerstone of AI data governance on AWS and a frequent exam target. Memorize it precisely.

The core guarantee

Your prompts, completions, and fine-tuning data submitted to Amazon Bedrock are not used by AWS and are not shared with the third-party foundation model providers to train or improve the underlying foundation models. AWS encrypts your data in transit and at rest, isolates per-customer fine-tuned adapters, and does not read your content except as needed to operate the service.

Why this matters for AI data governance

Many consumer-grade AI products retain and reuse prompts for model improvement — a non-starter for enterprise AI data governance with confidential, regulated, or copyrighted content. Bedrock's policy inverts that default: enterprise data stays enterprise data. AI data governance teams rely on this guarantee when approving Bedrock for production workloads.

What is encrypted and where

Data in transit between your VPC and Bedrock endpoints uses TLS. Data at rest in Bedrock Knowledge Bases, Bedrock Agents, and fine-tuned model adapters is encrypted using AWS-managed or customer-managed KMS keys. Your fine-tuning data is isolated to your account and cannot be accessed by other customers.

Limits of the guarantee — read the per-provider terms

Third-party foundation models accessed via Bedrock (Anthropic Claude, Meta Llama, Mistral, Cohere, AI21, Stability AI) are hosted on AWS infrastructure under the Bedrock data policy. However, each provider's Bedrock-specific terms may include small variations — for example, around abuse monitoring, retention for safety investigations, or use of aggregate metadata. AI data governance best practice is to read each provider's terms page in the Bedrock console during procurement review, and update the review whenever AWS onboards a new model family.

Amazon Titan and Amazon Q are AWS-managed

Amazon Titan foundation models and Amazon Q Business/Developer are fully AWS-managed and follow the Bedrock data policy and Amazon Q enterprise terms respectively. For customers who want the simplest AI data governance story (one vendor, one contract, one policy), Titan and Amazon Q reduce the vendor surface.

Bedrock PrivateLink — Keep AI Traffic Off the Internet

AWS PrivateLink lets Amazon Bedrock traffic stay on the AWS private network rather than traversing the public internet. For AI data governance in regulated industries, this is a non-negotiable setting.

VPC interface endpoints for Bedrock

You create a VPC interface endpoint for com.amazonaws.<region>.bedrock-runtime and com.amazonaws.<region>.bedrock. Applications running in private subnets route Bedrock InvokeModel calls through the endpoint. Traffic never touches the public internet.

Why PrivateLink matters for AI data governance

Prompts to a foundation model often contain the most sensitive data in a workflow — customer records, internal documents, source code. Keeping that traffic on the AWS backbone reduces exposure to internet-based interception, routing anomalies, and DNS attacks. Combined with the Bedrock data policy, PrivateLink provides a complete "data never leaves AWS" story for Bedrock-based AI workflows.

PrivateLink for supporting services

AI data governance teams should also provision PrivateLink endpoints for S3, AWS Glue, Amazon Comprehend, Amazon Macie, and SageMaker. An AI pipeline is only as private as its weakest link; if Bedrock traffic uses PrivateLink but the upstream S3 fetch goes over the internet, the AI data governance claim is only partial.

PrivateLink is additive, not a replacement

PrivateLink is a network isolation feature. It does not replace IAM, encryption, or the Bedrock data policy. AI data governance uses PrivateLink alongside these controls, never instead of them.

Glue Data Catalog = schema and metadata registry. Lake Formation = fine-grained table/column/row permissions. DataZone = cross-team catalog and subscription workflow. Macie = S3 PII scanning at scale. Comprehend PII = text PII detection and redaction in prompts and outputs. Bedrock data policy = AWS does not train provider models on your data. Bedrock PrivateLink = traffic off the public internet. Region selection = data residency enforcement. Reference: https://docs.aws.amazon.com/bedrock/latest/userguide/data-protection.html

Model Provider Data Usage Policies Vary — Read the Fine Print

Amazon Bedrock hosts models from multiple providers, and while the overarching Bedrock data policy applies to all of them, AI data governance requires reading each provider's specific terms during procurement review. The AIF-C01 exam expects awareness that provider policies differ, not memorization of each clause.

Per-provider terms surface

In the Amazon Bedrock console, each model catalog entry links to the provider's Bedrock-specific terms — Anthropic, Meta, Mistral AI, Cohere, AI21 Labs, Stability AI, Amazon Titan, and others. Terms differ on matters such as acceptable use restrictions, content that may be flagged for abuse review, and geographic availability.

What AI data governance teams should check

During procurement: does the provider's acceptable use policy conflict with our intended use case? Is there a retention window for abuse review that could keep prompt data beyond the Bedrock data policy guarantee? Are there geographic restrictions on using the model with data from particular jurisdictions? These checks are the "read the fine print" step in AI data governance.

Guarded variants and evaluation models

Some providers offer distinct Bedrock endpoints — standard, instruct, or safety-tuned variants. Terms may differ slightly between variants. AI data governance teams should record which variant is used, not just which provider.

Open-source models via Bedrock Marketplace and SageMaker JumpStart

Open-source foundation models brought into Bedrock Marketplace or deployed via SageMaker JumpStart carry their source licenses (Llama 3 Community License, Apache 2.0, Mistral license, and so on). AI data governance must track both the AWS-side data policy and the upstream open-source license — redistribution, fine-tuning, and commercial use rules differ materially across open-source licenses.

Compliance Frameworks that Touch AI Data Governance

AI data governance on AWS intersects with multiple compliance programs. The AIF-C01 exam asks you to recognize which framework applies to a given scenario rather than memorize clause-level detail.

The EU General Data Protection Regulation governs personal data of EU residents. For AI data governance, GDPR imposes lawful basis for processing, purpose limitation, data minimization, right to access, right to erasure, and explicit consent for certain processing types. AWS Artifact provides the GDPR Data Processing Addendum. AI data governance controls — Macie for classification, Comprehend for redaction, Lake Formation for purpose-limited column access — implement the technical side of GDPR compliance.

HIPAA — US protected health information

The Health Insurance Portability and Accountability Act applies to protected health information in the US. Bedrock and SageMaker are both HIPAA-eligible services when used under a signed Business Associate Addendum (BAA). AI data governance for healthcare AI workloads adds PHI-specific controls: Macie managed identifiers for medical record numbers, Lake Formation column-filters for provider-specific columns, and CloudTrail logging of every model invocation.

EU AI Act — risk-based AI regulation

The EU AI Act tiers AI systems by risk level (unacceptable risk, high risk, limited risk, minimal risk). AI data governance under the EU AI Act emphasizes training data quality, bias monitoring, human oversight, and technical documentation — all of which map to the AWS services in this topic plus SageMaker Clarify and Bedrock Model Evaluation.

ISO/IEC 42001 — AI management system

ISO/IEC 42001 is the international management system standard for AI. AWS pursues certifications against it; AWS Artifact will distribute the certificate as it becomes available. AI data governance programs aiming at ISO 42001 alignment lean heavily on DataZone and Lake Formation for the documented asset lifecycle the standard expects.

Other frameworks

SOC 2, ISO 27001, and FedRAMP apply to the infrastructure hosting AI workloads. AWS holds those certifications and publishes reports in AWS Artifact. AI-specific obligations layer on top through the frameworks listed above.

Common Exam Traps on AI Data Governance and PII

Several confusions recur on AIF-C01 Task 5.2 AI data governance items. Studying them explicitly is worth several exam points.

Amazon Macie versus Amazon Comprehend PII

Macie scans Amazon S3 objects at rest. Comprehend detects PII entities in free text at request time. If the scenario says "discover PII in a training dataset in S3," the answer is Macie. If the scenario says "redact PII from user prompts before sending to Bedrock," the answer is Comprehend (or Bedrock Guardrails Sensitive Information Filters).

AWS Glue Data Catalog versus SageMaker Feature Store

Data Catalog stores metadata about datasets in S3 and other stores. SageMaker Feature Store stores computed feature values for training and real-time inference. Exam items that mention "reusable feature engineering across training and inference" point to Feature Store. Items about "schema registry for training data" point to Data Catalog.

Lake Formation versus IAM

IAM alone manages coarse object-level and API-level permissions. Lake Formation adds column-, row-, and cell-level permissions on tables registered with the Data Catalog. AI data governance questions about fine-grained data access default to Lake Formation.

Amazon DataZone versus AWS Glue Data Catalog

DataZone is the business-layer catalog with subscription workflows, business glossaries, and cross-project sharing. Glue Data Catalog is the technical-layer catalog. They work together; DataZone publishes assets that are backed by Glue Catalog tables.

Bedrock data policy versus model provider terms

The Bedrock data policy prohibits AWS and providers from training on your Bedrock data. Individual provider terms layer additional constraints and occasional exceptions (abuse monitoring windows, geographic availability). Do not assume all providers have identical terms.

Regional residency versus encryption

Encryption protects confidentiality. Region selection enforces data residency. Encryption does not prevent data from moving Regions; only Region choice does. When a scenario asks "keep data in Germany only," the right answer leads with Region selection and SCPs, not KMS.

Macie sensitivity score is a signal, not a gate

Macie's automated sensitivity score helps prioritize but does not automatically block any workflow. AI data governance teams still need Lake Formation data filters, SCP denies, or pipeline-level quarantine logic to actually stop sensitive data from flowing into training.

Key Numbers and Must-Memorize Facts

For AIF-C01 AI data governance you need to recognize rather than memorize exact numbers, but a few facts recur.

Amazon Macie ships with 100+ managed data identifiers covering global PII patterns and credentials.
Amazon Comprehend detects 20+ PII entity types synchronously per request.
AWS Glue Data Catalog is Regional; catalog entries do not replicate across Regions by default.
AWS Lake Formation permissions apply at database, table, column, row, and cell granularity.
Amazon DataZone subscriptions are auditable end-to-end via CloudTrail.
Amazon Bedrock stores fine-tuning data encrypted with AWS-managed KMS keys by default; customer-managed KMS keys are optional.
Amazon Bedrock data policy applies to prompts, completions, and fine-tuning data — all three.
Bedrock PrivateLink uses VPC interface endpoints with Regional endpoint names.

Practice Question Links — Task 5.2 Mapped Exercises

Expect AIF-C01 exam items on AI data governance in these shapes:

"A healthcare company wants to scan S3 buckets of medical transcripts for PII before training a foundation model." Answer: Amazon Macie.
"An application built on Amazon Bedrock must redact user phone numbers from prompts before invoking a model." Answer: Amazon Comprehend DetectPiiEntities (or Bedrock Guardrails Sensitive Information Filters).
"A team needs a single catalog where data scientists can discover and subscribe to approved training datasets." Answer: Amazon DataZone.
"A data platform must restrict a SageMaker training role to specific columns of a customer table." Answer: AWS Lake Formation column permissions.
"An EU bank must ensure training data never leaves the EU." Answer: Select an EU Region, enforce SCPs denying non-EU resource creation, and verify the Bedrock foundation model is available in that Region.
"A company wants assurance AWS does not use Bedrock prompts to train provider models." Answer: The Amazon Bedrock data policy provides this guarantee in writing.
"A security team must prevent Bedrock traffic from traversing the public internet." Answer: Amazon Bedrock VPC interface endpoints via AWS PrivateLink.
"A compliance officer asks which AWS service centralizes schema and metadata for AI training data." Answer: AWS Glue Data Catalog.

FAQ — AI Data Governance and PII Top Questions

Q1. What is the difference between Amazon Macie and Amazon Comprehend for PII detection?

Amazon Macie scans Amazon S3 buckets for PII at rest, using managed and custom data identifiers across potentially petabytes of training data. Amazon Comprehend detects PII entities in free text synchronously or in batch jobs, and can return redacted text. Macie is the training-data governance tool; Comprehend is the runtime prompt and output redaction tool. On AIF-C01, Macie is almost always the right answer when the scenario involves S3 buckets, and Comprehend is almost always the right answer when the scenario involves live text flowing to or from a foundation model.

Q2. Does Amazon Bedrock use my prompts to train its foundation models?

No. The Amazon Bedrock data policy explicitly states that your prompts, completions, and fine-tuning data are not used by AWS and are not shared with third-party model providers to train or improve foundation models. This is a named AWS commitment in the Bedrock documentation and is a core AI data governance assurance for enterprise adoption. Individual provider terms may add minor clauses (such as abuse monitoring windows) but cannot override the core Bedrock data policy.

Q3. How do AWS Glue Data Catalog, AWS Lake Formation, and Amazon DataZone fit together for AI data governance?

They layer. AWS Glue Data Catalog stores technical metadata — schemas, partitions, statistics. AWS Lake Formation adds fine-grained access control (column, row, cell) on top of Data Catalog tables. Amazon DataZone publishes Data Catalog tables as business assets with business glossaries and a subscription workflow for cross-team access. An end-to-end AI data governance stack uses all three: Glue Catalog for metadata, Lake Formation for enforcement, DataZone for the user-facing catalog and approvals.

Q4. How do I keep EU training data inside the EU on AWS?

Select an EU AWS Region (eu-central-1, eu-west-1, eu-north-1, eu-west-3, eu-south-1, eu-south-2, or eu-central-2) for every service that touches the data — S3, Glue, SageMaker, Bedrock. Apply AWS Organizations service control policies that deny resource creation in non-EU Regions. Use AWS Config rules to flag drift. Confirm the Bedrock foundation model you plan to use is available in an EU Region and disable or scope cross-Region inference so prompts never route outside the EU. Sign the GDPR Data Processing Addendum via AWS Artifact. Together these steps give you defensible EU-only AI data governance.

Q5. What should I audit in my Bedrock provider's fine print?

Read each Bedrock provider's terms page in the console for acceptable use restrictions, abuse monitoring windows, retention behavior in abuse cases, geographic availability, and any variance from the core Bedrock data policy. Providers differ; AWS does not flatten their terms. AI data governance teams record which provider, which model version, and which variant is used in each workload, and repeat the fine-print review whenever AWS onboards a new model family.

Q6. Is Amazon Bedrock HIPAA-eligible for PHI in AI workloads?

Yes, Amazon Bedrock is HIPAA-eligible under a signed Business Associate Addendum (BAA) with AWS. This means you can process protected health information through Bedrock foundation models when the BAA is in place and when you layer the usual HIPAA controls: encryption at rest and in transit, IAM least privilege, CloudTrail auditing, and — for AI data governance specifically — Macie scanning of training data, Comprehend PII redaction on prompts and outputs, and Lake Formation column-level access enforcement. Always confirm the specific model and Region combination on the AWS HIPAA-eligible services page, which is updated periodically.

Q7. What is the "would-you-train-on-this?" audit and when should I run it?

The "would-you-train-on-this?" audit is a pre-training review where legal, privacy, and engineering stakeholders sample the dataset and ask whether each row is defensibly usable for AI model training — considering consent, licensing, intellectual property, PII exposure, and residency. Run it before every fine-tuning or continued pre-training job, not just at project kickoff, because upstream data pipelines change over time and what was safe last quarter may not be safe this quarter. On AWS, the audit is supported by Macie scans for PII, Glue Data Catalog lineage records, DataZone consent-tagged assets, and Lake Formation column filters that let you sample clean subsets without exposing raw data to reviewers.

What is AI Data Governance and PII Handling on AWS?

Why AI data governance matters for AIF-C01

Scope of this topic versus adjacent topics

The "would-you-train-on-this?" audit

白話文解釋 AI Data Governance and PII Handling

Analogy 1 — The commercial kitchen and the ingredient traceability binder

Analogy 2 — The library special-collections room

Analogy 3 — The postal system and sealed diplomatic pouches

Core Principles of AI Data Governance on AWS

Provenance — where did the data come from?

Consent, licensing, and IP — do we have permission to use it?

Classification — is there PII, PHI, or other sensitive content?

Residency — where does the data physically live?

Vendor data policy — does AWS or the model provider reuse my data?

Training Data Governance — Lineage, Consent, Licensing, and IP

Data lineage for AI pipelines

Consent and license tracking

Intellectual property and the "would-you-train-on-this?" audit

Data quality gates before training

AWS Glue Data Catalog + Lake Formation for Training Data

AWS Glue Data Catalog as AI metadata store

Glue Crawlers for automatic metadata discovery

AWS Lake Formation permissions for training data

Lake Formation data filters for PII exclusion

Lake Formation versus IAM alone

Amazon Macie — S3 PII Scanning at Scale for Training Data

What Amazon Macie finds

Macie in the AI data governance workflow

Custom data identifiers in Macie

Macie sensitivity scoring

Macie versus Amazon GuardDuty versus Inspector

Amazon Comprehend PII Entity Detection — Redact in Prompts and Outputs

Comprehend PII entities

Real-time versus batch detection

Redaction mode — ContainsPiiEntities and DetectPiiEntities

Output-side PII filtering

Comprehend PII versus Bedrock Guardrails Sensitive Information Filters

Amazon DataZone — Cross-Team Data Access Governance

DataZone projects and environments

DataZone business glossary

Subscription requests and data-sharing workflows

DataZone asset metadata for AI

DataZone versus Lake Formation versus Glue Catalog

Regional Data Sovereignty — EU Training Data Must Stay EU

Region choice is the primary control

Bedrock foundation model Regional availability

SCPs, AWS Config, and Lake Formation Regional controls

Cross-Region inference gotcha on Bedrock

AWS European Sovereign Cloud and GovCloud

Amazon Bedrock Data Policy — Your Data Is Not Used to Train Provider Models

The core guarantee

Why this matters for AI data governance

What is encrypted and where

Limits of the guarantee — read the per-provider terms

Amazon Titan and Amazon Q are AWS-managed

Bedrock PrivateLink — Keep AI Traffic Off the Internet

VPC interface endpoints for Bedrock

Why PrivateLink matters for AI data governance

PrivateLink for supporting services

PrivateLink is additive, not a replacement

Model Provider Data Usage Policies Vary — Read the Fine Print

Per-provider terms surface

What AI data governance teams should check

Guarded variants and evaluation models

Open-source models via Bedrock Marketplace and SageMaker JumpStart

Compliance Frameworks that Touch AI Data Governance

GDPR — personal data of EU residents

HIPAA — US protected health information

EU AI Act — risk-based AI regulation

ISO/IEC 42001 — AI management system

Other frameworks

Common Exam Traps on AI Data Governance and PII

Amazon Macie versus Amazon Comprehend PII

AWS Glue Data Catalog versus SageMaker Feature Store

Lake Formation versus IAM

Amazon DataZone versus AWS Glue Data Catalog

Bedrock data policy versus model provider terms

Regional residency versus encryption

Macie sensitivity score is a signal, not a gate

Key Numbers and Must-Memorize Facts