CloudFormation Stacks and StackSets Operations

AWS CloudFormation is the SOA-C02 service-of-record for provisioning and maintaining cloud resources. Domain 3 (Deployment, Provisioning, and Automation) is worth 18 percent of the exam, and Task Statement 3.1 explicitly demands the candidate "create, manage, and troubleshoot AWS CloudFormation" — note the verb troubleshoot. Where SAA-C03 asks which infrastructure-as-code service to choose at architectural design time, SOA-C02 puts you in front of a stack that is stuck in UPDATE_ROLLBACK_FAILED, a StackSet operation that failed in 5 of 200 accounts, a DELETE_FAILED caused by a non-empty S3 bucket, and a drifted resource that someone modified outside CloudFormation. The exam tests fixing CloudFormation, not designing IaC architecture.

This guide walks CloudFormation from the SysOps angle: the full stack lifecycle and every status code you might see at 3am, change sets as the production safety net, drift detection and remediation, the four DeletionPolicy values and how UpdateReplacePolicy differs, stack policies that block accidental updates to stateful resources, rollback triggers that watch CloudWatch alarms during deployment, StackSets in self-managed vs service-managed mode, organization-wide deployment with auto-deployment to new accounts, and the recurring SOA-C02 scenario shapes — UPDATE_ROLLBACK_FAILED recovery via ContinueUpdateRollback, StackSet failure tolerance settings, the 1-hour stack creation timeout, and the parameter store integration that surprises candidates.

Why CloudFormation Sits at the Heart of SOA-C02 Domain 3

The official SOA-C02 Exam Guide v2.3 lists five skills under Task Statement 3.1, and CloudFormation appears in three of them directly: "Create, manage, and troubleshoot AWS CloudFormation"; "Provision resources across multiple AWS Regions and accounts (for example, AWS RAM, CloudFormation StackSets, IAM cross-account roles)"; and "Identify and remediate deployment issues (for example, service quotas, subnet sizing, CloudFormation errors, permissions)." Task Statement 3.2 then layers automation on top — Systems Manager and CloudFormation are the only two automation services the exam guide names by example for "automate deployment processes."

At the SysOps tier the framing is operational. SAA-C03 asks "which IaC tool should the architect choose for a multi-account landing zone?" SOA-C02 asks "the stack update failed at 60 percent, the stack is now in UPDATE_ROLLBACK_FAILED, three resources are stuck in UPDATE_FAILED, and the production database has a DeletionPolicy: Retain — what is the next step?" The answer is rarely a different IaC tool; it is ContinueUpdateRollback with a ResourcesToSkip list. CloudFormation Stacks and StackSets is the topic where every later Domain 3 topic plugs in — AMI lifecycle and Image Builder pipelines (3.1) hand off AMIs that CloudFormation references in launch templates; Systems Manager Automation (3.2) is the AWS-managed runbook engine that CloudFormation can invoke during stack operations and that operates on stack-deployed resources; scheduled tasks (3.2) often are CloudFormation custom resources or stack creation triggers.

Template: a JSON or YAML document describing the desired state of AWS resources. Templates declare Parameters, Mappings, Conditions, Resources, Outputs, and Transform.
Stack: a single deployment of a template — the unit CloudFormation manages as a whole. Every resource in a stack is created, updated, and deleted together.
Change set: a preview of the changes CloudFormation will make to a stack if you execute the change set. Lets you verify replacement vs in-place modification before any resource is touched.
Drift: a difference between the stack's expected configuration (per the template) and the actual configuration of a resource (per the AWS API). Caused by out-of-band edits.
Rollback: automatic reversion of stack changes when a CREATE or UPDATE fails. Produces statuses like ROLLBACK_IN_PROGRESS, UPDATE_ROLLBACK_IN_PROGRESS, and on failure UPDATE_ROLLBACK_FAILED.
DeletionPolicy: a resource-level attribute controlling what happens when the resource is removed from the stack — Delete (default for most), Retain, Snapshot (for stateful services), or RetainExceptOnCreate.
UpdateReplacePolicy: a resource-level attribute controlling what happens to the old resource when an update requires replacement — same value set as DeletionPolicy.
Stack policy: a JSON policy attached to a stack that grants or denies update permissions on specific resources within that stack — protects stateful resources from accidental modification.
Rollback trigger: a CloudWatch alarm referenced during stack create/update; if the alarm fires within the configured monitoring time, CloudFormation automatically rolls back.
StackSet: a container that lets you deploy a single template to many accounts and many regions, managed centrally from an administrator account.
Stack instance: one stack created by a StackSet in a specific account-region pair. A StackSet has one stack instance per (account, region).
Self-managed permissions: StackSet mode where you create the IAM roles in the administrator and target accounts manually.
Service-managed permissions: StackSet mode where AWS Organizations creates and manages the roles automatically; supports auto-deployment to new accounts.
Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html

白話文解釋 CloudFormation Stacks and StackSets

CloudFormation has more vocabulary than any other Domain 3 topic. Three analogies make the constructs stick.

Analogy 1: The Construction Blueprint and the Building Permit Office

A CloudFormation template is the architectural blueprint for a building — it lists every wall, pipe, and outlet without actually building anything. A stack is the finished building that the city built from that blueprint; the city tracks which buildings exist and what blueprint produced each one. A change set is the renovation permit application the city posts on the door before any contractor swings a hammer — neighbors and the building owner read the permit and see "we will replace the load-bearing wall (means the building is rebuilt) and add a new bathroom (means a small in-place change)" and either approve it or reject it before any irreversible damage. Drift detection is the building inspector who walks every room with the original blueprint in hand and writes "the kitchen now has a wall that the blueprint never showed — someone built it without a permit". The DeletionPolicy: Retain on the foundation is the no-demolition clause in the deed — even if you tear up the rest of the building, the foundation stays. The stack policy is the building code taped to the door of the server room: "do not modify the database without a special permit, even if the architect's blueprint says to." A StackSet is the franchise builder — the same blueprint used to construct identical stores in 50 cities across 5 countries, all managed from corporate HQ. Service-managed StackSets is when the city's AWS Organizations zoning office automatically grants permits in every member city; self-managed StackSets is when each city's mayor has to sign a permit individually.

Analogy 2: The Restaurant Kitchen Recipe Book

A template is a recipe in the kitchen recipe book — ingredients, steps, exact temperatures, plating instructions. A stack is the plated dish that came out of the kitchen built from that recipe; every dish is traceable to the recipe version. Parameters are the ingredient substitutions the customer requests at order time ("medium-rare instead of well-done") that the recipe accepts. Outputs are the plating notes passed forward to the next station ("add this sauce ID to the dessert later"). Cross-stack references via Outputs and Fn::ImportValue are how the soup station hands a stock pot to the sauce station — the soup recipe Outputs the stock ARN, the sauce recipe ImportValues it. A change set is the chef's tasting plate — the kitchen plates a sample, the head chef tastes it and approves before sending the full order out. A rollback trigger wired to a CloudWatch alarm is the kitchen smoke detector — if smoke goes off during cooking, the whole dish is thrown out and we revert to the previous menu state. The monitoring time of 30 minutes is how long the head chef stares at the dish after plating; if the customer sends it back within 30 minutes, the kitchen rolls back the change.

Analogy 3: The Library Card Catalog and Inter-Branch Loans

A stack is one book on a library shelf — checked out, tracked, and managed as a unit. The template is the publisher's manuscript. Drift is when a librarian writes notes in the margins; the original manuscript no longer matches the book on the shelf, and a future reprint will overwrite the marginalia unless someone formally adds them to the manuscript. A StackSet is the multi-branch library system: corporate HQ pushes the same book to every branch in the city. A stack instance is one branch's copy of that book. Concurrent regions in a StackSet operation is how many branches receive new shipments at once — push to too many branches at once and the trucks gridlock the loading docks; the default of 1 region at a time is conservative for a reason. Failure tolerance is how many branches can refuse a shipment before HQ stops the rollout — if 5 branches reject the book due to fire-code violations, do we keep going to the next 195 or pause the campaign?

For SOA-C02, the construction blueprint analogy is the most useful when a question mixes change sets, drift detection, and DeletionPolicy. The blueprint-vs-building distinction makes "drift" obvious — the building no longer matches the blueprint someone is holding. When the question is about StackSets across accounts, the multi-branch library analogy clarifies why concurrent regions and failure tolerance both exist as separate knobs. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html

CloudFormation Stack Lifecycle: Statuses You Will Read at 3am

Before troubleshooting makes sense you need a mental map of the stack lifecycle and every status code CloudFormation may show.

Create lifecycle

A CREATE operation transitions through CREATE_IN_PROGRESS → CREATE_COMPLETE on success. On failure the default behavior is automatic rollback: CREATE_IN_PROGRESS → ROLLBACK_IN_PROGRESS → ROLLBACK_COMPLETE. The ROLLBACK_COMPLETE state is terminal — you cannot update or recover from it. The only valid action is delete the stack and re-create.

If automatic rollback is disabled with --disable-rollback (the OnFailure: DO_NOTHING option in the API), the stack stays in CREATE_FAILED so you can inspect partial resources and the error trail before deleting. This is a debugging mode for templates with hard-to-reproduce failures, never a production setting.

Update lifecycle

A successful UPDATE runs UPDATE_IN_PROGRESS → UPDATE_COMPLETE_CLEANUP_IN_PROGRESS → UPDATE_COMPLETE. The cleanup phase removes resources that the update replaced — for example, if an instance was rebuilt with a new AMI, the old instance is terminated only during cleanup. CloudFormation considers the stack updateable again only after UPDATE_COMPLETE.

On failure, automatic rollback kicks in: UPDATE_IN_PROGRESS → UPDATE_ROLLBACK_IN_PROGRESS → UPDATE_ROLLBACK_COMPLETE. The original state is restored and the stack is updateable again. So far so good.

The dangerous status is UPDATE_ROLLBACK_FAILED. This means the rollback itself encountered an error — for example, a resource cannot be restored to its previous configuration because some external dependency was deleted, or an IAM role was removed mid-rollback. The stack is now stuck. The recovery API is ContinueUpdateRollback, optionally with a ResourcesToSkip parameter listing the resource logical IDs you want CloudFormation to skip during the rollback retry. Skipped resources are marked UPDATE_COMPLETE in the stack metadata even though they may not match the template — this is a deliberate operational override.

Delete lifecycle

A DELETE operation runs DELETE_IN_PROGRESS → DELETE_COMPLETE. On failure, the stack enters DELETE_FAILED. Common causes are non-empty S3 buckets (S3 buckets must be empty before deletion), retained resources blocking deletion order, IAM permission gaps, or termination protection on EC2 instances. Recovery is to fix the underlying cause and retry DELETE, optionally with RetainResources to skip the problematic logical IDs.

Status quick reference

Status	Meaning	Recovery
`CREATE_COMPLETE`	Stack creation succeeded	Updateable
`CREATE_FAILED`	Failure with rollback disabled	Inspect, then delete
`ROLLBACK_COMPLETE`	Create failed, rolled back automatically — terminal	Delete and re-create
`UPDATE_COMPLETE`	Update succeeded	Updateable
`UPDATE_ROLLBACK_COMPLETE`	Update failed, original state restored	Updateable
`UPDATE_ROLLBACK_FAILED`	Rollback itself failed	`ContinueUpdateRollback` API
`DELETE_FAILED`	Delete encountered errors	Fix cause, retry delete (optionally `RetainResources`)
`REVIEW_IN_PROGRESS`	Stack created from change set, change set not yet executed	Execute or delete the change set

Many candidates conflate these two. ROLLBACK_COMPLETE comes from a failed CREATE: the stack was never successful in the first place, so there is nothing to update or roll back further; the only path forward is DELETE then re-create. UPDATE_ROLLBACK_FAILED comes from a failed rollback during UPDATE: the stack was previously healthy, the update partially succeeded, the rollback partially failed, and the recovery is ContinueUpdateRollback — possibly with ResourcesToSkip. Confusing them on SOA-C02 costs easy points. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-update-rollback.html

Stack Creation Timeout: The Hidden 1-Hour Default

A subtle SOA-C02 gotcha: CloudFormation has a default per-resource timeout for WaitCondition and CreationPolicy of 12 hours, but the stack itself does not have a hardcoded creation timeout — however, individual operations and many resource types each enforce their own timeouts internally. The most relevant SOA-C02 number is the CreationPolicy.ResourceSignal.Timeout which defaults to 1 hour (PT1H) for instances and Auto Scaling groups expecting a cfn-signal from user-data.

If user-data on a launching EC2 instance does not call cfn-signal --success within the timeout window, the stack create fails with Failed to receive 1 resource signal(s) within the specified duration. The fix is one of:

Make the user-data faster.
Increase the CreationPolicy.ResourceSignal.Timeout value (PT2H, PT4H, etc.).
Pre-bake dependencies into a golden AMI so user-data does not need to install them.
Switch from cfn-signal to a custom resource that polls for application readiness.

Default CreationPolicy.ResourceSignal.Timeout: 1 hour (PT1H).
Maximum CreationPolicy.ResourceSignal.Timeout: 12 hours (PT12H).
Default WaitCondition timeout: 12 hours.
Change set retention: change sets can be kept indefinitely until executed or deleted; there is no automatic expiry, but old change sets become stale once the stack changes underneath them and AWS recommends deleting them after use.
Default StackSet operation concurrency: 1 region at a time (regions are deployed sequentially by default; can be raised with RegionConcurrencyType: PARALLEL).
Default StackSet account concurrency: MaxConcurrentCount = 1 or MaxConcurrentPercentage = 0 (effectively serial unless explicitly raised).
Default StackSet failure tolerance: FailureToleranceCount = 0 — first failure stops the operation.
Maximum Resources in a single template: 500 (raised to 500 from the older 200 limit).
Maximum template body size: 1 MB through S3 upload, 51,200 bytes inline through API.
Maximum Parameters per template: 200.
Maximum Outputs per template: 200.
Maximum Mappings per template: 200.
Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html

Common Deployment Errors and How to Diagnose

CloudFormation surfaces errors as stack events in the console (and via DescribeStackEvents API). Each event has a logical ID, resource type, status, and a status reason — the status reason is your primary diagnostic input.

Insufficient IAM permissions

The most common deployment failure. CloudFormation runs operations using either:

The user/role invoking the stack operation — its permissions must cover every resource action the template implies.
A CloudFormation service role specified at stack creation — CloudFormation uses the service role to create resources, decoupling stack-execution permissions from invoker permissions.

If a stack event reads User: arn:aws:iam::123:role/X is not authorized to perform: ec2:RunInstances, you have an IAM gap. Fix the role policy or attach a service role with broader permissions.

Service quota exceeded

If the stack tries to create the 21st VPC in a region with a quota of 20, the resource event reads The maximum number of VPCs has been reached. Resolution: request a quota increase via Service Quotas, clean up unused resources, or split the stack across accounts. Domain 3.1 explicitly tests "service quotas" as a deployment issue category.

Invalid property values

The template references an AMI ID that does not exist in the region, an instance type unavailable in the AZ, or an SSL certificate that is not validated. The error is usually clear: The image id '[ami-xxxxx]' does not exist. Resolution: parameterize the AMI ID with Parameters of type AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> so CloudFormation pulls the correct AMI from SSM Parameter Store at deploy time.

Dependency cycle

If two resources mutually reference each other through Ref or GetAtt, CloudFormation cannot decide which to create first. The error is Circular dependency between resources: [A, B]. Resolution: break the cycle by using DependsOn only one direction, or by introducing a third resource (often an SSM Parameter or a Lambda-backed custom resource) that brokers the value.

Subnet sizing

The template tries to launch 50 instances into a subnet with only 32 available IPs. Error: There are not enough free addresses in subnet. Resolution: enlarge the subnet via re-templating, or split across additional subnets in other AZs. The exam guide explicitly lists "subnet sizing" as a deployment issue.

A common SOA-C02 distractor: a template hardcodes an AMI ID, the stack works in us-east-1 but fails in eu-west-1. Candidates may pick "use Mappings" — viable but verbose. The cleaner SOA-C02 answer is AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> — CloudFormation reads the AMI ID from SSM Parameter Store at stack-create time, supports the AWS-managed /aws/service/ami-amazon-linux-latest/... parameters, and adapts to each region automatically. Mappings is the legacy answer; SSM Parameter Store types are the modern answer. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/parameters-section-structure.html

Change Sets: Previewing Stack Updates Before Execution

A change set is a preview of every change CloudFormation will make to an existing stack — every resource that will be added, modified in-place, replaced (i.e., destroyed and recreated), or removed. Change sets are the production safety net that prevents accidental data loss.

Why change sets exist

Direct stack updates (UpdateStack API) execute immediately. If the update changes a property that requires replacement on a stateful resource — for example, modifying the DBSubnetGroupName of an RDS instance — CloudFormation creates a new RDS instance and deletes the old one. The old data is gone unless you wrapped the resource with DeletionPolicy: Snapshot or UpdateReplacePolicy: Snapshot. A direct update can therefore wipe a database with one CLI call.

A change set forces a review step. The change set output for the same update reads RDSInstance: REPLACEMENT_REQUIRED — the operator sees the warning before any resource is touched, can cancel the change, edit the template, or proceed knowingly.

Creating and executing change sets

aws cloudformation create-change-set \
  --stack-name web-prod \
  --template-body file://template.yaml \
  --change-set-name pr-432-update \
  --change-set-type UPDATE

CloudFormation generates the change set in CREATE_PENDING → CREATE_COMPLETE. Inspect with describe-change-set. Each change is classified as:

Add — new resource will be created.
Modify — existing resource will be modified in place (no replacement).
Modify with RequiresRecreation: Always — the resource will be replaced; old destroyed.
Modify with RequiresRecreation: Conditionally — replacement depends on additional context.
Remove — resource will be deleted.

When ready, execute:

aws cloudformation execute-change-set --change-set-name pr-432-update --stack-name web-prod

If you discover the change set is wrong, delete it without execution: aws cloudformation delete-change-set .... Deleting a change set has no effect on the live stack.

Change sets for new stacks

Change sets work for new stack creation too: pass --change-set-type CREATE. The stack is created in REVIEW_IN_PROGRESS state, and only when you execute the change set does the stack become real. This is useful for templates with conditional resources where the operator wants to verify which resources will be created before committing.

SOA-C02 favors change sets for any "the team accidentally replaced production database" scenario. The fix is procedural — require change sets in the deployment pipeline, never invoke UpdateStack directly on production stacks. Combined with DeletionPolicy: Snapshot and a stack policy denying updates to the database, the database is triple-protected against accidental destruction. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html

Drift Detection: When Reality Diverges From the Template

Drift is the difference between a stack's expected state (per the template at the last successful operation) and the actual state of resources (per the AWS API right now). Drift happens when someone modifies a stack-managed resource through the console, CLI, or another tool — out-of-band edits that the next CloudFormation update would silently overwrite.

Detecting drift

Drift detection runs on demand:

aws cloudformation detect-stack-drift --stack-name web-prod
# returns a StackDriftDetectionId
aws cloudformation describe-stack-resource-drifts --stack-name web-prod

Each resource is reported as IN_SYNC, MODIFIED, DELETED, or NOT_CHECKED (the resource type does not support drift detection). For modified resources, CloudFormation lists each property that differs, both expected and actual values.

Drift detection is read-only — it does not change anything. The operator decides remediation.

Remediation strategies

Three strategies, picked per situation:

Update the template to match reality. If the out-of-band change was deliberate (e.g., security team manually tightened a security group), modify the template, commit to Git, and run a stack update. The next stack update will not overwrite the change because the template now matches.
Re-apply the template to overwrite reality. If the out-of-band change was accidental or unauthorized, run a stack update with no template changes — CloudFormation detects the divergence and reverts the resource to the template definition.
Use AWS Config rules to flag drift continuously. The managed Config rule cloudformation-stack-drift-detection-check runs drift detection on a schedule and reports non-compliant stacks. Combined with EventBridge → SNS, this surfaces drift as soon as it appears, not weeks later.

When drift detection won't help

Some resource types do not support drift detection, and some properties of supported types are excluded from drift comparison. The CloudFormation User Guide lists which resource types are supported; common managed services like RDS, EC2 instances, IAM roles, S3 buckets, and Lambda functions are supported, while some custom or legacy resource types are not.

On SOA-C02, "the security team detected drifted resources across 50 stacks" maps to an organization-wide pattern: AWS Config rule cloudformation-stack-drift-detection-check deployed to every account via Config Aggregator, EventBridge rules on Config compliance change events, and notification or auto-remediation through Systems Manager Automation. The exam favors this end-to-end pipeline answer over a one-off detect-stack-drift CLI call. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift.html

DeletionPolicy: Retain, Delete, Snapshot, RetainExceptOnCreate

The DeletionPolicy resource attribute controls what happens when a resource is removed from the stack — either because the stack is deleted or because the resource is removed from the template. Four values:

Delete (default for most resources) — the resource is destroyed when the stack is deleted or the resource is removed from the template.
Retain — the resource is kept when the stack is deleted or removed; it becomes orphaned (no longer managed by any stack) but continues to exist and incur cost.
Snapshot — supported for stateful resources (RDS DB instances, RDS DB clusters, EBS volumes, ElastiCache, Neptune, Redshift, FSx); CloudFormation creates a final snapshot before deleting the resource. Recovery requires manually restoring from the snapshot.
RetainExceptOnCreate — the resource is retained on deletion except during a failed create operation. Useful for resources that should survive normal deletes but should not pile up after failed initial deployments.

The default is Delete for most resources, but a small number of stateful types default differently or recommend explicit policies. Always set DeletionPolicy explicitly on stateful resources — RDS, EBS, S3 buckets you do not want to lose, KMS keys, IAM resources you want to outlive the stack.

Common combinations

RDS database: DeletionPolicy: Snapshot, UpdateReplacePolicy: Snapshot — final snapshot taken on stack delete or replacement.
S3 bucket holding production data: DeletionPolicy: Retain, UpdateReplacePolicy: Retain — bucket survives stack delete; data is safe.
KMS CMK: DeletionPolicy: Retain — KMS keys cannot be deleted instantly anyway (they enter a 7-30 day pending deletion window), and you almost never want to lose a key referenced by ciphertext elsewhere.
Auto Scaling launch template: DeletionPolicy: Delete (default) — launch templates are not stateful; safe to recreate.

`DeletionPolicy` on stack delete vs resource removal

DeletionPolicy applies in both cases:

When the entire stack is deleted via DeleteStack, every resource's DeletionPolicy is honored.
When a single resource is removed from the template (the template no longer declares it), the next stack update applies the resource's DeletionPolicy.

`DeletionPolicy` does NOT apply to update-replacement

If an in-place update is impossible and CloudFormation must replace the resource, DeletionPolicy is not the attribute that protects the old copy — that is the job of UpdateReplacePolicy (next section). This is a common SOA-C02 trap.

A common pattern that goes wrong: candidate adds DeletionPolicy: Snapshot to an RDS instance and assumes the database is safe under all circumstances. It is not. If a future template change requires replacement (such as changing the DB engine version in a way that triggers replacement), DeletionPolicy is irrelevant — UpdateReplacePolicy decides whether the replaced resource is snapshotted. The defensive pair is DeletionPolicy: Snapshot AND UpdateReplacePolicy: Snapshot on every stateful resource. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-deletionpolicy.html

UpdateReplacePolicy: Protecting Against Update-Driven Replacement

The UpdateReplacePolicy resource attribute controls what happens to the original resource when an update operation replaces the resource with a new copy. The values are the same as DeletionPolicy — Delete (default), Retain, Snapshot, RetainExceptOnCreate.

When does CloudFormation replace?

Some properties cannot be modified on an existing resource. Examples:

Changing DBInstanceClass on RDS in some cases triggers in-place modification, but changing DBSubnetGroupName requires replacement.
Changing Engine on RDS forces replacement.
Changing KeyName on EC2 instance forces replacement.
Changing AvailabilityZone on a resource that is AZ-pinned forces replacement.
Changing the logical ID of a resource is conceptually a delete + create — same effect.

A change set marks each property change with RequiresRecreation: Always | Conditionally | Never. Anything Always triggers replacement, and UpdateReplacePolicy decides the fate of the old copy.

Why `UpdateReplacePolicy: Snapshot` exists

Imagine an RDS instance where someone changes a property that requires replacement. Without UpdateReplacePolicy, the default is Delete — the old database is destroyed during cleanup, with no snapshot. The new database is empty. Production data: gone. With UpdateReplacePolicy: Snapshot, CloudFormation snapshots the old database before deleting it, preserving an emergency recovery path.

::warning

Production templates should set both attributes explicitly on every stateful resource. The defensive idiom for RDS, EBS, ElastiCache, Redshift, Neptune, FSx:

ProductionDatabase:
  Type: AWS::RDS::DBInstance
  DeletionPolicy: Snapshot
  UpdateReplacePolicy: Snapshot
  Properties: ...

Setting only one leaves a hole. SOA-C02 explicitly tests this pair. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatereplacepolicy.html ::

Stack Policies: Blocking Accidental Updates to Stateful Resources

A stack policy is a JSON IAM-style policy attached to a stack that controls which resources can be modified during a stack update. It is enforced server-side by CloudFormation regardless of who triggers the update.

Stack policy structure

A stack policy has Effect, Action, Principal, Resource, and optional Condition. Actions are limited to Update:*, Update:Modify, Update:Replace, Update:Delete. Principal is always *. Resources are referenced by LogicalResourceId/* patterns.

Default stack policy (when none is set): Allow Update:* on * — every resource is updateable.

A common defensive policy:

{
  "Statement": [
    { "Effect": "Allow", "Principal": "*", "Action": "Update:*", "Resource": "*" },
    { "Effect": "Deny",  "Principal": "*", "Action": "Update:*", "Resource": "LogicalResourceId/ProductionDatabase" }
  ]
}

This permits updates to every resource except the production database. Any update affecting ProductionDatabase is rejected at stack-update time.

Override during a single update

To intentionally update a protected resource, you pass a temporary override stack policy (StackPolicyDuringUpdateBody) that applies only to that single update. This forces a deliberate operator decision rather than relying on the persistent stack policy alone.

Stack policy vs IAM policy

Mechanism	Scope	Example use
Stack policy	Per-stack, controls which resources within the stack can be updated	Block updates to ProductionDatabase logical resource
IAM policy	Per-principal, controls who can call CloudFormation APIs at all	Only DevOps role can call `UpdateStack`
Service Control Policy (SCP)	Per-organization-OU, deny CloudFormation actions in the prod OU	Prevent `DeleteStack` from any role in production accounts
Service role for stack	Per-stack, what AWS resources CloudFormation itself can create	Limit the blast radius of a runaway template

The four mechanisms compose. A robust production setup uses all four — stack policy, IAM, SCP, and a scoped service role.

A subtle SOA-C02 trap: stack policies cover Update:* actions. They do not prevent stack deletion. A DeleteStack API call obeys DeletionPolicy on each resource but ignores the stack policy entirely. To prevent stack deletion, use either (a) IAM denying cloudformation:DeleteStack on that stack ARN, (b) termination protection enabled on the stack (UpdateTerminationProtection), or (c) an SCP at the OU level. Defense in depth means combining stack policy (against updates), termination protection (against deletes), and DeletionPolicy: Retain on irrecoverable resources. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/protect-stack-resources.html

Rollback Triggers and Monitoring Time

CloudFormation can monitor a stack's CloudWatch alarms during a create or update operation and automatically roll back if any of the alarms fire. This is rollback triggers plus monitoring time.

How rollback triggers work

You configure up to 5 rollback triggers per stack operation, each pointing at a CloudWatch alarm ARN. During the operation and for an additional monitoring time (0–180 minutes) after the operation completes, CloudFormation polls the alarms. If any trigger transitions to ALARM, CloudFormation automatically rolls back the stack to its previous state.

This is the closest CloudFormation comes to native canary deployments — you can update a stack that runs a new version of an application, set a rollback trigger on the application's 5xx-error-rate alarm with a 30-minute monitoring time, and any error rate spike within 30 minutes of completion automatically reverts the deployment.

Configuration via CLI

aws cloudformation update-stack \
  --stack-name web-prod \
  --template-body file://template-v2.yaml \
  --rollback-configuration '{
    "RollbackTriggers": [
      { "Arn": "arn:aws:cloudwatch:us-east-1:123:alarm:5xx-error-rate", "Type": "AWS::CloudWatch::Alarm" }
    ],
    "MonitoringTimeInMinutes": 30
  }'

Monitoring time

The monitoring window starts when the stack reaches UPDATE_COMPLETE (or CREATE_COMPLETE) and runs for the configured minutes. During this window, the stack is considered "in observation" — alarm triggers cause rollback. Outside the window, alarms have no effect on the stack.

MonitoringTimeInMinutes: 0 means the monitoring window is zero (alarms only checked during the operation itself, not afterward). 180 is the maximum.

Rollback triggers vs blue/green

Rollback triggers handle in-place rollouts where the running stack is the deployment target. For full blue/green deployment — two parallel stacks with traffic shifted via Route 53 weighted records or ALB target group switches — CloudFormation has no native primitive; you compose it manually with two stacks plus an automation runbook. Rollback triggers cover the in-place case; blue/green is a higher-level pattern that uses CloudFormation as the building block.

When a question asks "the team needs to deploy a new application version and automatically revert if 5xx errors spike within 30 minutes" — the answer is rollback triggers with monitoring time. CodeDeploy is out-of-scope for SOA-C02; the exam-sanctioned tool for automatic rollback is CloudFormation rollback triggers. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-rollback-triggers.html

CloudFormation StackSets: Multi-Account, Multi-Region Deployments

A StackSet is a container that lets a single template be deployed as stack instances across many accounts and many regions, all managed centrally from an administrator account. It is the SOA-C02 answer for "deploy this baseline to every account in the organization" — guardrails, logging configurations, IAM roles, security tooling.

StackSets terminology

Administrator account: the account that holds the StackSet definition and from which operations are launched.
Target account: each account that receives a stack instance.
Stack instance: one deployed stack in a (target account, region) pair. A StackSet with 50 accounts × 4 regions has 200 stack instances.
Operation: a create-stack-instances, update-stack-set, or delete-stack-instances action — runs across the configured account-region matrix.

Self-managed permissions vs service-managed permissions

The two modes differ in how the cross-account trust is established:

Self-managed permissions

The administrator account holds an AWSCloudFormationStackSetAdministrationRole.
Each target account holds an AWSCloudFormationStackSetExecutionRole whose trust policy permits the administrator role to assume it.
The operator must create both roles manually (CloudFormation provides templates) before the StackSet can deploy.
Works without AWS Organizations — useful for organizations that have not adopted Organizations.

Service-managed permissions

Requires AWS Organizations with all features enabled (not just consolidated billing).
Trusted access for CloudFormation must be enabled in the organization.
AWS Organizations creates and manages the necessary IAM roles automatically.
Supports auto-deployment: when a new account is added to an organizational unit, the StackSet automatically deploys the stack to the new account — critical for landing zones.
Supports deployment to entire OUs rather than enumerating account IDs.

Feature	Self-managed	Service-managed
Requires Organizations	No	Yes (all features)
IAM roles	Manual	Auto-created
Target by OU	No (account IDs only)	Yes
Auto-deploy to new accounts	No	Yes
Use case	Pre-Organizations or specific account selection	Org-wide baseline, landing zone

StackSet operation knobs

Three knobs every SOA-C02 candidate must know:

MaxConcurrentCount / MaxConcurrentPercentage: how many accounts to deploy to in parallel within a region. Default 1 (serial). Higher values speed deployment but increase blast radius if the template has bugs.
FailureToleranceCount / FailureTolerancePercentage: how many account failures to tolerate before stopping the operation. Default 0 (first failure stops). Higher values allow the rollout to continue past isolated failures.
RegionConcurrencyType: SEQUENTIAL (default — one region at a time) or PARALLEL (all regions simultaneously, much faster but riskier). For service-managed StackSets, also configure RegionOrder to choose deployment ordering when sequential.

A typical org-wide rollout uses MaxConcurrentPercentage: 25, FailureTolerancePercentage: 5, RegionConcurrencyType: SEQUENTIAL — deploy to 25 percent of accounts in parallel within one region at a time, tolerate up to 5 percent failure before stopping.

Account filtering in service-managed StackSets

Service-managed StackSets support account filters when targeting an OU:

NONE (default) — deploy to every account in the OU.
INTERSECTION — deploy only to listed accounts that are also in the OU.
DIFFERENCE — deploy to every account in the OU except the listed accounts.
UNION — deploy to every account in the OU plus any additionally listed accounts.

A common production pattern is DIFFERENCE to exclude a couple of legacy accounts that cannot accept the baseline yet.

Default MaxConcurrentCount: 1 (serial deployment within a region).
Default FailureToleranceCount: 0 (first failure stops the operation).
Default RegionConcurrencyType: SEQUENTIAL (one region at a time).
Maximum stack instances per StackSet (service-managed): 2,000 per administrator account per region.
Maximum operation parallelism: 100 accounts at a time (MaxConcurrentCount cap).
Auto-deployment: only available in service-managed mode.
Trusted access for CloudFormation in Organizations: required before any service-managed StackSet can be created.
Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacksets-concepts.html

Whenever a SOA-C02 scenario describes "deploy a CloudWatch alarm baseline to every existing and future account in the organization", the canonical answer is service-managed StackSets with auto-deployment enabled, targeting the root or a baseline OU. Self-managed StackSets are correct only when the scenario explicitly says "the company does not use AWS Organizations" or "deploy only to a small fixed list of accounts". Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacksets-orgs-enable-trusted-access.html

Cross-Stack References, Nested Stacks, and Stack Resources Across Accounts

CloudFormation provides three composition mechanisms; SOA-C02 expects you to know when each applies.

Outputs and Fn::ImportValue (cross-stack references)

The exporting stack declares an Output with Export: { Name: my-vpc-id }. The importing stack uses Fn::ImportValue: my-vpc-id. Cross-stack references are:

Same account, same region only. No cross-account or cross-region.
Loosely coupled. The exporting stack does not know who imports its outputs.
Sticky. Once a value is imported by another stack, you cannot delete or modify the export until the importing stack stops using it.

Use cross-stack references for shared infrastructure within a single account-region — VPC, subnets, security groups exported from a network stack and imported by every workload stack.

Nested stacks

A nested stack is a stack created as a resource within another stack. The parent stack has a resource of type AWS::CloudFormation::Stack pointing at a child template URL in S3. Nested stacks:

Share the parent stack's lifecycle. Updating the parent updates all nested children.
Are tightly coupled. Useful for re-usable template modules (e.g., a "VPC module" used in many parent templates).
Pass parameters and outputs between parent and child via Parameters and Outputs directly.

Nested stacks are the right tool for reusable internal modules; cross-stack references are right for independently-managed shared resources.

Cross-account / cross-region — use StackSets

CloudFormation has no native cross-account or cross-region stack reference. To deploy related resources across accounts, you compose with StackSets (one StackSet per template) and use IAM cross-account roles for run-time access between deployed resources.

Same account, same region, shared infra: Outputs + Fn::ImportValue.
Reusable template modules: nested stacks (AWS::CloudFormation::Stack).
Multi-account, multi-region baseline: StackSets.
Cross-account resource sharing of an existing resource (e.g., a Transit Gateway): AWS Resource Access Manager (RAM), referenced from CloudFormation as AWS::RAM::ResourceShare.

Mixing these up is a frequent SOA-C02 distractor. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference-importvalue.html

Intrinsic Functions: The CloudFormation Toolbox

Intrinsic functions are built-in template helpers. You will see a handful repeatedly on SOA-C02:

Ref — reference a parameter or another resource. For most resource types Ref returns the logical ID (which is the physical name for many resources, e.g., the bucket name for AWS::S3::Bucket).
Fn::GetAtt / !GetAtt — get a specific attribute of a resource (!GetAtt MyInstance.PrivateIp).
Fn::Sub / !Sub — substitute variables into a string (!Sub "arn:aws:s3:::${BucketName}/*").
Fn::Join / !Join — concatenate values with a delimiter.
Fn::ImportValue / !ImportValue — import an exported output from another stack.
Fn::FindInMap / !FindInMap — look up a value in a Mappings table by key (!FindInMap [RegionMap, !Ref AWS::Region, AMI]).
Fn::If / !If — pick between values based on a Condition.
Fn::GetAZs — return the list of AZs in a region.
Fn::Cidr — calculate CIDR blocks for subnet sizing.
Fn::Base64 — encode user-data scripts.
Pseudo parameters — AWS::Region, AWS::AccountId, AWS::StackName, AWS::Partition, AWS::URLSuffix.

The exam rarely tests detailed syntax, but knowing which function applies to which use case (especially Sub vs Join and Ref vs GetAtt) is fair game.

Scenario Pattern: Stack Stuck in UPDATE_ROLLBACK_FAILED

This is the canonical SOA-C02 troubleshooting scenario. The runbook:

Inspect the stack events. aws cloudformation describe-stack-events --stack-name <name> shows the chain of failures. Identify the resource(s) that failed to roll back and the underlying error message — usually one of: external dependency missing, IAM permission removed mid-rollback, resource modified outside CloudFormation during the rollback.
Decide which resources to skip. For each stuck resource, decide whether (a) you can fix the underlying cause (e.g., re-create a deleted IAM role, re-attach an Internet Gateway), or (b) you accept the resource in its current state and tell CloudFormation to skip it during the retry. Skipping is irreversible — the resource is marked UPDATE_COMPLETE regardless of actual state, so reconciliation later is your responsibility.

Call ContinueUpdateRollback.

aws cloudformation continue-update-rollback \
  --stack-name web-prod \
  --resources-to-skip ResourceA ResourceB

Verify final stack state. The stack should reach UPDATE_ROLLBACK_COMPLETE. From here it is updateable again.
Reconcile any skipped resources. If you skipped a resource, follow up with a stack update or out-of-band fix to bring it back into alignment with the template.

The wrong answer that SOA-C02 distractor lists frequently: "delete the stack and re-create". This destroys the entire stack and all its resources, including ones that are healthy. Always try ContinueUpdateRollback first.

Scenario Pattern: StackSet Operation Failed in 5 of 200 Accounts

A StackSet update was launched against 200 accounts. It failed in 5. Default behavior: FailureToleranceCount = 0 means the operation halted on the first failure. The other 199 are in mixed states.

The runbook:

Inspect operation results. aws cloudformation list-stack-set-operation-results --stack-set-name X --operation-id Y shows per-account status. Identify the 5 failed accounts and the error message for each.
Categorize failures. Common causes: missing IAM role in the target account (self-managed mode), resource quota in the target account, region not enabled in the target account, account suspended.
Fix or exclude failed accounts. Either remediate (request quota increase, enable region, fix IAM trust) or exclude them with an account filter on the next operation.
Re-run the operation with higher failure tolerance. Set FailureTolerancePercentage: 10 or higher so a few stragglers do not halt the entire rollout.
For service-managed StackSets, consider the auto-deployment retry. New accounts that join the OU later will trigger automatic re-deployment.

Scenario Pattern: DELETE_FAILED Caused by Non-Empty S3 Bucket

A common SOA-C02 trap. The stack contains an AWS::S3::Bucket resource. The stack delete fails with The bucket you tried to delete is not empty. CloudFormation cannot delete a non-empty S3 bucket.

The runbook:

Empty the bucket manually (or via a Lambda-backed custom resource that runs on delete).
Retry stack delete. Optionally pass --retain-resources MyBucket if you want to keep the bucket — CloudFormation removes the resource from the stack but leaves the bucket in place.
For future templates, wrap the bucket with a custom resource that empties it on stack delete, or set DeletionPolicy: Retain so the bucket persists outside the stack.

A related trap: the stack contains a KMS CMK with DeletionPolicy: Retain. The stack deletes successfully but leaves the CMK orphaned (still incurs cost). This is by design for Retain, but operators must remember to manually clean up retained resources during cost reviews.

Common Trap: Stack Creation Timeout vs Resource Signal Timeout

A frequent confusion. There is no single "stack creation timeout" you set. Instead:

Each resource type has its own service-side creation timeout (RDS instances may take up to 1 hour, EKS clusters longer, etc.).
Resources with CreationPolicy.ResourceSignal (typically EC2 instances and ASGs) wait for a cfn-signal --success from user-data within Timeout (default PT1H, max PT12H).
WaitCondition resources have their own Timeout (default PT12H).

If the stack appears stuck in CREATE_IN_PROGRESS, the cause is usually a resource taking longer than expected, not a global "stack timeout". Inspect stack events to find the slow resource.

Common Trap: Parameter Store Types Are Underused

Templates that hardcode AMI IDs, instance types, or environment-specific values create per-region or per-environment maintenance debt. The clean SOA-C02 answer:

AWS::SSM::Parameter::Value<String> — read a String parameter from SSM Parameter Store at deploy time.
AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> — read an AMI ID from Parameter Store.
AWS::SSM::Parameter::Value<List<String>> — read a comma-separated list.

Combined with the AWS-managed AMI parameters (/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64) and the AWS-managed ECS-optimized parameters, templates become region-portable without Mappings.

Common Trap: Stack Policies Block Updates But Not Deletes

As covered above, this is a recurring distractor. Stack policies cover Update:* only. Termination protection plus IAM denies on cloudformation:DeleteStack are how you prevent deletes.

Common Trap: StackSets Concurrent Regions Default Is Sequential

RegionConcurrencyType defaults to SEQUENTIAL. A candidate may assume StackSets deploy to all regions in parallel by default — they do not. For an organization-wide rollout to 4 regions across 200 accounts, sequential region deployment can take hours. Set RegionConcurrencyType: PARALLEL for faster rollouts when blast radius is acceptable, or pair SEQUENTIAL with explicit RegionOrder to deploy to a canary region first.

SOA-C02 vs SAA-C03: The Operational Lens

SAA-C03 and SOA-C02 both touch CloudFormation, but the lenses differ.

Question style	SAA-C03 lens	SOA-C02 lens
IaC selection	"Which service should the architect use to deploy across accounts?"	"The StackSet failed in 5 accounts — what is the next step?"
`DeletionPolicy`	"Which feature retains the database across stack deletes?"	"The team accidentally lost data on stack update — what attribute pair was missing?"
Change sets	"Which feature previews stack changes?"	"The team replaced production database with `UpdateStack` — what process change?"
Drift	Rarely tested	"How do you detect and remediate drift across 50 stacks?"
Rollback	"Why use rollback?"	"The stack is stuck in UPDATE_ROLLBACK_FAILED — what API call?"
Stack policies	Rarely tested	"Block accidental updates to the production database within the stack."
StackSets self vs service-managed	"Which mode supports auto-deploy to new org accounts?"	"The team uses Organizations and wants every new account to receive a baseline — configure."

The SAA candidate selects CloudFormation; the SOA candidate operates it under failure conditions, with multi-account and multi-region scope.

Exam Signal: How to Recognize a Domain 3.1 CloudFormation Question

Domain 3.1 questions on SOA-C02 follow predictable shapes.

"Stack stuck in UPDATE_ROLLBACK_FAILED" → ContinueUpdateRollback API, optionally with ResourcesToSkip. Never delete-and-recreate as the first answer.
"Stack deletion failed" → Most likely a non-empty S3 bucket, a retained resource, or termination protection. Empty bucket / fix dependency / disable protection, then retry. Optionally RetainResources on the retry.
"Production database accidentally replaced during stack update" → Process gap. Require change sets, set UpdateReplacePolicy: Snapshot on stateful resources, attach a stack policy denying Update:* on the database resource.
"Deploy a baseline to every account in the organization" → Service-managed StackSets with auto-deployment, targeting the root or baseline OU.
"Some accounts in StackSet failed" → Adjust FailureToleranceCount / FailureTolerancePercentage, fix root cause, retry.
"Drift in stack-managed resources" → detect-stack-drift + Config rule cloudformation-stack-drift-detection-check + EventBridge → SSM Automation remediation.
"AMI ID hardcoded, fails in another region" → Parameter type AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> with the AWS-managed AMI parameter path.
"Auto-rollback if 5xx errors spike during deployment" → Rollback triggers with monitoring time pointing at a CloudWatch alarm.
"Stack creation hangs at 60 minutes for an EC2 instance" → CreationPolicy.ResourceSignal.Timeout exceeded; user-data did not call cfn-signal --success. Either fix user-data or raise the timeout (max 12 hours).
"Insufficient IAM permissions during stack creation" → Either grant the invoker more permissions or use a CloudFormation service role with broader scope.

Domain 3 is 18 percent of the exam, with Task Statement 3.1 covering CloudFormation, AMIs, Image Builder, multi-account/multi-region, and deployment scenarios. CloudFormation alone is the deepest of these and likely accounts for 6 to 9 questions. Mastering UPDATE_ROLLBACK_FAILED recovery, change sets, drift, stack policies, DeletionPolicy / UpdateReplacePolicy, and StackSets configuration is a high-leverage study activity. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html

Decision Matrix — CloudFormation Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal	Primary construct	Notes
Preview stack changes before applying	Change set	Always for production updates.
Detect out-of-band edits	Drift detection + Config rule `cloudformation-stack-drift-detection-check`	Schedule via Config aggregator for org-wide.
Protect a stateful resource from deletion	`DeletionPolicy: Retain` or `Snapshot`	Snapshot for RDS/EBS; Retain for KMS/S3.
Protect a stateful resource from replacement	`UpdateReplacePolicy: Retain` or `Snapshot`	Pair with `DeletionPolicy`.
Block updates to specific resources within a stack	Stack policy	JSON policy attached to the stack.
Prevent stack deletion	Termination protection + IAM deny `DeleteStack`	Stack policy does NOT block deletes.
Auto-rollback on metric breach during deploy	Rollback triggers + monitoring time	Up to 5 alarms, 0–180 min monitoring.
Recover from `UPDATE_ROLLBACK_FAILED`	`ContinueUpdateRollback` API	Optionally `ResourcesToSkip`.
Recover from `ROLLBACK_COMPLETE`	Delete the stack and re-create	Terminal state — no in-place recovery.
Recover from `DELETE_FAILED`	Fix root cause + retry delete with `RetainResources`	Common cause: non-empty S3 bucket.
Deploy across multiple regions	StackSets with `RegionConcurrencyType`	Default sequential; set PARALLEL for speed.
Deploy across multiple accounts	StackSets — service-managed if Organizations	Self-managed otherwise.
Auto-deploy to new accounts joining OU	Service-managed StackSet with auto-deployment	Requires Organizations all features + trusted access.
Tolerate isolated account failures	`FailureToleranceCount` / `FailureTolerancePercentage`	Default 0 — first failure halts.
Region-portable AMI references	`AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>`	Use AWS-managed parameter paths.
Long-running EC2 first-boot signal	`CreationPolicy.ResourceSignal.Timeout`	Default 1 hour, max 12 hours.
Reusable template module	Nested stacks (`AWS::CloudFormation::Stack`)	Tightly coupled to parent.
Share infra between independent stacks	Outputs + `Fn::ImportValue`	Same account/region only.
Cross-account resource sharing	AWS RAM (`AWS::RAM::ResourceShare`)	Referenced from CloudFormation.
Limit blast radius of CloudFormation actions	CloudFormation service role + IAM scoping	Service role acts on CloudFormation's behalf.
Audit CloudFormation operations	CloudTrail management events	Every CFN API call logged.
Schedule periodic stack updates	EventBridge rule → Lambda → CloudFormation API	No native CFN scheduler.

Common Traps Recap — CloudFormation Stacks and StackSets

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: ROLLBACK_COMPLETE is recoverable

It is not. ROLLBACK_COMPLETE from a failed CREATE is terminal — delete and re-create. Only UPDATE_ROLLBACK_FAILED from a failed rollback during update is recoverable via ContinueUpdateRollback.

Trap 2: DeletionPolicy alone protects against update-replacement

It does not. Pair with UpdateReplacePolicy on every stateful resource.

Trap 3: Stack policy blocks stack deletion

It does not. Use termination protection plus IAM denies for that.

Trap 4: Detailed monitoring triggers CloudFormation rollback

It does not. Rollback triggers are explicit CloudWatch alarm ARNs configured on the stack operation, not implicit detailed monitoring.

Trap 5: StackSets deploys to all regions in parallel by default

It does not. RegionConcurrencyType defaults to SEQUENTIAL. Set PARALLEL explicitly for faster rollouts.

Trap 6: Self-managed StackSets supports auto-deployment

It does not. Auto-deployment to new organization accounts requires service-managed permissions plus AWS Organizations all features.

Trap 7: Hardcoded AMI IDs are acceptable

They are not. Use AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> with AWS-managed parameter paths for region portability.

Trap 8: Delete-and-recreate is the right answer for stuck stacks

It is the last resort. Always try ContinueUpdateRollback first; consider RetainResources on a delete; only delete-and-recreate when those fail.

Trap 9: Cross-stack references work cross-account

They do not. Fn::ImportValue works only within the same account and region. Cross-account requires StackSets or AWS RAM.

Trap 10: CloudFormation has a global stack timeout

It does not. Each resource has its own service-side timeout; CreationPolicy.ResourceSignal.Timeout defaults to 1 hour for resource-signaling resources.

Trap 11: Empty an S3 bucket via the template

CloudFormation cannot empty buckets. Use a Lambda-backed custom resource that empties on delete, or empty manually before stack delete, or set DeletionPolicy: Retain on the bucket.

Trap 12: Service quotas are tested at deploy time only

They are tested per-resource at create time. Higher-level quotas (max stacks per region, max resources per template) are also enforced. Approach quota visibility via Service Quotas + CloudWatch alarm at 80 percent.

FAQ — CloudFormation Stacks and StackSets

Q1: A stack is stuck in `UPDATE_ROLLBACK_FAILED`. Should I delete it?

No — at least not first. The recovery API is ContinueUpdateRollback, optionally with ResourcesToSkip listing logical IDs that cannot be rolled back. Inspect stack events to find which resources failed and why. If the underlying cause can be fixed (e.g., re-create a deleted IAM role, re-attach a missing Internet Gateway), fix it and call ContinueUpdateRollback without skips. If a resource is genuinely unrecoverable, skip it — CloudFormation marks it UPDATE_COMPLETE regardless of actual state, and reconciliation becomes your responsibility. Deleting and recreating destroys all stack resources, including healthy ones, and is the wrong first answer on SOA-C02.

Q2: When do I use `DeletionPolicy: Retain` vs `Snapshot` vs leaving the default?

Use Retain for resources that must outlive the stack and cannot be reconstructed from a snapshot — KMS CMKs (referenced by ciphertext outside the stack), S3 buckets holding production data, IAM resources referenced cross-account. Use Snapshot for stateful resources where a final point-in-time copy is sufficient for recovery — RDS DB instances, EBS volumes, ElastiCache clusters, FSx file systems. Leave the default Delete for stateless resources where re-creation from the template is cheap — launch templates, security groups, IAM roles tightly coupled to the stack. The same rule applies to UpdateReplacePolicy — set both attributes together on every stateful resource.

Q3: When should I use change sets?

Always for production stack updates. Change sets preview every resource change including critical replacement events that destroy data. The cost is one extra command before execute-change-set; the benefit is preventing accidental data loss. For new stacks, change sets are useful when the template has conditional resources whose final composition depends on parameter values — review what will be created before committing. For non-production environments, direct UpdateStack may be acceptable for velocity, but mature SysOps teams require change sets in every environment as a habit.

Q4: How do I deploy a CloudWatch alarm baseline to every account in my organization?

Use a service-managed StackSet with auto-deployment enabled, targeting the root OU or a baseline OU. Prerequisites: AWS Organizations with all features (not just consolidated billing), trusted access for CloudFormation enabled, and a delegated administrator account if you want to manage StackSets from a non-management account. The StackSet template defines the alarms; auto-deployment ensures every new account joining the OU gets the alarms automatically. Self-managed StackSets cannot auto-deploy — they require explicit account IDs and manual IAM role provisioning. The exam answer is service-managed StackSets in virtually every "deploy to all org accounts" scenario.

Q5: What does `treatMissingData` from CloudWatch alarms have to do with rollback triggers?

A CloudWatch alarm referenced as a rollback trigger evaluates per its own configuration. If the alarm is in INSUFFICIENT_DATA during the monitoring window, it does not count as ALARM for rollback purposes — only an explicit transition to ALARM triggers rollback. So an alarm with treatMissingData: missing and no recent data points sits in INSUFFICIENT_DATA and never causes a rollback. This is usually the correct behavior — you do not want missing data to revert a deployment. If you need silence to be treated as failure, set the underlying alarm to treatMissingData: breaching so missing periods count as breaches and the alarm transitions to ALARM.

Q6: How do I prevent the production database from being accidentally deleted via CloudFormation?

Defense in depth, four layers: (1) DeletionPolicy: Snapshot on the RDS resource so any deletion produces a final snapshot. (2) UpdateReplacePolicy: Snapshot so any replacement triggered by an update produces a final snapshot. (3) A stack policy that denies Update:* on the database logical resource ID, preventing accidental modifications. (4) Termination protection enabled on the stack itself (UpdateTerminationProtection: true) to block DeleteStack API calls. Optionally add an SCP at the OU level denying cloudformation:DeleteStack against production stack ARNs. Each layer covers a different failure mode; together they make accidental destruction near-impossible.

Q7: What is the difference between self-managed and service-managed StackSets?

Self-managed requires the operator to create the cross-account IAM trust manually — an AWSCloudFormationStackSetAdministrationRole in the administrator account and an AWSCloudFormationStackSetExecutionRole in each target account. It works without AWS Organizations. Service-managed uses AWS Organizations to create and manage the IAM roles automatically, supports targeting OUs (not just account IDs), and supports auto-deployment to new accounts joining the OU. Service-managed requires Organizations all features and trusted access for CloudFormation. The SOA-C02 default answer for organization-wide deployment is service-managed; self-managed is correct only when Organizations is not in use or when targeting a specific fixed list of accounts.

Q8: What are `MaxConcurrentCount` and `FailureToleranceCount` on a StackSet operation?

MaxConcurrentCount (or MaxConcurrentPercentage) controls how many target accounts CloudFormation deploys to in parallel within a single region. Default is 1 (serial). Higher values speed up large rollouts but increase blast radius if the template has bugs. FailureToleranceCount (or FailureTolerancePercentage) controls how many account failures the operation tolerates before halting. Default is 0 — first failure stops everything. For production rollouts to many accounts, typical values are MaxConcurrentPercentage: 25, FailureTolerancePercentage: 5 — deploy to 25 percent of accounts in parallel, tolerate up to 5 percent failures, and halt only when failures exceed the tolerance.

Q9: How do I make a CloudFormation template region-portable for AMI IDs?

Use a parameter of type AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> with a default pointing at an AWS-managed Parameter Store path. For Amazon Linux 2023: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64. CloudFormation reads the parameter at stack-create time, getting the correct region-specific AMI ID without hardcoding. The legacy approach is Mappings with per-region AMI IDs, which works but requires manual maintenance every time AWS releases a new AMI. SSM Parameter Store types are the SOA-C02 recommended approach.

Q10: When does CloudFormation roll back automatically vs leave the stack in a failed state?

By default, every CREATE and UPDATE operation rolls back on failure. The flag that disables this on CREATE is --disable-rollback (or OnFailure: DO_NOTHING in the API), leaving the stack in CREATE_FAILED for inspection. There is no equivalent flag for updates; updates always roll back automatically. If the rollback itself fails, the stack enters UPDATE_ROLLBACK_FAILED and waits for ContinueUpdateRollback. Rollback triggers are an additional layer that monitors CloudWatch alarms during and after the operation; if a trigger alarm fires, CloudFormation initiates rollback even if the operation itself technically succeeded — useful for catching application-level regressions that show up only after deployment.

Q11: What gets logged for CloudFormation operations?

Every CloudFormation API call appears in CloudTrail management events — CreateStack, UpdateStack, DeleteStack, ExecuteChangeSet, CreateChangeSet, DetectStackDrift, CreateStackInstances, etc. Each event records the caller (user or service role), the parameters, and the response. For the resources CloudFormation creates, the resource-level events are logged by their respective services (e.g., EC2 RunInstances events). The combination of CloudFormation API events plus resource service events provides complete forensic visibility. For real-time alerting on sensitive CloudFormation actions, route CloudTrail to CloudWatch Logs and create metric filters or EventBridge rules on event names like DeleteStack or UpdateStack against production stack ARNs.

Q12: How do I roll back a specific change after the stack has reached `UPDATE_COMPLETE`?

CloudFormation does not have a "git revert" operation. Two paths: (1) Re-deploy the previous template version. Keep templates in Git, identify the previous tag or commit, and run an UpdateStack with that template body. Change sets show what will revert. (2) Use rollback triggers proactively — set a CloudWatch alarm on the application's error rate as a rollback trigger with a 30-minute monitoring time, and any post-deployment regression triggers automatic rollback before the stack finalizes. Once UPDATE_COMPLETE is reached and the monitoring time has expired, you must re-deploy the previous template manually. There is no "undo last update" command.

Once CloudFormation provisioning is solid, the next operational layers are: AMI lifecycle and EC2 Image Builder for the artifact CloudFormation deploys; Systems Manager Automation and Patch Manager for the runbook engine that operates on CloudFormation-deployed resources and is itself often CloudFormation-managed; multi-account strategy with AWS Organizations and Control Tower for the org-level container that service-managed StackSets and SCPs operate within; and scheduled tasks and Config auto-remediation for the event-driven automation that detects drift and triggers CloudFormation updates or SSM runbooks.

Why CloudFormation Sits at the Heart of SOA-C02 Domain 3

白話文解釋 CloudFormation Stacks and StackSets

Analogy 1: The Construction Blueprint and the Building Permit Office

Analogy 2: The Restaurant Kitchen Recipe Book

Analogy 3: The Library Card Catalog and Inter-Branch Loans

CloudFormation Stack Lifecycle: Statuses You Will Read at 3am

Create lifecycle

Update lifecycle

Delete lifecycle

Status quick reference

Stack Creation Timeout: The Hidden 1-Hour Default

Common Deployment Errors and How to Diagnose

Insufficient IAM permissions

Service quota exceeded

Invalid property values

Dependency cycle

Subnet sizing

Change Sets: Previewing Stack Updates Before Execution

Why change sets exist

Creating and executing change sets

Change sets for new stacks

Drift Detection: When Reality Diverges From the Template

Detecting drift

Remediation strategies

When drift detection won't help

DeletionPolicy: Retain, Delete, Snapshot, RetainExceptOnCreate

Common combinations

DeletionPolicy on stack delete vs resource removal

DeletionPolicy does NOT apply to update-replacement

UpdateReplacePolicy: Protecting Against Update-Driven Replacement

When does CloudFormation replace?

Why UpdateReplacePolicy: Snapshot exists

Stack Policies: Blocking Accidental Updates to Stateful Resources

Stack policy structure

Override during a single update

Stack policy vs IAM policy

Rollback Triggers and Monitoring Time

How rollback triggers work

Configuration via CLI

Monitoring time

Rollback triggers vs blue/green

CloudFormation StackSets: Multi-Account, Multi-Region Deployments

StackSets terminology

Self-managed permissions vs service-managed permissions

Self-managed permissions

Service-managed permissions

StackSet operation knobs

Account filtering in service-managed StackSets

Cross-Stack References, Nested Stacks, and Stack Resources Across Accounts

Outputs and Fn::ImportValue (cross-stack references)

Nested stacks

Cross-account / cross-region — use StackSets

Intrinsic Functions: The CloudFormation Toolbox

Scenario Pattern: Stack Stuck in UPDATE_ROLLBACK_FAILED

Scenario Pattern: StackSet Operation Failed in 5 of 200 Accounts

Scenario Pattern: DELETE_FAILED Caused by Non-Empty S3 Bucket

Common Trap: Stack Creation Timeout vs Resource Signal Timeout

Common Trap: Parameter Store Types Are Underused

Common Trap: Stack Policies Block Updates But Not Deletes

Common Trap: StackSets Concurrent Regions Default Is Sequential

SOA-C02 vs SAA-C03: The Operational Lens

Exam Signal: How to Recognize a Domain 3.1 CloudFormation Question

Decision Matrix — CloudFormation Construct for Each SysOps Goal

Common Traps Recap — CloudFormation Stacks and StackSets

Trap 1: ROLLBACK_COMPLETE is recoverable

Trap 2: DeletionPolicy alone protects against update-replacement

Trap 3: Stack policy blocks stack deletion

Trap 4: Detailed monitoring triggers CloudFormation rollback

Trap 5: StackSets deploys to all regions in parallel by default

Trap 6: Self-managed StackSets supports auto-deployment

Trap 7: Hardcoded AMI IDs are acceptable

Trap 8: Delete-and-recreate is the right answer for stuck stacks

Trap 9: Cross-stack references work cross-account

Trap 10: CloudFormation has a global stack timeout

Trap 11: Empty an S3 bucket via the template

Trap 12: Service quotas are tested at deploy time only

FAQ — CloudFormation Stacks and StackSets

Q1: A stack is stuck in UPDATE_ROLLBACK_FAILED. Should I delete it?

Q2: When do I use DeletionPolicy: Retain vs Snapshot vs leaving the default?

Q3: When should I use change sets?

`DeletionPolicy` on stack delete vs resource removal

`DeletionPolicy` does NOT apply to update-replacement

Why `UpdateReplacePolicy: Snapshot` exists

Q1: A stack is stuck in `UPDATE_ROLLBACK_FAILED`. Should I delete it?

Q2: When do I use `DeletionPolicy: Retain` vs `Snapshot` vs leaving the default?

Q5: What does `treatMissingData` from CloudWatch alarms have to do with rollback triggers?

Q8: What are `MaxConcurrentCount` and `FailureToleranceCount` on a StackSet operation?

Q12: How do I roll back a specific change after the stack has reached `UPDATE_COMPLETE`?