examhub .cc 用最有效率的方法,考取最有價值的認證
Vol. I
本篇導覽 約 37 分鐘

備份、還原與災難復原程序

7,400 字 · 約 37 分鐘閱讀

Backup and Disaster Recovery is the operational test of every SysOps engineer's nerve: it is the topic where the question is not "is the backup running?" but "the production database has been corrupted thirty seconds ago by an accidental DROP TABLE — what do you type next?" The SOA-C02 exam guide v2.3 carves out Task Statement 2.3 ("Implement backup and restore strategies") inside Domain 2 (Reliability and Business Continuity, 16 percent), and unlike SAA-C03 — which asks you to design a DR architecture from RTO and RPO targets — SOA-C02 asks you to execute the restore. Which AWS Backup recovery point do you select? Which restore parameters do you set? What is the new endpoint after RDS point-in-time restore? Which S3 lifecycle transition is allowed and which throws InvalidRequest? Why is the cross-region snapshot copy you just made unreadable in the destination Region?

This guide is built specifically around the backup and disaster recovery operational workflow on AWS. We walk through AWS Backup plans, vaults, Vault Lock (governance vs compliance modes, including the irreversible 72-hour grace window), RDS automated backups and point-in-time restore, promoting a read replica during a regional disaster, S3 versioning with MFA Delete, S3 lifecycle rules and the 30-day transition minimum, S3 Cross-Region Replication (CRR) including Replication Time Control (RTC), Data Lifecycle Manager (DLM) for EBS and AMI snapshots, cross-region snapshot copy and the encryption gotcha that catches every candidate, S3 Glacier retrieval tiers and their actual minute-to-hour timings, and the four canonical DR strategies (backup/restore, pilot light, warm standby, multi-site active/active) framed against operational RTO and RPO. Backup and Disaster Recovery is therefore the topic where reading runbooks like a SysOps engineer beats memorizing architectural diagrams like an architect.

Why Backup and Disaster Recovery Sits at the Heart of SOA-C02 Domain 2

The SOA-C02 exam guide names exactly five skills under TS 2.3: automate snapshots and backups (RDS snapshots, AWS Backup, RTO/RPO, Data Lifecycle Manager, retention policy), restore databases (point-in-time restore, promote read replica), implement versioning and lifecycle rules, configure S3 Cross-Region Replication, and perform disaster recovery procedures. Every one of those skills is operational — not "decide which backup strategy fits", but "configure the backup plan, run the restore, validate the recovered resource". Backup and Disaster Recovery is therefore the cleanest expression of the SOA-C02 lens versus the SAA-C03 lens.

At the SysOps tier the framing is procedural, not architectural. SAA-C03 asks "the workload has an RPO of 5 minutes and an RTO of 1 hour — which DR pattern do you choose?" SOA-C02 asks "the team chose pilot light — the primary region just failed — list the steps you execute, in order, to bring the workload up in the secondary region". Backup and Disaster Recovery is also the topic where every other SOA-C02 topic plugs in: CloudWatch alarms (Domain 1) monitor backup job success and CRR replication latency, EventBridge rules (Domain 1.2) route Backup job failure events to remediation, Auto Scaling (Domain 2.1/2.2) launches the warm standby fleet, CloudFormation (Domain 3) deploys the secondary-region stack, KMS keys (Domain 4) re-encrypt the cross-region snapshot copy, VPC (Domain 5) provides the network in the failover Region, and EBS performance (Domain 6) determines whether the restored volume meets the SLA on first I/O. Backup and Disaster Recovery is therefore the topic where every later SOA-C02 skill must compose into a single procedure that actually restores production.

  • RTO (Recovery Time Objective): maximum acceptable elapsed time between the disaster and the workload being available again. RTO answers "how long can we be down?"
  • RPO (Recovery Point Objective): maximum acceptable data loss measured in time. RPO answers "how much recent data can we lose?"
  • AWS Backup: a centralized backup service that orchestrates backups across EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3, and more, using backup plans and vaults.
  • Backup plan: a policy document defining backup frequency, backup window, lifecycle (transition to cold, deletion), destination vault, and copy-to-other-vault rules.
  • Backup vault: a logical container that stores recovery points; encrypted with a KMS key, and optionally hardened with Vault Lock.
  • Recovery point: a single backup of a single resource at a single moment in time — the unit you actually restore from.
  • Vault Lock: a write-once-read-many (WORM) policy on a backup vault; governance mode can be removed by privileged IAM, compliance mode is irrevocable after the 72-hour grace window expires.
  • Point-in-time restore (PITR): RDS feature that creates a new DB instance restored to any second within the configured backup retention window (up to 35 days).
  • S3 versioning: bucket-level setting that preserves every version of every object; once enabled it can be Suspended but never returned to Disabled.
  • MFA Delete: an additional protection on a versioning-enabled bucket requiring an MFA code to delete object versions or change versioning state; configurable only by the bucket-owning AWS account root user.
  • S3 lifecycle rule: a rule that transitions objects between storage classes or expires them on a schedule.
  • S3 Cross-Region Replication (CRR): asynchronous, automatic replication of objects from a source bucket to a destination bucket in a different Region.
  • Replication Time Control (RTC): SLA-backed CRR feature that replicates 99.99 percent of objects within 15 minutes.
  • Data Lifecycle Manager (DLM): native AWS service that schedules creation, retention, and cross-region copy of EBS snapshots and AMIs.
  • Object Lock: S3 feature that applies WORM retention to individual object versions, in governance or compliance retention mode, optionally with Legal Hold.
  • Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html

白話文解釋 Backup and Disaster Recovery on AWS

Backup vocabulary collapses easily. Three analogies help the constructs stick.

Analogy 1: The Bank Vault and Safety Deposit Boxes

AWS Backup is a bank vault. The backup vault is the physical vault room — bolted, alarmed, KMS-encrypted. Recovery points are the safety deposit boxes stacked inside, each one a snapshot of a customer's assets at one moment. The backup plan is the standing instruction at the customer counter: "every weeknight at 23:00, take a fresh snapshot of accounts tagged tier=prod, transfer last month's box to the cold-storage vault on the third basement, and shred boxes older than seven years". Vault Lock is the time-locked safe door the regulator demanded after a fraud audit — once locked in compliance mode and the 72-hour grace window expires, not even the bank president can shorten retention or remove a box, because the law says the regulator must be able to inspect every transaction for seven years. Vault Lock governance mode is the softer cousin: the safe door is locked, but the bank's chief security officer with a special key (backup:DeleteRecoveryPoint plus backup:DisassociateRecoveryPoint) can still open it for legitimate reasons. Cross-region copy is the off-site backup vault in a different city — required because if the headquarters building burns down, the in-building vault burns with it.

Analogy 2: The Library Reservation Desk

S3 versioning is the library reservation desk. Each time a borrower returns an annotated copy of a book, the librarian does not overwrite the original — she keeps both versions on the shelf with version IDs (v1: untouched, v2: with marginalia from 2024, v3: with the 2025 reader's correction). When someone "deletes" a book she does not throw it away; she places a delete marker bookmark in front of the stack so casual searches return "not found", but the librarian can lift the bookmark and recover any previous version. MFA Delete is the rare-books cage key that the head librarian — only the head librarian, only with a physical token — must produce before any version can be permanently destroyed. S3 lifecycle rules are the librarian's standing rotation policy: "after 30 days move to the off-site warehouse (Glacier Instant Retrieval), after 90 days move to deep storage (Glacier Deep Archive), after seven years shred — and on a versioning-enabled shelf, expire non-current versions after 90 days so the marginalia copies don't pile up forever". Object Lock compliance mode is the government records vault in the basement that no librarian, not even the chief, can unlock until the legally mandated retention expires.

Analogy 3: The Insurance Policy Tier (DR Strategies)

The four AWS DR strategies map cleanly to insurance product tiers. Backup/restore is basic property insurance — cheap monthly premium, but if the house burns down you live in a hotel for weeks while the rebuild happens (high RTO, high RPO). Pilot light is liability insurance with an emergency apartment on standby — the apartment is empty but plumbed, painted, and warm; if the house burns you move in tonight and only need to bring your clothes (medium RTO, low RPO because the database is already replicating). Warm standby is a fully furnished second home running at minimum staff — small kitchen team, lights on, ready to receive the family within the hour at scaled-up capacity (low RTO, low RPO). Multi-site active/active is two identical homes occupied simultaneously — every meal cooked twice, every bed kept made; expensive, but if one burns you keep eating dinner without missing a course (near-zero RTO, near-zero RPO). On SOA-C02, the question rarely asks which to pick (that is SAA's job); it asks which procedure runs at failover time for the tier already chosen.

For SOA-C02, the bank vault analogy is the most useful when a question mixes Vault Lock modes with retention. Compliance mode after 72 hours = nobody opens it, ever — even AWS Support cannot delete a recovery point until its scheduled retention expires. Governance mode = privileged IAM principals can still delete with the right permissions. Memorize the 72-hour grace window: that is the only time you can change your mind about Compliance Mode. After 72 hours, even the AWS root user cannot shorten retention or delete recovery points before their scheduled expiry. Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/vault-lock.html

TS 2.3 Reading: SOA Tests Running Backups, Not Designing DR

The SOA-C02 exam guide phrases TS 2.3 as "Implement backup and restore strategies" — note the verb implement, not design. This wording is the single best predictor of question style. SOA does not test "given a 5-minute RPO, which AWS service do you choose?" — that is SAA-C03 territory. SOA tests "the backup plan is configured, the recovery point exists, the production resource is corrupted — what do you do, in order, to restore it?" The five skills the exam guide names directly map to five operational competencies a SysOps engineer must demonstrate at a keyboard.

TS 2.3 skill What SAA tests What SOA tests
Automate snapshots and backups "Pick AWS Backup vs DLM" "Configure the backup plan with cold-storage transition at 90 days and verify the next backup job in CloudWatch"
Restore databases "PITR meets the 5-minute RPO requirement" "Run RDS PITR to 14:32:05 UTC; what is the new endpoint and how do you cut the application over?"
Implement versioning and lifecycle rules "Versioning protects against accidental delete" "Lifecycle transition fails — why is 30-day Glacier Flexible Retrieval the minimum from Standard?"
Configure S3 CRR "CRR meets cross-region durability requirement" "CRR is configured but no replication is happening — list the four prerequisite checks"
Perform disaster recovery procedures "Pilot light is appropriate for this RTO" "The primary Region is unavailable — execute the pilot light failover steps in the secondary Region"

This wording — implement, restore, configure, perform — is everywhere in TS 2.3. Memorize it as the SOA voice.

AWS Backup: Plans, Vaults, and Recovery Points

AWS Backup is the centralized, policy-driven backup service that the SOA-C02 exam treats as the canonical answer for any "automate backups across multiple resource types" scenario. It supersedes per-service backup configuration (RDS automated backups, EBS snapshot lifecycle, DynamoDB on-demand backup, EFS backups) by orchestrating all of them under one set of policies and one audit surface.

Backup plan structure

A backup plan has three components:

  1. Rules — one or more rules each specifying:
    • Schedule (cron expression or rate, e.g., daily at 05:00 UTC).
    • Backup window — the start window plus completion window in which AWS Backup may run the job.
    • Lifecycle — when to transition to cold storage (minimum 90 days after creation) and when to delete (minimum cold-storage period of 90 days, so total minimum cold retention is 90 + 90 = 180 days for any cold-tiered recovery point).
    • Destination vault — where the recovery points land.
    • Copy actions — optional cross-Region or cross-account copy with their own lifecycle.
  2. Resource assignments — which resources the plan covers, selected by tag (Backup=daily, Environment=prod) or by ARN list. Tag-based selection is the SOA-preferred pattern because new tagged resources are picked up automatically.
  3. IAM service role — the role AWS Backup assumes to read source resources and write recovery points (AWSBackupServiceRolePolicyForBackup and ...ForRestores).

Backup vault structure

A backup vault is the storage container for recovery points. Every vault has:

  • A KMS encryption key — either AWS-managed aws/backup or a customer-managed CMK (CMK gives you key-policy-level control, including cross-account access).
  • A vault access policy — a resource policy controlling which principals can perform vault-level operations.
  • An optional Vault Lock policy (covered next).
  • An optional vault notification SNS topic for backup/restore job state changes.

The vault is regional. To protect against a regional outage, configure the backup plan with a copy action that replicates the recovery point to a vault in a second Region.

Restoring from a recovery point

The cardinal rule: AWS Backup restores create new resources. They do not overwrite the original. Restoring an EBS snapshot creates a new volume (you then attach it). Restoring an RDS recovery point creates a new DB instance (you then update the application connection string or run a CNAME flip). Restoring an EFS recovery point creates a new file system or restores items into a new path inside an existing file system. The SysOps engineer running the restore must therefore plan the cutover step after the restore — the restore alone does not bring the workload back.

A consistent SOA-C02 trap: candidates assume "restore" overwrites the corrupted resource. It does not. AWS Backup restore creates a brand-new EBS volume, RDS instance, EFS file system, or DynamoDB table; the original is untouched (and often still corrupted). The SysOps procedure is: (1) restore to a new resource, (2) validate the data, (3) cut traffic over (DNS update, application config push, attach new EBS volume, point readers at the new endpoint), (4) only then decommission the original. RDS PITR follows the same rule and even forces a new instance ID. Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/restoring-a-backup.html

AWS Backup Vault Lock: Governance vs Compliance Mode

Vault Lock is the WORM (write-once-read-many) feature that turns a backup vault into a tamper-evident, regulator-grade backup store. It is the SOA-correct answer to any "regulator requires 7-year tamper-proof retention" or "ransomware encrypted our primary AND our backups — make the backups untouchable" scenario.

Two modes

  • Governance mode — Vault Lock is in effect, but principals with backup:DeleteRecoveryPoint and the matching vault access policy permission can still delete recovery points before scheduled expiry, and backup:PutBackupVaultAccessPolicy can still loosen access. Governance mode protects against accidents and unprivileged users; it does not protect against a malicious privileged insider with the right IAM.
  • Compliance mode — Vault Lock is in effect, and after the 72-hour grace window the lock cannot be removed or weakened by anyone, including the AWS account root user, AWS Support, or even an explicit re-locking. Recovery points cannot be deleted before their scheduled retention expires; minimum retention configured at lock time cannot be shortened; maximum retention cannot be lowered.

The 72-hour grace window

When you put-backup-vault-lock-configuration in compliance mode, AWS opens a 72-hour grace window during which you can call delete-backup-vault-lock-configuration to back out — perhaps you misconfigured the retention values. After 72 hours wall-clock time, the lock becomes immutable. The 72-hour window exists exactly because compliance mode is irreversible: AWS gives you three days to test the configuration before it becomes permanent.

When to choose which

Requirement Mode
Protect against accidental deletion by ops engineers Governance — privileged break-glass still works for legitimate emergencies
Tamper-proof retention for SEC 17a-4, FINRA, HIPAA, PCI-DSS evidence Compliance — auditor needs cryptographic proof nobody can alter the record
Defense against ransomware that compromised admin credentials Compliance — even the attacker with root cannot delete recovery points before scheduled expiry
Internal policy without external audit Governance — easier to recover from misconfiguration

A SOA-C02 distractor pattern: a question describes a regulator demanding 7-year retention; the candidate selects compliance mode; an answer choice claims "you can change the retention later if business needs evolve". That answer is wrong — after the 72-hour grace window, compliance-mode retention can only be lengthened by adjusting individual recovery point retention upward where allowed, never shortened, and the lock itself cannot be removed. AWS Support cannot remove it. The AWS root user cannot remove it. The only path to delete a compliance-locked recovery point is to wait for its scheduled expiry. Pick compliance mode only when you are certain the policy is correct. Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/vault-lock.html

  • Vault Lock compliance mode grace window: 72 hours to remove the lock; after that the lock is irreversible.
  • Vault Lock minimum retention is set at lock time and cannot be lowered.
  • AWS Backup cold-storage minimum: 90 days — a recovery point must remain in cold storage at least 90 days before it can be deleted, on top of any minimum stay in warm storage.
  • AWS Backup minimum total retention for cold-tiered recovery point: warm period + 90 days cold = at least 90 + 90 = 180 days for typical configurations transitioning to cold at day 90.
  • RDS automated backup retention range: 0 to 35 days (0 disables automated backups, breaking PITR).
  • RDS PITR granularity: 5 minutes for restorable time (the latest restorable time is typically 5 minutes behind current time).
  • EBS snapshot maximum per Region: 100,000 by default (soft quota, can be increased).
  • Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/vault-lock.html

RDS Automated Backups and Point-in-Time Restore (PITR)

Even before AWS Backup centralizes everything, RDS has its own automated backup machinery that powers PITR. Understanding it is mandatory for SOA-C02.

How RDS automated backups work

When you enable automated backups (retention 1–35 days, default 7), RDS:

  1. Takes a daily storage volume snapshot during the configured backup window.
  2. Captures transaction logs every 5 minutes and stores them in an internal S3-backed location.
  3. Combines snapshots + transaction logs to support PITR to any second within the retention period (the most recent restorable time is typically 5 minutes behind current time because of log shipping cadence).

Point-in-time restore procedure

PITR is the canonical "we just dropped the wrong table" recovery. The operational steps:

  1. From the RDS console (or restore-db-instance-to-point-in-time CLI), pick the source DB instance.
  2. Choose Latest restorable time or a custom timestamp within the retention window.
  3. Configure the new DB instance — name, instance class, storage, multi-AZ, security group, subnet group, parameter group. The restore creates a new instance; you cannot overwrite the source.
  4. RDS provisions the new instance, applies snapshots up to the chosen timestamp, replays transaction logs to the exact second, and reports the new endpoint.
  5. Cut over by updating the application connection string (or, if you used Route 53 / RDS Proxy, point the alias at the new endpoint).
  6. (Optional) Decommission the corrupted source instance after validating the new one.

The cutover step is the operationally interesting one. PITR generates a new endpoint hostname (mydb-restored.cluster-xyz.us-east-1.rds.amazonaws.com); applications hardcoded to the original endpoint do not auto-switch. SOA-C02 routinely tests "what changes after a PITR?" — the answer always includes "the application must be repointed to the new endpoint".

PITR vs snapshot restore

  • PITR: restore to any second within the retention window — rolls forward transaction logs from the most recent automated snapshot.
  • Snapshot restore: restore to the exact snapshot moment (manual snapshots and automated daily snapshots both qualify) — no log replay, restore lands at the snapshot timestamp.

PITR is preferred when the corruption time is known precisely and you want to lose as little data as possible. Snapshot restore is preferred when an older known-good state is needed (the corruption was introduced more than 35 days ago, or you want a specific manual snapshot taken before a release).

Promoting a read replica during disaster

When the primary RDS instance is in an unavailable Region or has suffered a regional event, the SOA procedure is to promote a read replica in another Region:

  1. Verify the read replica is healthy and has acceptable replica lag (CloudWatch ReplicaLag metric).
  2. Optionally pause writes on the primary if reachable, to bound RPO.
  3. From the RDS console (or promote-read-replica CLI), promote the replica.
  4. Promotion breaks replication and converts the replica into a standalone primary. The old primary, when reachable again, must be either rebuilt as a replica of the new primary or decommissioned.
  5. Update the application connection string (or DNS alias) to the promoted instance's endpoint.
  6. Re-establish read replicas off the new primary if the read-scaling tier is required.

The single most common SOA-C02 RDS recovery question: "After running point-in-time restore on prod-db, what is the application endpoint?" The answer is a new endpoint on the new DB instance (prod-db-restored or whatever name you supplied). The old prod-db endpoint still points at the corrupted source instance until you delete it. Cutover via application config change, Route 53 CNAME update, or RDS Proxy is mandatory. AWS Backup restores of RDS recovery points behave the same way. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIT.html

S3 Versioning and MFA Delete

S3 versioning is the foundation of S3-side data protection. Once enabled, every PUT of an object creates a new version with a unique version ID; every DELETE writes a delete marker (also a version) that hides the object from default GET requests but keeps all prior versions recoverable.

Versioning states

A bucket has three versioning states:

  • Unversioned (default for new buckets) — PUT overwrites the object in place; DELETE is permanent.
  • Versioning-enabledPUT creates a new version; DELETE writes a delete marker; previous versions are recoverable.
  • Versioning-suspended — already-versioned objects keep their version history, but new PUT operations create version ID null (overwriting any existing null version), and DELETE writes a delete marker over the null version. Suspended is not the same as Disabled — once you enable versioning, you can never return to the original Unversioned state.

Restoring a previous version

Two ways to recover an object that was overwritten or "deleted":

  • Delete the delete marker — if a DELETE wrote a delete marker, removing the marker exposes the most recent prior version, restoring it as the current version.
  • Copy a specific prior version onto itselfaws s3api copy-object --version-id <prior> copies a specific version back as the new current version, useful when you need to restore an older version that has subsequent versions on top.

MFA Delete

MFA Delete is an extra protection on a versioning-enabled bucket. When enabled:

  • Permanently deleting an object version requires an MFA code in the request.
  • Suspending or re-enabling versioning also requires an MFA code.

MFA Delete has unusually strict rules:

  • It can only be enabled or disabled by the AWS account root user, not by an IAM user — even an admin IAM user cannot toggle it.
  • It must be configured via the AWS CLI or API; the S3 console does not surface the toggle.
  • It requires a hardware MFA device or virtual MFA registered to the root user; the device's serial number plus current code go into the API call.
::warning

Candidates routinely lose points by selecting an answer like "the bucket administrator IAM user enables MFA Delete via the S3 console". Wrong on two counts: only the root user can enable MFA Delete, and the console does not expose the option — it must be done via CLI or SDK. The exam-correct procedure is: log in as root (with MFA), run aws s3api put-bucket-versioning --bucket <name> --versioning-configuration Status=Enabled,MFADelete=Enabled --mfa "<serial> <code>". Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiFactorAuthenticationDelete.html ::

S3 Lifecycle Rules and the 30-Day Transition Minimum

S3 lifecycle rules automate object transitions between storage classes and object expiration. Lifecycle is the operational lever that turns S3 into a durable, cost-tiered backup target.

Storage class transitions

The supported transitions follow a "flow downhill" pattern (you can move to a colder, cheaper class but generally not back):

  • S3 Standard → S3 Standard-IA, S3 Intelligent-Tiering, S3 One Zone-IA, Glacier Instant Retrieval, Glacier Flexible Retrieval, Glacier Deep Archive.
  • S3 Standard-IA → S3 Intelligent-Tiering, S3 One Zone-IA, Glacier Instant Retrieval, Glacier Flexible Retrieval, Glacier Deep Archive.
  • S3 Intelligent-Tiering → Glacier Instant Retrieval, Glacier Flexible Retrieval, Glacier Deep Archive (Intelligent-Tiering also handles Standard ↔ IA shifts internally).
  • Glacier classes → Glacier Deep Archive (no transition back to warmer classes via lifecycle; restore is the only way back, and it produces a new copy).

The 30-day minimums

Two separate 30-day rules are heavily tested:

  • Minimum age before transition from S3 Standard to S3 Standard-IA or S3 One Zone-IA: 30 days. A lifecycle rule transitioning at day 10 fails validation.
  • Minimum stay in S3 Standard-IA / S3 One Zone-IA before transition to Glacier classes: 30 days (so total minimum from Standard → IA → Glacier is at least 60 days when chained).

There are also storage-class-specific minimum durations: once an object lands in S3 Standard-IA or S3 One Zone-IA, it must stay at least 30 days before it can be deleted or transitioned without an early-deletion charge. Glacier Flexible Retrieval has a 90-day minimum; Glacier Deep Archive has a 180-day minimum. Lifecycle rules respect these; ad-hoc deletes pay the early-deletion fee.

Object size minimums for transitions

Objects smaller than 128 KB are not transitioned by lifecycle rules from S3 Standard or Standard-IA to a colder class — the per-object overhead would exceed the savings. The objects stay where they are. This is why S3 small-object workloads benefit more from Intelligent-Tiering (which has its own small-object handling) than from explicit lifecycle rules.

Versioning + lifecycle interaction

For versioning-enabled buckets, lifecycle rules can target current versions and non-current versions separately:

  • Transitions and Expiration apply to current versions.
  • NoncurrentVersionTransitions and NoncurrentVersionExpiration apply to versions that are no longer current (because a newer version was uploaded).
  • A common pattern: keep current versions in Standard, transition non-current versions to Glacier Flexible Retrieval after 30 days, expire non-current versions after 365 days. This bounds versioning costs while keeping the most recent state hot.
  • A separate setting ExpiredObjectDeleteMarker: true removes orphan delete markers (delete markers whose only underlying versions have been expired) so the bucket does not accumulate dead pointers.

A SOA-C02 distractor: a question shows a lifecycle rule with Days: 10 transitioning to Standard-IA. The candidate, eyeballing the policy, picks "the rule will transition objects after 10 days". Wrong — S3 rejects the rule with InvalidArgument because the minimum days for transition to IA is 30. The same applies to the 90-day Glacier Flexible Retrieval minimum stay and the 180-day Glacier Deep Archive minimum stay; deleting earlier triggers an early-deletion charge equal to the remaining storage cost. Memorize: 30 days to IA, 90 days minimum in Flexible Retrieval, 180 days minimum in Deep Archive. Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-transition-general-considerations.html

S3 Cross-Region Replication (CRR)

S3 Cross-Region Replication asynchronously copies objects from a source bucket to a destination bucket in a different AWS Region. CRR is the SOA-correct answer for "ensure S3 data survives a regional outage" and "store a copy in a different jurisdiction for compliance".

Prerequisites

CRR has four hard prerequisites — every one of them is exam-tested:

  1. Versioning must be enabled on both source and destination buckets. CRR replicates versions; without versioning the rule cannot exist.
  2. An IAM role must be granted to S3 with permissions to read objects from the source and write to the destination (s3:GetReplicationConfiguration, s3:ListBucket, s3:GetObjectVersionForReplication, s3:GetObjectVersionAcl, s3:GetObjectVersionTagging, plus s3:ReplicateObject, s3:ReplicateDelete, s3:ReplicateTags on the destination).
  3. A replication rule specifies source prefix/tag filters, destination bucket, destination storage class (optional override), and whether to replicate delete markers, replica modifications, and existing objects (the S3 Batch Replication feature backfills objects that existed before the rule was created).
  4. Compatible KMS configuration if either bucket uses SSE-KMS — the source role must have kms:Decrypt on the source key and kms:Encrypt on the destination key, and the replication rule must list the destination KMS key.

What gets replicated, what does not

Replicated by default: new objects (after rule creation), object metadata, ACLs, tags, object lock retention. Optional via rule flags: delete markers (off by default), existing objects (via Batch Replication), replica modifications (when the destination object is modified, replicate back to source — used in bidirectional replication).

Not replicated: objects encrypted with SSE-C (S3 has no key access), objects in source bucket created before the rule (unless Batch Replication runs), objects whose owner does not have the necessary ACL grants on the destination.

Replication Time Control (RTC)

Standard CRR is best-effort with no SLA. Replication Time Control (RTC) is an opt-in feature that adds:

  • A 15-minute SLA: 99.99 percent of objects replicate within 15 minutes of the source PUT.
  • Replication metrics in CloudWatchBytesPendingReplication, OperationsPendingReplication, ReplicationLatency per rule.
  • EventBridge events for replication failure conditions.

RTC costs more per GB replicated but is the only way to make a measurable RPO commitment on S3 CRR. SOA-C02 favors RTC for any "RPO of 15 minutes for S3 data" scenario.

Same-Region Replication (SRR)

SRR is the same machinery as CRR but with both buckets in the same Region. SRR use cases: aggregate logs from multiple buckets into one analysis bucket in the same Region, replicate prod data to a sandbox bucket in the same Region for testing, separate production and audit copies under different access policies.

CRR is not a substitute for backup

CRR replicates changes including delete markers (if enabled) and overwrites. If a malicious or accidental delete happens in the source bucket and delete-marker replication is on, the destination receives the same delete marker and the data is hidden there too. To survive deletions and corruption you also need versioning (which CRR requires anyway), MFA Delete or Object Lock on the destination, and ideally a separate AWS Backup plan or AWS Backup for S3 backup of the source.

A SOA-C02 trap: candidates assume CRR is a backup. It is replication, not backup. If you enable delete-marker replication, an attacker with s3:DeleteObject on the source can effectively wipe the destination too (by writing delete markers that replicate). The hardened pattern is: CRR with delete-marker replication ON for operational consistency, plus Object Lock in compliance mode on the destination bucket so versions cannot be permanently destroyed, plus an AWS Backup plan as a separate recoverability layer. Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

Data Lifecycle Manager (DLM) for EBS Snapshots and AMIs

Amazon Data Lifecycle Manager is the native AWS service for scheduled creation, retention, and copy of EBS snapshots and AMIs based on tags. DLM is the SOA-preferred answer for "automate EBS snapshots without writing custom Lambda" or "automate AMI builds for golden image rotation".

DLM policy types

  • EBS snapshot policy — schedules snapshots of EBS volumes selected by tag, with retention, cross-region copy, and cross-account share rules.
  • EBS-backed AMI policy — schedules AMI creation from instances selected by tag (with optional reboot for application consistency), with retention and cross-region copy.
  • Cross-account copy event policy — automatically copies snapshots received from another account into your account's encrypted vault.

Schedule and retention

A DLM schedule specifies:

  • Frequencycron(0 5 * * ? *) for daily at 05:00 UTC, rate(12 hours), or interval-based.
  • Retention — count-based (keep last 7) or age-based (keep for 30 days). The exam favors count-based for snapshot rotation.
  • Tags to apply to created snapshots so they are themselves selectable by other automation.
  • Cross-region copy — destination Region, encryption KMS key in the destination, copy retention.

DLM vs AWS Backup

When does an SOA candidate pick which?

  • DLM is best when the scope is only EBS snapshots and/or AMIs, the team is already tag-driven, and centralized reporting across resource types is not required. DLM is free (you pay only for the snapshots themselves).
  • AWS Backup is best when the scope is multiple resource types (EC2 + RDS + EFS + DynamoDB), centralized policies and audit reports are required, Vault Lock or compliance mode is needed, or backup-job notifications must be unified.

For EBS-only fleets that already have golden AMI workflows and tag-based ownership, DLM is leaner. For diverse multi-service estates, AWS Backup wins on operational simplicity even though it costs slightly more per recovery point.

Cross-region snapshot copy and the encryption gotcha

DLM (and the manual copy-snapshot API) can copy a snapshot to another Region. The non-obvious operational rule:

  • The source snapshot may be encrypted with a customer-managed CMK in the source Region.
  • The destination Region uses a different KMS key (CMKs are regional). You must specify the destination Region's KMS key in the copy parameters; if you omit it, AWS uses the destination Region's aws/ebs AWS-managed key — which works but loses customer-managed key controls.
  • Cross-Region snapshot copies are NOT automatically encrypted with the same key as the source. The encryption key changes by definition.
  • For cross-account share + cross-region copy, the destination account must have kms:DescribeKey and kms:CreateGrant on the destination key, and the source CMK must grant the destination account kms:CreateGrant for replication.

Candidates often assume a copied snapshot retains the source's customer-managed CMK across Regions. It cannot — KMS keys are regional. The destination snapshot is encrypted with the destination Region's KMS key (you choose which). Worse: if you forget to specify a customer-managed key in the destination, the copy uses the AWS-managed aws/ebs key in the destination Region, breaking the customer-managed-key audit trail. The exam-correct procedure is to pre-create a customer-managed CMK in the destination Region and specify it in the DLM cross-region copy rule or copy-snapshot --kms-key-id parameter. Reference: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-copy-snapshot.html

S3 Glacier Storage Classes and Retrieval Tiers

Glacier is not one storage class but three, plus Object Lock-compatible Standard tiers, and each has its own retrieval profile. SOA-C02 expects you to know the retrieval times by heart.

Glacier classes

  • S3 Glacier Instant Retrieval — millisecond retrieval, comparable to S3 Standard for read latency; designed for rarely accessed data (once a quarter) where retrieval latency must remain low. Cheaper storage than Standard-IA, more expensive retrieval.
  • S3 Glacier Flexible Retrieval (formerly "S3 Glacier") — three retrieval tiers (Expedited, Standard, Bulk), 90-day minimum stay; ideal for backup that is rarely retrieved.
  • S3 Glacier Deep Archive — two retrieval tiers (Standard, Bulk), 180-day minimum stay; lowest cost AWS storage; ideal for compliance archives held for years.

Retrieval times — memorize these

  • Glacier Instant Retrieval: milliseconds (no async restore — direct GET like Standard).
  • Glacier Flexible Retrieval — Expedited: 1–5 minutes (small objects only — typically up to 250 MB).
  • Glacier Flexible Retrieval — Standard: 3–5 hours.
  • Glacier Flexible Retrieval — Bulk: 5–12 hours (cheapest retrieval per GB).
  • Glacier Deep Archive — Standard: within 12 hours.
  • Glacier Deep Archive — Bulk: within 48 hours (cheapest of all retrievals).
  • Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/restoring-objects-retrieval-options.html

How restore works

A Glacier (Flexible Retrieval or Deep Archive) restore is asynchronous: you call restore-object specifying the restore tier and the restored-copy availability period (1–N days). S3 makes a temporary copy in Standard/Standard-IA accessible for the requested period; the underlying Glacier object stays in Glacier. You pay for the temporary copy plus the retrieval fee. After the availability period expires, the temporary copy is removed and the underlying Glacier object remains.

For Glacier Instant Retrieval there is no restore step — you GET directly. The class is designed for low-latency reads of rarely accessed data; the cost model is cheaper storage but more expensive per-GB retrieval than Standard.

Disaster Recovery Strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site

The AWS Disaster Recovery whitepaper defines four strategies. SOA-C02 expects you to recognize the operational steps for each, not invent the strategy from scratch.

1. Backup and restore

  • RTO: hours to days.
  • RPO: hours (depending on backup frequency).
  • Operational footprint in DR Region: only the backups (S3 cross-region, AWS Backup cross-region copy).
  • Failover steps: at disaster time, provision new infrastructure in the DR Region (CloudFormation), restore data from cross-region backups, redirect traffic.
  • Best for: dev/test, low-priority workloads, where multi-hour downtime is acceptable.

2. Pilot light

  • RTO: tens of minutes to a few hours.
  • RPO: minutes (data is continuously replicated).
  • Operational footprint in DR Region: minimal compute (often zero), but data tier is live: RDS cross-region read replica running, DynamoDB Global Tables active, S3 CRR replicating, AMIs pre-baked.
  • Failover steps: scale up the dormant compute (Auto Scaling group desired-capacity from 0 to N), promote the RDS read replica to primary, update Route 53 to point at the DR Region.
  • Best for: workloads with stringent RPO but tolerant of minutes-scale RTO; a common sweet spot.

3. Warm standby

  • RTO: minutes.
  • RPO: seconds to minutes.
  • Operational footprint in DR Region: a scaled-down but always-running copy of production — small ASG capacity, ELB live, database replicating.
  • Failover steps: scale the ASG to full production capacity, update Route 53 (or use a Route 53 failover routing policy that flips automatically), promote DB replica.
  • Best for: revenue-critical workloads where minutes of downtime cost real money.

4. Multi-site active/active

  • RTO: near zero.
  • RPO: near zero.
  • Operational footprint in DR Region: full production capacity, both Regions handling traffic concurrently behind Route 53 latency or weighted routing.
  • Failover steps: there is no failover — Route 53 health checks remove the failed Region from rotation automatically.
  • Best for: tier-0 mission-critical workloads (payments, life-safety); highest cost.

On SOA-C02, when a question describes "the team chose pilot light, the primary Region just went unavailable", the right answer enumerates the procedure: scale ASG up, promote RDS replica, switch Route 53. When a question says "the workload requires near-zero RTO", that is SAA-flavored architecture selection — but on SOA-C02 the same prompt is usually paired with "and we are already in multi-site active/active; configure the Route 53 failover health check correctly". Read the verbs: implement, configure, execute are SOA cues; choose, select, recommend are SAA cues. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

Scenario Pattern: Regulator Demands 7-Year Tamper-Proof Retention

A canonical SOA-C02 scenario. The compliance team says: "Our regulator audits us under SEC 17a-4. Backups must be retained for 7 years and proven tamper-proof — even our administrators must not be able to delete them." The runbook:

  1. AWS Backup vault in the production Region with a customer-managed KMS CMK (so key access is logged in CloudTrail). Configure a backup plan covering the relevant resources (RDS, EBS, DynamoDB, S3) with daily backups and 7-year (2557-day) retention.
  2. AWS Backup Vault Lock in compliance mode with minimum retention of 7 years. Wait the 72-hour grace window to make absolutely sure the configuration is correct, then let the lock become irreversible.
  3. Backup plan copy action to a vault in a second Region with the same Vault Lock compliance configuration — protects against regional loss of the primary vault.
  4. For S3 data, enable Object Lock in compliance mode on the source bucket with a 7-year default retention, plus versioning plus CRR with delete-marker replication off so deletions in source do not cascade to destination.
  5. CloudTrail data events on the vaults' KMS keys and the buckets to log every access; deliver to a separate log archive account with its own Object Lock.

This combination is what regulators expect: WORM at multiple layers, separate log evidence, multi-Region durability.

Scenario Pattern: Cross-Region DR Test for an RDS-Backed Application

Another canonical scenario. The team scheduled a cross-region DR test for a pilot-light architecture. The application uses RDS in us-east-1 with a cross-region read replica in us-west-2 and S3 CRR for static assets. The test runbook:

  1. Validate replication health: CloudWatch ReplicaLag on the read replica is below 5 seconds; CloudWatch BytesPendingReplication for S3 CRR is at zero.
  2. Validate Route 53 health checks are configured with failover routing — primary record points at the us-east-1 ALB, secondary at the us-west-2 ALB; health check targets the us-east-1 health endpoint.
  3. Trigger failover by simulating primary failure: either disable the us-east-1 health check endpoint or change its threshold so Route 53 marks it unhealthy.
  4. Promote the read replica in us-west-2 to a standalone primary.
  5. Scale the dormant ASG in us-west-2 from desired-capacity 0 to production capacity.
  6. Verify Route 53 has failed over (dig the application hostname; observe TTL-bounded transition).
  7. Validate application functionality end-to-end against the secondary Region.
  8. Failback: when the test ends, recreate the original us-east-1 primary as a replica of the new us-west-2 primary, then reverse the promotion when caught up — or, simpler, restore from a snapshot of the new primary into a new us-east-1 instance and switch back during a maintenance window.

The test exposes the operational gaps every team has on its first DR drill: stale ASG launch templates, expired ACM certificates in the secondary Region, security groups that reference IDs from the primary, KMS keys that were never replicated. SOA-C02 favors candidates who recognize these gaps.

Common Traps Recap — Backup and Disaster Recovery

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: AWS Backup restore overwrites the original

It does not. Restore creates a new resource. Cutover is a separate step.

Trap 2: Vault Lock compliance mode can be removed later

After the 72-hour grace window, no IAM principal — including root — can remove or weaken the lock.

Trap 3: S3 lifecycle transition can fire after 10 days

Transition to Standard-IA or One Zone-IA requires a minimum of 30 days from creation. Glacier minimum stays are 90 days (Flexible Retrieval) and 180 days (Deep Archive).

Trap 4: MFA Delete can be enabled by an IAM admin via the console

It cannot. MFA Delete is root-only and CLI/SDK-only.

Trap 5: CRR is a backup

It is replication. Delete-marker replication, if enabled, propagates deletions. Versioning + Object Lock + AWS Backup are the additional layers.

Trap 6: CRR replicates existing objects automatically

It does not. Only new objects after the rule is created. Existing objects require S3 Batch Replication to backfill.

Trap 7: Cross-Region snapshot copy retains the source KMS key

KMS keys are regional. The destination copy uses a destination-Region key (which you must specify, otherwise the AWS-managed default is used).

Trap 8: PITR can restore to the same DB instance

It cannot. PITR always creates a new DB instance with a new endpoint. Cutover is mandatory.

Trap 9: Promoting a read replica leaves the primary in place

Promotion breaks replication and creates a standalone primary. The original primary is no longer a replication source for the promoted instance and must be reconfigured or decommissioned.

Trap 10: Glacier Expedited retrieval works for any object size

Expedited typically supports objects up to about 250 MB; larger objects must use Standard or Bulk. Plan retrieval tier by object size.

Trap 11: Disabling versioning recovers an unversioned bucket

Versioning, once enabled, can only be Suspended — never Disabled. Existing versions persist until lifecycle expires them.

Trap 12: AWS Backup is free

You pay for storage (warm and cold), restore (per GB), and cross-region copy. Free-tier amounts are minimal; the operational lesson is to lifecycle to cold storage aggressively.

SOA-C02 vs SAA-C03: The Operational Lens

SAA-C03 and SOA-C02 both test backup and DR, but the lenses differ.

Question style SAA-C03 lens SOA-C02 lens
Choose a DR strategy "RPO 5 min, RTO 1 hour — which strategy?" "We chose pilot light — list the failover steps"
RDS recovery "PITR meets the RPO" "Run PITR; what is the new endpoint and how do you cut over?"
S3 cross-region "CRR meets the durability requirement" "CRR is configured but no objects are replicating — debug"
Vault Lock "Vault Lock compliance mode for SEC 17a-4" "Configure compliance mode; navigate the 72-hour grace window"
Lifecycle "Lifecycle to Glacier reduces cost" "Lifecycle rule rejected — why? (30-day minimum)"
EBS snapshots "DLM automates snapshots" "DLM cross-region copy lost the customer-managed key — fix"
MFA Delete "MFA Delete protects against accidental deletion" "Enable MFA Delete — which user, which interface?"

The SAA candidate selects the strategy; the SOA candidate executes the procedure, troubleshoots when it misbehaves, and operates the recovery during incidents.

Exam Signal: How to Recognize a Domain 2.3 Question

Domain 2.3 questions on SOA-C02 follow predictable shapes. Recognize them and your time on each question drops dramatically.

  • "The restore creates a new resource" — every AWS Backup and RDS PITR question. Look for "what is the endpoint after restore" or "what is the next step".
  • "Lifecycle rule fails or fires unexpectedly" — almost always the 30-day Standard-IA minimum, the 90-day Flexible Retrieval minimum, or the 180-day Deep Archive minimum.
  • "Tamper-proof retention for compliance" — Vault Lock compliance mode plus the 72-hour grace window, often combined with S3 Object Lock compliance mode.
  • "CRR not replicating" — versioning off on one side, IAM role missing, KMS key access denied, or rule was created before existing objects (Batch Replication needed).
  • "Cross-region snapshot copy unreadable" — destination Region's KMS key not specified or destination account lacks key grants.
  • "Disaster recovery procedure" — promote RDS read replica, scale up DR-Region ASG, update Route 53, validate.
  • "Schedule snapshots automatically" — DLM for EBS-only, AWS Backup for multi-resource estates.
  • "Protect against accidental deletion" — versioning + MFA Delete (root user, CLI only).
  • "Ransomware-proof backup" — AWS Backup Vault Lock compliance mode, plus a separate account, plus S3 Object Lock for object-level data.

Domain 2 is 16 percent of SOA-C02 and TS 2.3 takes roughly a third of that domain — expect 6 to 10 questions on backup, restore, and DR procedures specifically. Mastering the 30/90/180-day lifecycle minimums, the 72-hour Vault Lock grace window, the four DR strategies, and the PITR cutover procedure is mandatory. Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html

Decision Matrix — Backup Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal Primary construct Notes
Centralized backup across EC2/EBS/RDS/EFS/DynamoDB AWS Backup plan + vault Tag-based selection scales as resources grow.
EBS-only snapshot automation Data Lifecycle Manager (DLM) Cheaper than AWS Backup if scope is narrow.
AMI rotation for golden image DLM AMI policy or EC2 Image Builder Image Builder for build pipeline; DLM for retention rotation.
RDS recovery to a recent moment RDS Point-in-Time Restore (PITR) Creates new instance; cutover required.
RDS regional failover Promote cross-region read replica Replica must exist; promotion is one-way.
Tamper-proof backups for SEC 17a-4 AWS Backup Vault Lock compliance mode 72-hour grace window, irreversible afterward.
Protect against accidental S3 delete Versioning + MFA Delete MFA Delete is root-only, CLI only.
Tamper-proof S3 objects S3 Object Lock compliance mode Per-version retention; complements Vault Lock.
Cost-tier S3 backups Lifecycle to Standard-IA → Glacier Flexible / Deep Archive 30-day min to IA, 90-day min in Flexible, 180-day min in Deep Archive.
Cross-region S3 durability S3 CRR Add RTC for 15-minute SLA.
15-minute RPO on S3 data CRR + Replication Time Control (RTC) RTC adds metrics and SLA.
Cross-region EBS durability DLM cross-region copy or AWS Backup copy action Pre-create destination KMS key.
Pilot-light DR for compute Pre-baked AMI + dormant ASG (desired=0) + cross-region replica Failover scales ASG and promotes replica.
Warm-standby DR Scaled-down ASG + ELB + replica running Failover scales up; Route 53 flip.
Multi-site active/active Active in both Regions + Route 53 failover health checks Highest cost, near-zero RTO/RPO.
Glacier rapid retrieval (small object) Expedited tier — 1–5 minutes Up to ~250 MB only.
Glacier cheapest retrieval Bulk tier — 5–12 hours (Flexible) or 48 hours (Deep Archive) Lowest per-GB cost.
Aggregate logs to one bucket S3 Same-Region Replication (SRR) Same machinery as CRR, same Region.

FAQ — Backup and Disaster Recovery

Q1: My RDS PITR completed but my application cannot connect — what did I miss?

The new DB instance has a new endpoint hostname. PITR never overwrites the source — it creates mydb-restored (or whatever name you specified) with a new DNS endpoint like mydb-restored.cluster-abc.us-east-1.rds.amazonaws.com. Your application is still pointing at the old endpoint. Operational fix: update the application's connection string (in Parameter Store, Secrets Manager, or environment), or — if you used Route 53 / RDS Proxy — repoint the alias at the new instance. Then validate the new instance has the right security group, subnet group, and parameter group; PITR uses the source's settings by default but lets you override at restore time. Finally, decommission the old instance only after confirming the new one is serving traffic correctly.

Q2: Why was my S3 lifecycle rule rejected with InvalidArgument?

The most common cause is violating a minimum-age rule. S3 lifecycle requires:

  • At least 30 days before transitioning from S3 Standard to S3 Standard-IA or S3 One Zone-IA.
  • At least 30 days stay in Standard-IA / One Zone-IA before transition to Glacier classes.
  • Glacier Flexible Retrieval has a 90-day minimum stay; Glacier Deep Archive has a 180-day minimum stay.

If your rule says Days: 10 for an IA transition, S3 rejects it. Fix the rule to use 30 or higher. A second cause is conflicting actions in the same rule (transition and expiration on the same day). A third cause is filter syntax — the rule prefix must not collide with another rule's prefix in a way that makes the schedule ambiguous.

Q3: Vault Lock compliance mode versus governance mode — which do I pick for ransomware defense?

Compliance mode, every time. Ransomware attackers who compromise admin credentials would, with governance mode, be able to call backup:DeleteRecoveryPoint and wipe your backups before encrypting production. Compliance mode after the 72-hour grace window forbids deletion of recovery points before scheduled expiry — even from the AWS root user. The trade-off: if you misconfigured retention, you cannot fix it; you must live with the configuration until recovery points expire naturally. The 72-hour grace window exists to let you test before the lock becomes irreversible. For SOA-C02, "ransomware" or "insider threat" or "regulator demands tamper-proof" all map to compliance mode.

Q4: How does S3 CRR handle objects that already existed before I configured the rule?

By default, CRR replicates only new objects created after the rule was added — existing objects are not backfilled. To replicate existing objects, run S3 Batch Replication, which creates a one-time job that copies pre-existing objects according to the rule's filter. Batch Replication is billed per object and per GB processed. The operational sequence: create the replication rule, then start a Batch Replication job from the CRR rule's "Replicate existing objects" workflow in the console (or s3control create-job with Operation: S3ReplicateObject). Wait for the batch job to complete before considering the destination bucket fully synchronized.

Q5: After cross-region copying an EBS snapshot, the destination volume restore fails with KMS access denied — why?

KMS keys are regional. The source snapshot was encrypted with a customer-managed CMK in the source Region; the copy operation re-encrypted it with a key in the destination Region. If you did not specify a destination KMS key in the copy parameters, AWS used the destination Region's aws/ebs AWS-managed key — and the IAM principal trying to restore the volume might lack kms:Decrypt on that managed key (unlikely but possible if your account has restrictive SCPs). More commonly, you specified a customer-managed CMK in the destination but the restoring IAM principal does not have a key policy grant on that CMK. Fix: pre-create a customer-managed CMK in the destination Region, grant the restore principal kms:Decrypt, and reference that CMK in the DLM cross-region copy rule (or copy-snapshot --kms-key-id).

Q6: What is the minimum retention I can configure for an AWS Backup recovery point that uses cold storage?

The minimum age before transition to cold storage is 90 days in warm storage (you cannot transition immediately at backup time). Once in cold storage, the recovery point must remain there for at least 90 days before it can be deleted — that is a minimum stay enforced by AWS Backup, similar to the early-deletion rules on S3 Glacier. Total minimum retention for any cold-tiered recovery point is therefore at least 90 + 90 = 180 days. Plans that try to specify shorter cold retention are rejected. For shorter total retention, keep recovery points in warm storage only (no cold transition); warm has no minimum and supports any retention from 1 day onward.

Q7: Should I enable MFA Delete on every production S3 bucket?

Probably not on every bucket, but yes on buckets storing irreplaceable data (regulatory archives, master copies of customer data, the only backup of a critical workload). The trade-off:

  • Pro: even compromised admin IAM credentials cannot permanently delete object versions or change versioning state without the root user's MFA token.
  • Con: every legitimate version cleanup (lifecycle non-current expiration is fine; manual delete-version operations are not) requires the root user with MFA — operationally awkward.

A pragmatic pattern: MFA Delete on the regulatory archive bucket and on the last-line-of-defense backup bucket, but not on every working bucket. For non-archive buckets, prefer S3 Object Lock in governance or compliance mode as a less-disruptive alternative, since Object Lock applies to specific object versions and can coexist with normal IAM-driven deletes for non-locked objects.

Q8: Can I use AWS Backup to back up an S3 bucket?

Yes — AWS Backup added S3 support, and it backs up both the objects and the bucket-level configuration (versioning state, encryption, ACLs, tags, lifecycle, public access block, replication). Backups can be cross-Region and cross-account, and they participate in Vault Lock. The operational use case: you want a single audit surface across RDS, EBS, EFS, DynamoDB, and S3 — instead of relying on S3 versioning + CRR alone, AWS Backup gives you a centralized recovery-point catalog with the same retention and lock policies as the rest of the estate. Restore creates a new bucket or copies into an existing one. AWS Backup for S3 is billed per GB stored in the vault plus restore fees; for write-heavy buckets the storage cost can be meaningful, so most teams use AWS Backup for S3 only on critical buckets and rely on versioning+CRR for the bulk.

Q9: What is the operational difference between ReplicaLag on an RDS read replica and replication latency on S3 CRR?

ReplicaLag (CloudWatch metric on the read replica DB instance) measures how many seconds the replica is behind the primary in transaction-log replay. A replica with 60-second ReplicaLag is one minute stale; promoting it loses up to one minute of writes. The metric is exposed continuously and is the single most-watched RDS replication signal. S3 CRR's ReplicationLatency (CloudWatch, when you have RTC enabled) measures the per-object time from source PUT to destination availability; with RTC the SLA is 99.99 percent within 15 minutes. Without RTC, S3 does not publish replication latency by default, and best-effort replication can take much longer for very large objects or during throttling. SOA-C02 sometimes asks "how do I monitor replication is keeping up?" — the answer is ReplicaLag for RDS and ReplicationLatency (with RTC) for S3 CRR.

Q10: We need an RPO of 15 minutes for S3 data and 5 minutes for an RDS database. What is the operational configuration?

For S3: enable versioning on both source and destination buckets, configure CRR with Replication Time Control (RTC) which provides the SLA-backed 15-minute RPO target, plus CloudWatch alarms on ReplicationLatency and BytesPendingReplication so you are paged when replication is falling behind. Enable delete-marker replication off if you want CRR to not propagate deletions, or enable Object Lock compliance mode on the destination to make versions tamper-proof either way. For RDS: enable Multi-AZ for in-Region 60-to-120-second failover plus automated backups with 35-day retention for in-Region PITR, and add a cross-region read replica with ReplicaLag typically under 5 seconds for the cross-region failover scenario. CloudWatch alarm on ReplicaLag > 300 (5 minutes) and on the difference between latest restorable time and current time. The combination meets both RPO targets with measurable, alertable signals.

Once backup and disaster recovery procedures are in place, the next operational layers are: RDS and Aurora resilience for the database tier where PITR and read replica promotion live, CloudTrail and AWS Config for the audit trail that proves backup operations are running and recovery points are immutable, data protection and encryption for the KMS keys that encrypt vaults and cross-region snapshots, and CloudWatch metrics and alarms for the monitoring layer that catches replication lag and backup job failures before the next disaster forces a real failover.

官方資料來源