Disaster recovery (DR) is the discipline of getting a workload running again after a destructive event — a region outage, a ransomware detonation, a bad deployment, an accidentally-deleted bucket, or a cascading dependency failure. On SAP-C02, disaster recovery threads through Domain 2 (Design for New Solutions), Domain 3 (Continuous Improvement), and Domain 4 (Accelerate Workload Migration and Modernization). Every scenario that mentions an RTO in minutes, an RPO in seconds, a regulated industry, a compliance auditor, or the phrase "business continuity" is really a disaster recovery question in disguise.
This guide assumes you already know Associate-level building blocks — Multi-AZ RDS, S3 versioning, EBS snapshots, Auto Scaling, Route 53 health checks — and focuses on the Professional-tier DR patterns you need to eliminate wrong answers quickly: the DR strategy ladder (Backup and Restore, Pilot Light, Warm Standby, Multi-Site Active-Active), AWS Elastic Disaster Recovery (DRS), Route 53 Application Recovery Controller (ARC), Aurora Global Database, DynamoDB Global Tables, S3 Cross-Region Replication, multi-region KMS keys, AWS Backup cross-region copy, multi-region landing zones, and chaos engineering with AWS Fault Injection Service. By the end, the canonical pharmaceutical scenario — RTO 15 minutes, RPO 5 minutes, multi-region, compliant — should read as a single coherent architecture.
Why Disaster Recovery Matters on SAP-C02
At Professional tier, AWS expects you to treat DR as a first-class design input, not a retrofit. Scenarios carry explicit RTO and RPO targets, compliance constraints (HIPAA, PCI DSS, GxP, FedRAMP), and cost ceilings, and the correct answer is the cheapest architecture that meets all three targets. Over-engineering is as wrong as under-engineering — picking Multi-Site Active-Active for a workload whose RTO is 24 hours is a wrong answer because it wastes money; picking Backup and Restore for a workload whose RTO is 5 minutes is a wrong answer because it fails the objective.
The exam also routinely pits AWS DRS against CloudEndure legacy patterns, Route 53 ARC against plain Route 53 failover records, Aurora Global Database managed failover against unmanaged promote, and multi-region active-active against warm standby with failover. Knowing which construct fits each RTO/RPO band is the fastest way to narrow four answer choices to one.
- Recovery Time Objective (RTO): the maximum acceptable time between the outage starting and the workload being functional again. Measured wall-clock.
- Recovery Point Objective (RPO): the maximum acceptable data loss expressed as time — how far back the last consistent data point may be. An RPO of 5 minutes means no more than 5 minutes of transactions may be lost.
- Disaster Recovery Strategy Ladder: the four canonical AWS DR strategies, in order of increasing cost and decreasing RTO/RPO — Backup and Restore, Pilot Light, Warm Standby, Multi-Site Active-Active.
- AWS Elastic Disaster Recovery (DRS): the managed service that continuously replicates source servers (on-prem or in AWS) at the block level into an AWS Region, launching recovery instances on demand.
- Route 53 Application Recovery Controller (ARC): the DR-grade traffic-control service that provides readiness checks, routing controls (dataplane-quorum failover switches), and safety rules, running out of five AWS Regions.
- Aurora Global Database: the Amazon Aurora feature that replicates a cluster to up to five secondary Regions with sub-second lag via the Aurora storage layer, enabling under-one-minute RPO and under-one-minute RTO for managed failover.
- DynamoDB Global Tables: a fully managed multi-Region, multi-active replication for DynamoDB tables using last-writer-wins conflict resolution.
- S3 Cross-Region Replication (CRR): asynchronous object replication between buckets in different Regions. Replication Time Control (RTC) guarantees a 15-minute SLA for 99.99 percent of new objects.
- Multi-region KMS keys: KMS key material replicated across Regions, sharing key ID and key material so ciphertext encrypted in Region A can be decrypted in Region B.
- Chaos engineering / game days: deliberate fault injection into production-like systems to validate DR runbooks and build confidence in recovery.
- Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html
Plain-Language Explanation: Disaster Recovery
Disaster recovery has a lot of jargon — RTO, RPO, pilot light, warm standby, active-active. Three different analogies make the constructs sticky.
Analogy 1: The Fire Station Ladder
Picture a city's fire response planning. Backup and Restore is like keeping extinguishers and a fire-safety manual in a locked cabinet — cheap to maintain, but when the fire starts somebody has to run to the cabinet, read the manual, and assemble the hoses from scratch. Pilot Light is a pilot flame kept always lit on the gas stove — the core of the fire team is already on duty with trucks fueled and engines warm, but the full crew and ladders must be summoned and equipment inflated before the real response begins. Warm Standby is a scaled-down fire station fully staffed at a neighboring block — trucks are running, crew is on shift, radios on, but the station is half-size; when the call comes in, it responds immediately and scales up by calling additional reserves. Multi-Site Active-Active is two equally-staffed fire stations at either end of town operating simultaneously, each handling half the calls on an ordinary day; when one station burns down, the other already knows the streets and can absorb the full load instantly without any scaling. Climbing the ladder costs more money every rung, and you only climb as high as your RTO/RPO force you to.
AWS DRS is the automated mutual-aid agreement that continuously mirrors every piece of equipment from your primary station to a backup station — at any moment you can evacuate to the backup and the trucks, hoses, and personnel manifests match byte-for-byte. Route 53 ARC is the 911 dispatcher with a separate, redundant switchboard — even if the main city network melts, the dispatcher still routes calls to the healthy station because the switchboard runs in five cities on a separate power grid. Aurora Global Database is the shared fire-department records system replicated across cities in real time — every incident report written in City A is visible in City B within a second, so the responding crew always has the full history.
Analogy 2: The Restaurant Franchise With Multiple Kitchens
A restaurant chain planning for kitchen fires maps cleanly to DR. Backup and Restore is keeping recipe books and bulk ingredients in a warehouse — after a fire you rent a new kitchen, hire a crew, restock, and reopen days later. Pilot Light is renting a dark kitchen in another city with refrigerators running and a skeleton prep crew — core inventory is always fresh but the line cooks are not on shift; when the primary burns, you call in the rest of the crew and flip the "open" sign. Warm Standby is running a half-capacity kitchen in the second city right now — it serves a few lunch customers daily, which keeps the equipment calibrated and the crew practiced, and if the primary fails the manager scales it up to full dinner service within the hour. Multi-Site Active-Active is running two equal-size kitchens in two cities that both serve customers continuously — with a load balancer (DNS) splitting orders evenly, either kitchen can absorb 100 percent of orders instantly if the other goes dark.
S3 Cross-Region Replication is the shared inventory database updated whenever either kitchen receives a delivery, so both kitchens always know the stock. DynamoDB Global Tables is the POS system that accepts orders at either location and reconciles them across both — a customer who ordered in City A can pick up in City B because the ticket was replicated within seconds. KMS multi-region keys are the master key blanks duplicated at both locations so the same safe can be opened with the same key at either kitchen — no need to rekey ciphertext during failover. Game days are unannounced fire drills where a manager yells "pretend the fryer just exploded" and the crew executes the recovery runbook; without these drills the runbook is fiction.
Analogy 3: The Hospital Emergency Response Plan
A hospital's continuity plan is the closest real-world parallel for a regulated workload. RTO is how long a patient can be without care before harm — for a cardiology workload it is minutes; for a records-archive workload it is days. RPO is how much medical history you can afford to lose — for a live cardiology monitor, effectively zero; for quarterly billing, a day might be acceptable. Backup and Restore is the off-site medical records archive on tape — full history preserved, but slow to retrieve. Pilot Light is a second hospital site with the core electronic-health-records system running but no clinical staff on duty — the data stays current, and staff can be paged in during a crisis. Warm Standby is a half-capacity satellite hospital staffed for walk-in care that can scale to full emergency operations when the main hospital fails. Multi-Site Active-Active is two equal-size hospitals across town with the same EHR, the same staffing, and traffic split by ZIP code — either can absorb the other's patient load immediately during a regional disaster.
For SAP-C02, the fire station ladder analogy is the most useful when a question mixes RTO, RPO, and cost into a single scenario — each rung up the ladder is 2–5x the cost of the previous and cuts RTO/RPO by an order of magnitude. Do not pick the top rung unless the scenario's RTO/RPO forces you there, and do not pick the bottom rung when the RPO is in minutes. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html
RTO, RPO, and the DR Strategy Ladder
Before picking a specific AWS service, you must internalize how RTO and RPO map to the four canonical DR strategies. SAP-C02 scenarios usually give you an RTO and RPO value explicitly or embed them in the narrative ("the trading floor cannot lose more than 30 seconds of orders"), and the correct strategy is the cheapest one whose theoretical floor meets both numbers.
The ladder at a glance
| Strategy | Typical RTO | Typical RPO | Relative cost | Complexity | When to choose |
|---|---|---|---|---|---|
| Backup and Restore | Hours to days | Hours | 1x (lowest) | Low | Dev/test, tier-3 workloads, compliance archives |
| Pilot Light | Tens of minutes to hours | Minutes | 2–3x | Medium | Tier-2 internal apps, batch workloads |
| Warm Standby | Minutes | Seconds to minutes | 4–6x | Medium-High | Tier-1 customer-facing apps with tight RTO |
| Multi-Site Active-Active | Seconds to under a minute | Near-zero (seconds) | 8x+ | Very High | Global workloads, financial trading, life-critical systems |
Cost multipliers are order-of-magnitude indicators — actual cost depends on workload shape (compute-heavy vs storage-heavy) and region pricing. The key signal on SAP-C02: every rung up the ladder at least doubles ongoing spend, because you are running more infrastructure in the standby Region.
Strategy 1: Backup and Restore
The workload runs in Region A; backups are copied to Region B. In a disaster, infrastructure is provisioned fresh in Region B from CloudFormation/CDK/Terraform and data is restored from backups.
- Data path: AWS Backup → cross-region copy → Backup Vault in Region B. EBS snapshots, RDS snapshots, DynamoDB on-demand backups, S3 objects via CRR, FSx snapshots.
- Compute path: CloudFormation StackSets or a Terraform pipeline deploys the VPC, subnets, security groups, ALB, Auto Scaling group, and application tier on demand.
- RTO driver: how long CloudFormation plus data restore takes — typically hours for non-trivial workloads, because RDS snapshot restore alone is measured in tens of minutes to hours, and very large S3 datasets may need AWS DataSync or S3 Batch to reinflate.
- RPO driver: backup frequency. AWS Backup scheduled every 1 hour gives ~1-hour RPO; Continuous Backup for Aurora and DynamoDB gives point-in-time recovery within seconds of the event, reducing RPO drastically.
- Right answer when: RTO is 4+ hours, cost is paramount, workloads are non-critical or batch, or for compliance archives where data preservation matters more than quick recovery.
A common SAP-C02 confusion: Backup and Restore as a DR strategy is not the same as having backups. The strategy means you rely on backups as the sole mechanism for recovering into the DR Region, with no pre-provisioned infrastructure standing by. Every workload should have backups; only some workloads use Backup and Restore as their DR strategy. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/backup-and-restore.html
Strategy 2: Pilot Light
A minimal core of the workload runs continuously in Region B — typically the data tier and the network/IAM scaffolding — while compute is stopped or scaled to zero. On failover, compute is started or scaled up, and traffic is redirected.
- Data path: Aurora Global Database secondary, DynamoDB Global Tables replica, S3 CRR destination bucket, EFS replication, FSx cross-region replication. The data tier is always current in Region B.
- Compute path: Launch Templates, Auto Scaling groups at desired-capacity zero, stopped EC2 instances, or undeployed ECS services. Infrastructure exists but consumes minimal cost.
- RTO driver: the time to scale compute up, warm caches, and flip DNS. For an Auto Scaling group starting from zero, 5–15 minutes is realistic; for an EKS cluster with nodes pre-warmed but deployments scaled to zero, 2–5 minutes.
- RPO driver: data replication lag. Aurora Global Database: typically under 1 second. DynamoDB Global Tables: usually under 1 second. S3 CRR without RTC: minutes; with RTC: 15-minute 99.99 percent SLA.
- Right answer when: RTO is 30 minutes to 2 hours, RPO is seconds to a few minutes, and running full warm capacity all day is too expensive.
Strategy 3: Warm Standby
A scaled-down but fully functional version of the workload runs continuously in Region B, serving a fraction of real traffic or just health-check traffic. On failover, it scales up to full capacity and absorbs 100 percent of traffic.
- Data path: same as Pilot Light — Global Database, Global Tables, CRR — with the difference being the application tier is already processing requests.
- Compute path: Auto Scaling groups running at a minimum capacity (e.g., 2 instances in Region B vs 20 in Region A), ECS services at reduced task count, Lambda functions warm via provisioned concurrency, or ALBs already routed to a small fleet.
- RTO driver: the time to scale up from low to full capacity — seconds for Lambda, 1–3 minutes for EC2 Auto Scaling reaching warm pools, minutes for container services depending on image pull cache.
- RPO driver: same as Pilot Light — essentially the replication lag of the chosen data services.
- Right answer when: RTO is 2–15 minutes, RPO is seconds to a minute, the workload is revenue-critical, and the cost of always-on reduced capacity is acceptable.
Strategy 4: Multi-Site Active-Active
Two (or more) Regions serve traffic simultaneously, each at full or near-full capacity. A regional failure instantly shifts all traffic to the surviving Region with no capacity ramp-up.
- Data path: multi-active data stores are mandatory — DynamoDB Global Tables (multi-active writer per Region), Aurora Global Database with write forwarding (single active writer, reads everywhere) or multiple Aurora clusters fronted by app-layer conflict resolution, S3 CRR bidirectional (source and destination in both Regions). Conflict resolution becomes a design issue.
- Compute path: both Regions run the full stack behind DNS-based load balancing (Route 53 latency-based or geoproximity routing) or a global accelerator.
- RTO driver: essentially DNS TTL plus ARC switch latency — sub-minute in most cases, and zero for clients already hitting the surviving Region.
- RPO driver: replication lag between Regions — for DynamoDB Global Tables, usually well under 1 second; for Aurora Global Database with write forwarding, under 1 second but with cross-region write latency for non-local writes.
- Right answer when: RTO under 1 minute, RPO approaching zero, global user base demanding low latency, regulatory or customer commitments requiring no observable downtime.
- Backup and Restore: RTO hours, RPO hours, cost 1x. Redeploy on demand.
- Pilot Light: RTO 30+ min, RPO minutes, cost 2–3x. Data live, compute dark.
- Warm Standby: RTO minutes, RPO seconds, cost 4–6x. Data live, compute small-but-live.
- Multi-Site Active-Active: RTO seconds, RPO near zero, cost 8x+. Both Regions serve real traffic.
- Pick the cheapest rung whose floor meets RTO and RPO — never higher.
Mapping scenarios to the ladder
| Scenario phrase | Ladder answer |
|---|---|
| "Internal reporting tool; can be down overnight" | Backup and Restore |
| "Batch ETL that runs nightly" | Backup and Restore or Pilot Light |
| "Customer-facing web app; 1-hour RTO acceptable" | Pilot Light |
| "E-commerce site for a regional retailer; 15-minute RTO" | Warm Standby |
| "Financial trading platform; 30-second RTO, zero data loss" | Multi-Site Active-Active |
| "Global SaaS with users on 3 continents" | Multi-Site Active-Active with latency routing |
| "Pharma clinical trial app; 15-minute RTO, 5-minute RPO, compliance" | Warm Standby with DRS or Pilot Light+ (see canonical scenario) |
AWS Elastic Disaster Recovery (DRS) — Block-Level Replication and Failback
AWS Elastic Disaster Recovery (DRS), the successor to CloudEndure Disaster Recovery, is AWS's managed block-level DR service. It runs on source servers — on-premises physical, on-premises virtual (VMware, Hyper-V), or EC2 instances in a different Region — and continuously replicates every disk write to a staging area in the target AWS Region. On failover, DRS launches full-fidelity recovery EC2 instances from the replicated blocks within minutes.
How DRS works mechanically
- Install the AWS Replication Agent on each source server. It enumerates disks and attaches a lightweight I/O filter.
- The agent continuously ships block-level changes to a staging subnet in the target Region, where data lands on low-cost staging EBS volumes attached to small replication servers (a fleet DRS manages).
- Point-in-time snapshots are retained (up to 365 days by default, configurable) so you can launch recovery at any of those points — critical for ransomware rollback.
- On failover, DRS launches EC2 instances in the target Region using the latest snapshot (or a chosen PIT), attaches EBS volumes built from the staging data, and applies launch template settings (instance type, subnet, security groups, IAM role).
- After primary recovery, failback reverses replication from AWS back to the source environment.
Source types DRS supports
- On-premises physical servers (Windows Server, Linux) — DRS is the canonical answer for "we need AWS to be our DR site for our data center without rearchitecting our applications".
- On-premises virtual machines (VMware vSphere, Microsoft Hyper-V) — treated the same as physical from DRS's point of view.
- Cross-Region EC2 — replicate an EC2 workload running in
us-east-1tous-west-2for region-level DR without re-platforming to Aurora Global or other managed multi-region services. - Cross-Cloud — servers running on other clouds can also replicate to AWS.
RTO and RPO characteristics
- RPO: sub-second under steady state, typically under 1 second for a well-provisioned staging area. Bandwidth-limited environments can see RPO drift up during ingest spikes — monitor via CloudWatch.
- RTO: typically 5–20 minutes from failover initiation to EC2 instances in the target Region responding to traffic, depending on instance boot time, post-launch automation, and warm-up.
Failback
Failback is the often-overlooked other half of DR. Once your primary site is healthy, you must return the workload and re-establish protection:
- Install the Failback Client on a server in the original environment (or use the Failback automation for EC2-as-source).
- Reverse replication flows from the recovery EC2 instances to the original server targets.
- Perform a controlled cutover back to the original site.
- Restart replication in the original direction (source → AWS).
SAP-C02 scenarios frequently ask about the full cycle — failover and failback — and the right answer involves DRS on both legs, not just the outbound.
DRS vs other DR patterns
DRS is the right choice when you want like-for-like recovery of an existing application without modifying it. It shines for:
- Lift-and-shift workloads that cannot be refactored to Aurora Global or DynamoDB Global Tables.
- Legacy applications (Windows Server with complex stateful services, third-party appliances with EBS-like volumes) where vendor support requires original binaries.
- Hybrid DR from on-prem data centers to AWS, eliminating the need for a secondary physical site.
- Ransomware protection using PIT snapshots up to 365 days — you can launch a recovery from a week ago before the encryption payload detonated.
DRS is not the right answer for:
- Workloads already running on managed services (Aurora, DynamoDB, Lambda, ECS Fargate) — those services have their own cross-region features that are simpler and cheaper.
- Cases where you actually want a re-platforming migration — that is AWS Application Migration Service (MGN), DRS's sibling product that reuses the same replication agent but is billed and marketed for migration.
AWS Application Migration Service (MGN) and AWS Elastic Disaster Recovery (DRS) share the same replication agent and block-level engine. MGN is optimized for one-way migration with a terminal cutover — after cutover, replication stops. DRS is optimized for continuous ongoing protection — replication runs forever, and you can fail over repeatedly for game days. Pricing reflects this: MGN charges per-server during replication and stops billing after cutover; DRS charges per source server continuously. On SAP-C02, pick MGN when the scenario says "migrate to AWS and decommission the source"; pick DRS when it says "use AWS as our DR site for on-prem workloads". Reference: https://docs.aws.amazon.com/drs/latest/userguide/what-is-drs.html
DRS security and compliance
- Encryption in transit: TLS from agent to staging.
- Encryption at rest: staging EBS volumes encrypted with KMS; recovery volumes inherit encryption.
- Network segmentation: staging subnet is isolated; outbound communication via endpoints to DRS service APIs.
- IAM integration: all operations via IAM roles, auditable in CloudTrail.
- Compliance: DRS is included in the same compliance scope as its data-plane AWS Regions (HIPAA, PCI DSS, SOC, FedRAMP where applicable).
For regulated industries (pharma, finance, healthcare), DRS is often the answer when the question asks for application-agnostic cross-region DR of a vendor-supplied package that cannot be re-platformed.
Route 53 Application Recovery Controller (ARC) — Readiness and Routing Controls
Amazon Route 53 Application Recovery Controller (ARC) is the DR-grade traffic-control companion to Route 53. Plain Route 53 health checks and failover records are good enough for most workloads, but they have two limitations that regulated, tight-RTO workloads cannot accept: (1) DNS health checks can be slow and noisy under partial failures, and (2) the control plane runs in a single Region. ARC addresses both.
Three pillars of ARC
- Readiness checks — continuous validation that the standby Region can actually take traffic. ARC examines per-resource attributes (Auto Scaling capacity, ALB target group health, RDS instance status, DynamoDB provisioned throughput, IAM roles, KMS keys, service quotas) and reports whether the standby is ready.
- Routing controls — human-operated (or automation-operated) on/off switches that flip traffic between Regions. Each routing control is wired to Route 53 health checks; flipping the control causes the health check to pass or fail, which drives DNS failover or ALB listener rules.
- Safety rules — guardrails on routing controls: assertion rules (e.g., "at least one Region must be on at all times") and gating rules (e.g., "to turn off Region A, Region B must be on and ready") prevent a single operator from flipping both Regions off.
The ARC cluster — five-Region dataplane
The critical Pro-tier detail: ARC's routing-control dataplane runs in five AWS Regions (us-east-1, us-west-2, ap-northeast-1, ap-southeast-2, eu-west-1) as a cluster. You call any cluster endpoint to flip a control, and a quorum of the five Regions must acknowledge. This architecture survives the loss of any single Region — including the one you are trying to fail away from.
Each ARC cluster costs a fixed monthly fee (not trivial — several hundred USD per month at the time of writing), plus per-routing-control costs. SAP-C02 does not test the dollar figures but does test the architectural claim: ARC is more available than Route 53 itself for failover decisions, because the control plane is multi-region.
Readiness checks in depth
Readiness scopes are defined at three levels:
- Recovery group — the top-level DR unit representing an application.
- Cell — a replica of the application in a specific Region or AZ.
- Resource set — the actual AWS resources to monitor (e.g., ALBs, ASGs, RDS instances).
ARC continuously evaluates each resource against its readiness rules (pre-built rules like "ASG desired capacity equals maximum", "ALB target group has healthy instances", "RDS is available", plus custom service-quota rules). A readiness check aggregates into the cell's readiness status, and then the recovery group's readiness.
You use readiness checks as pre-failover confidence: during a game day or real failover, operators verify the target cell is READY before flipping the routing control. This catches the classic "we failed over and discovered the standby's quota was set to 10 instead of 200" problem before it breaks production.
Routing controls and safety rules in depth
A routing control is a named boolean (On/Off) tied to an ARC cluster. Associate it with a Route 53 health check of type "calculated" referencing the routing control's state. Route 53 DNS records using failover or weighted routing then react as the health check flips.
Safety rules prevent operator error:
- Assertion rule: "the sum of On states for controls A, B, and C must be at least 1" — ensures at least one cell is always on.
- Gating rule: "control X cannot be turned On unless control Y is On" — enforces dependency order.
- Wait period: enforce a cool-off after a flip.
Safety rules are evaluated by the ARC cluster and reject invalid operations at the control plane — no amount of console clicking can violate them.
When to use ARC vs plain Route 53 health checks
Plain Route 53 failover records + health checks are sufficient for most web apps — they are free (beyond the health check cost), automatic, and widely deployed. ARC is the right answer when:
- The workload cannot tolerate Route 53's single-Region control-plane limitation.
- Failover must be manually orchestrated by an operator or runbook, not automatic — because automatic DNS failover can misfire on partial failures.
- Compliance requires auditable, explicit failover decisions with safety rules and dual-control sign-off.
- Readiness checks are needed to validate the standby's capacity and quotas before flipping.
SAP-C02 signal phrases for ARC: "highly regulated", "financial services", "healthcare", "auditable failover", "pre-validated standby", "fail over without relying on Route 53 control plane".
When an SAP-C02 scenario says "the company needs to fail over the workload to the DR Region without depending on the health of the primary Region's control plane" or "auditors require explicit, reviewed failover decisions with dual-control", the answer is ARC — not plain Route 53 failover records. Remember that ARC itself costs non-trivially per month, so normal web apps with automated failover stick to Route 53. Reference: https://docs.aws.amazon.com/r53recovery/latest/dg/what-is-route-53-recovery.html
Aurora Global Database — Write Forwarding, Managed Failover, and Detached Standby
Amazon Aurora Global Database is the pro-tier cross-region relational database on AWS. It extends an Aurora cluster to up to five additional AWS Regions, each with its own Aurora cluster using storage-level replication that reaches sub-second lag. It is the default SAP-C02 answer for any scenario that says "cross-region relational database with sub-second RPO".
Architecture
- One primary cluster in the primary Region accepts reads and writes.
- Up to five secondary clusters in other Regions replicate via the Aurora storage layer (not via binlog — Aurora's storage is a distributed shared-storage fabric, so replication is handled beneath the database engine).
- Each secondary is a full Aurora cluster with its own readers, capable of serving low-latency reads in its Region and promoting to a full primary if needed.
Replication lag is typically under 1 second at the storage layer, with the SLA commonly under 1 second and frequently measuring in tens to hundreds of milliseconds.
Managed failover
Managed failover (also called managed planned failover) is the modern way to promote a secondary to primary under controlled conditions — during game days, planned maintenance, or after a confirmed Region disruption:
- You initiate managed failover from the console, CLI, or API, specifying the target secondary Region.
- Aurora coordinates the switch: it promotes the chosen secondary to primary, demotes the former primary to a secondary, and re-orients all replication in the background.
- Typical RTO for managed failover is under 1 minute; RPO is near-zero (under 1 second) because replication was up-to-date at the moment of promotion.
Managed failover preserves the global database topology — no manual detach/reattach — and is safer than the older "detach and promote" workaround, which leaves you with separate clusters that must be manually rejoined.
Detached standby (unplanned failover)
For unplanned failover — when the primary Region is actually down and managed failover cannot reach it — you use the detach and promote path:
- Remove the primary cluster from the global database (in the secondary's console, because the primary may be unreachable).
- The secondary becomes a standalone Aurora cluster that you can promote to primary.
- Redirect application writes to the (now independent) cluster.
- After the disaster passes, rebuild the global database by creating a new secondary or restoring from backup.
Detached-standby is the path that meets the sub-minute RTO claim even when the primary Region's Aurora control plane is unavailable. The trade-off: the pre-disaster primary and the now-promoted cluster are two separate databases; reconciling any late transactions is an application concern.
Write forwarding
Aurora Global Database write forwarding is a feature that lets secondary clusters accept write statements from applications without failover. Writes are forwarded over the AWS backbone to the primary Region, executed there, and the result returned to the originating secondary. This enables apps with global users to use a single write endpoint (the local secondary) while Aurora transparently routes writes to the primary.
Write forwarding characteristics:
- Supported for Aurora MySQL and Aurora PostgreSQL (check current version matrix).
- Adds cross-region write latency (tens to hundreds of ms) — reads in the secondary remain local and fast.
- Offers session consistency levels:
EVENTUAL,SESSION,GLOBAL— trading latency for consistency depending on the workload. - Ideal for multi-region active-active-read, single-primary-write patterns where the app wants a simple client but workload distribution across Regions.
Aurora Global Database vs Multi-AZ vs read replicas
On SAP-C02, distinguish:
- Multi-AZ Aurora (the default): one cluster, data replicated across three AZs via the storage fabric. Handles AZ failure, not Region failure. RPO zero, RTO 30–60 seconds via automatic failover.
- Aurora cross-region read replicas (legacy): single-region cluster with a read replica in another Region using binlog. Replaced by Global Database for new designs; still appears in legacy systems.
- Aurora Global Database: cluster-of-clusters via storage replication. Region-level DR. RPO under 1 second, RTO under 1 minute managed, under a few minutes unplanned.
When SAP-C02 says "multi-region RDBMS with sub-second RPO and sub-minute RTO", the answer is Aurora Global Database. Plain RDS cross-region read replicas are a weaker option (binlog-based, higher lag). Pick Global Database and then decide between managed failover (planned) and detach-promote (unplanned) based on whether the primary is reachable. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html
DynamoDB Global Tables — Multi-Region, Multi-Active Replication
Amazon DynamoDB Global Tables provides fully managed multi-Region, multi-active replication for DynamoDB tables. It is the NoSQL sibling of Aurora Global Database, but with a fundamentally different consistency model: every Region is a writer, and conflicts are resolved with last-writer-wins based on the item's most recent timestamp.
Architecture
- Create a table in Region A, then add replicas in Regions B, C, etc. (Version 2019.11.21 is the current "global tables v2" — ignore the older v1 that required capacity matching).
- Writes to any replica propagate to all other replicas typically within seconds, often under 1 second.
- Reads are always local to the Region; writes are local and replicated in the background.
- Each replica carries its own provisioned or on-demand capacity configuration.
Consistency model
- Within a single Region, DynamoDB offers strong or eventually consistent reads as usual.
- Across Regions, the replication is eventual (seconds) — a write in Region A is not immediately visible in Region B.
- Conflicts (same key written in two Regions within the replication window) resolve via last-writer-wins based on the attribute
aws:rep:updatetimetimestamp assigned by DynamoDB — there is no application-controlled conflict resolution without design patterns (see below).
Design patterns to avoid surprises
- Region-sticky sessions: route each user's writes to a single Region (e.g., via Route 53 geoproximity) so conflicts are rare. Cross-region failover moves the user to a new Region; background replication reconciles.
- Idempotent writes: design items so that re-applying the same write is safe.
- Append-only event streams: avoid updates to the same key from multiple Regions by using immutable event keys.
- DynamoDB Streams + Lambda for custom reconciliation: when business logic must merge conflicting writes, read Streams and write reconciled state back.
When DynamoDB Global Tables is the right answer
- The workload is NoSQL-friendly: key-value or document, single-item lookups, predictable access patterns.
- Multi-region writes are needed for user latency (global user base).
- RTO must be near-zero and RPO in seconds.
- Cost is acceptable: each replica is essentially a full copy, so you pay for storage and capacity in each Region.
SAP-C02 signal phrases: "multi-region writes", "low-latency for users across continents", "DynamoDB", "eventually consistent across Regions".
Cross-region backup
DynamoDB's point-in-time recovery (PITR) is per-table and per-Region. For compliance, also use AWS Backup with cross-region copy to preserve restorable snapshots in the DR Region independent of Global Tables replication — this defends against logical corruption (a bad deployment writing junk) that Global Tables would cheerfully replicate.
S3 Cross-Region Replication (CRR) and Same-Region Replication (SRR) with Replication Time Control
Amazon S3 Replication mirrors objects between buckets asynchronously. It is the default SAP-C02 answer for cross-region S3 DR, but it has enough nuances — RTC, SRR, replication filters, delete replication, encryption handling, batch operations — that the exam loves to probe edges.
CRR vs SRR
- Cross-Region Replication (CRR): source and destination buckets in different Regions. Used for DR, latency optimization for global reads, and compliance-mandated Region separation.
- Same-Region Replication (SRR): source and destination buckets in the same Region, typically in different accounts. Used for log aggregation into a central account, data sovereignty (keeping replicas in-Region but in a different account), and test data population.
Both share the same underlying replication engine; the Region pairing distinguishes them.
Replication Time Control (RTC)
S3 Replication Time Control (RTC) is the premium replication tier that guarantees 99.99 percent of new objects replicate within 15 minutes, with the majority replicating in seconds. RTC includes:
- A 15-minute SLA for 99.99 percent of objects, backed by AWS service credits.
- Replication metrics in CloudWatch (
BytesPendingReplication,OperationsPendingReplication,ReplicationLatency) for real-time monitoring. - Replication events in EventBridge for automation (e.g., trigger Lambda when an object exceeds a threshold).
RTC costs more than standard replication (per-GB data transfer plus an RTC surcharge). Pick it when the workload has a tight RPO for S3 data — for example, a pharma workload where lab data must reach the DR Region within 15 minutes to meet RPO.
Replication features worth memorizing
- Filter by prefix or object tag — replicate only specific subsets.
- Delete marker replication — toggled on/off; SAP-C02 frequently tests that by default, delete markers are not replicated, which is the right design for ransomware-resilient replication (a deletion in source does not propagate).
- Replica modification sync — when replicas are modified independently, sync changes back.
- Existing object replication — replicate objects that existed before replication was configured, via S3 Batch Replication.
- KMS-encrypted objects — if source uses SSE-KMS, configure the replication role with
kms:Decrypton the source key andkms:Encrypton the destination key; use multi-region KMS keys to simplify. - Cross-account replication — the destination bucket owns the replicated objects if you enable change object ownership.
- Two-way replication — configure replication on both buckets for active-active patterns.
S3 multi-region access points
For active-active S3 architectures, S3 Multi-Region Access Points (MRAPs) provide a single global endpoint that routes to the nearest healthy bucket using AWS Global Accelerator. Combined with bidirectional CRR, MRAPs give the application one URL regardless of Region, while S3 handles traffic steering and failover.
A classic SAP-C02 trap: the scenario says "we need DR for our S3 data, we have CRR enabled, are we protected against ransomware?" The answer is not fully. S3 Replication is forward replication, so an attacker who deletes or encrypts objects in source sees those changes propagate (unless delete-marker replication is off and versioning is on). For true ransomware protection combine CRR with S3 Versioning, MFA Delete, Object Lock in compliance mode, and AWS Backup cross-region copy for immutable point-in-time recovery points. Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
Cross-Region KMS Key Management
Encryption keys cross regions differently than data. Getting this right is a frequent SAP-C02 topic because it is subtle.
Single-region KMS keys (default)
A regular AWS KMS key is Region-scoped. Ciphertext encrypted with a key in us-east-1 can only be decrypted by that same key in us-east-1. For cross-region DR you would need to decrypt-then-re-encrypt on the destination side, or design around it.
Multi-region KMS keys
Multi-region KMS keys (introduced in 2021) share key material and key ID across Regions. You create a primary multi-region key in one Region and replicas in other Regions; ciphertext encrypted by any replica is decryptable by any other replica because they share key material.
Key properties:
- All replicas have the same key ID (with a region prefix in the ARN).
- Key material is copied securely between Regions at creation; you can also use imported key material in multi-region mode.
- Rotation applies consistently across replicas.
- Key policies are per-Region — you can allow different principals in different Regions.
- Only symmetric and asymmetric keys support multi-region; HMAC keys are single-region in many implementations (check docs for current support).
Multi-region keys are the right choice when:
- Application data encrypted in Region A must be decrypted in Region B with low cost and low latency.
- Cross-region S3 replication of SSE-KMS-encrypted objects, where ciphertext is re-encrypted by the destination replica without round-tripping plaintext.
- DynamoDB Global Tables with encryption where each replica can decrypt locally.
When to keep single-region keys
- Compliance or data sovereignty requires cryptographic isolation per Region.
- The workload is single-region with cross-region backup only — AWS Backup handles re-encryption automatically using the destination Backup Vault's key.
- Key material must never leave a specific Region (some regulated industries).
For SAP-C02, the key question is: "can the destination Region decrypt the data by itself during failover?" If yes and you use KMS, the answer is multi-region keys. If the answer is "no, we want per-Region isolation with decrypt done via re-encryption during backup copy", the answer is single-region keys + AWS Backup cross-region copy with destination-side re-encryption. Both are correct patterns; the scenario's constraints pick one. Reference: https://docs.aws.amazon.com/kms/latest/developerguide/multi-region-keys-overview.html
AWS Backup — Cross-Region and Cross-Account Backup Copy
AWS Backup is the centralized backup service that supports most AWS data services (EBS, RDS, DynamoDB, EFS, FSx, Storage Gateway, S3, Neptune, DocumentDB, Redshift, and more). For DR, three Backup capabilities are critical.
Cross-region backup copy
Backup plans include a copy action that sends every recovery point to a Backup Vault in another Region. The copy is KMS-encrypted in the destination (using the destination vault's key or a multi-region key) and counts toward both vaults' lifecycle policies independently.
Cross-account backup copy
Send recovery points to a dedicated Backup account in another AWS account. The isolation account can be locked down with SCPs denying everyone (including root) backup:Delete* and s3:Delete*, giving you a true tamper-proof recovery point set. This is the gold standard for ransomware-resilient backup.
Vault Lock — WORM compliance
AWS Backup Vault Lock applies WORM (Write-Once-Read-Many) immutability to a vault:
- Governance mode: admins with specific IAM permissions can override the lock.
- Compliance mode: no one, including root, can delete recovery points or shorten retention until the retention expires. Once locked, compliance mode cannot be downgraded.
Vault Lock is the right answer for SAP-C02 scenarios mentioning "seven-year retention", "SEC 17a-4", "ransomware", or "immutable backups with regulatory requirement".
Organization backup policies
AWS Organizations backup policies propagate backup plans to every account in an OU, ensuring uniform backup posture. Combined with a delegated administrator for AWS Backup, you get one-org-one-backup-posture governance without per-account configuration.
Multi-Region Landing Zone
A multi-region landing zone is the architectural scaffolding that all per-workload DR patterns sit on. SAP-C02 sometimes asks about governance-scale multi-region choices, not just one application's DR.
Components
- AWS Organizations with multi-region governance applied via SCPs.
- AWS Control Tower with governed regions extended to include the DR Region. Control Tower's home region is fixed at creation; secondary Regions are added as governed regions and receive baseline guardrails.
- Organization CloudTrail trail delivering to a Log Archive bucket with S3 CRR replicating to a second-Region copy.
- AWS Config aggregator in the Audit account aggregating resource state from every Region.
- IAM Identity Center with a home Region; the instance itself fails over via a separate strategy (IAM Identity Center has its own DR story — check current docs for details).
- Transit Gateway in each Region, peered across Regions, carrying inter-VPC and hybrid connectivity.
- Route 53 private hosted zones associated with VPCs in multiple Regions for internal DNS resolution.
- KMS multi-region keys for workload data that must fail over.
- AWS Backup vaults in each Region, with cross-region copy plans and cross-account isolation.
- CloudFormation StackSets with service-managed permissions deploying baseline resources to every OU and every Region.
Region pairing strategy
Pick primary and DR Regions with attention to:
- Latency: closer pairs reduce replication lag (affects Aurora Global Database, DynamoDB Global Tables, DRS RPO).
- Compliance: data sovereignty may constrain Region choice (e.g., EU workloads fail over only to EU Regions).
- Service availability: DR Region must support every service the workload uses — some services launch in non-US Regions years late.
- Blast-radius independence: avoid pairing within the same metro area; choose Regions on different power grids and peering fabrics.
- Cost: Region-to-Region data transfer is billed; some pairings are cheaper than others.
Standard pairings
Common enterprise pairings:
- North America:
us-east-1↔us-west-2(different coasts, different power regions). - Europe:
eu-west-1↔eu-central-1(Ireland ↔ Frankfurt). - Asia-Pacific:
ap-northeast-1↔ap-southeast-1(Tokyo ↔ Singapore).
SAP-C02 scenarios rarely ask you to pick specific Regions, but they do ask which Regions support a needed service — check service availability before betting on a Region.
Chaos Engineering and Game Days
A DR architecture that has never been tested is fiction. SAP-C02 expects you to know the operational practices that make DR real.
AWS Fault Injection Service (FIS)
AWS Fault Injection Service (FIS) — formerly AWS Fault Injection Simulator — is the managed chaos-engineering service. It executes experiment templates that inject controlled faults:
- EC2 actions: stop instances, reboot instances, terminate instances in specific AZs.
- ECS actions: stop tasks, scale services to zero, throttle container traffic.
- EKS actions: drain nodes, disrupt pods.
- RDS actions: fail over DB clusters, reboot instances.
- Network actions (via Systems Manager): introduce latency, packet loss, DNS disruption.
- IAM action: simulate permission removal.
- Stop conditions: safety-net CloudWatch alarms that abort the experiment if blast radius exceeds threshold.
FIS is the AWS-native way to run pre-production game days: schedule an experiment that kills half the EC2 fleet in us-east-1, verify Auto Scaling recovers, verify Route 53 ARC readiness checks flip correctly, verify alarms page the on-call.
Game days — running them properly
A game day is an organized rehearsal of a disaster scenario. Best practice:
- Define a scenario and success criteria in advance ("Region
us-east-1becomes unreachable; workload must serve traffic fromus-west-2within 15 minutes with RPO under 5 minutes"). - Pre-brief the on-call team — some game days are announced, some unannounced, depending on the confidence level.
- Execute the fault (via FIS or manual action).
- Observe recovery against RTO/RPO objectives.
- Document findings: what broke, what took longer than expected, what wasn't monitored, what didn't alert.
- Fix and re-run — a game day without follow-through is just a drill, not a learning.
Mature organizations run game days quarterly for critical workloads and build them into release gates.
- 4 DR strategies in order of cost: Backup and Restore, Pilot Light, Warm Standby, Multi-Site Active-Active.
- 5 ARC cluster Regions:
us-east-1,us-west-2,ap-northeast-1,ap-southeast-2,eu-west-1— quorum-based. - 5 Aurora Global Database secondary Regions maximum per primary.
- 15 minutes is the S3 RTC SLA for 99.99 percent of new objects.
- 365 days maximum PIT retention for AWS DRS snapshots (default 7 days).
- Sub-second typical replication lag for Aurora Global Database and DynamoDB Global Tables.
- Under 1 minute typical RTO for Aurora Global Database managed failover.
- 5–20 minutes typical RTO for DRS recovery.
- Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html
Canonical Scenario — Pharma App With RTO 15 Minutes, RPO 5 Minutes, Multi-Region, Compliance
SAP-C02 scenarios often crystallize into one dense paragraph. Here is the canonical pharma-industry scenario and the full pro-depth answer.
Scenario
A pharmaceutical company operates a clinical-trial management platform that must comply with GxP, HIPAA, and 21 CFR Part 11. The application consists of a web tier (ALB + Auto Scaling Java application), an Aurora PostgreSQL database holding patient data, a DynamoDB table holding study-protocol metadata, an S3 bucket holding lab-result files (up to 5 TB and growing), a set of Lambda functions orchestrating workflows, and a third-party analytics VM appliance (vendor-supplied, no redesign possible) running on EC2 in us-east-1. Business requirements: RTO 15 minutes, RPO 5 minutes, two AWS Regions, auditable failover decisions, backup retention of 7 years with immutability, encryption with customer-managed keys, and the ability to recover from a ransomware event that encrypts production data.
Solution architecture
Foundation
- AWS Control Tower landing zone with
us-east-1as home Region andus-west-2added as a governed region. - Dedicated accounts: Workloads/Prod (primary app), Security/Audit (delegated admin for GuardDuty/Security Hub/Macie), Log Archive, Backup (isolation account for tamper-proof backups), Network (Transit Gateway hub-and-spoke).
- Root SCPs denying
cloudtrail:StopLogging,config:DeleteConfigurationRecorder,kms:ScheduleKeyDeletion,backup:DeleteRecoveryPointoutside break-glass. - KMS multi-region customer-managed keys with primary in
us-east-1and replicas inus-west-2.
Data tier
- Aurora PostgreSQL Global Database, primary cluster in
us-east-1, secondary cluster inus-west-2. Replication lag under 1 second (meets 5-minute RPO with generous margin). Managed failover is the planned path; detach-promote is the unplanned path. - DynamoDB Global Tables across
us-east-1andus-west-2, with the study-protocol metadata table replicated multi-active. PITR enabled in both Regions. - S3 bucket in
us-east-1with Cross-Region Replication + RTC to a bucket inus-west-2— 15-minute SLA satisfies RPO. S3 Versioning + Object Lock in compliance mode with 7-year retention. Delete-marker replication is off so a delete in source does not propagate.
Compute tier
- Web tier: Warm Standby in
us-west-2. ALB + Auto Scaling Group running at minimum 2 instances (scaled down from 20 in primary). Launch Templates reference multi-region AMIs. - Lambda: same functions deployed via CloudFormation StackSets in both Regions. Active in primary, idle in DR.
- Vendor VM appliance: protected via AWS DRS. Replication agent installed on the EC2 instance in
us-east-1; staging inus-west-2. On failover, DRS launches the recovery EC2 with the latest block-level state — meets 15-minute RTO and sub-minute RPO for a vendor-supplied appliance that cannot be refactored.
Traffic control
- Route 53 Application Recovery Controller (ARC) cluster. Routing controls for
region-us-east-1-activeandregion-us-west-2-active. Safety rules: at least one Region must be on at all times; flippingus-east-1off requiresus-west-2to be readinessREADY. - Route 53 failover records reference calculated health checks tied to the routing controls.
- ARC readiness checks validate
us-west-2has sufficient ASG capacity, sufficient RDS instance size, required IAM roles, and KMS key policies before approving a flip.
Backup and ransomware resilience
- AWS Backup plans run every 15 minutes for Aurora and DynamoDB (Continuous Backup enabled, PITR within seconds), hourly for EFS, and daily for EBS.
- Cross-region copy to
us-west-2Backup Vaults. - Cross-account copy to the isolated Backup account.
- Vault Lock in compliance mode with 7-year retention — meets GxP and SEC-equivalent requirements.
- AWS DRS 30-day PIT retention for the vendor VM, giving ransomware rollback.
Operations
- AWS Fault Injection Service game days run quarterly: terminate primary Auro writer, then terminate the
us-east-1Region via forced ARC flip, verify recovery within 15 minutes. - CloudWatch dashboards per Region monitor replication lag, ASG desired/actual, RTC pending objects.
- EventBridge rules on ARC routing control state changes notify the compliance team via SNS + Slack.
- Runbooks in Systems Manager Documents with approval steps for manual failover decisions; auditors see the approval trail in CloudTrail.
Why this architecture is correct
- RTO 15 minutes: warm standby compute scales up in minutes; ARC flips within seconds; Aurora managed failover under 1 minute; DRS recovery in 5–20 minutes for the vendor VM.
- RPO 5 minutes: Aurora Global DB sub-second lag, DynamoDB Global Tables sub-second, S3 RTC 15-minute SLA (tight — if the scenario stressed <1 minute RPO for S3 you would add a custom fast-replication path).
- Multi-region compliance: dedicated Log Archive with CRR, KMS multi-region keys, Control Tower governed regions, organization CloudTrail, Config aggregator.
- Auditable failover: ARC safety rules, SSM Document approvals, CloudTrail records every routing-control flip.
- Ransomware: Object Lock compliance mode, Vault Lock compliance mode, cross-account Backup isolation, DRS PIT up to 365 days, delete-marker replication off.
- Cost-justified: Warm Standby, not Multi-Site Active-Active — because RTO 15 minutes allows scale-up time, and active-active would double compute spend without improving the targets.
Common wrong-answer patterns for this scenario
- Multi-Site Active-Active — overkill for 15-minute RTO; doubles cost without benefit.
- Backup and Restore — cannot meet 15-minute RTO with a full stack redeploy.
- Plain Route 53 failover records — lacks the auditable, explicit, quorum-based control ARC provides.
- Aurora cross-region read replica (legacy) — higher lag than Global Database; deprecated pattern.
- DRS for everything — overkill for managed services that have native multi-region features (Aurora, DynamoDB, S3, Lambda).
- Single-region KMS keys — would require decrypt-then-reencrypt during failover, adding complexity.
Common Traps — Disaster Recovery Pro Patterns
Expect to see at least two of these distractors on every SAP-C02 attempt.
Trap 1: Picking the top rung by default
A scenario with RTO 4 hours does not need Multi-Site Active-Active. The correct answer is the cheapest strategy that meets RTO/RPO — usually Pilot Light or Warm Standby. Always read the RTO/RPO numbers before picking the strategy.
Trap 2: Confusing RTO and RPO
RTO is "how long we are down"; RPO is "how much data we lose". A workload can have tight RTO (10 minutes) and loose RPO (1 hour) or vice versa. Each drives a different part of the architecture — RTO drives compute readiness, RPO drives replication frequency.
Trap 3: Relying on Route 53 control plane in tight-RTO scenarios
Plain Route 53 health checks and failover records run out of a single control-plane Region. For compliance-grade tight-RTO workloads that cannot tolerate control-plane Region loss, the answer is Route 53 ARC with its five-Region cluster.
Trap 4: Treating S3 Replication as ransomware protection
CRR forward-replicates, including delete markers if configured. Against ransomware you need Object Lock in compliance mode, MFA Delete on versioning, and cross-account Backup isolation. CRR alone is not enough.
Trap 5: Using Aurora cross-region read replicas instead of Global Database
Legacy Aurora cross-region read replicas (binlog-based) have worse lag than Aurora Global Database (storage-level). For new SAP-C02 designs, Global Database is the answer unless the scenario explicitly pins you to the legacy pattern.
Trap 6: Forgetting that Global Tables resolves conflicts with last-writer-wins
Any scenario with a multi-region DynamoDB Global Table that mentions "two regions may update the same item" must acknowledge LWW. If strong cross-region consistency is required, DynamoDB Global Tables is the wrong answer — consider Aurora Global Database with write forwarding instead.
Trap 7: Using single-region KMS keys in a cross-region DR architecture
If data is encrypted with SSE-KMS in Region A and replicated to Region B, Region B cannot decrypt unless the key policy permits it and the key exists there. Multi-region KMS keys solve both. Single-region keys force a decrypt/re-encrypt pipeline that adds latency and cost.
Trap 8: Believing DRS replaces managed multi-region features
DRS is for lift-and-shift workloads that cannot be refactored. If the workload runs on Aurora, DynamoDB, S3, Lambda, or managed services with native multi-region support, use those — not DRS. Using DRS for an Aurora cluster would replicate compute but not the database layer correctly.
Trap 9: Forgetting failback
Failover is only half. On SAP-C02, DR questions sometimes ask about the full cycle, and the correct answer names DRS failback, Aurora Global Database reverse replication, or explicit backup-and-restore for the return leg.
Trap 10: Assuming automatic failover is always better
Automatic failover (Route 53 health checks flipping DNS) can misfire on partial failures — a slow backend that still responds to TCP health checks keeps the primary "healthy" while users suffer. Tight-RTO workloads prefer operator-initiated failover via ARC, with readiness checks providing pre-validation. "Auditable explicit decisions" in a scenario points to ARC, not automatic DNS failover.
Trap 11: Skipping game days
Untested DR is fiction. Scenarios that mention "the company has never tested the DR plan" are always wrong — the fix is game days with AWS FIS or a manual runbook rehearsal schedule.
Decision Matrix — Which DR Construct for Which Goal
| Goal | Primary construct | Notes |
|---|---|---|
| RTO hours, RPO hours, lowest cost | Backup and Restore | AWS Backup with cross-region copy |
| RTO 30+ min, RPO minutes, moderate cost | Pilot Light | Data replicated live, compute scaled to zero |
| RTO minutes, RPO seconds, higher cost | Warm Standby | Compute scaled down but live |
| RTO seconds, RPO near-zero, highest cost | Multi-Site Active-Active | Both Regions serve traffic |
| Cross-region relational database | Aurora Global Database | Managed failover for planned, detach-promote for unplanned |
| Multi-region writes on RDBMS | Aurora Global DB + write forwarding | Single primary, reads anywhere, writes forwarded |
| Multi-region NoSQL writes | DynamoDB Global Tables | LWW conflict resolution |
| DR for lift-and-shift on-prem or EC2 workloads | AWS DRS | Block-level, PIT up to 365 days |
| Migrate from on-prem and decommission source | AWS MGN (not DRS) | Same agent, different billing |
| Auditable cross-region failover | Route 53 ARC | Five-Region cluster + readiness + safety rules |
| Simple automatic DNS failover | Route 53 failover records + health checks | For non-critical workloads |
| S3 cross-region replication with SLA | S3 CRR + RTC | 15-min 99.99% SLA |
| Ransomware-resilient S3 | Object Lock compliance + versioning + CRR + Backup | Multi-layer |
| Cross-region encryption continuity | KMS multi-region keys | Shared key material |
| Tamper-proof backups | AWS Backup Vault Lock (compliance mode) + cross-account copy | WORM |
| Chaos testing DR runbook | AWS Fault Injection Service | Experiment templates + stop conditions |
| Multi-region governance baseline | Control Tower governed regions + StackSets + Org backup policies | Per-Region baseline |
FAQ — Disaster Recovery Pro Patterns
Q1: How do I choose between Pilot Light and Warm Standby for a scenario with RTO 30 minutes and RPO 1 minute?
Both strategies technically meet the RPO (replication to the DR Region is usually sub-second regardless of which you pick), so the distinguishing factor is the RTO. Pilot Light requires 10–20 minutes to scale compute from zero to serving traffic — for Auto Scaling with image pull, cache warm-up, and database connection pool warm-up. Warm Standby keeps a small fleet already running, so scale-up is 2–5 minutes. With a 30-minute RTO you have headroom, so Pilot Light is acceptable and cheaper. If the scenario adds a twist — "and the app has a 3-minute cache warm-up" or "and one-third of traffic must land in under 5 minutes" — Warm Standby becomes correct because Pilot Light's scale-from-zero path eats too much of the budget.
Q2: When do I need Route 53 ARC instead of plain Route 53 failover records?
Plain Route 53 failover records + health checks are automatic, cheap, and sufficient for most workloads. ARC is needed when: (1) failover must survive the loss of the primary Region's control plane — ARC's dataplane runs across five Regions in a quorum, so flips work even when the primary is totally down; (2) compliance requires explicit, auditable, dual-control failover decisions rather than automatic DNS flips; (3) readiness checks are required to pre-validate the standby's capacity and quotas before approving a flip; (4) safety rules must prevent operator error like turning off both Regions. The monthly cost of an ARC cluster is non-trivial, so normal web apps stay on Route 53 failover. Banking, healthcare, and other regulated workloads upgrade to ARC.
Q3: What is the difference between Aurora Global Database managed failover and detach-and-promote?
Managed failover is the controlled, coordinated promotion of a secondary to primary, orchestrated by Aurora's control plane. It preserves the global database topology, demoting the former primary to a secondary and re-orienting replication. Typical RTO is under 1 minute, RPO near-zero. Use it for planned events and confirmed Region failures where the primary's control plane is still reachable. Detach-and-promote is the emergency path: from the secondary's console, detach the unreachable primary from the global database, which makes the secondary a standalone cluster that can accept writes. After the disaster passes you must rebuild the global database (create a new secondary, or restore from backup). Use detach-and-promote when managed failover cannot reach the primary — this is the true Region-outage scenario.
Q4: Can I use DRS for my Aurora database?
No — DRS replicates block-level EBS-equivalent data for EC2 and on-prem servers. Aurora runs on a managed distributed storage fabric with no customer-accessible EBS volumes. For Aurora cross-region DR, use Aurora Global Database. DRS is for lift-and-shift workloads: EC2 instances running self-managed databases, vendor VM appliances, legacy Windows Server workloads, or on-prem servers replicating into AWS. A common SAP-C02 wrong answer is "use DRS to replicate the Aurora cluster" — it cannot, and Global Database is the right construct.
Q5: How do I protect against ransomware in a DR architecture?
Ransomware defense is layered, because forward replication happily propagates malicious deletes and encryptions. The layers: (1) S3 Versioning with MFA Delete on critical buckets — an attacker cannot permanently delete versions without MFA; (2) S3 Object Lock in compliance mode with appropriate retention — objects cannot be overwritten or deleted until retention expires, period; (3) CRR with delete-marker replication OFF — a delete in source does not propagate to destination; (4) AWS Backup with Vault Lock in compliance mode — backups themselves cannot be deleted; (5) Cross-account Backup copy to an isolation account locked down by SCPs — even root in the production account cannot reach the backups; (6) AWS DRS with long PIT retention (up to 365 days) — recover to a point before the attacker gained access; (7) GuardDuty + Security Hub detection of anomalous API calls indicating compromise. Implement all of these together for regulated workloads.
Q6: Why would I use DynamoDB Global Tables instead of running DynamoDB in one Region with cross-region backup?
The choice depends on your RTO and latency goals. Single-region DynamoDB with cross-region backup gives you RPO in hours (backup frequency) and RTO in minutes (restore from backup or redirect via app logic) — fine for batch or internal workloads. DynamoDB Global Tables gives you RPO in seconds (replication lag), RTO near zero (the DR Region is already serving), and low-latency reads for global users. The trade-off is cost (two or more full copies) and the last-writer-wins consistency model — if your application cannot tolerate LWW semantics across Regions, Global Tables is not a drop-in. SAP-C02 scenarios that mention "global user base" or "multi-region active writes" point to Global Tables; scenarios that only need region-level DR without active-active can use single-region + backup.
Q7: How do I test my DR plan without impacting production?
Use AWS Fault Injection Service (FIS) in a pre-production environment that mirrors production topology. Define experiments that terminate EC2 instances, stop ECS services, force RDS failovers, or simulate network partitions — with stop conditions tied to CloudWatch alarms so blast radius is capped. For the full DR rehearsal, schedule a game day in a staging environment where you actually flip the Route 53 ARC routing control, verify the DR Region serves traffic end-to-end, validate readiness checks, and measure actual RTO/RPO against objectives. Mature organizations also run production game days for non-critical time windows (e.g., 2am on a Saturday) with explicit pre-briefing, roll-forward criteria, and roll-back plans — the goal is operational confidence, and production is the only environment that proves the runbook works. Document every finding and create follow-up actions; a game day without follow-through is just a drill.
Q8: How does Route 53 ARC's five-Region cluster survive the loss of a Region I am trying to fail away from?
The ARC cluster runs a dataplane replica in each of five Regions (us-east-1, us-west-2, ap-northeast-1, ap-southeast-2, eu-west-1). When you flip a routing control, the client SDK calls any of the five cluster endpoints — whichever is reachable — and the call is committed when a quorum (three of five) of Regions acknowledge. If you are failing away from us-east-1 and us-east-1 itself is down, the SDK simply calls one of the other four endpoints and the quorum forms from the remaining four Regions. This is fundamentally different from Route 53's main control plane, which runs in a single Region — when that Region fails you cannot edit records. ARC was built specifically to solve that control-plane dependency for compliance-grade workloads. On SAP-C02, if the question mentions "control plane of the primary Region is unavailable and we still need to fail over", ARC is the right answer.
Q9: What is the right encryption strategy for a workload with Aurora Global Database, S3 CRR, DynamoDB Global Tables, and AWS Backup cross-region copy?
Use KMS multi-region customer-managed keys as the default encryption key for all four services. The primary key lives in the primary Region; replicas live in the DR Region. Because all replicas share key material, encrypted data in the DR Region can be decrypted by the DR-Region replica without round-tripping plaintext cross-Region. Separate keys per service (one for Aurora, one for DynamoDB, one for S3, one for Backup) keep the blast radius of a key policy change small and align to least-privilege. Grant Aurora's secondary cluster's service role access to the replica key; grant S3's replication role kms:Decrypt on the source multi-region key and kms:Encrypt on the destination multi-region key; grant AWS Backup vault access to the replica key. Rotate keys on a schedule aligned to compliance; rotation applies consistently to replicas. For workloads with strict data-sovereignty requirements that forbid key material crossing borders, fall back to single-region keys with re-encryption at the replication boundary (handled natively by AWS Backup cross-region copy), accepting the latency cost.
Q10: When should I choose Multi-Site Active-Active over Warm Standby given that both can meet sub-minute RTO with enough automation?
Multi-Site Active-Active is justified when one of these conditions holds: (1) Global user base needs low-latency service from the nearest Region continuously — active-active gives locality, while warm standby keeps the DR Region idle from a user perspective; (2) RTO must be zero for users already hitting the surviving Region — warm standby adds at least the scale-up time for those users migrating, while active-active already has full capacity in both Regions; (3) Regulatory or contractual commitments require no observable downtime; (4) Load exceeds what a single Region can handle — active-active is the capacity strategy as much as the DR strategy. Otherwise Warm Standby is the better choice: it meets sub-minute RTO with 40–60 percent of the ongoing cost, avoids the multi-writer consistency complexity, and is easier to reason about in incident response. The SAP-C02 signal for Multi-Site Active-Active is explicit global-latency requirements or "zero downtime" language; the signal for Warm Standby is "tight RTO but acceptable brief scale-up delay".
Further Reading
- Disaster Recovery of Workloads on AWS: Recovery in the Cloud (Whitepaper)
- AWS Well-Architected Framework — Reliability Pillar
- AWS Elastic Disaster Recovery (DRS) User Guide
- AWS DRS — Failback
- Route 53 Application Recovery Controller (ARC)
- Route 53 ARC — Routing Controls
- Route 53 ARC — Readiness Checks
- Amazon Aurora Global Database
- Aurora Global Database — Write Forwarding
- Aurora Global Database — Managed Failover and Recovery
- Amazon DynamoDB Global Tables
- Amazon S3 Replication (CRR and SRR)
- S3 Replication Time Control (RTC)
- AWS KMS Multi-Region Keys
- AWS Backup — Cross-Region Copy
- AWS Fault Injection Service (FIS)
- AWS SAP-C02 Exam Guide (PDF)