Disaster Recovery Strategies on AWS

Disaster Recovery Strategies on AWS define how a workload survives region-scale failures, large-scale data corruption events, and business-ending outages that High Availability alone cannot absorb. For the SAA-C03 exam, you must know the four canonical Disaster Recovery Strategies — Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active — along with their RTO/RPO profiles, cost trade-offs, and the specific AWS services that implement each pattern. Disaster Recovery Strategies also overlap with data replication mechanics (S3 CRR, Aurora Global Database, DynamoDB Global Tables), traffic failover (Route 53), and block-level replication (AWS Elastic Disaster Recovery). Mastering Disaster Recovery Strategies is essential for Domain 2 (Design Resilient Architectures), which carries 26% of the SAA-C03 exam weight.

This page belongs to Task Statement 2.2 — "Design highly available and/or fault-tolerant architectures." It is the sibling of the high-availability-multi-az topic. HA covers the inside of a single Region (multi-AZ, ALB, ASG); Disaster Recovery Strategies cover the cross-Region and business continuity dimension that HA does not solve. For backup retention policy, WORM, and governance concerns, see data-governance-compliance.

What Are Disaster Recovery Strategies on AWS?

Disaster Recovery Strategies on AWS are pre-designed architectural patterns that let a workload resume operation in a secondary Region (or secondary infrastructure) after a primary failure. AWS documents four canonical Disaster Recovery Strategies in the official "Disaster Recovery of Workloads on AWS" whitepaper — a document the SAA-C03 exam leans on heavily. Each of the four Disaster Recovery Strategies trades cost against recovery speed; your job as a solutions architect is to pick the strategy whose RTO/RPO profile matches the business requirement at the lowest defensible cost.

The SAA-C03 exam tests Disaster Recovery Strategies in three recurring shapes. First, definitional questions ask you to pair an RTO/RPO number with a strategy name. Second, scenario questions describe a business requirement ("financial trading platform, zero data loss tolerated") and ask which Disaster Recovery Strategy fits. Third, service-mapping questions ask which AWS service implements a specific replication guarantee (S3 Replication Time Control for 15-minute SLA, Aurora Global Database for ~1-second lag, DynamoDB Global Tables for active-active writes).

Why Disaster Recovery Strategies Matter Beyond HA

High Availability inside one Region protects against AZ-level failures — a blown transformer, a flooded data center, a cooling outage affecting one building. Disaster Recovery Strategies protect against scenarios HA cannot handle: Region-wide service outages, natural disasters covering hundreds of kilometres, software bugs that propagate across all AZs at once, ransomware that encrypts every EBS volume in the primary Region, and compliance-driven requirements for a DR site in a different jurisdiction. Every serious production workload needs both HA and Disaster Recovery Strategies.

The Four Canonical Disaster Recovery Strategies

AWS organizes Disaster Recovery Strategies on a cost-vs-recovery-time ladder:

Backup and Restore — lowest cost, highest RTO (hours). Periodic backups copied cross-Region; infrastructure rebuilt on demand.
Pilot Light — low cost, hours RTO. Core data replicated and running; application tier dormant until failover.
Warm Standby — medium cost, minutes RTO. Scaled-down full stack always running in the DR Region.
Multi-Site Active-Active — highest cost, seconds RTO. Full production capacity in both Regions, serving live traffic simultaneously.

RTO (Recovery Time Objective) — the maximum acceptable elapsed time between disaster declaration and service restoration. "How long can we be down?" RPO (Recovery Point Objective) — the maximum acceptable amount of recent data that can be lost, measured in time. "How much recent data can we afford to lose?" DR Region — the secondary AWS Region chosen to host the Disaster Recovery Strategies infrastructure. Failover — the act of shifting live traffic from primary to DR Region. Failback — returning traffic to the original Region after recovery. Pilot Light — a DR strategy where core data is replicated live but compute is dormant, ready to scale. Warm Standby — a DR strategy where a scaled-down full stack runs continuously in the DR Region.

Source ↗

Plain-Language Explanation — Disaster Recovery Strategies in Everyday Terms

Disaster Recovery Strategies sound abstract, but the four options map cleanly to everyday backup behaviour. Below are three different analogies drawn from deliberately different domains so that at least one will stick in exam conditions.

Analogy 1 — The Insurance Policy Ladder

Disaster Recovery Strategies behave exactly like insurance policies on a house.

Backup and Restore is cheap term insurance. You pay a small monthly premium. When the house burns down, you file a claim, the insurer takes several weeks to inspect, you find temporary lodging, and eventually you rebuild from scratch. Low cost, long recovery. That is Backup and Restore: cross-Region backups sit quietly in S3 or AWS Backup, and when disaster strikes you spin up new infrastructure from templates plus restored data. RTO measured in hours to a day.
Pilot Light is like owning a second empty apartment with the utilities already connected, the fridge already plugged in, and a suitcase of essentials ready. If your main house burns, you move in the same day and furnish the rest over the next few hours. In AWS terms, your database replica is live in the DR Region, AMIs and templates are ready, but web servers are not running. When failover fires, you scale out the compute tier and Route 53 shifts traffic.
Warm Standby is a fully furnished second home that someone already lives in at small scale — perhaps a caretaker. Everything works, but the house can host a family of two, not ten. If your main house fails, your family moves in, and the caretaker ramps up supplies. In AWS, a scaled-down production stack runs continuously; failover triggers a scale-out and traffic shift in minutes.
Multi-Site Active-Active is owning two fully staffed homes and living in both simultaneously. Family members rotate between them. If one burns down, the other continues without skipping a meal. Highest cost but zero disruption. In AWS, two Regions actively serve user traffic through Route 53 latency-based routing, Global Accelerator, or CloudFront.

This analogy makes the cost-vs-recovery trade-off intuitive. Disaster Recovery Strategies are just how much insurance you choose to buy.

Analogy 2 — The Restaurant Kitchen Contingency Plan

Imagine a chain restaurant deciding how to handle a kitchen fire at its flagship location.

Backup and Restore is like storing recipe cards and ingredient lists in a safe deposit box. If the kitchen burns, the chain rents a new kitchen, retrieves the recipes, re-orders ingredients, rehires staff. Cheapest approach, but service is gone for days. The "ingredients" correspond to S3 backups copied to the DR Region; the "recipes" correspond to CloudFormation or Terraform templates; the "new kitchen" is the infrastructure you stand up after the disaster.
Pilot Light is like keeping a refrigerated truck of pre-mixed sauces and dough parked at a second location. The stoves are cold but the perishables are fresh. If the flagship fails, cooks show up, fire the stoves, and start serving within a few hours. In AWS, this maps to a replicated database (Aurora Global Database secondary, DynamoDB Global Tables replica) with compute infrastructure defined but not running.
Warm Standby is like running a small pop-up kitchen that serves 10% of the flagship's volume every day. It is staffed, tested, and serving real customers. If the flagship fails, the pop-up cranks up to 100% capacity. In AWS, ASGs are running with minimum capacity, ALBs are live, databases are receiving writes via replication.
Multi-Site Active-Active is like running two equally-sized flagship kitchens that already split customer traffic by geography — east-side customers eat at Kitchen A, west-side at Kitchen B, both serving identical menus. If Kitchen A burns, Kitchen B absorbs the full load. Maximum redundancy, maximum cost. In AWS, DynamoDB Global Tables accept writes in both Regions and resolve conflicts; Aurora Global Database can promote quickly, or you can run two separately-managed clusters with application-level reconciliation.

Analogy 3 — The Airline Operations Playbook

Airlines have literal Disaster Recovery Strategies for aircraft diversions.

Backup and Restore is like keeping spare-part inventory in a central warehouse. If a plane suffers damage, you ship the parts to the airport, mechanics arrive, the plane is fixed over a long period. Cheapest because you only hold one inventory, but recovery is slow. This mirrors Backup and Restore in AWS: one set of backups, rebuild everything after.
Pilot Light is like pre-positioning engine rebuild kits at every hub. The engines are not installed in a standby plane, but the kits are there and a mechanic crew can rotate one in quickly. The core asset (the engine kit) is "hot," the surrounding infrastructure (the plane) can be staged quickly. In AWS, the data tier is always live; the compute tier can be stood up from templates in minutes.
Warm Standby is like keeping a partially-fueled backup aircraft parked at each hub with a skeleton crew. It flies shorter routes daily, but can be upsized to flagship routes within minutes. In AWS, a small but complete production environment is always running.
Multi-Site Active-Active is like running two full-capacity hubs that serve the same destinations in parallel — Atlanta and Dallas, both hubs for the same routes. If Atlanta closes due to weather, Dallas absorbs every passenger without cancellation. In AWS, both Regions accept live traffic, with Route 53 or Global Accelerator balancing.

Three analogies, one lesson: Disaster Recovery Strategies buy recovery speed with money. The more you pay up front, the faster you recover. The SAA-C03 exam always tells you the RTO/RPO target in the question stem — your job is to pick the cheapest Disaster Recovery Strategy that still meets the target.

RTO and RPO — The Two Numbers That Define Every Disaster Recovery Strategy

Every Disaster Recovery Strategies discussion starts with two numbers: RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Confusing the two is one of the most common SAA-C03 traps.

RTO — How Long Can We Be Down?

RTO measures the maximum acceptable time between the disaster event and the restoration of service. It is always expressed as a time duration — seconds, minutes, hours, or days. An RTO of 4 hours means the business is willing to tolerate up to 4 hours of downtime. An RTO of 30 seconds means the business needs near-instantaneous cutover. RTO drives which of the four Disaster Recovery Strategies you pick, because each strategy has a different recovery-time floor.

RPO — How Much Data Can We Lose?

RPO measures the maximum acceptable amount of recent data that can be lost, also expressed as time. An RPO of 1 hour means you can lose up to the last 60 minutes of data. An RPO of zero means no committed transaction may be lost. RPO drives your data-replication choice — daily backups give a 24-hour RPO, S3 CRR typically gives a few-minutes RPO, S3 Replication Time Control gives a 15-minute SLA, Aurora Global Database gives ~1-second RPO, DynamoDB Global Tables give sub-second RPO.

The Axis Diagram

Plot RTO on the horizontal axis, RPO on the vertical axis. The four Disaster Recovery Strategies occupy different quadrants:

Top-right (high RTO, high RPO): Backup and Restore.
Middle (moderate RTO, low-moderate RPO): Pilot Light.
Lower-middle (low RTO, low RPO): Warm Standby.
Bottom-left (near-zero RTO, near-zero RPO): Multi-Site Active-Active.

R-T-O = Time to recover (downtime tolerance). R-P-O = Point-in-time tolerance (data loss tolerance). If the question mentions "business downtime" → RTO. If it mentions "data loss" or "last committed transaction" → RPO. This single mnemonic resolves roughly half of the SAA-C03 Disaster Recovery Strategies questions. Source ↗

Typical RTO/RPO Targets for the Four Disaster Recovery Strategies

Backup and Restore — RTO hours to 24 hours, RPO hours (depends on backup frequency).
Pilot Light — RTO tens of minutes, RPO minutes.
Warm Standby — RTO minutes, RPO seconds to minutes.
Multi-Site Active-Active — RTO near-zero, RPO near-zero.

The SAA-C03 exam uses these bands consistently. Memorize them.

Strategy 1 — Backup and Restore (Cheapest, Slowest)

Backup and Restore is the entry-level Disaster Recovery Strategy. It is the right choice for workloads that can tolerate multi-hour downtime and for cost-constrained environments where paying for standby infrastructure is unjustifiable.

How Backup and Restore Works

Production runs in Region A (primary).
Scheduled backups copy data to Region B (DR). Typical mechanisms: AWS Backup cross-Region copy, S3 Cross-Region Replication, EBS snapshot cross-Region copy, RDS automated snapshot copy.
When disaster strikes, an operator (or automation) provisions infrastructure in Region B using Infrastructure-as-Code (CloudFormation, CDK, Terraform), restores data from the cross-Region backups, and directs traffic to the new environment via Route 53 DNS change.

Cost Profile

Backup and Restore carries almost no steady-state cost beyond storage: you pay for the backup data in Region B and occasional cross-Region data transfer. There are no idle EC2 instances, no idle RDS instances, no idle load balancers. This makes Backup and Restore the default choice for dev/test, internal tooling, and non-revenue-critical workloads.

Recovery Profile

Recovery time is long because you are rebuilding infrastructure from scratch at disaster declaration time. Even with perfect automation, bootstrapping a full production stack typically takes at least 30-60 minutes; realistic recoveries span several hours. RPO matches the backup frequency — hourly backups give hour-scale RPO, nightly backups give day-scale RPO.

AWS Services Commonly Used

AWS Backup — centralized backup plans with cross-Region copy.
S3 Cross-Region Replication (CRR) — asynchronous object replication.
Amazon EBS snapshot cross-Region copy — volumes snapshotted and copied.
Amazon RDS automated snapshots — shipped to DR Region by scheduled copy.
CloudFormation / CDK templates — infrastructure blueprints to rebuild on failover.

The exam and real life both punish teams whose Disaster Recovery Strategies exist on paper but have never been exercised. A backup you cannot restore is worthless. If a question mentions "the company has never tested its DR plan," the correct answer usually involves adding regular restore drills, testing via AWS Backup restore jobs, or game-day exercises — not buying more backup storage. Source ↗

Strategy 2 — Pilot Light

Pilot Light sits one rung up from Backup and Restore. It keeps the core data tier replicated and live in the DR Region while the application and web tiers remain dormant.

How Pilot Light Works

The term comes from a gas furnace: a small always-on flame (the "pilot") can ignite the main burner quickly when needed. In AWS Disaster Recovery Strategies, the pilot is the database or data store:

Aurora Global Database secondary Region receives continuous replication with typical lag under one second.
DynamoDB Global Tables replica lives in the DR Region.
S3 Cross-Region Replication mirrors object data.
AMIs, launch templates, CloudFormation stacks are pre-staged in the DR Region but the EC2 instances, ECS services, and load balancer capacity are not running (or are running at zero capacity).

Failover Sequence

Declare disaster.
Promote Aurora Global Database secondary to primary (or activate the write endpoint).
Scale up ASGs, ECS services, Lambda concurrency from zero to production capacity.
Update Route 53 DNS records (or trigger pre-configured failover routing) to point at the DR Region.
Validate traffic, monitor, freeze primary.

Cost and Recovery Profile

Pilot Light costs more than Backup and Restore because you pay for the replicated data tier continuously — Aurora Global Database charges for the secondary cluster even when passive, DynamoDB Global Tables charge for replicated writes, S3 replicated storage is billed. But compute cost stays near zero. Recovery takes tens of minutes because most of the time is spent launching and warming up compute capacity.

Phrases like "minimal resources running in DR," "database replicated, app tier off," or "launch configurations ready but no running compute" are Pilot Light markers. If the question stem tolerates tens-of-minutes RTO and emphasises cost minimisation beyond what Backup and Restore can offer, Pilot Light is the answer. Source ↗

Strategy 3 — Warm Standby

Warm Standby raises the stakes again: a scaled-down but fully functional production stack runs continuously in the DR Region. It handles no user traffic (or a tiny canary slice), but every tier of the architecture is alive.

How Warm Standby Works

Database tier: Aurora Global Database, DynamoDB Global Tables, or RDS cross-Region read replica — continuously synchronized.
Application tier: ECS services or ASGs running with minimum-sized capacity (for example, 2 tasks instead of 20, t3.medium instead of c6i.4xlarge).
Load balancer: ALB/NLB deployed and reachable.
DNS: Route 53 configured with failover or weighted routing that currently sends 0% (or a tiny percentage) of traffic to the DR Region.

Failover Sequence

Disaster declared.
Auto Scaling policies scale the DR Region's compute tier out to production capacity.
Database promotion (Aurora Global Database unplanned failover typically under a minute).
Route 53 failover record flips; health checks mark the primary unhealthy and DNS resolution moves to DR.
Users reconnect; RTO measured in minutes.

Cost and Recovery Profile

Warm Standby is meaningfully more expensive than Pilot Light because the compute tier is running 24/7 at small scale — you pay for instances, load balancers, cache nodes. Recovery is faster because nothing has to cold-start; all that happens is a scale-out. A further benefit: because the DR stack is always live, operational drift between primary and DR is minimal and restore confidence is higher.

Warm Standby Variants

Static stable scale-down — fixed small fleet, scale out on failover.
Active-passive with canary traffic — 1-5% of real traffic routed to DR to continuously validate the stack.

The canary variant is especially powerful because it turns the DR Region into a continuously-tested environment rather than an assumption.

Strategy 4 — Multi-Site Active-Active

Multi-Site Active-Active is the top of the Disaster Recovery Strategies ladder. Both Regions serve real user traffic simultaneously; any single-Region failure is absorbed by the other(s) with minimal or zero downtime.

How Multi-Site Active-Active Works

Data tier — multi-master data stores such as DynamoDB Global Tables (last-writer-wins conflict resolution) or Aurora Global Database with write-forwarding. For strict consistency, some architectures keep a single write Region with synchronous replication and failover-driven write promotion.
Compute tier — full production capacity in both Regions. ASGs scaled equivalently, ECS clusters equivalently sized.
Traffic routing — Route 53 latency-based routing, AWS Global Accelerator, or CloudFront with multi-origin failover steer users to whichever Region is healthy and closest.

Failover Characteristics

There is no "failover event" in the traditional sense. If Region A degrades, Route 53 health checks (or Global Accelerator endpoint health) mark the affected endpoints unhealthy and steer new connections to Region B. Existing sessions may reconnect, but no administrator intervention is required. RTO approaches zero; RPO depends on replication tech (DynamoDB Global Tables give sub-second; Aurora Global Database ~1 second).

Cost and Complexity Profile

This is the most expensive Disaster Recovery Strategy — you are paying for two (or more) full production stacks. It is also the most operationally complex: deploying code changes requires multi-Region coordination, and conflict resolution for multi-master data introduces application-level design work. Choose Multi-Site Active-Active only when the business cost of downtime clearly exceeds the doubled infrastructure cost.

Backup and Restore — RTO hours, RPO hours. AWS Backup + S3 CRR + IaC templates. Cheapest.
Pilot Light — RTO tens of minutes, RPO minutes. Data tier live (Aurora Global DB, DynamoDB Global Tables, S3 CRR); compute dormant.
Warm Standby — RTO minutes, RPO seconds to minutes. Scaled-down full stack running; scale out on failover.
Multi-Site Active-Active — RTO near-zero, RPO near-zero. Full capacity in multiple Regions, live traffic. Most expensive.

Source ↗

S3 Replication — CRR, SRR, and Replication Time Control

S3 replication is the workhorse of Disaster Recovery Strategies for object data. The SAA-C03 exam tests the distinctions between Cross-Region Replication (CRR), Same-Region Replication (SRR), and Replication Time Control (RTC).

Cross-Region Replication (CRR)

CRR copies objects asynchronously from a source bucket in one Region to a destination bucket in another Region. Typical use cases:

Disaster Recovery Strategies — protect against Region-scale failures.
Latency reduction — put copies of static assets closer to users in multiple geographies.
Compliance — maintain copies in specific jurisdictions.

CRR is asynchronous. Replication lag is typically minutes, but not guaranteed by SLA in the default configuration.

Same-Region Replication (SRR)

SRR copies objects to a different bucket in the same Region. It is not itself a DR mechanism (both copies fall together if the Region fails), but it supports:

Aggregating logs from multiple buckets into one for analytics.
Copying data between dev/test/prod accounts in the same Region.
Keeping replicas with different encryption keys, object ownership, or lifecycle rules.

Replication Time Control (RTC)

Replication Time Control is a paid feature that adds a 15-minute SLA and additional monitoring: 99.99% of replicated objects must arrive in the destination within 15 minutes of being uploaded to the source. RTC is the right answer when the question stem requires a guaranteed replication SLA — most commonly, tight RPO targets for financial services, healthcare, or regulated industries.

Other S3 Replication Features

Delete marker replication — optional replication of delete markers for compliance auditing.
Replica modification sync (RMS) — bidirectional metadata sync for active-active patterns.
Batch Replication — back-fill replication for objects that existed before replication was enabled.
Two-way replication — configure both buckets as source and destination for active-active use.

If the stem says "cross-Region DR" → CRR. If it says "guaranteed 15-minute replication SLA" → CRR with Replication Time Control. If it says "copy to another bucket in the same Region for aggregation" → SRR. If it says "I need to back-fill objects uploaded before replication was enabled" → Batch Replication. These four branches cover nearly every S3 replication question on SAA-C03. Source ↗

AWS Backup — Centralized Backup for Disaster Recovery Strategies

AWS Backup is a fully managed service for policy-driven, centralized backup across many AWS services. It is the canonical answer when a scenario asks for "centralized backup across services" or "single backup policy for EBS, RDS, DynamoDB, EFS, and more."

What AWS Backup Supports

Amazon EBS volumes
Amazon RDS databases (including Aurora)
Amazon DynamoDB tables
Amazon EFS file systems
Amazon FSx file systems
Amazon EC2 instances
AWS Storage Gateway volumes
Amazon Redshift clusters (via snapshots)
Amazon S3 (continuous backups with point-in-time restore)
VMware workloads (via AWS Backup Gateway)

Key Concepts

Backup Plan — schedule + retention + lifecycle (e.g. daily backup, retain 35 days, transition to cold storage after 7 days).
Backup Vault — encrypted storage container for recovery points; supports AWS Backup Vault Lock for WORM/immutable backups.
Resource Assignment — tag-based or resource-based selection of which resources fall under a plan.
Cross-Region Copy — copy recovery points to a DR Region as part of the backup plan.
Cross-Account Copy — copy to a separate AWS account (for example a "vault account") to harden against account-compromise scenarios.
AWS Backup Audit Manager — compliance reporting on whether backups are happening per policy.

Why AWS Backup Is a Disaster Recovery Strategies Primitive

Before AWS Backup, DR for multiple services meant stitching together service-specific snapshot APIs, CloudWatch Events, Lambda automation, and custom retention logic. AWS Backup collapses all of that into a single managed control plane with cross-Region and cross-Account copy built in. For Backup and Restore Disaster Recovery Strategies, AWS Backup is almost always the preferred answer on SAA-C03.

AWS Backup Vault Lock

Vault Lock provides WORM semantics: once enabled in compliance mode, recovery points cannot be deleted even by the root user until their retention period expires. This protects backups against ransomware and malicious insiders — a critical property because ransomware attackers often try to destroy backups before encrypting primary data.

Aurora Global Database — Cross-Region Relational DR

Aurora Global Database is the SAA-C03 answer for "I need a cross-Region MySQL or PostgreSQL database with very low replication lag and fast failover."

Architecture

One primary Region accepts reads and writes.
Up to five secondary Regions receive replication with typical lag under one second.
Replication uses a dedicated storage-layer channel that is independent of the database engine, so it does not consume primary compute or network bandwidth.
Each secondary Region can host up to 16 read replicas for local-read scale-out.

Failover Behaviour

Managed planned failover — promotes a secondary to primary with near-zero data loss; typical RTO around one minute.
Unplanned cross-Region failover — detach the secondary, promote it to a standalone cluster, and repoint applications; completes within minutes.
Managed unplanned failover (newer) — AWS-managed failover for unplanned scenarios with preserved global topology.

Use Cases for Disaster Recovery Strategies

Pilot Light — secondary Region cluster is live, application tier is dormant.
Warm Standby — secondary Region hosts a small application stack reading from the local Aurora replicas.
Multi-Site Active-Active — combined with write-forwarding, allows application tiers in secondary Regions to submit writes that are forwarded to the primary.

Exam Trap — Global Database vs Cross-Region Read Replica

Amazon RDS for MySQL, MariaDB, and PostgreSQL also supports "cross-Region read replicas" on the non-Aurora engines. These are not the same as Aurora Global Database. Cross-Region read replicas use engine-level logical replication, have higher lag, and do not offer Aurora Global Database's sub-second RPO. When the stem mentions "Aurora" + "cross-Region" + "low RPO," the answer is Aurora Global Database.

DynamoDB Global Tables — Multi-Region Active-Active NoSQL

DynamoDB Global Tables are the NoSQL companion to Aurora Global Database in Disaster Recovery Strategies. They provide multi-Region, multi-active replication with last-writer-wins conflict resolution.

How Global Tables Work

You enable Global Tables on a DynamoDB table and add replica Regions. Writes in any Region are asynchronously replicated to all other replica Regions, typically within one second. Applications can read and write against whichever Region is closest, and last-writer-wins semantics (based on timestamps) handles conflicts.

Why Global Tables Fit Multi-Site Active-Active

No single-writer bottleneck — all replicas accept writes.
Sub-second replication lag — RPO near-zero.
Automatic conflict resolution — no manual intervention required.
Integration with Route 53 latency-based routing — applications route to the nearest Region naturally.

Cost and Caveats

Write capacity must be provisioned (or on-demand) in each replica Region. Writes in Region A are replicated to Regions B and C, so you pay write units in all three.
Last-writer-wins may be inappropriate for strict-consistency workloads (financial ledgers) — use idempotent, conflict-free data models.
Point-in-time recovery (PITR) is per-Region; restoring PITR on a Global Table requires extra care.

Route 53 Failover — DNS-Level Traffic Direction

Route 53 is the control plane that makes Disaster Recovery Strategies failover visible to end users. It maps health-check results to DNS answers, shifting traffic between Regions without requiring the application to coordinate directly.

Failover Routing Policy

The simplest pattern: one primary record, one secondary record, one health check. If the health check for the primary fails, Route 53 returns the secondary record. This is the classic active-passive pattern for Pilot Light and Warm Standby Disaster Recovery Strategies.

Health Checks

Route 53 health checks probe endpoints from many AWS Edge Locations. If a majority of probes fail, the endpoint is marked unhealthy. Health checks support:

Endpoint health checks — HTTP/HTTPS/TCP probes.
Calculated health checks — aggregate multiple child checks with AND/OR logic.
CloudWatch alarm health checks — derive health from CloudWatch metrics (e.g. ALB error rate, custom business metric).

Other Route 53 Routing Policies Relevant to DR

Weighted routing — shift a gradual percentage of traffic to DR as part of a canary cutover.
Latency-based routing — send each user to the lowest-latency healthy Region (Multi-Site Active-Active).
Geolocation routing — pin users to specific Regions by country/continent.
Multivalue answer routing — return multiple healthy IPs for simple client-side failover.

TTL Implications

DNS TTLs constrain failover speed. If your TTL is 300 seconds, clients may cache the primary record for up to 5 minutes after a failover. Shorten TTLs (30-60 seconds) before exercising a Disaster Recovery Strategies drill, accept that very low TTLs slightly increase DNS query costs, and consider AWS Global Accelerator when DNS caching latency is unacceptable.

Route 53 failover acts at DNS resolution time — new connections go to the new Region but existing sessions depend on client TTL behaviour. AWS Global Accelerator acts at the anycast-network layer, shifting existing TCP/UDP flows to healthy endpoints in seconds. For sub-minute failover requirements or applications with long-lived connections, add Global Accelerator on top of Route 53. Source ↗

AWS Elastic Disaster Recovery (DRS) — Block-Level Replication

AWS Elastic Disaster Recovery (AWS DRS, previously known as CloudEndure Disaster Recovery) is the service for block-level replication of servers — EC2 instances, on-premises physical servers, and VMs — into AWS for Disaster Recovery Strategies use cases.

How DRS Works

Install the AWS Replication Agent on the source server (on-prem, VM, or EC2).
The agent continuously replicates disk blocks to low-cost staging storage in a target AWS Region (EBS volumes attached to lightweight replication servers).
On failover, DRS launches fully-provisioned EC2 instances in the target Region using the latest replicated state, typically within minutes.
On failback, reverse replication brings changes back to the source.

When to Use DRS

Lift-and-shift DR for on-premises workloads — servers you do not want to re-architect as native AWS services but still need cross-cloud DR for.
DR for EC2 across Regions — block-level replication where application-level replication is not feasible.
Cross-cloud DR — replicating from other clouds into AWS.

DRS vs Native-Service Replication

If your workload is already cloud-native (Aurora, DynamoDB, S3), use the service-native replication (Global Database, Global Tables, CRR). DRS shines specifically for legacy or third-party workloads where the block-layer is the right unit of replication.

Cost Profile

DRS steady-state cost is low — you pay for the staging EBS volumes and small replication servers. Full-capacity costs only materialize when you launch the recovery instances. This makes DRS a natural fit for Pilot Light Disaster Recovery Strategies for infrastructure that cannot be re-architected.

AWS Backup protects snapshots with retention policies — it is a point-in-time backup tool. DRS continuously replicates block-level changes for near-zero-RPO server recovery. Questions asking for "continuous replication of running servers" or "RPO of seconds for legacy workloads" point to DRS; questions asking for "centralized backup of multiple AWS services with retention policy" point to AWS Backup. Source ↗

Cross-Region vs Cross-AZ — Choosing the Right Failure Boundary

A recurring SAA-C03 trap asks whether Multi-AZ is sufficient or whether Multi-Region is required. The correct answer hinges on the failure scenarios the business must survive.

What Multi-AZ Protects Against

Single data center loss (fire, flood, power event).
Single AZ network partition.
Single-AZ hardware fleet failures.
Localized natural disasters affecting one metro area sub-zone.

Multi-AZ is High Availability, not Disaster Recovery. Most production workloads on AWS should be Multi-AZ by default — RDS Multi-AZ, ASGs across 2-3 AZs, S3 (multi-AZ by default), EFS (multi-AZ by default).

What Multi-AZ Does Not Protect Against

Region-wide service control plane outages (rare but real).
Natural disasters covering an entire Region's geographic footprint.
Software bugs or misconfigurations that deploy to all AZs simultaneously.
Regulatory requirements mandating a DR site in a different jurisdiction.
Ransomware events that encrypt every volume in one Region.
Account-level compromises (unless combined with cross-account copies).

For any of the above, you need cross-Region Disaster Recovery Strategies.

Decision Framework

Ask four questions, in order:

Does a regulation require cross-Region DR? (banking, healthcare in some jurisdictions). If yes, Multi-Region is mandatory.
What is the RTO/RPO? If sub-minute RTO with the business surviving only partial-Region outages, Multi-AZ may be enough; if sub-minute with full-Region outage survival, Multi-Region.
What is the cost tolerance? Multi-Region roughly doubles infrastructure cost in the active-active extreme; quantify against the revenue impact of downtime.
How globally distributed are users? Global user bases often gain latency benefits from multi-Region deployments that double as DR — the DR cost is offset by performance gains.

Scenarios describing "the primary data center experienced a fire" or "one AZ went dark" typically map to Multi-AZ solutions — not Multi-Region Disaster Recovery Strategies. The exam often distracts with Region-level options when the actual failure boundary is AZ-level. Read the failure description carefully. If only one AZ is mentioned, Multi-AZ is almost always the cost-appropriate answer. Source ↗

Mapping Disaster Recovery Strategies to Business Requirements

The SAA-C03 exam frequently gives you a business description and asks you to pick the Disaster Recovery Strategy. Use this decision tree.

Step 1 — Read the RTO and RPO Targets

If the question provides explicit RTO/RPO numbers, map directly:

RTO 24+ hours, RPO 24+ hours → Backup and Restore.
RTO 1-4 hours, RPO minutes → Pilot Light.
RTO 5-30 minutes, RPO seconds to minutes → Warm Standby.
RTO seconds, RPO near-zero → Multi-Site Active-Active.

Step 2 — Read the Cost Signals

If the question emphasises "lowest cost," "cost-constrained," or "minimize expense," favour the leftmost strategy (Backup and Restore) that meets the RTO/RPO. If the question emphasises "mission critical," "zero downtime," "always-on," favour the rightmost strategy.

Step 3 — Read the Service Hints

Certain service mentions bias the correct answer:

"DynamoDB" + "multi-Region" + "active-active" → DynamoDB Global Tables (Multi-Site Active-Active).
"Aurora" + "near-zero RPO" + "minute-scale RTO" → Aurora Global Database (Pilot Light or Warm Standby).
"Replicate servers continuously from on-prem" → AWS Elastic Disaster Recovery.
"Centralized backup across services" → AWS Backup.
"15-minute replication SLA for objects" → S3 Replication Time Control.

Step 4 — Validate Against Cost Tolerance

Finally, confirm the architecture cost is defensible. Do not pick Multi-Site Active-Active when the stem says "cost-conscious startup" — even if the RTO target technically matches.

Data Durability vs Availability — S3 Eleven Nines Explained

A subtle Disaster Recovery Strategies concept the SAA-C03 exam tests: data durability is not the same as data availability.

Durability — the probability that data is not lost. S3 Standard offers 99.999999999% (eleven nines) durability, meaning statistically you might lose one object every 10 million years out of 10 million.
Availability — the probability that data is reachable. S3 Standard offers 99.99% availability — four nines.

Eleven nines of durability means data loss is virtually impossible within a Region. But if the entire Region is unreachable (availability event), the data is also unreachable. Cross-Region replication (CRR) moves beyond durability into Disaster Recovery Strategies by giving you a second Region's availability to fall back on.

Standard — 11 nines durability, 4 nines availability.
Standard-IA, Intelligent-Tiering — 11 nines durability, 3 nines availability (Standard-IA).
One Zone-IA — 11 nines durability but only one AZ — NOT suitable for DR-critical data.
Glacier Instant / Flexible / Deep Archive — 11 nines durability, availability varies with retrieval tier.
S3 Cross-Region Replication — ensures a second-Region copy exists; combats availability risk.

Source ↗

Testing and Exercising Disaster Recovery Strategies

A Disaster Recovery Strategies plan that has never been tested is closer to a hope than a plan. AWS provides several testing mechanisms.

AWS Backup Restore Testing

AWS Backup Restore Testing (launched in 2023) runs automated restore drills on a schedule: AWS Backup spins up a recovery point into a sandbox environment, validates the restore succeeded, and tears it down — all without operator involvement. This is the lowest-friction way to continuously validate Backup and Restore Disaster Recovery Strategies.

DRS Drill Mode

AWS Elastic Disaster Recovery supports "drill" recovery: you can launch recovery instances in the target Region without affecting production replication. This allows full end-to-end failover testing on a schedule (monthly or quarterly) without touching live systems.

Aurora Global Database Switchover

Aurora Global Database supports "managed planned failover" (switchover) — gracefully promote a secondary to primary with zero data loss. Scheduling regular switchover exercises builds operational confidence.

Route 53 Failover Testing

Route 53 supports manual health check failures (via CLI or console) so you can trigger failover routing without taking down the primary endpoint. This lets teams validate DNS-level failover independently of data-tier failover.

Game Day Exercises

Beyond automated testing, full "game day" exercises simulate a disaster: ops teams execute the runbook end to end, with stakeholders timing RTO/RPO achievement against targets. Frequency recommendations: at least annually for every production workload; quarterly for tier-1 workloads.

Key Numbers and Must-Memorize Facts for Disaster Recovery Strategies

Aurora Global Database replication lag — typically under 1 second; cross-Region RTO ~1 minute for managed failover.
DynamoDB Global Tables replication — typically under 1 second end-to-end.
S3 Replication Time Control SLA — 99.99% of objects replicated within 15 minutes.
S3 Standard durability — 11 nines (99.999999999%).
AWS Backup cross-Region copy — supported on most AWS Backup-enabled services; lifecycle rules can transition to cold storage.
RDS cross-Region read replicas (non-Aurora) — available for MySQL, MariaDB, PostgreSQL, Oracle; higher lag than Aurora Global Database.
Route 53 health check — requires a majority of health-check probes from multiple Edge Locations to mark an endpoint healthy/unhealthy.
DRS recovery launch time — typically minutes from failover signal to running instances.

Source ↗

Common Exam Traps — Disaster Recovery Strategies Boundary Tests

Trap 1 — RTO vs RPO Confusion

Students under exam pressure swap the two. Anchor: RTO is time until service is back (downtime duration); RPO is how much recent data you might lose. "Recovery Point" = a point in time; "Recovery Time" = a duration of outage.

Trap 2 — Pilot Light vs Warm Standby

Both keep data replicated. The distinction: in Pilot Light, the application tier is dormant (scale 0); in Warm Standby, the application tier is running at reduced scale. If the stem says "scaled-down version of production is running" → Warm Standby. If it says "servers are not running, AMIs are ready" → Pilot Light.

Trap 3 — Multi-AZ as Disaster Recovery

Multi-AZ is HA, not DR. If the business requirement includes surviving a full Region outage, Multi-AZ alone is insufficient. The exam loves this trap because Multi-AZ is cheap and tempting; read for Region-level failure scenarios.

Trap 4 — Aurora Global Database vs Aurora Replicas

Aurora read replicas within one Region are for HA and read scale-out. Aurora Global Database replicates across Regions with a separate storage-layer channel. If the stem says "cross-Region," the answer involves Global Database, not in-Region replicas.

Trap 5 — S3 CRR Without Versioning

S3 Cross-Region Replication requires versioning enabled on both source and destination buckets. Exam questions sometimes hide this prerequisite — if a distractor mentions "enable replication without versioning," that option is wrong.

Trap 6 — DynamoDB Global Tables and Strong Consistency

DynamoDB Global Tables use eventually-consistent multi-master replication with last-writer-wins. For workloads that require strict consistency across Regions (bank ledgers), Global Tables may not be appropriate — the answer might instead be a primary/secondary Aurora Global Database pattern.

Trap 7 — AWS Backup vs Native Service Snapshots

Native service snapshots (EBS snapshots, RDS snapshots, DynamoDB backups) still exist alongside AWS Backup. If the stem asks for a "centralized, cross-service backup policy with a single management pane," the answer is AWS Backup. If it asks about a specific snapshot-level feature (EBS fast snapshot restore), the answer may be the native service.

Trap 8 — Route 53 Failover Speed

TTL governs failover speed for clients that cache DNS. Lowering TTL before a planned DR exercise is routine; expect questions that hinge on "users still connecting to the old endpoint 10 minutes after failover" — the answer is DNS TTL caching.

Disaster Recovery Strategies vs High Availability — Scope Boundary

HA and Disaster Recovery Strategies are complementary, not substitutes.

High Availability (Task 2.2 sibling topic high-availability-multi-az) — Multi-AZ deployments, load balancers, Auto Scaling, RDS Multi-AZ. Protects against in-Region failures. Usually operates with RTO seconds-to-minutes and RPO near-zero for in-Region events.
Disaster Recovery Strategies (this topic) — cross-Region patterns. Protects against Region-scale failures. Usually operates with RTO ranging from seconds (Multi-Site Active-Active) to hours (Backup and Restore) depending on the strategy.

Most production workloads need both: Multi-AZ for day-to-day resilience and a Disaster Recovery Strategy layered on top for Region-scale events. Asking "should I use HA or DR?" is a false dichotomy; the real question is "what RTO/RPO do I need at each failure boundary, and what architecture meets it at acceptable cost?"

Practice Question Themes — Task 2.2 Mapped Exercises

Use the quiz engine to drill Task 2.2 questions that map to Disaster Recovery Strategies concepts:

"RTO of 4 hours, RPO of 1 hour, minimise cost" → Backup and Restore with hourly AWS Backup + S3 CRR.
"RTO minutes, RPO seconds, Aurora-backed app" → Aurora Global Database + Warm Standby.
"Global active-active gaming leaderboard with low latency from any Region" → DynamoDB Global Tables + Route 53 latency routing.
"Continuously replicate on-premises VMs to AWS with minute-scale RTO" → AWS Elastic Disaster Recovery.
"15-minute replication SLA for compliance" → S3 Replication Time Control.
"Survive account compromise that deletes backups" → AWS Backup with Vault Lock and cross-account copy.
"Failover DNS when primary ALB fails" → Route 53 failover routing with ALB health check.
"Reduce failover time below DNS TTL" → add AWS Global Accelerator.

FAQ — Disaster Recovery Strategies Top Questions

Q1. What is the difference between RTO and RPO in Disaster Recovery Strategies?

RTO (Recovery Time Objective) is the maximum acceptable elapsed time between disaster declaration and service restoration — it measures downtime tolerance. RPO (Recovery Point Objective) is the maximum acceptable amount of recent data that can be lost, typically measured as a time window of data — it measures data-loss tolerance. A workload with RTO of 30 minutes and RPO of 5 minutes must be restored within 30 minutes and can lose at most the last 5 minutes of committed data. Disaster Recovery Strategies are chosen by matching RTO and RPO targets against the cost of each strategy; tighter targets demand more expensive strategies.

Q2. How do I choose between the four Disaster Recovery Strategies on AWS?

Start with the business RTO/RPO targets. Backup and Restore suits RTOs of several hours and RPOs of hours — cheapest, slowest. Pilot Light suits RTOs of tens of minutes and RPOs of minutes — data tier live, compute dormant. Warm Standby suits RTOs of minutes and RPOs of seconds to minutes — scaled-down full stack always running. Multi-Site Active-Active suits RTOs near zero and RPOs near zero — full capacity in multiple Regions — but at the highest cost. Always pick the cheapest Disaster Recovery Strategy that meets the required RTO/RPO; overspending on DR is a common and preventable waste.

Q3. What is the difference between S3 Cross-Region Replication and Replication Time Control?

S3 Cross-Region Replication (CRR) asynchronously copies objects from a source bucket to a destination bucket in a different Region. Typical replication lag is minutes, but there is no service-level agreement in the default configuration. S3 Replication Time Control (RTC) is a paid add-on to CRR that provides a 15-minute SLA: 99.99% of replicated objects arrive in the destination within 15 minutes of upload. Use RTC when Disaster Recovery Strategies require a guaranteed replication window — typical in finance, healthcare, and other regulated scenarios. Use plain CRR when best-effort replication is acceptable and you want to minimise cost.

Q4. When should I use Aurora Global Database instead of cross-Region read replicas?

Aurora Global Database uses storage-layer replication with typical lag under one second and supports managed cross-Region failover with RTO of about one minute. It scales to five secondary Regions with up to 16 read replicas each. RDS cross-Region read replicas (for MySQL, PostgreSQL, MariaDB, Oracle) use engine-level logical replication, which has higher lag and less predictable failover behaviour. For Disaster Recovery Strategies requiring near-zero RPO with fast cross-Region failover on Aurora MySQL or Aurora PostgreSQL, always choose Aurora Global Database. Use non-Aurora cross-Region read replicas only when you cannot run Aurora.

Q5. Is DynamoDB Global Tables suitable for every multi-Region workload?

No. DynamoDB Global Tables give sub-second cross-Region replication with last-writer-wins conflict resolution — excellent for Multi-Site Active-Active Disaster Recovery Strategies where the application tolerates eventually-consistent writes. They are not appropriate for workloads requiring strict cross-Region consistency (for example a single global financial ledger with strict ordering), because last-writer-wins may lose legitimate concurrent updates. Evaluate your data model against last-writer-wins semantics before choosing Global Tables; design writes to be idempotent and conflict-free whenever possible.

Q6. Does AWS Elastic Disaster Recovery (DRS) replace AWS Backup?

No — they serve different purposes. AWS Backup is a policy-driven snapshot service that produces point-in-time recovery points across many AWS services (EBS, RDS, DynamoDB, EFS, S3, and more) with retention lifecycle and WORM protection via Vault Lock. AWS Elastic Disaster Recovery (DRS) performs continuous block-level replication of running servers — on-premises, VM, or EC2 — into AWS for fast server-level recovery. AWS Backup is the centralized-backup answer; DRS is the continuous-replication answer. Many production Disaster Recovery Strategies use both: DRS for servers that cannot be re-architected, AWS Backup for managed-service resources.

Q7. How fast is Route 53 failover, and how can I make it faster?

Route 53 failover speed is bounded by two factors: health-check detection time (typically 30 seconds to 2 minutes depending on interval and threshold settings) and DNS TTL (how long clients cache the old record). With a 60-second TTL and fast health checks, end-to-end failover is typically 1-3 minutes. To go faster, combine Route 53 with AWS Global Accelerator, which operates at the anycast-network layer and can shift TCP/UDP flows to healthy endpoints in seconds without depending on DNS caching. Global Accelerator is the right choice when Disaster Recovery Strategies require sub-minute failover for long-lived connections or stateful protocols.

Q8. How do I protect Disaster Recovery Strategies against ransomware or account compromise?

Three layers are standard. First, enable AWS Backup Vault Lock in compliance mode so recovery points are immutable for their retention period — even the root user cannot delete them. Second, configure cross-account backup copies to a dedicated "vault account" with strict IAM boundaries so a compromise of the production account cannot propagate to backups. Third, enforce MFA delete on S3 replica buckets and version-enabled source buckets. Combined, these controls mean that even a full production-account compromise leaves a known-good recovery path in the vault account. For SAA-C03, questions about "protecting backups from ransomware" almost always point to Vault Lock + cross-account copy.