Reliability Improvement for Existing AWS Architectures: SPOF Remediation Playbook (SAP-C02)

Diagnostic Entry Point: Reliability Improvement Begins With Honest Inventory

Reliability improvement for existing systems is the single most common SAP-C02 Domain 3 question shape: you are not handed a greenfield whiteboard — you inherit a production workload that has already failed in anger. The exam question almost always begins with a symptom (the application went down three times last quarter), a constraint (cannot take a maintenance window longer than 10 minutes), and an SLA (99.5% must become 99.95%). Your reliability improvement plan must convert that symptom into a ranked list of retrofit actions, each justified against an AWS service that can be introduced without rewriting the application.

The mistake candidates make is jumping straight to solution mode — "add Multi-AZ" — before completing the diagnostic entry point. A proper reliability improvement workflow starts with three parallel data-gathering tasks. First, run an AWS Resilience Hub assessment against the existing application stack so that the tool enumerates resource-level resilience policies and highlights the delta versus your stated RTO/RPO. Second, walk the architecture diagram manually and list every component whose failure would stop the workload; that is your single-point-of-failure (SPOF) register. Third, pull the incident history from CloudWatch Logs, Personal Health Dashboard, and your ticketing system to learn which SPOFs have actually bitten you, so reliability improvement effort is sequenced by real probability instead of paranoia.

This topic walks the full reliability improvement retrofit cycle at SAP-C02 Pro depth. You will learn how to identify SPOFs across compute, data, network, identity, and region dimensions; how to sequence retrofit work so the highest-risk components move first; how to introduce AWS Fault Injection Service (FIS) experiments, Route 53 Application Recovery Controller (ARC) routing controls, RDS Proxy, and per-AZ NAT Gateways onto a live system without rewriting code. Every pattern here is explicitly framed as reliability improvement on a running workload — not a greenfield design exercise.

The SAP-C02 exam rarely asks "what is the most reliable design" — it asks "what is the next action" or "which sequence of actions minimises risk during the retrofit". Your reliability improvement answer must therefore rank interventions by blast radius of failure, cost of change, and whether downtime is required. Reference: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html

Reliability Improvement Scenario: The 99.5% Monolith

Hold this scenario in your head for the rest of the topic; every retrofit pattern will map back to it. A regional monolith runs on a handful of EC2 instances behind one Application Load Balancer, talks to a single-AZ MySQL RDS instance, writes session state to sticky ALB cookies, egresses through one NAT Gateway, terminates TLS with a single ACM certificate in one region, federates through a single on-premises IdP, and points at one hard-coded DynamoDB Streams consumer endpoint. It failed three times in the last quarter — once due to AZ impairment, once due to an RDS storage exhaustion event, and once because the NAT Gateway hit its port allocation limit. The business has moved the SLA from 99.5% to 99.95%, meaning the allowed annual downtime drops from 43 hours to about 4.4 hours.

You have 90 days, no application code ownership, and a hard rule: no maintenance window may exceed 10 minutes. Every reliability improvement action described below must respect that constraint. This kind of real-world box is exactly why the SAP-C02 exam rewards candidates who know which AWS services can be bolted onto a running workload versus which ones demand a rewrite.

Why 99.5% to 99.95% Is a Reliability Improvement Step Change

Moving from 99.5% to 99.95% is not a "little more reliability" — it is a 10x improvement in allowed downtime budget. That jump demands structural reliability improvement: removing SPOFs, not just tuning them. A workload that can absorb one AZ failure can remain single-region, but every shared dependency in that one region — one RDS, one NAT, one certificate — becomes an order of magnitude more important.

The Hidden SPOF Inventory Most Teams Miss

Most teams list the obvious SPOFs (the database, the NAT) and stop. The reliability improvement exam pattern tests the hidden ones: a single ACM certificate that will expire; a hardcoded regional endpoint inside Lambda environment variables; a single Identity Provider that gates every federated login; a Route 53 private hosted zone associated with only one VPC; a single AWS account whose root credentials hold the entire blast radius.

Analogy 1: Reliability Improvement as a Home Retrofit Kitchen Project

Think of reliability improvement on an existing AWS workload the same way you think about renovating a working kitchen while the family still eats three meals a day. You cannot take the entire kitchen offline — you have to sequence the work so the stove stays usable while the oven is being replaced, and the fridge keeps running while the countertop is torn out. That is why reliability improvement is a sequencing problem. You do the sink last if the dishwasher already works; you do the gas line first if a leak would stop everything. Every Domain 3 retrofit question is some version of "which appliance do you replace first so the family can still cook tonight?"

Analogy 2: Reliability Improvement as a Swiss Army Knife, Not a New Toolbox

AWS gives you a Swiss Army knife for reliability improvement — Multi-AZ, RDS Proxy, Route 53 ARC, Resilience Hub, FIS, Service Quotas, Auto Scaling mixed instance groups. You do not buy a whole new toolbox; you flip open the right blade. The exam rewards knowing which blade is the right one for each SPOF. Multi-AZ is the big blade for data plane reliability; RDS Proxy is the tweezers for connection storms; Route 53 ARC is the scissors for flipping traffic deterministically; FIS is the magnifying glass that reveals where your knife is still rusty.

Analogy 3: Reliability Improvement as Traffic Signals on a Grid

A highway without signals moves fast until the first accident stops everything. Retrofitting traffic signals onto an existing grid is a reliability improvement: each intersection can fail independently, detours route around outages, and rush-hour surges are absorbed because green-light time adapts. Per-AZ NAT Gateways, Route 53 ARC routing controls, and Auto Scaling mixed instance groups all play the role of traffic signals — they compartmentalise failure, provide explicit fail-over controls, and adapt to load. Sticky sessions on the ALB are the opposite: one intersection without a signal that grinds the whole grid to a halt when that single instance crashes.

SPOF Inventory Checklist: The Reliability Improvement Starting Point

Before any retrofit, you must walk this checklist against the existing architecture. Each entry is a SPOF class that reliability improvement work must either eliminate or explicitly accept with a compensating control.

Single-AZ RDS Instance

A single-AZ RDS instance is the canonical SPOF and the first stop of any reliability improvement plan. AZ impairment, storage controller failure, or hardware replacement all translate to downtime. The fix is converting to Multi-AZ, which introduces a synchronous standby in a second AZ. Multi-AZ failover is automatic and completes in roughly 60–120 seconds, bounded by DNS TTL propagation and application connection pool behaviour.

Single NAT Gateway

One NAT Gateway in one AZ means that if the AZ degrades, every private subnet in the VPC loses egress — including instances in healthy AZs whose route tables point at the dead NAT. Worse, a single NAT is subject to port allocation limits (about 55,000 simultaneous connections per destination). Reliability improvement requires one NAT Gateway per AZ and per-AZ route tables so each AZ egresses independently.

Cross-Region Dependency

A workload that claims to be multi-AZ but silently depends on an S3 bucket, KMS CMK, or DynamoDB table in another region is not actually multi-AZ — it is cross-region, and the cross-region link is the new SPOF. Reliability improvement requires either replicating that dependency into the primary region (S3 Same-Region replication, multi-Region KMS keys, DynamoDB global tables) or explicitly documenting the cross-region dependency in the RTO calculation.

Single ACM Certificate

A single ACM certificate on one ALB looks fine until renewal fails, the private key is revoked, or the domain validation record is deleted. ACM public certs auto-renew, but private CA issued certs and imported certs do not. Reliability improvement inventories every TLS termination point, ensures automated renewal, and for disaster recovery scenarios pre-provisions a parallel certificate in the DR region (ACM is regional; a cert in us-east-1 does not serve an ALB in eu-west-1).

Hardcoded Endpoint

A Lambda environment variable containing api.us-east-1.amazonaws.com or an EC2 user-data script with a hardcoded RDS endpoint DNS name is a reliability improvement anti-pattern. When Multi-AZ failover happens, the DNS record updates — but only if clients resolve the DNS name rather than cache the resolved IP. Reliability improvement replaces hardcoded IPs with Route 53 records, connection strings with RDS Proxy endpoints, and region-specific ARNs with SSM Parameter Store lookups resolved at runtime.

Single Identity Provider

A single on-premises IdP is a single authentication SPOF. If the IdP is down, no federated user can sign in — including operators trying to mitigate the outage. Reliability improvement adds break-glass IAM users (with hardware MFA) in IAM Identity Center for emergency access, or uses AWS Managed Microsoft AD with multi-AZ domain controllers.

Single Region

A single-region deployment is a SPOF for region-wide impairments. SAP-C02 reliability improvement rarely forces multi-region adoption — it forces you to state the RTO/RPO consequence of staying single-region, and it tests whether you know which services require multi-region opt-in (Route 53 ARC readiness checks, Global Accelerator, DynamoDB global tables, S3 CRR).

Multi-AZ RDS protects against AZ failure but not against regional failure, logical corruption, accidental deletion, or a poisoned replication stream. Reliability improvement that stops at Multi-AZ still leaves the backup-and-restore story, the cross-region DR story, and the accidental-DELETE story unanswered. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZSingleStandby.html

Retrofit Sequence 1: Single-AZ RDS to Multi-AZ Without Downtime

Converting a single-AZ RDS to Multi-AZ is the first reliability improvement action because it removes the highest-probability, highest-impact SPOF. AWS supports this with a single modify-db-instance call, but Pro-level reliability improvement candidates must know the failure modes.

The Simple Path: Modify-In-Place

For most engines, aws rds modify-db-instance --db-instance-identifier prod-db --multi-az --apply-immediately triggers a snapshot of the primary, provisions a standby in a second AZ, and starts synchronous replication. During this process, there is no downtime on the primary — but write latency increases by a few milliseconds after the standby goes live because commits are now synchronous across AZs. The modification window can last an hour or more for large databases; that is not an outage, just a background task.

The Replica-Promotion Path for Stricter Windows

When the business insists on a controlled cutover rather than "AWS will flip it when ready", reliability improvement uses the read replica promotion path. Create a read replica in AZ B, wait for replication lag to reach zero, then promote the replica and fail application traffic over to the new endpoint using Route 53 or an ALB target group swap. This gives you a known cutover moment, but it is a one-way door: the old primary is orphaned and must be re-ingested or decommissioned. This pattern is the right reliability improvement answer when the exam stem says "we need a rehearsed failover moment".

Application-Side Reliability Improvement for the Failover

The application tier must cooperate. JDBC connection pools, SQLAlchemy engines, and .NET connection strings cache DNS. After a Multi-AZ failover, stale pools point at the old IP and connections fail for up to 5 minutes. Reliability improvement on the client side means setting a short DNS TTL, enabling connection validation on checkout, and ideally placing RDS Proxy between the app and the database so the reconnect storm is handled by a managed middle tier.

Retrofit Sequence 2: RDS Proxy for Connection Spike Resilience

RDS Proxy is the reliability improvement answer when the workload is susceptible to connection storms — Lambda scale-out, ECS task restarts, or spiky batch workloads exhausting the database's max_connections budget. RDS Proxy sits between the clients and the database, pools connections, and absorbs the spike.

When RDS Proxy Is the Right Reliability Improvement

If the incident history shows "database max_connections exhausted" or "database stuck in high CPU after Lambda surge", that is the RDS Proxy signal. Lambda cold start followed by 1,000 concurrent executions opening 1,000 fresh connections will crush an RDS db.m5.large. RDS Proxy multiplexes those 1,000 Lambda requests onto a manageable pool of 100 real database connections.

RDS Proxy Failover Time Improvement

Without RDS Proxy, Multi-AZ failover takes 60–120 seconds plus client reconnection time. With RDS Proxy in the path, failover appears to the application as roughly 30 seconds of elevated latency, because RDS Proxy handles the reconnect to the new primary behind a stable endpoint. For reliability improvement in tight SLAs, this is a meaningful win.

Inserting RDS Proxy Onto a Live Workload

The retrofit is straightforward: create the RDS Proxy, associate it with the existing database, store database credentials in AWS Secrets Manager, grant the Lambda/ECS execution role secretsmanager:GetSecretValue and rds-db:connect, then swap the endpoint in the application config. RDS Proxy supports IAM authentication as well, which removes password rotation complexity. The reliability improvement blast radius is small: if RDS Proxy misbehaves, revert the endpoint to the database directly.

RDS Proxy does not accept raw database passwords at runtime — it reads credentials from Secrets Manager, or authenticates clients via IAM. This is actually a reliability improvement and a security improvement in one move: you can rotate database passwords in Secrets Manager without touching application config. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html

Retrofit Sequence 3: Single NAT to One NAT per AZ

A single NAT Gateway is the most common reliability improvement finding in Resilience Hub assessments because it is so easy to set up and so easy to forget. The retrofit is pure infrastructure work — no application changes.

The Per-AZ NAT Topology

For each AZ that contains private subnets, provision a NAT Gateway in that AZ's public subnet. Update the private subnet route tables so subnets in AZ A route 0.0.0.0/0 to the NAT in AZ A, subnets in AZ B route to the NAT in AZ B, and so on. Now a NAT Gateway failure in AZ A only affects AZ A workloads; AZ B keeps egressing normally.

Cost and Port Exhaustion as Reliability Improvement Drivers

Per-AZ NAT increases cost (each NAT is billed hourly plus data processing), but it also distributes port allocation across gateways, which directly fixes the port-exhaustion class of outage. If the incident history mentions "SNAT port exhaustion" or "outbound connections from many instances to a single third-party API failing intermittently", per-AZ NAT is not optional — it is the reliability improvement.

The Hybrid Approach: VPC Endpoints to Reduce NAT Dependency

Before retrofitting per-AZ NAT, audit what traffic flows through the NAT. If most of it is S3 or DynamoDB, adding Gateway VPC endpoints eliminates that traffic from the NAT path entirely, which is both a cost reduction and a reliability improvement. Interface endpoints for other AWS services (Secrets Manager, SSM, KMS) further reduce NAT load and remove a failure dependency on the public internet for AWS API calls.

Retrofit Sequence 4: Auto Scaling Group Mixed Instance Types and Spot Rebalance

A single instance type in an Auto Scaling group is a reliability improvement SPOF in a less obvious dimension: capacity availability. If m5.large is exhausted in the region, your ASG cannot scale out, which is an outage in disguise during a traffic spike.

Multiple Instance Types for Capacity Diversity

Modify the existing ASG launch template to reference a mixed instance policy with three or four instance types (m5.large, m5a.large, m6i.large, m6a.large) and let the allocation strategy be price-capacity-optimized. The ASG will now draw from whichever capacity pool has inventory, converting a scaling-out failure into a cost dimension.

Spot with Capacity Rebalance

For fault-tolerant tiers, mix On-Demand with Spot. Enable Capacity Rebalance so the ASG receives EC2 Spot rebalance recommendations and proactively replaces instances before a Spot interruption. Combined with target tracking scaling, this delivers reliability improvement and cost reduction simultaneously.

Mixed instance ASGs with Spot are one of the rare reliability improvement patterns that reduces cost as it increases reliability. The SAP-C02 exam explicitly tests this combination. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups.html

Retrofit Sequence 5: Replace Sticky ALB Sessions with DynamoDB Session Store

Sticky ALB sessions are a reliability improvement anti-pattern because they pin a user to one instance. When that instance dies, the user is logged out and every in-flight transaction for that user fails. Pro-level reliability improvement moves session state off the instance.

DynamoDB as Session Store

The retrofit pattern: create a DynamoDB table sessions with the session ID as the partition key, TTL attribute for automatic expiry, and either on-demand capacity or provisioned with auto-scaling. Replace the in-memory session middleware with a DynamoDB-backed one (express-session with connect-dynamodb, ASP.NET Core with Microsoft.AspNetCore.DataProtection.StackExchangeRedis + DynamoDB, etc.).

Turning Off Stickiness Safely

Once sessions are in DynamoDB, the ALB target group stickiness setting can be turned off. Do this during a low-traffic window and monitor for anomalies. Reliability improvement here also unlocks horizontal scaling — any instance can serve any user — and simplifies blue/green deployments because warm-up no longer requires routing specific users to specific instances.

ElastiCache Alternative

If the workload needs sub-millisecond session lookups, ElastiCache for Redis is the alternative session store. Use cluster mode with multi-AZ failover for reliability improvement at the cache layer; otherwise ElastiCache becomes its own SPOF.

Chaos Testing Retrofit: AWS Fault Injection Service Experiments

A reliability improvement plan is only as trustworthy as its rehearsal. AWS Fault Injection Service (FIS) is the managed chaos engineering service that lets you run controlled failure experiments on existing workloads. On the exam, FIS is the correct reliability improvement answer whenever the stem says "validate the new design survives AZ failure" or "prove the failover actually works".

FIS Experiment Template Structure

An FIS experiment has actions (what to break), targets (which resources), stop conditions (CloudWatch alarms that abort if blast radius grows), and an IAM role authorising FIS to perform the actions. For reliability improvement of the monolith scenario, candidate experiments include:

Experiment Per Component Retrofit

RDS: aws:rds:reboot-db-instances with forceFailover = true to validate the Multi-AZ failover path.
NAT/AZ: aws:network:disrupt-connectivity against subnets in one AZ to validate per-AZ NAT isolation.
EC2: aws:ec2:terminate-instances to validate ASG replacement and session continuity via DynamoDB.
API/latency: aws:fis:inject-api-internal-error against selected APIs to validate client retry and circuit breaker logic.

Stop Conditions as Blast-Radius Guards

Every FIS experiment on a production reliability improvement retrofit must attach a CloudWatch alarm as a stop condition — for example, "abort if 5xx error rate exceeds 1% for 2 minutes". If the stop condition fires, FIS automatically rolls back and terminates the experiment. This is the reliability improvement insurance that lets you run chaos on production instead of a clone.

Running FIS on production without stop conditions is a reliability improvement anti-pattern — it is just an outage with an IAM role. Always configure CloudWatch alarms tied to customer-facing SLOs before running any chaos experiment on a live workload. Reference: https://docs.aws.amazon.com/fis/latest/userguide/stop-conditions.html

Route 53 ARC Retrofit: Deterministic Failover Controls

Route 53 Application Recovery Controller (ARC) is the reliability improvement answer when Route 53 health-check based failover is too probabilistic. ARC adds routing controls (manual on/off switches) and readiness checks (continuous validation that a replica is actually ready).

Adding Routing Controls to Existing Route 53 Records

The retrofit sequence: create an ARC cluster (five regional endpoints for quorum), define a control panel, create routing controls, then attach each routing control to an existing Route 53 record via an ARC alias. The Route 53 record now only resolves to the endpoint whose routing control is On. During a failover, an operator toggles routing controls via the ARC cluster data plane — a quorum-based decision that cannot be silently overridden by a regional control plane outage.

Readiness Checks on Existing Infrastructure

A readiness check continuously evaluates resource configurations (are the ASG capacities matched across regions? is the DynamoDB table healthy? are the Route 53 records identical?) and fires an alarm when drift is detected. Reliability improvement here is reducing the chance of a failed failover because the DR replica quietly diverged from primary.

Zonal Shift for AZ-Level Reliability Improvement

Zonal shift is the lighter-weight ARC feature — it lets you temporarily stop routing ALB/NLB traffic to one AZ without editing load balancer config. For the monolith scenario, zonal shift is the one-click reliability improvement when an AZ is misbehaving but AWS has not yet declared impairment.

Zonal Autoshift

Zonal autoshift is the newer evolution: AWS monitors AZ health and automatically shifts traffic away from an impaired AZ, then shifts back when healthy. Enable this on existing ALBs and NLBs as a zero-operator reliability improvement baseline.

AWS Resilience Hub Assessments for Existing Applications

AWS Resilience Hub is the assessment service that evaluates an existing workload against your stated RTO/RPO and produces a prioritised remediation list. On the SAP-C02 exam, whenever the stem says "assess the current resilience posture of an existing workload", Resilience Hub is the correct reliability improvement answer.

Defining a Resilience Policy

A resilience policy states the RTO and RPO for four disruption types: Software (application-level), Hardware (instance/volume), AZ (one AZ offline), Region (one region offline). Resilience Hub measures the app's ability to meet each RTO/RPO and gives you a "Policy met / Policy breached" verdict per component.

Adding an Existing Application to Resilience Hub

You can point Resilience Hub at a CloudFormation stack, a Terraform state file, a resource group, or an EKS cluster. It inspects the declared resources, computes the implied resilience based on each resource's configuration (single-AZ RDS = AZ RTO breach; single NAT = AZ RTO breach; no backup plan = hardware RPO breach), and produces a ranked list of recommendations.

Closing the Loop With Resilience Hub

After each reliability improvement retrofit, re-run the assessment and confirm that the corresponding policy line item flips from "breached" to "met". This is the SAP-C02 answer for "how do you verify the remediation was effective" — you do not eyeball the architecture, you re-run the Resilience Hub assessment.

Resilience Hub is the static assessment (did the design meet the policy?). FIS is the dynamic test (did the failover actually work?). You need both for a complete reliability improvement story — and the SAP-C02 exam will distinguish between them. Reference: https://docs.aws.amazon.com/resilience-hub/latest/userguide/what-is.html

Quota Monitoring Retrofit: Service Quotas and CloudWatch

Quota breaches cause outages that look like application bugs. Reliability improvement includes proactive quota monitoring because you cannot scale into a wall.

Service Quotas API for Programmatic Audits

The Service Quotas API lists the applied quota for every AWS service in every region. A weekly Lambda that compares current usage (via CloudWatch metrics or describe-* API calls) against the applied quota and fires an SNS alert at 70% utilisation is a minimal but powerful reliability improvement.

CloudWatch Alarms on Usage Metrics

Many services publish usage metrics under the AWS/Usage namespace. Create CloudWatch alarms on metrics like CallCount for API rate limits or ResourceCount for "number of EC2 instances". Pair with an EventBridge rule that files a Service Quotas increase request automatically when the alarm fires — reliability improvement as automation, not as reactive tickets.

Common Quotas Reliability Improvement Must Cover

Lambda concurrent executions, EC2 vCPU limits per family, VPC count per region, EBS volume storage per region, Route 53 hosted zones per account, and ACM certificates per region are the usual suspects. A reliability improvement audit enumerates these for each account in the organisation and sets headroom targets.

A "service limit" is the old term. Today, AWS calls every tunable cap a "service quota", managed through the Service Quotas console. Some quotas are adjustable (you can request an increase), others are hard (architectural limits you cannot breach). Reliability improvement auditors must tag each relevant quota as adjustable or hard and plan accordingly. Reference: https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html

Self-Healing Automation Retrofit

Reliability improvement that requires a human to read a pager is slower than reliability improvement that runs as code. Once the SPOF inventory is clean, layer EventBridge + Lambda/SSM automation on top.

CloudWatch Alarm to EventBridge to SSM Document

The canonical self-healing loop: CloudWatch alarm fires (e.g., "instance has been unresponsive for 5 minutes"), EventBridge rule matches the alarm state change, target is an SSM Automation Document that restarts the instance, rotates credentials, or replaces an ASG member. The operator is still notified, but the remediation has already started.

SSM Runbook Catalogue

Build a catalogue of SSM Automation Documents for known failure modes: "restart stuck RDS connection pool", "rotate leaked IAM credential", "force NAT Gateway failover", "invalidate CloudFront distribution". Each runbook is a reliability improvement asset that reduces MTTR on the next incident.

H2 Pro: Diagnostic Entry Point Deep Dive

When the SAP-C02 stem says "an existing workload failed", the Pro-level diagnostic entry point sequence is:

Run Resilience Hub assessment against the current stack to get a resource-level SPOF map.
Pull Personal Health Dashboard events for the past 90 days — have we already been hit by AZ or regional events?
Pull Trusted Advisor service limit and fault-tolerance checks as a quick reliability improvement gut check.
Pull CloudWatch dashboards and Logs Insights for top error sources.
Run an FIS experiment in a pre-production environment to validate what the Resilience Hub assessment predicts.

This entry point sequence itself is testable — questions ask "which AWS service first?" and the answer is almost always Resilience Hub or Trusted Advisor for the assessment phase, followed by FIS for validation.

H2 Pro: SPOF Identification Beyond the Obvious

The beginner SPOF list is "single AZ, single region, single instance". The Pro reliability improvement inventory adds:

Quota SPOFs

The account can scale until it hits a quota and then it cannot. That quota is a SPOF even if the architecture is redundant.

Identity SPOFs

A single IdP, a single break-glass account, a single AWS Organizations management account whose loss would block account-level recovery.

Control-Plane vs Data-Plane SPOFs

Route 53 data plane is globally resilient; Route 53 control plane (record editing) is hosted in us-east-1 and can be impaired. ARC's routing controls specifically use a data-plane quorum API to survive us-east-1 control plane outages. Reliability improvement at Pro depth distinguishes control plane from data plane for every service.

Configuration Drift SPOFs

A region whose AMI is out of date, whose CloudFormation template has diverged, whose security group rules are inconsistent — the DR region that looks ready but is not. Readiness checks in Route 53 ARC are the answer.

H2 Pro: Retrofit Pattern Principles

Every reliability improvement retrofit on an existing workload follows these principles:

Smallest Safe Change First

Enable Multi-AZ before introducing RDS Proxy. Introduce RDS Proxy before moving to Aurora. Each step is reversible or blast-radius contained.

Observable Before Automated

Add CloudWatch metrics and alarms before layering EventBridge remediation. You need to see the symptom before you automate the response.

Rehearse Before Trusting

Every retrofit must be exercised via FIS at least once in staging and once in production (with stop conditions). Reliability improvement that has never failed over has not actually been validated.

Document the Delta

Each retrofit changes the resilience policy. Update Resilience Hub, update the runbook, update the architecture diagram. Undocumented reliability improvement decays quickly.

Walking the Scenario End-to-End: 90-Day Reliability Improvement Plan

For the 99.5% monolith scenario, a Pro-level reliability improvement plan sequenced across 90 days:

Week 1–2: Assessment

Run Resilience Hub against the CloudFormation stack; run Trusted Advisor; tag SPOFs by blast radius; baseline the CloudWatch dashboards and Service Quotas utilisation.

Week 3–4: Data-Layer Reliability Improvement

Convert RDS to Multi-AZ via modify-in-place (no downtime). Enable automated backups to 35 days. Introduce RDS Proxy with Secrets Manager-backed credentials. Test failover via FIS reboot-db-instance with forceFailover.

Week 5–6: Network-Layer Reliability Improvement

Retrofit per-AZ NAT Gateways. Add S3 and DynamoDB Gateway VPC endpoints. Add interface endpoints for Secrets Manager, SSM, and KMS. Validate route tables per AZ. Run FIS network-disruption experiment on one AZ.

Week 7–8: Compute-Layer Reliability Improvement

Convert the static EC2 fleet into an ASG with mixed instance types. Introduce capacity rebalancing. Move session state from sticky ALB cookies to DynamoDB sessions table with TTL. Disable ALB stickiness. Run FIS instance-termination experiment.

Week 9–10: Traffic and Failover Reliability Improvement

Introduce Route 53 ARC cluster, routing controls, and readiness checks. Enable zonal autoshift on the ALB. Document the manual zonal shift procedure. Run a tabletop exercise on AZ failover.

Week 11–12: Validation and Automation Reliability Improvement

Layer CloudWatch to EventBridge to SSM automation for top-three known failure modes. Run a full Resilience Hub re-assessment; confirm every policy item shows "met". Present the new SLA track record to the business.

SAP-C02 questions often ask "which order" rather than "which service". The canonical order for reliability improvement is: data layer first (highest blast radius), then network, then compute, then traffic control, then automation. Memorise this sequence. Reference: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html

Common Reliability Improvement Traps on SAP-C02

Several question patterns consistently catch out candidates who have not rehearsed reliability improvement at Pro depth.

Trap 1: Multi-AZ Means "Fault Tolerant"

Multi-AZ is highly available, not fault tolerant. The exam distinguishes between "tolerates one AZ failure" (Multi-AZ) and "continues serving without human intervention during failure" (true fault tolerance, which requires redundant capacity pre-provisioned across AZs).

Trap 2: Route 53 Failover Routing Is Deterministic

Route 53 health-check-based failover is probabilistic — it depends on health check evaluation timing and TTL. When reliability improvement demands a deterministic flip, ARC routing controls are the right answer, not Route 53 failover records alone.

Trap 3: RDS Proxy Is a Performance Tool

RDS Proxy is a reliability improvement tool that also improves connection pooling performance. The exam asks "what problem does RDS Proxy solve", and the right answer centres on connection exhaustion and failover resilience, not raw latency.

Trap 4: Cross-Zone Load Balancing Is Free Reliability Improvement

Cross-zone load balancing is usually on by default for ALB and has no data transfer cost within the same VPC; for NLB it is off by default and turning it on does incur cross-AZ data charges. Reliability improvement retrofits must account for this cost/traffic trade-off.

Trap 5: Backup Is Not Disaster Recovery

Automated RDS snapshots in the same region are backups, not DR. True DR reliability improvement requires cross-region snapshot copy or cross-region read replicas. SAP-C02 tests this distinction.

Adding redundancy without exercising the failover path is theatre. The exam specifically rewards FIS + Resilience Hub + ARC readiness checks as the rehearsal trinity that turns redundancy into actual reliability improvement. Reference: https://docs.aws.amazon.com/fis/latest/userguide/what-is.html

White-Box Walkthrough: Reliability Improvement Decisions Explained

To turn the patterns into exam reflexes, walk the decision tree for each SPOF.

Single-AZ RDS → Pick the Retrofit Path

Engine supports Multi-AZ cluster (MySQL 8.0.28+, PostgreSQL 13.6+)? Consider Multi-AZ DB cluster for better write performance on the standby.
Engine supports classic Multi-AZ only? Use classic Multi-AZ.
Write latency sensitivity is extreme? Evaluate Aurora, not RDS Multi-AZ — a different reliability improvement conversation.

Single NAT → Pick the Topology

High NAT cost concern? Add VPC endpoints first, then per-AZ NAT.
Port exhaustion incident history? Per-AZ NAT is mandatory.
Egress to public internet is rare? Consider NAT Gateway per AZ plus egress-only Internet Gateway for IPv6.

Sticky Sessions → Pick the Session Store

Latency budget < 5ms and willing to operate a cache tier? ElastiCache for Redis with multi-AZ.
Latency budget forgiving (10ms+) and want managed? DynamoDB with TTL.
Very short-lived sessions (minutes)? JWT tokens client-side eliminate the server session store entirely.

FAQ

How is reliability improvement different from disaster recovery on the SAP-C02 exam?

Reliability improvement is about eliminating SPOFs inside a region so the workload survives component and AZ failures. Disaster recovery is about surviving region-wide events. The exam uses the phrase "improve reliability" for SPOF remediation within one region and "disaster recovery" for multi-region patterns. RDS Multi-AZ is reliability improvement; RDS cross-region read replica with promotion is disaster recovery. Both topics are tested in SAP-C02 Domain 3, but they answer different question stems.

Can I really convert single-AZ RDS to Multi-AZ with zero downtime?

Yes, for the primary — but the application side must be resilient to DNS changes and to a brief commit latency increase while the standby is being seeded. The AWS operation itself does not pause the primary; reads and writes continue. The reliability improvement risk is on the client side: connection pools that cache DNS or hold stale connections may observe errors for the first few minutes after the change completes. Preemptively introducing RDS Proxy before the conversion removes most of this client-side reliability improvement risk.

When should I choose Route 53 ARC routing controls over Route 53 failover records?

Use ARC routing controls when you need a deterministic, quorum-backed traffic flip that does not depend on health check evaluation timing — typically for high-stakes regional failovers where a confused-deputy failover would be worse than the original outage. Use Route 53 failover records for lower-stakes, automated failovers where health checks are sufficient. Reliability improvement maturity usually starts with failover records and graduates to ARC as the workload's criticality grows.

Do I need AWS Fault Injection Service on a small workload?

Yes, directly proportional to how much you believe your own reliability improvement story. If you cannot prove via FIS that the new RDS Multi-AZ actually fails over, and that the application actually reconnects, the reliability improvement is on paper only. FIS is cheap (per-action-minute pricing), and a small monthly chaos experiment is the minimum reliability improvement validation standard for any workload backing an SLA. On the exam, FIS is the canonical answer whenever the stem says "validate" or "prove the resilience".

How do I prioritise reliability improvement when everything looks like a SPOF?

Rank SPOFs on two axes: blast radius if it fails (does it take down the whole workload, or one tier?) and probability of failure (is this component historically unreliable, near a quota, or running on unpatched software?). Fix high-blast, high-probability SPOFs first — almost always the data layer. Then attack the next ring: network egress, compute capacity, identity. Finally, automate remediation for known-recurring failure modes. Resilience Hub's ranked recommendations give you an opinionated starting list, which is usually the right reliability improvement backbone for the first 90 days.

What is the difference between Resilience Hub and Trusted Advisor for reliability improvement?

Trusted Advisor gives account-wide reliability improvement signals — service limits, single-AZ RDS, ASG not using Multi-AZ, missing snapshots. It is broad and shallow, and free for all accounts. Resilience Hub is application-scoped, measures your declared RTO/RPO against resource-level configuration, and produces a policy-based "met / breached" verdict. Use Trusted Advisor for hygiene sweeps across the entire account; use Resilience Hub for per-workload SLA-driven reliability improvement plans.

Does moving sessions to DynamoDB introduce a new SPOF?

DynamoDB is a regional, multi-AZ managed service, so moving sessions to DynamoDB reduces your SPOF count rather than adding one. The reliability improvement you gain is: any compute instance can serve any user, and instance replacement is transparent to the user. The one new consideration is DynamoDB throughput — for on-demand tables, throughput scales automatically; for provisioned, you must set auto-scaling so a traffic spike does not throttle session reads. This is a reliability improvement trade in the right direction.

How do I know when reliability improvement work is done?

When every item in the Resilience Hub resilience policy shows "met", every FIS experiment in the catalogue passes cleanly under production load, every quota is below 70% utilisation with automated increase workflows, and the monthly SLA report shows the measured availability consistently above the target for three consecutive months. Reliability improvement is never "done" in the sense of "finished forever" — but it is "done for this policy level", and the next reliability improvement cycle begins only when the SLA target itself is raised.

Exam Signal Summary

For SAP-C02 Domain 3 reliability improvement questions, the canonical answer shape is: identify the SPOF class, pick the AWS-native retrofit service, justify why downtime is minimal, and prove the retrofit via assessment and chaos validation. The services you must have reflexes for are AWS Resilience Hub (assessment), AWS Fault Injection Service (validation), Route 53 ARC (deterministic failover), RDS Proxy (connection resilience), Multi-AZ and per-AZ NAT (infrastructure redundancy), ASG mixed instance types with Capacity Rebalance (compute capacity reliability improvement), DynamoDB session store (stateless tier enablement), and Service Quotas with CloudWatch (proactive limit reliability improvement). Master this set, and the reliability improvement questions become deterministic.