Performance Diagnosis Methodology: From Symptom to Root Cause
Performance tuning remediation is the discipline of finding, quantifying, and fixing performance regressions in systems that already exist in production. Unlike greenfield performance design, where you pick purpose-built databases and caching tiers on day one, performance tuning remediation begins with a symptom that a customer, SRE, or dashboard already caught. The SAP-C02 exam tests performance tuning remediation almost exclusively through diagnostic question stems that describe a running workload, a measurable degradation, and a set of telemetry signals, and then asks which remediation sequence best restores the SLA without rearchitecting the entire stack.
The mental model this topic teaches is a four-stage loop — symptom, metric, bottleneck, root cause — and then a remediation sequence that prefers reversible changes over irreversible ones. Performance tuning remediation always respects the blast radius of the change: adding a CloudFront distribution in front of an existing ALB is reversible in minutes, but migrating RDS MySQL 5.7 to Aurora MySQL-compatible is a one-way door that must be scheduled and rehearsed. Throughout this topic we will return to a single anchor scenario — an e-commerce checkout whose p99 latency jumps from 200ms to 3 seconds at traffic peak — because it exercises every dimension of performance tuning remediation a Pro-level architect must know.
Performance tuning remediation is the process of restoring or improving an existing AWS workload's latency, throughput, or resource efficiency through targeted diagnostic instrumentation and minimum-blast-radius remediation, without a full rearchitecture. The SAP-C02 exam frames this under Domain 3, Task Statement 3.3. Reference: https://d1.awsstatic.com/training-and-certification/docs-sa-pro/AWS-Certified-Solutions-Architect-Professional_Exam-Guide.pdf
The Seven-Step Remediation Workflow
The performance tuning remediation workflow for any existing AWS system consists of seven disciplined steps. First, define the SLA in measurable terms — p50, p95, p99 latency, error rate, throughput per second. Second, capture a baseline from CloudWatch, Performance Insights, X-Ray, or DevOps Guru over a representative window. Third, localize the bottleneck to a tier — edge, compute, database, cache, storage, network. Fourth, identify the root cause within that tier using purpose-built diagnostics. Fifth, design the remediation with a rollback plan. Sixth, execute behind a feature flag or a change window. Seventh, validate that the new baseline holds for at least one full business cycle. Performance tuning remediation without step seven is how teams regress: a fix held for an hour is not a fix.
Why SLA Definition Precedes Tooling
Do not open CloudWatch first. Open the SLA definition first. If the business requires p99 checkout under 800ms but you only measure p50, no amount of performance tuning remediation will close the ticket. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/performance-efficiency-pillar.html
A measurable SLA has four components: the metric (latency, throughput, error rate), the percentile (p50, p95, p99, p99.9), the window (per second, per minute, per hour), and the threshold (a hard number like 800 ms). Performance tuning remediation against a fuzzy SLA like "make it faster" generates political debate; performance tuning remediation against "p99 checkout latency under 800ms measured per minute" generates engineering consensus. Every Pro-level architect should push back on vague targets before accepting a tuning ticket.
The Anchor Scenario: E-Commerce Checkout P99 Jumps from 200ms to 3s at Peak
Throughout this topic we will reason about a single anchor scenario because SAP-C02 questions almost always embed performance tuning remediation in a business-flavored narrative. The scenario: a mid-size e-commerce platform runs its checkout flow on an Application Load Balancer in front of an Auto Scaling group of EC2 m5.large instances, a shared RDS MySQL 5.7 db.m5.2xlarge single-AZ primary, a DynamoDB cart table, and an S3 bucket for product images. Steady-state traffic is 400 requests per second with p99 latency of 200 milliseconds. During Black Friday peak, traffic climbs to 2,400 requests per second and p99 latency rises to 3,000 milliseconds. Conversion drops 18 percent. The CTO wants performance tuning remediation delivered in two sprints with rollback on every change.
We will use this anchor scenario repeatedly. Every diagnostic tool and remediation pattern introduced below will be applied to this scenario at least once. By the end of this topic the reader will have a complete performance tuning remediation playbook, sequenced from lowest risk to highest risk, with measurable expected impact at each step.
A p50 latency report hides exactly the customers whose checkouts fail. If your average checkout is 250ms but your p99 is 3s, then 1 percent of carts — often your highest-intent buyers who added many items — experience a three-second wait. Performance tuning remediation at Pro level always targets tail latency. Reference: https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
Translating Business Impact to Technical Targets
The CTO's 18 percent conversion drop translates to a specific latency SLA. Industry studies (Amazon, Google, Walmart) show checkout conversion drops roughly 1-2 percent per additional 100ms of p99 latency above a 500ms baseline. A 2.8-second regression (from 200ms to 3,000ms) crosses every known conversion threshold. Performance tuning remediation for this scenario has a fixed target: p99 checkout latency under 800ms at 2,400 RPS sustained for a four-hour peak window. Every remediation proposed below is scored against this target.
Localizing the Bottleneck Tier: Edge, Compute, Database, Cache, Storage, Network
Before any specific tool is used, performance tuning remediation demands a tier localization step. A request traversing an AWS web stack passes through six tiers in sequence: edge (DNS, CloudFront, Global Accelerator), ingress (ALB, NLB, API Gateway), compute (EC2, ECS, EKS, Lambda), cache (ElastiCache, DAX, CloudFront cache), database (RDS, Aurora, DynamoDB), and storage (EBS, EFS, S3). A 3-second p99 can live in any of these tiers or across several. Performance tuning remediation that starts at the wrong tier wastes a sprint.
The classical localization trick is the subtraction method. Measure latency at each tier boundary. ALB access logs give request processing and target response times separately. X-Ray traces annotate database call durations separately from compute time. CloudFront access logs record time-to-first-byte from the origin. The tier with the largest delta between baseline and peak is where performance tuning remediation begins. In our e-commerce anchor, ALB target response time went from 150ms to 2,800ms at peak while request processing stayed at 20ms — this localizes the bottleneck below the ALB, into compute or database.
For every existing system: (1) edge, (2) ingress, (3) compute, (4) cache, (5) database, (6) storage. Walk each tier with its native telemetry. Do not skip tiers — performance tuning remediation against the wrong tier is why a three-week sprint produces a three-percent gain. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/monitoring.html
The Subtraction Method in Practice
Walk the request from client to storage, recording the latency attributable to each hop. CloudFront gives time-taken and origin-fbl in access logs — origin first byte latency is the ALB-and-below portion. ALB gives request_processing_time, target_processing_time, response_processing_time as three separate fields. X-Ray segments assign latency to compute versus downstream calls. RDS Performance Insights attributes DB Load to wait events. Subtract backwards: total latency minus CloudFront edge time equals origin time; origin time minus ALB target response time equals non-target overhead; target response time minus X-Ray database segment time equals compute-only time. Performance tuning remediation targets whichever subtraction produces the largest residual.
CloudWatch Contributor Insights for Hot Partitions and Hot Connections
CloudWatch Contributor Insights is the single most underused tool in performance tuning remediation on the SAP-C02 exam. Contributor Insights analyzes CloudWatch Logs in near real time and produces a ranked list of "top N contributors" by a key you define. The two canonical use cases are hot partition detection in DynamoDB and hot connection detection in any log stream that records client identifiers.
For DynamoDB hot partition detection, CloudWatch Contributor Insights for DynamoDB can be enabled per table and per global secondary index. Once enabled, Contributor Insights shows the most accessed partition keys in both reads and writes. When our e-commerce checkout's DynamoDB cart table shows that a single partition key — the loyalty-program "vip" tier shard — receives 60 percent of all writes, performance tuning remediation is clear: the partition key design is wrong, and the table needs a repartition strategy.
For hot connection detection, Contributor Insights rules can parse VPC Flow Logs or ALB access logs to rank client IPs, user agents, or URL paths by traffic volume. A single misbehaving client issuing 50 requests per second is a contributor that standard CloudWatch metrics cannot surface because it is invisible in aggregate.
Contributor Insights is billed per rule-hour and per million log events matched. It does not show up under CloudWatch Metrics — it has its own console section. For performance tuning remediation of hot partitions and hot keys, Contributor Insights is the definitive tool. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html
DynamoDB Hot Partition Detection and Repartition Strategy
A DynamoDB hot partition is a single physical partition receiving disproportionate traffic. DynamoDB's internal partition boundary is 10 GB of storage or 3,000 RCU / 1,000 WCU per partition, whichever is reached first. When a single partition key receives more than these limits, DynamoDB throttles — even when the table-level provisioned capacity is not exhausted. Adaptive Capacity redistributes unused capacity across partitions automatically, but it does not eliminate the hot partition problem entirely when write concentration is extreme.
Performance tuning remediation for hot partitions follows a fixed sequence. First, confirm the hot partition using Contributor Insights. Second, classify the hot key — is it a known hot key (a celebrity user, a VIP tenant) or an emergent hot key (a product going viral)? Third, apply the repartition strategy that matches the classification. For known hot keys, write-sharding appends a random suffix 0..N to the partition key on write and fans out the read with N parallel GetItem calls. For emergent hot keys, the remediation is often to introduce a DAX cluster in front of the table to absorb reads. For storage hotness, consider a GSI that projects the data along a different access dimension.
The exam will offer "enable Adaptive Capacity" as a tempting option. Adaptive Capacity is always-on and cannot be turned off — it is a property of DynamoDB itself. If you see that option, it is a distractor. The correct performance tuning remediation for a hot partition is write-sharding or a partition key redesign. Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
Write-Sharding Implementation Pattern
Write-sharding expands a single partition key into N sharded keys by appending a random or hashed suffix. A write to tenant vip becomes a write to one of vip#0, vip#1, ... vip#9. Reads fan out to all N shards and merge results client-side. The shard count N is chosen as the ratio between the hot key's traffic and the 1,000 WCU per-partition ceiling, rounded up. For our e-commerce anchor where the vip key receives 60 percent of 2,400 writes per second, N of 2-3 shards is enough. Performance tuning remediation by write-sharding requires a data migration for existing items — use DynamoDB Streams plus a Lambda to dual-write during the transition.
AWS X-Ray Service Map for Latency Bottleneck Isolation
The AWS X-Ray service map is the second pillar of performance tuning remediation. X-Ray instruments distributed applications and renders a node-and-edge diagram where each node is a service and each edge is a call. Latency, error rate, and throughput are displayed on each node and edge. When a single downstream call is contributing 80 percent of the total request latency, the service map shows it in one glance.
For our e-commerce anchor, adding the X-Ray SDK to the checkout Lambda and the order-service ECS task reveals a service map with five nodes: API Gateway, checkout Lambda, order-service, RDS MySQL, DynamoDB cart. The edge from order-service to RDS MySQL shows 2,400ms average latency during peak — confirming the bottleneck is the RDS MySQL call. The edge from checkout Lambda to DynamoDB shows 15ms, ruling out DynamoDB for the checkout path specifically even though the earlier Contributor Insights finding is still valid for cart operations.
X-Ray sampling rules control cost. The default rule samples one request per second plus 5 percent of the remainder. For performance tuning remediation of rare p99 outliers, create a custom rule that samples 100 percent of requests to the problem URL pattern for the diagnostic window, then revert. X-Ray trace analytics also support filtering traces by annotation — annotate traces with user tier, region, or feature flag to slice tail latency by cohort.
X-Ray Insights automatically groups related traces that share an anomaly pattern and produces a timeline. This is especially useful for intermittent performance tuning remediation cases where the p99 spike lasts only two minutes. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html
Annotations Versus Metadata in Trace Filtering
X-Ray segments carry two flavors of user data: annotations (indexed, filterable) and metadata (stored but not indexed). Performance tuning remediation benefits from annotations because you can search traces by user tier, feature flag, or geographic region. Add annotations sparingly — each segment supports up to 50 annotations, and unused annotations inflate trace size. A good baseline: annotate every segment with user_tier, deployment_version, and region. Metadata is for diagnostic detail you rarely filter on.
Amazon RDS Performance Insights: Top SQL, Wait Events, TopSQL Digest
Amazon RDS Performance Insights is the mandatory tool for any performance tuning remediation scenario involving RDS, Aurora, or RDS Custom. It is enabled per DB instance at no cost for the default 7-day retention tier. Performance Insights measures DB Load, a single scalar representing how many active sessions are running at any moment. DB Load is broken down by Top SQL, Top Wait Events, Top Hosts, and Top Users.
The Top Wait Events panel is the most diagnostic-dense view in all of AWS. Wait events for MySQL include io/table/sql/handler, synch/cond/sql/MDL_context::COND_wait_status, synch/mutex/innodb/buf_pool_mutex, and hundreds more. When our e-commerce anchor's RDS MySQL 5.7 Performance Insights shows that 70 percent of DB Load during peak is blocked on synch/cond/sql/MDL_context::COND_wait_status, performance tuning remediation points at metadata-lock contention — typically a long-running DDL or a transaction holding a table lock.
The Top SQL panel produces a digest — queries normalized by replacing literals with placeholders — and ranks them by DB Load contribution. This digest is the SQL-level equivalent of Contributor Insights. A common finding is a single SELECT * FROM orders WHERE user_id = ? consuming 40 percent of DB Load — indicating a missing index on user_id.
Performance Insights gives 7 days of free history. Paid long-term retention extends to 24 months. For performance tuning remediation of a quarterly peak like Black Friday, the long-term retention tier is essential so that post-event analysis can compare year-over-year digests. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html
Slow Query Log Versus Performance Insights
The slow query log is the older mechanism — a text log of all queries exceeding long_query_time. It is still useful because Performance Insights does not capture query plans, only aggregate load. The remediation workflow is: use Performance Insights to rank Top SQL, then pull the specific query from the slow query log, then run EXPLAIN to see the plan, then add the index or rewrite the query. Performance tuning remediation for RDS almost always ends with an index change, a query rewrite, a connection-pool adjustment, or an instance class upgrade — in that order of preference.
Interpreting DB Load Relative to vCPU Count
DB Load is measured in average active sessions (AAS). A DB Load of 4.0 on an instance with 8 vCPUs means four sessions are concurrently active on average — the instance is 50 percent CPU-saturated. A DB Load exceeding the vCPU count means the instance is bottlenecked on CPU and sessions are queueing. Performance tuning remediation should always graph DB Load against the instance's vCPU count as a horizontal reference line — the ratio is the CPU saturation metric. For our e-commerce anchor, the db.m5.2xlarge has 8 vCPUs and peak DB Load of 14 AAS, indicating severe CPU saturation.
Amazon DevOps Guru: ML-Based Anomaly Detection for Existing Workloads
Amazon DevOps Guru is a machine learning service that continuously analyzes CloudWatch metrics, CloudTrail events, and AWS X-Ray traces for a set of covered resources, and emits reactive insights when anomalous behavior correlates across resources. For performance tuning remediation, DevOps Guru is the catch-all tool that finds what you did not know to look for.
DevOps Guru covers two resource scopes: an entire account, or a CloudFormation stack. For an existing system that grew organically, the account-wide scope is common; for a new microservice with a clean stack boundary, per-stack is cheaper. DevOps Guru pricing is per resource-hour analyzed. A reactive insight bundles an anomaly (CPU spike, increased DB Load, slow API latency), a set of related events (recent deploys, config changes, scaling activity), and a recommendation. The recommendation is often a link to AWS documentation or a specific mitigation.
In our e-commerce anchor, enabling DevOps Guru one week before Black Friday and letting it baseline traffic patterns would have flagged an insight three hours into the peak: "Elevated RDS DatabaseConnections correlated with application latency, recommended to enable RDS Proxy." Performance tuning remediation is faster when DevOps Guru has already done the cross-resource correlation.
DevOps Guru emits proactive insights (warning of likely future issues based on trends) and reactive insights (responding to current anomalies). Reactive insights drive performance tuning remediation today; proactive insights drive capacity planning. Reference: https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html
Baseline Warm-Up Period
DevOps Guru requires one to two weeks of baseline observation before its anomaly detection reaches full accuracy. Enabling it the morning of Black Friday is too late — the model has no idea what normal looks like. Performance tuning remediation planning should include DevOps Guru enablement as a T-minus-14-days step, not a T-zero step. Budget the per-resource-hour cost for the baseline window as part of remediation cost.
Compute Bottleneck Diagnosis and EC2 Rightsizing with Compute Optimizer
Compute Optimizer is the rightsizing tool for performance tuning remediation of EC2, Auto Scaling groups, EBS volumes, Lambda, and ECS tasks on Fargate. It consumes 14 days of CloudWatch metrics by default (90 days with enhanced infrastructure metrics enabled at additional cost) and emits three classes of recommendation: under-provisioned, over-provisioned, and optimized.
For our e-commerce anchor, Compute Optimizer analysis of the m5.large Auto Scaling group shows that during peak CPU sustains at 92 percent with network in at 70 percent of the instance limit. The recommendation is m5.xlarge or m5n.large (network-optimized variant). Performance tuning remediation here is a simple launch-template update and a rolling replacement via ASG instance refresh — reversible, scheduled, measurable.
Compute Optimizer's EBS recommendations are similarly valuable. A gp2 volume consistently hitting its burst credit ceiling will receive a recommendation to migrate to gp3 with specified baseline IOPS and throughput. For Lambda, Compute Optimizer recommends memory-size changes and surfaces functions whose current memory is both a performance limiter and a cost inefficiency.
The default 14-day lookback misses monthly and quarterly peaks. Enabling Enhanced Infrastructure Metrics at 90 days of retention costs pennies per resource and catches exactly the Black Friday class of workload that silently over-provisions 11 months of the year. Reference: https://docs.aws.amazon.com/compute-optimizer/latest/ug/enhanced-infrastructure-metrics.html
Recommendation Confidence and Risk Rating
Compute Optimizer tags each recommendation with a risk rating: "Very Low", "Low", "Medium", or "High". Low and Very Low risk recommendations are safe for automated application. Medium risk requires a manual validation step. High risk should never be applied without load testing — the metrics are likely ambiguous or the workload pattern is unusual. Performance tuning remediation via Compute Optimizer should automate Low-risk changes via a change pipeline and escalate Medium and High to human review.
Analogy One — The Restaurant Kitchen at Dinner Rush
Performance tuning remediation at its heart is like running a busy restaurant kitchen at dinner rush. A customer complaint that "the pasta took an hour" is the p99 symptom. A good head chef does not start by yelling at the pasta station. She walks the line: is the ticket printer jammed (the ingress tier), is the prep station out of garlic (a dependency), is a single burner (partition) doing the work of three, is the dishwasher (storage) backed up, is one waiter (client) hogging the expeditor's attention (a hot connection)?
CloudWatch Contributor Insights is the chef noticing that Table 12 has ordered 40 percent of tonight's pasta — a hot partition. X-Ray is the chef timing each station: prep station 2 minutes, sauté 3 minutes, plating 40 minutes. RDS Performance Insights is the chef watching how long the sauté pan waits on the burner — a wait event. DevOps Guru is the sous-chef who has worked here for months and can say "this matches the night three weeks ago when we got slammed." Compute Optimizer is the inventory report that says "you have six burners but only three are on — fire up the other three." Performance tuning remediation in the kitchen is exactly the same loop as on AWS: symptom, metric, bottleneck, root cause, fix, validate.
Analogy Two — The Library Reference Desk at Exam Week
The second analogy that helps internalize performance tuning remediation is a university library's reference desk during final exam week. Steady state: 50 questions per hour, each answered in 3 minutes. Peak: 400 questions per hour, and the queue stretches to the entrance. The librarian's toolkit is the performance tuning remediation toolkit.
Add more reference librarians — EC2 Auto Scaling, except those librarians need training (warm-up time), so maintain a warm pool. Put FAQ cards on the front of the desk so 60 percent of questions can be self-served — CloudFront caching retrofit in front of the existing ALB. Shard the desk: one librarian for science, one for humanities — partition repartition. Install a digital ticketing system so students can submit questions asynchronously and pick up answers later — SQS-backed worker replacing a synchronous N+1 call. Upgrade the card catalog from index cards to a searchable terminal — gp2 to gp3, MySQL 5.7 to Aurora.
Every tool in performance tuning remediation has a librarian analog. This is not a coincidence — both are human and machine queueing systems obeying Little's Law (concurrent-in-flight equals throughput times latency).
Analogy Three — The Freeway Interchange at Rush Hour
The third analogy is a multi-level freeway interchange during evening rush hour. A single slow merge ramp (a hot partition) creates backpressure across three upstream lanes. Adding a third lane without fixing the merge (scaling compute without fixing the database) just moves the jam. A high-occupancy-vehicle lane is a cache layer — a privileged path for frequently repeated trips. A carpool lot is a CDN — aggregating demand at the edge so only one vehicle makes the full trip. A toll reader with a backed-up queue is a synchronous payment call; replacing it with a transponder is replacing sync with async. Performance tuning remediation on AWS follows traffic engineering principles because both fields are applied queueing theory.
Analogy Four — The Swiss Army Knife of Diagnostic Tools
Performance tuning remediation is a Swiss Army knife of diagnostic blades, each optimized for one cut. CloudWatch standard metrics is the main blade — always there, always sharp enough. Contributor Insights is the tweezers — tiny, specialized, exactly right when you need to pull one hot key out of a pile. X-Ray is the magnifying glass — follows the thread of a single request through the whole system. Performance Insights is the corkscrew — purpose-built for databases, useless for anything else. DevOps Guru is the compass — tells you where you are before you try to go somewhere. Compute Optimizer is the scissors — trims oversized resources without drama. Performance tuning remediation mastery is knowing which blade to pull first.
Lambda Memory Tuning with AWS Lambda Power Tuning
Lambda performance tuning remediation has exactly one good tool: the AWS Lambda Power Tuning open-source project, which runs a state machine that invokes a function at every memory configuration from 128 MB to 10,240 MB and plots cost versus latency on a Pareto curve. Lambda memory controls both RAM and proportional CPU, so doubling memory often halves execution time — and when execution time halves, total cost stays roughly flat because the billed duration is halved while the per-ms price doubles.
Performance tuning remediation for Lambda in our e-commerce anchor is not an immediate priority — the Lambda latency is already sub-20ms. But the technique is universal. For a data-processing Lambda running at 512 MB with 4-second duration, Power Tuning might reveal 1,792 MB as the sweet spot — 1-second duration, same total cost, 4x better latency.
There is no separate CPU setting for Lambda. Allocating 1,769 MB of memory gives the function exactly one vCPU. Above that, multiple vCPUs. Performance tuning remediation for CPU-bound Lambda functions is memory tuning. Reference: https://docs.aws.amazon.com/lambda/latest/dg/configuration-memory.html
Pareto Strategies: Cost, Speed, Balanced
Power Tuning offers three optimization strategies. cost minimizes total bill per invocation. speed minimizes execution time regardless of cost. balanced picks the knee of the Pareto curve — best improvement per dollar. For latency-critical paths, pick speed. For batch workloads, pick cost. For most production performance tuning remediation, balanced is the right default.
Caching Layer Retrofit: Cache-Aside, Write-Through, and DAX
Caching retrofit is the highest-leverage performance tuning remediation in most existing AWS systems. The three caching patterns to know cold are cache-aside (lazy loading), write-through (synchronous cache update on write), and write-behind (asynchronous cache update after write).
Cache-aside is the default retrofit pattern. Application code checks the cache first; on miss, queries the origin, populates the cache with a TTL, and returns. Cache-aside is easy to retrofit because it requires only read-path changes — existing write paths continue to write directly to the origin. The drawback is cold-cache stampede: after a cache flush, every concurrent reader misses simultaneously and hits the origin.
Write-through synchronously writes to both cache and origin on every write. Reads are always fresh. The retrofit cost is higher because every write path must be modified, and write latency increases because the cache is now on the critical path. Write-through is appropriate when reads vastly outnumber writes and cache staleness is intolerable.
For our e-commerce anchor, the performance tuning remediation sequence is: start with ElastiCache Redis in cache-aside mode for product-catalog reads (highest hit rate, most stale-tolerant), measure, then extend to write-through for cart state if cart read staleness is a business problem. DAX is a drop-in cache for DynamoDB — no application code change beyond swapping the SDK client — and is the right choice when DynamoDB is the bottleneck, not RDS.
DAX only caches eventually consistent reads. Strongly consistent reads bypass DAX and go directly to DynamoDB. If your application reads with ConsistentRead=true, DAX adds cost without benefit. Performance tuning remediation must verify the consistency mode before recommending DAX. Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.consistency.html
Migration from Cache-Aside to Write-Through
A common performance tuning remediation arc is migrating from cache-aside to write-through after the initial retrofit reveals staleness complaints. The migration sequence: (1) add write-through code paths behind a feature flag, (2) enable for 1 percent of writes, (3) verify cache and origin consistency, (4) ramp to 100 percent, (5) remove the feature flag and the old cache-aside write path. This is a classic dual-write migration and the same pattern used for any cache retrofit rollout.
DAX Cluster Sizing and Placement
DAX clusters must be placed in the same VPC as the DynamoDB-consuming application and use the same subnet group. Cluster node count should match the sum of read throughput plus the desired redundancy — a three-node cluster offers one primary and two read-replicas with automatic failover. Performance tuning remediation with DAX must size the item cache (TTL-based) and the query cache (parameterized-query result cache) separately. For our e-commerce anchor's DynamoDB cart table, DAX is an appropriate retrofit if and only if the application reads with ConsistentRead=false.
Asynchronous Migration: N+1 Sync Calls to SQS-Backed Workers
A pervasive performance tuning remediation pattern is replacing synchronous N+1 call chains with an asynchronous SQS-backed worker. The N+1 antipattern: a checkout API calls an inventory service once, then for each of N cart items calls the inventory service again to decrement stock. If each call is 50ms and the cart has 10 items, the sync-chain latency is 550ms just for inventory — before payment or shipping.
The performance tuning remediation is to decouple. The checkout API emits a single OrderPlaced event to SQS or EventBridge, returns 200 OK to the client, and a separate worker consumes the event and performs the N inventory decrements asynchronously. Client-observed latency drops to the cost of one enqueue — under 10ms. Consistency becomes eventual, which is acceptable for inventory-decrement when paired with a downstream reconciliation step.
For our e-commerce anchor, the checkout flow synchronously calls six downstream services: inventory, loyalty, fraud, tax, shipping, email. Performance tuning remediation converts email, loyalty, and shipping to async SQS-backed workers. Inventory and fraud must remain synchronous because their decisions gate the transaction. Tax remains synchronous for legal accuracy. The result is a three-step synchronous critical path instead of six, cutting compute-tier latency roughly in half.
Default SQS short polling returns immediately with empty responses on idle queues — burning requests. Enable long polling with ReceiveMessageWaitTimeSeconds=20 when retrofitting SQS workers. Performance tuning remediation for SQS cost is often just flipping this one setting. Reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-short-and-long-polling.html
Classifying Calls as Sync-Mandatory Versus Async-Safe
Not every downstream call can be made async. The classification rule: if the caller needs the callee's result to form the HTTP response to the client, it must stay synchronous. If the call is side-effect-only (email, audit log, analytics), it is async-safe. Fraud checks block the transaction decision — sync-mandatory. Shipping quote is informational for a post-checkout email — async-safe. Performance tuning remediation correctness depends on classifying every call correctly; the wrong classification creates subtle data-loss bugs.
CDN Retrofit: CloudFront in Front of an Existing Uncached Origin
CDN retrofit is the most reversible high-impact performance tuning remediation in the Pro toolkit. CloudFront sits in front of any HTTP origin — ALB, S3, custom origin — and caches responses at 600+ edge locations. Retrofit is reversible because disabling the distribution or pointing DNS back at the origin takes minutes.
For an uncached origin, the retrofit sequence is: (1) create a CloudFront distribution with the existing origin, (2) define cache behaviors matching URL patterns — static assets /static/* with a 1-year TTL, API paths /api/* with a 0-second TTL or a short TTL based on business freshness requirements, (3) configure origin request and cache key policies to control which headers and cookies are forwarded, (4) enable real-time logs, (5) cut DNS over with a weighted Route 53 record starting at 10 percent, (6) monitor cache hit ratio, (7) ramp to 100 percent.
For our e-commerce anchor, product detail pages are read-dominant — 95 percent of traffic is cacheable with a 60-second TTL. CloudFront retrofit drops origin load by 80 percent at the edge tier, which directly relieves the ALB, the EC2 fleet, and the RDS backing the product catalog. This single remediation is often worth 40 percent p99 improvement on read-heavy workloads.
A misconfigured cache key that includes all cookies will produce a cache hit ratio near zero because every session cookie creates a distinct cache entry. Performance tuning remediation for a low-hit-ratio CloudFront retrofit is almost always a cache-key policy fix. Reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/controlling-the-cache-key.html
CloudFront Caching Policy and Origin Request Policy
CloudFront separates the cache key (what identifies a cached object) from the origin request (what gets sent to the origin when there is a cache miss). This separation is a Pro-level feature: you can forward a header to the origin for logging without including it in the cache key. The managed policies — CachingOptimized, CachingDisabled, CachingOptimizedForUncompressedObjects, UserAgentRefererHeaders — cover most cases. Custom policies handle the rest.
For performance tuning remediation, the CachingOptimized managed policy is the default starting point. It excludes cookies and query strings from the cache key, maximizing hit ratio. Override only when the application requires query-string-based content variation.
Origin Shield for Reducing Origin Load Further
Origin Shield is a regional caching layer between CloudFront edge locations and the origin. With Origin Shield enabled, edge location cache misses funnel through a single regional cache before reaching the origin. Performance tuning remediation for origins seeing thundering-herd traffic during cache misses can reduce origin load by an additional 30-50 percent via Origin Shield. The trade-off is latency on Origin Shield cache misses (extra regional hop) and the per-request fee. Enable Origin Shield only when the origin itself is the bottleneck.
Database Engine Upgrade Strategy: MySQL 5.7 to 8 and RDS to Aurora
The largest-blast-radius performance tuning remediation in the database tier is an engine upgrade. Two common upgrade paths on SAP-C02: RDS MySQL 5.7 to RDS MySQL 8.0, and RDS MySQL to Aurora MySQL-compatible.
MySQL 5.7 reached standard support end-of-life in February 2024. RDS Extended Support extends the life by up to three years at additional per-vCPU-hour cost. Performance tuning remediation that involves an engine version upgrade is also a license and lifecycle decision. The technical upgrade is a major-version upgrade performed via the RDS console or API — it is an offline operation requiring a maintenance window, typically 15 to 60 minutes of unavailability. Pre-upgrade dry runs are mandatory because deprecated SQL syntax, character set changes, and reserved-word additions between 5.7 and 8.0 can silently break applications.
Migrating from RDS to Aurora is a different operation. Aurora MySQL-compatible is a distinct engine with a distributed storage layer that replicates six copies across three availability zones. For our e-commerce anchor, performance tuning remediation toward Aurora delivers three wins: up to 5x throughput on the same instance class, sub-second failover, and up to 15 read replicas with sub-10ms replication lag. The migration path is RDS snapshot restore into Aurora, DMS change-data-capture to keep in sync, cutover during a change window, rollback plan via DNS to the old RDS.
Aurora has a different cost model (per IO charges on Aurora Standard, all-inclusive on Aurora I/O-Optimized) and a different failover model (storage-level, not instance-level). Performance tuning remediation toward Aurora requires sizing the workload against the cost model first. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html
Blue/Green Deployment for Database Upgrades
RDS Blue/Green Deployments (released 2022) create a staging replica of the current database, apply the upgrade to the staging copy, and allow a one-minute-or-less cutover switch. Performance tuning remediation via major-version upgrade should always use Blue/Green when available — it converts an hour of downtime into a minute of downtime and provides instant rollback. Blue/Green supports engine upgrades, parameter group changes, and instance class changes in a single operation.
Storage Layer Remediation: gp2 to gp3 and Provisioned IOPS Evaluation
EBS volume type migration is a low-risk high-reward performance tuning remediation. gp2 performance is tied to volume size — 3 IOPS per GB with a 5,400 IOPS ceiling unless size exceeds 1,000 GB. gp3 decouples performance from size — every gp3 volume starts with 3,000 IOPS and 125 MB/s throughput, independent of size. You pay extra only if you provision beyond those baselines, up to 16,000 IOPS and 1,000 MB/s.
For most workloads, gp3 is cheaper than gp2 at equal performance. The migration is performed in-place via volume modification — no detach, no snapshot restore. Existing data is preserved. Downtime is zero. This is the purest example of a reversible performance tuning remediation change.
For I/O-intensive databases, io2 Block Express scales to 256,000 IOPS and 4,000 MB/s per volume, with sub-millisecond latency and 99.999 percent durability. Performance tuning remediation for RDS instances pegging at the EBS limit should evaluate io2 Block Express before upgrading the instance class, because the I/O ceiling — not CPU — is often the true bottleneck.
aws ec2 modify-volume triggers an asynchronous background migration that can take hours for large volumes but does not require any downtime. Application I/O continues throughout. Performance tuning remediation via gp2 to gp3 should be the first EBS remediation tried. Reference: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-modify-volume.html
Provisioned IOPS Evaluation
Provisioned IOPS volumes (io1, io2, io2 Block Express) guarantee IOPS at all times. gp3 provisions IOPS as a ceiling but performance is best-effort within that ceiling. For database workloads requiring sustained high IOPS with tight tail-latency SLAs, io2 is the correct choice. For batch workloads tolerant of variability, gp3 is cheaper.
S3 Prefix Performance and Request Rate
S3 bucket request rate scales to 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD per prefix per second. Performance tuning remediation for S3-bottlenecked workloads is often a prefix redesign — spreading objects across many prefixes rather than concentrating under a single one. Modern S3 auto-partitions internally, so the old recommendation to hash-prefix keys is less critical than before, but workloads exceeding 5,500 GETs per second on a single prefix still benefit from partitioning.
Auto Scaling Retrofit and Warm Pools
Auto Scaling group retrofit is the performance tuning remediation for any existing EC2 fleet sized for peak rather than average. The retrofit sequence: create an AMI from a representative existing instance, create a launch template referencing that AMI, create an ASG using the launch template, set a target-tracking scaling policy (typically 50 percent CPU), and migrate traffic to the new ASG behind the existing ALB target group.
Target tracking is the preferred scaling policy for retrofit because it requires no tuning — just pick the target metric value. Step scaling requires calibrating step sizes. Simple scaling is legacy. Predictive scaling uses historical CloudWatch data to warm capacity ahead of anticipated peaks — essential for Black Friday class workloads.
Warm pools are the retrofit for slow-starting applications. A warm pool maintains pre-initialized instances in a stopped or hibernated state. When the ASG scales out, warm-pool instances start in seconds instead of the 5-10 minute cold-boot-and-bootstrap cycle. Performance tuning remediation for scale-out latency during peak is often a warm pool with a maintained size equal to 20 percent of the ASG maximum.
ASG instance refresh replaces instances in rolling waves with configurable minimum healthy percentage and health check grace period. Performance tuning remediation changes to launch templates can be rolled out gradually and rolled back via instance refresh stop. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html
Predictive Scaling Versus Reactive Scaling
Predictive scaling uses up to 14 days of historical CloudWatch data to forecast load patterns and pre-provision capacity ahead of the predicted peak. Reactive scaling (target tracking, step) responds to current metrics. Performance tuning remediation for cyclical workloads (daily, weekly) should layer predictive on top of reactive — predictive handles the expected rise, reactive handles the unexpected spike. For our e-commerce anchor, predictive scaling trained on four weeks of data would have warmed the fleet 30 minutes before the 8 p.m. Black Friday peak.
Remediation Sequence: A Priority-Ordered Playbook
Performance tuning remediation is not a single action — it is a sequence. SAP-C02 questions often hand the candidate a scenario with four or five candidate remediations and ask which sequence best balances impact, risk, and reversibility. The Pro-level sequencing principle: reversible before irreversible, cheap before expensive, diagnostic before remedial, reads before writes.
For our e-commerce anchor, the priority-ordered playbook is:
Step one — enable diagnostics. Turn on RDS Performance Insights with long-term retention, enable CloudWatch Contributor Insights on the DynamoDB cart table and ALB access logs, enable X-Ray on the checkout Lambda and order-service ECS task, enable DevOps Guru at account scope. Cost: tens of dollars per month. Risk: zero. Reversibility: instant.
Step two — CloudFront retrofit. Put CloudFront in front of the ALB with CachingOptimized policy for static paths and a short TTL for product catalog paths. Expected impact: 30-50 percent p99 reduction on read-heavy endpoints. Risk: low. Reversibility: DNS cutover.
Step three — ElastiCache Redis retrofit in cache-aside mode. Target the top three SQL digests from Performance Insights. Expected impact: 40-60 percent RDS load reduction on read-dominated queries. Risk: medium. Reversibility: feature flag.
Step four — EBS gp2 to gp3 migration on RDS and EC2. Expected impact: stable IOPS, 20 percent storage cost reduction. Risk: zero. Reversibility: reverse the modify-volume.
Step five — SQS async migration for non-critical downstream calls (email, loyalty, shipping). Expected impact: 40-50 percent compute-tier latency reduction. Risk: medium. Reversibility: feature flag.
Step six — DynamoDB repartition for the hot cart partition. Write-sharding with a random suffix on known hot tenant IDs. Expected impact: eliminates throttling. Risk: medium-high (write-path change). Reversibility: dual-write migration.
Step seven — EC2 rightsizing per Compute Optimizer recommendations. Risk: low. Reversibility: launch template rollback via instance refresh.
Step eight — RDS Multi-AZ conversion (if not already) and consider Aurora migration for next quarter. High blast radius, long lead time — plan but do not rush.
Reversible before irreversible. Cheap before expensive. Diagnostic before remedial. Reads before writes. Apply these in order when the exam asks which remediation to do first. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/performance-efficiency-pillar.html
Cumulative Impact Modeling
The playbook produces a cumulative expected improvement. Each step has a multiplicative effect on the p99 latency of the workload segment it touches. Modeling the total impact is not simple addition — CloudFront reduces origin load by 80 percent on read paths, so subsequent ElastiCache retrofit operates on the remaining 20 percent. Performance tuning remediation planning should use a spreadsheet model that tracks latency by path segment before and after each step, with the final p99 computed as the weighted average across all paths.
Plain Language Explanation: What Performance Tuning Remediation Really Is
Performance tuning remediation sounds like a technical term, but in plain language it just means "your system is slow and you need to fix it without throwing it away." Every real production system eventually gets slow. It might be slow because traffic grew, because data grew, because code got more complex, or because some once-reasonable choice (single-AZ RDS, no caching, synchronous email sending) hit a wall at scale.
Think of performance tuning remediation as a home renovation. You do not tear down the house. You find the room that is the problem — maybe the kitchen is too small, maybe the plumbing backs up when three showers run at once — and you renovate just that room with minimum disruption to the rest of the household. On AWS, the tools for diagnosis are CloudWatch (the thermostat and energy meter), X-Ray (the stud-finder that shows where the wires actually run), Performance Insights (the plumber's video scope), DevOps Guru (the home inspector who has seen a thousand houses), and Compute Optimizer (the right-sizing calculator).
The three analogies earlier in this topic — the restaurant kitchen, the library reference desk, the freeway interchange — are not just pedagogical. Each captures a true property of performance tuning remediation. The kitchen teaches workflow and station isolation. The library teaches queueing and self-service caching. The freeway teaches that a bottleneck upstream determines everything downstream. Performance tuning remediation mastery is fluency in applying all three mental models to whatever scenario the exam or real production throws at you.
One more plain-language principle. You will often see performance tuning remediation scenarios where the "correct" answer is not the most dramatic one. The exam rewards the minimum-viable change that meets the SLA. If CloudFront retrofit fixes p99 for 90 percent of requests, that is the correct answer — not "rearchitect to Aurora Global Database." The architect's job is to stop the bleeding with the smallest incision.
Common Traps in Performance Tuning Remediation Questions
Trap one — offering "enable DynamoDB Adaptive Capacity" as a fix for hot partitions. Adaptive Capacity is always on; it is a distractor. Trap two — recommending DAX for a strongly consistent read workload. DAX only caches eventually consistent reads. Trap three — suggesting gp2 volume resizing to fix IOPS. gp3 decouples IOPS from size — resize is not the fix. Trap four — CloudFront with a cache key that includes all cookies. Hit ratio collapses. Trap five — Aurora migration proposed as a day-one fix. Aurora is a multi-week migration. Trap six — RDS read replicas proposed for write-heavy workloads. Read replicas do not scale writes. Trap seven — Lambda concurrency reserved set low, then the team wonders why cold starts spike. Reserved concurrency is a cap, not a guarantee of warmth. Trap eight — X-Ray enabled everywhere at 100 percent sampling. Cost explosion. Performance tuning remediation requires sampling discipline. Reference: https://d1.awsstatic.com/training-and-certification/docs-sa-pro/AWS-Certified-Solutions-Architect-Professional_Exam-Guide.pdf
Validating the Remediation: Load Testing and Golden Signals
Performance tuning remediation is incomplete without validation. A fix held for an hour at peak is not a fix — it must survive a full business cycle. The validation toolkit has two parts: synthetic load generation and production golden-signal monitoring.
For synthetic load, Distributed Load Testing on AWS (a Solutions Library offering) and open-source tools like k6, Locust, and JMeter generate parameterized traffic profiles. For our e-commerce anchor, a valid load test replays 2,400 requests per second with realistic product-catalog access patterns and realistic cart-size distributions — not a uniform flat load that misses the hot-partition signal entirely.
For production monitoring, the four golden signals (Google SRE) are latency, traffic, errors, and saturation. Every performance tuning remediation should map to a CloudWatch composite alarm covering these four signals, with anomaly-detection thresholds where static thresholds are too rigid. The alarm triggers a runbook (Systems Manager document) that either auto-remediates or pages on-call.
Every performance tuning remediation must be paired with a CloudWatch composite alarm covering latency, traffic, errors, saturation. No alarm means no proof of sustained fix. Reference: https://sre.google/sre-book/monitoring-distributed-systems/
Realistic Load Profile Design
A load test that issues uniform random requests across all product pages will never reproduce the hot-partition phenomenon that causes the real p99 spike. Performance tuning remediation validation requires load profiles that match production traffic distribution: a Zipf-distributed popularity curve where the top 1 percent of products receive 50 percent of traffic. Tools like k6 support weighted request generators for this purpose. Without realistic profile, load test green lights are false positives.
Rightsizing Workflow: Compute Optimizer + 14-Day Metrics + Change Window
A final Pro-level performance tuning remediation pattern is the rightsizing workflow — the disciplined sequence for acting on Compute Optimizer recommendations without breaking production. Step one: review recommendations in the Compute Optimizer console, filter for high-confidence recommendations (the "Medium risk" or lower rating). Step two: cross-reference with 14 days of CloudWatch metrics to confirm the recommendation matches the actual traffic profile — a 14-day window that happens to span a low-traffic vacation week will under-recommend. Step three: plan a change window with rollback prepared. Step four: execute via launch template update and instance refresh. Step five: monitor for one business cycle before accepting the change.
For the e-commerce anchor, Compute Optimizer's m5.large to m5.xlarge recommendation is high-confidence, matches observed 92 percent peak CPU, and has a trivial rollback. Performance tuning remediation here is a one-afternoon change. The same workflow applied to an RDS instance-class upgrade is a weekend change — same methodology, different blast radius.
Frequently Asked Questions
What is the difference between performance tuning remediation and performance architecture for new solutions?
Performance architecture for new solutions (Domain 2.5) is greenfield design — you pick Aurora, DAX, CloudFront, and purpose-built databases on day one. Performance tuning remediation (Domain 3.3) is retrofit — you inherit an RDS single-AZ MySQL 5.7 with no caching and remediate incrementally with minimum blast radius. The SAP-C02 exam distinguishes them by framing: "new solution" versus "existing system." Performance tuning remediation almost always prefers additive, reversible changes like CloudFront retrofit and ElastiCache cache-aside over wholesale rearchitecture.
When should I use CloudWatch Contributor Insights versus CloudWatch Metrics?
Use CloudWatch Metrics for aggregate dimensions — CPU, latency, connection count. Use Contributor Insights when you need to rank "top N" of something — top partition keys in DynamoDB, top source IPs in VPC Flow Logs, top URL paths in ALB logs. Metrics answer "how much?" Contributor Insights answers "who or what is responsible?" Performance tuning remediation of hot-key phenomena — the single partition doing 60 percent of the work — is a Contributor Insights job, not a Metrics job.
Is Amazon DevOps Guru worth the cost for performance tuning remediation?
For account-wide coverage on a production workload with significant engineering investment, yes. DevOps Guru ML baselines take one to two weeks to reach full accuracy, so enable it well before an anticipated peak. For short-lived workloads or pre-production environments, the cost outweighs the benefit — rely on CloudWatch alarms and X-Ray instead. Performance tuning remediation benefits most from DevOps Guru when the operator is not a dedicated performance engineer — DevOps Guru captures cross-resource correlations that require experienced intuition otherwise.
How do I choose between CloudFront, Global Accelerator, and ElastiCache for latency remediation?
CloudFront caches content at the edge — solves latency for cacheable HTTP content. Global Accelerator uses the AWS backbone to shorten the network path — solves latency for non-cacheable traffic to regional endpoints (TCP or UDP, including gaming and WebSocket). ElastiCache solves latency for repeated expensive database reads inside the application tier. They are complementary, not alternatives. Performance tuning remediation for a slow API often layers all three: CloudFront in front for cacheable GETs, Global Accelerator for non-cacheable API calls from distant regions, ElastiCache inside the application for database offload.
Why is the e-commerce anchor scenario structured as p99 at peak rather than average latency?
Because tail latency is what customers feel. A p50 of 250ms with p99 of 3s means 1 percent of customers — often high-intent bulk buyers — experience three-second waits. Conversion studies show that p99 degradation correlates tightly with cart abandonment, while average latency correlates weakly. Performance tuning remediation at Pro level always targets tail latency. The SAP-C02 exam will specifically use p95 and p99 in question stems to distinguish candidates who optimize for averages (wrong) from those who optimize for tails (correct).
When is it correct to migrate from RDS to Aurora as part of performance tuning remediation?
When the workload is read-heavy enough to benefit from Aurora's up-to-15 read replicas with sub-10ms replication lag, or when write throughput exceeds what a single RDS writer can sustain on the largest available instance class, or when the cost of Multi-AZ storage replication at the RDS level exceeds Aurora I/O-Optimized on equivalent hardware. Aurora migration is a multi-week project — DMS setup, CDC validation, cutover rehearsal, rollback planning. It is never the first remediation. Performance tuning remediation should exhaust CloudFront, ElastiCache, index optimization, and query rewriting before engaging an Aurora migration.
How does Lambda Power Tuning interact with Provisioned Concurrency?
Power Tuning finds the memory configuration that optimizes the cost/latency Pareto frontier for a cold-start-agnostic workload. Provisioned Concurrency eliminates cold starts at an hourly cost. The two are complementary: run Power Tuning first to pick the optimal memory, then decide whether cold starts are a problem for your workload. For latency-sensitive APIs behind API Gateway, Provisioned Concurrency at the optimal memory configuration is typically the Pro-level performance tuning remediation for Lambda cold-start outliers.
What is the typical order of magnitude of improvement from each remediation?
Rough rule-of-thumb impact on p99 for the anchor scenario class of workload: CloudFront retrofit — 30 to 50 percent reduction on read-heavy endpoints. ElastiCache cache-aside on top SQL — 40 to 60 percent RDS load reduction. SQS async migration for non-critical downstream calls — 40 to 50 percent compute-tier latency reduction. EBS gp2 to gp3 — 10 to 20 percent storage-tier latency stability. EC2 rightsizing — negligible latency change, significant cost change. DynamoDB repartition — eliminates throttling (infinite improvement on throttled requests). Aurora migration — 2 to 5x throughput at same instance class. Performance tuning remediation value stacks: combining CloudFront plus ElastiCache plus async migration often cuts p99 by 70 percent or more.
Exam Signal: How SAP-C02 Tests Performance Tuning Remediation
The SAP-C02 exam tests performance tuning remediation under Task Statement 3.3 with approximately 10 to 15 percent of exam questions. Question stems use diagnostic framing: "An existing application is experiencing..." or "A company has a running workload that..." The correct answer almost always prefers the minimum-blast-radius remediation among reasonable options. The distractor answers typically offer a full rearchitecture (wrong — too much blast radius) or a single point fix that does not address the described symptom (wrong — misaligned to evidence).
The rubric the exam applies is the four-principle sequencing heuristic from earlier: reversible before irreversible, cheap before expensive, diagnostic before remedial, reads before writes. If two options are both technically correct, pick the one that is more reversible, cheaper, more diagnostic, or addresses the read path. Performance tuning remediation questions reward architects who respect operational risk, not architects who reach for the biggest hammer. Mastering this topic means internalizing that discipline until it becomes reflex — because the exam and real production both reward exactly the same restraint.