Performance architecture on AWS is the practice of choosing the right compute shape, storage tier, network topology, database engine, and caching layer before you have a latency incident — not after. On SAP-C02, performance architecture is Domain 2, Task Statement 2.5 ("Design a solution to meet performance objectives"), and it sits squarely inside the Well-Architected Performance Efficiency pillar. Every performance architecture question is really a selection and trade-off question: purpose-built beats generic, layered caching beats monolithic caching, purpose-built databases beat one-size-fits-all relational, and asynchronous beats synchronous whenever business rules permit.
This guide assumes you already know Associate-level material (EC2 instance families at the letter level, S3 storage classes by name, RDS vs DynamoDB basics, CloudFront fundamentals) and focuses on the Pro-level performance architecture decisions: when Graviton4 is the right migration target, when io2 Block Express is wasted money, when DAX beats ElastiCache, when Global Accelerator beats CloudFront, when Aurora Serverless v2 is wrong, and how to serve machine-learning inference at 10,000 queries per second with a p99 latency below 50 milliseconds. Performance architecture questions frequently masquerade as cost questions — if you pick the wrong instance family you burn both budget and latency — so keep the cost-efficiency lens open throughout.
Why Performance Architecture Matters on SAP-C02
At Professional tier, AWS expects you to know that performance is a design-time decision, not a runtime knob. The exam will not ask you to tune a slow query; it will ask you to pick the right engine so the query never gets slow in the first place. A typical SAP-C02 performance architecture scenario looks like this: "A genomics startup must serve inference from a 70-billion-parameter model to 10,000 concurrent users at a p99 latency below 50 milliseconds across three regions, with monthly cost under USD 120,000. Choose the correct combination of compute, accelerator, storage, network, and caching." No single service answers that — only an integrated performance architecture does.
The exam also tests your ability to eliminate wrong performance architecture answers fast. If you see "general-purpose M7i" for an ML training workload, it is wrong before you read the rest of the sentence. If you see "CloudFront" for a stateful low-latency TCP/UDP game server, it is wrong. If you see "Lambda provisioned concurrency" at 100,000 QPS steady-state, the math is probably wrong. Performance architecture gives you these fast-reject heuristics.
- Performance Efficiency pillar: one of the six Well-Architected pillars; covers selection, review, monitoring, and trade-offs across compute, storage, database, and network.
- Purpose-built compute: EC2 or accelerator families optimised for a specific workload class — storage (I/D/H), accelerated computing (G/P/Inf/Trn/F/DL), memory-optimised (R/X/U), compute-optimised (C).
- Purpose-built database: engine selected for the access pattern — Aurora for relational OLTP at scale, DynamoDB for key-value at scale, Neptune for graph, Timestream for time-series, Keyspaces for Cassandra, OpenSearch for search/analytics.
- io2 Block Express: Amazon EBS volume type delivering up to 256,000 IOPS and 4,000 MB/s throughput per volume with sub-millisecond latency — the highest-performance general-purpose AWS block storage.
- Elastic Fabric Adapter (EFA): network interface for EC2 that bypasses the OS kernel for HPC/ML collective communication; required for sub-10-microsecond inter-node latency inside a cluster placement group.
- DAX (DynamoDB Accelerator): managed in-memory cache in front of DynamoDB delivering microsecond read latency; write-through for item cache, eventually consistent reads only.
- Aurora Limitless Database: horizontally sharded Aurora that scales a single database beyond the single-writer limit by automatically partitioning across compute shards; PostgreSQL-compatible.
- Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/welcome.html
Plain-Language Explanation: Performance Architecture
Performance architecture has a lot of moving parts. Four distinct analogies from different domains make the concepts stick.
Analogy 1: The Professional Kitchen
A restaurant is a performance architecture. The general-purpose burner (M-family EC2) is fine for sautéing, but you do not roast a whole brisket on it — you buy a smoker (P-family GPU for ML training) for that exact job, and you buy a sous-vide circulator (Inf-family for inference) for precise repeated tasks. The walk-in freezer is Amazon S3 Glacier Deep Archive — cheap, slow to retrieve. The pantry is S3 Standard — quick to retrieve, moderate cost. The countertop within arm's reach is S3 Express One Zone — single-AZ, fastest access, most expensive per GB. The cutting board on the workstation is instance store NVMe — right under the chef's knife, but if the power cuts off, whatever's on the board is lost (ephemeral).
Caching layers are the mise en place — chopped onions, measured spices, pre-portioned proteins staged closer and closer to the stove. The CDN edge cache is the garnish tray at the pass, already plated. ElastiCache is the reach-in fridge next to the stove. DAX is a specialised reach-in designed exactly for one ingredient (DynamoDB items) — it fits nothing else. RDS Performance Insights is the kitchen display system telling you which station is the bottleneck during dinner rush.
Analogy 2: The Library Reference Desk
DynamoDB access patterns resemble a library. The DynamoDB table is the entire library; the partition key is the Dewey Decimal shelf location. A well-chosen partition key spreads requests evenly across shelves; a bad partition key (everyone requests one book) creates a hot shelf and the librarian can't move fast enough. Adaptive capacity is the library quietly assigning a second clerk to that popular shelf.
DAX is the librarian's cart of recently-requested books rolled right up to the reference desk — the librarian answers "do you have X?" from the cart without walking to the shelf, returning answers in microseconds. ElastiCache-in-front-of-DynamoDB is renting a second library across the street where you manually copy popular books — you have to decide what to copy, you have to invalidate stale copies, and you have to route the reader to the right building. DAX is the built-in solution; ElastiCache-in-front-of-DDB is the custom one you only build when you have reasons DAX can't meet (strongly consistent cache hits, cross-table aggregation, or shared cache across DynamoDB and other services).
Analogy 3: The Highway Network
Network performance is a highway system. The public internet is mixed-traffic surface streets — unpredictable, congested at rush hour. CloudFront is a network of suburban drive-thrus that pre-stages items near the customer so they never have to drive to the factory. Global Accelerator is an HOV on-ramp that carries you from a local anycast entrance onto the private AWS backbone (the interstate) faster than the public internet would. For stateful TCP/UDP traffic like game servers or IoT — things CloudFront refuses to carry — Global Accelerator is the highway; CloudFront is the drive-thru that only sells HTTP(S).
Placement Group — Cluster is a Formula 1 pit lane: all cars parked bumper to bumper in the same garage, so crew communication is measured in microseconds — you get sub-10-microsecond inter-node latency when combined with EFA. Placement Group — Spread is the opposite: cars parked at different circuits for fault isolation. Partition placement is a compromise — a hundred cars split across seven garages, communication within a garage is close, across garages is independent — ideal for HDFS/Cassandra where the application already models the fault domain.
Analogy 4: The Hospital Triage
Asynchronous and streaming performance architecture is hospital triage. A synchronous API is an emergency room — every patient must be seen and discharged before the next one walks in. An SQS queue is the waiting room — patients wait, the doctor pulls them at their own pace, and the hospital never overflows even when a bus crash brings 50 people at once. Kinesis Data Streams is the vital-signs telemetry stream — continuous, ordered, replayable, every device posts data in real time and multiple consumers (cardiology, ICU, analytics) each read independently. Step Functions Express is the five-minute fast-track triage protocol — cheap per workflow, optimised for high volume, sub-five-minute durations. Step Functions Standard is the full hospital admission protocol — longer-running, auditable, can wait for human approval or seven-day callback.
For SAP-C02, the kitchen analogy is the single most useful mental model across performance architecture questions — purpose-built tools, layered caching, and the idea that speed comes from staging things close to where they are used. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/selection.html
Performance Architecture Decision Framework
Before any service deep-dive, internalise the four questions the Well-Architected Performance Efficiency pillar asks on every new design. Every SAP-C02 performance architecture question is essentially one of these four in disguise.
- What is the access pattern? Read-heavy or write-heavy? Random or sequential? Key-value, range-scan, graph traversal, full-text search, or relational join? The answer selects the database engine before you consider anything else.
- What is the latency budget? Sub-millisecond (DAX, ElastiCache, instance-store NVMe)? Single-digit milliseconds (DynamoDB, EBS io2 Block Express)? Tens of milliseconds (RDS, S3)? Seconds (Athena, Glue)? Budget dictates tier.
- What is the geographic distribution? Users in one region, multiple regions, or globally? Stateless HTTP(S) or stateful TCP/UDP? This selects CloudFront vs Global Accelerator vs neither.
- What is the traffic shape? Steady, bursty, unpredictable, or scheduled? Shape selects serverless vs provisioned, Auto Scaling policy type, and purchasing model (Savings Plan vs Spot vs On-Demand).
SAP-C02 expects you to select the correct primitive and only then consider tuning. If a question offers "tune the DynamoDB table" and "migrate to DAX" for a 50,000 read/sec hot-item workload, DAX is correct — selection beats tuning. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/selection.html
Compute Performance — Purpose-Built Instance Selection
EC2 instance selection at Pro depth is not "use C-family for CPU-heavy" — it is knowing which generation, which accelerator, and which processor architecture for a given workload, plus understanding migration paths.
General-purpose families and the Graviton4 migration
- M7i / M7a (x86): Intel Sapphire Rapids and AMD EPYC 4th-gen — the "default" family, balanced CPU/memory, correct when Graviton compatibility is unknown.
- M7g / M8g (Graviton3 / Graviton4): AWS-designed ARM processors. Graviton4 (M8g/C8g/R8g and peers) delivers roughly 30% better price-performance than comparable x86 for Linux workloads that can recompile. Graviton4 is the correct migration target for 2025+ greenfield Linux workloads unless a dependency is x86-only.
- C7i / C7g / C8g: compute-optimised — lower memory-to-CPU ratio than M-family, right for CPU-bound web/API tiers.
- R7i / R7g / R8g / X2idn / X2iedn / U-7i: memory-optimised — R-family for in-memory caches and analytics, X/U-family for SAP HANA and very large in-memory databases (U instances reach multiple terabytes of RAM).
Accelerated computing — the G / P / Inf / Trn / F / DL matrix
Accelerated computing is where SAP-C02 performance architecture questions get specific. The wrong family can turn a USD 200,000/month bill into a USD 2 million/month bill.
- P5 / P5e / P5en / P6 (NVIDIA H100/H200/B200): training large language models, HPC with CUDA, distributed training with NCCL. These are the heaviest, most expensive GPUs AWS offers. P5en supports EFA v2 for sub-10-microsecond multi-node collectives.
- G6 / G6e (NVIDIA L4 / L40S): graphics, small-model inference, 3D rendering, remote workstations — a fraction of P-family cost.
- Inf2 (AWS Inferentia2): inference-only custom silicon. Up to 4x better price-per-inference than comparable GPU instances for supported model architectures (transformers, diffusion, CNN). Correct answer for "10k QPS LLM inference at lowest cost".
- Trn1 / Trn2 (AWS Trainium): training-only custom silicon. Up to 50% better price-per-training than comparable GPU instances. Trn2 (and Trn2 UltraServer) is the correct answer for "train a 70B-parameter model on AWS at lowest cost".
- F2 (FPGA): custom hardware acceleration for genomics, video transcoding, and ASIC prototyping.
- DL1 / DL2q (Habana Gaudi / Qualcomm AI 100): alternative deep-learning accelerators — appear rarely on the exam but good to recognise.
If a scenario explicitly demands lowest inference cost and the model is a mainstream transformer or CNN, AWS Inferentia (Inf2) is the target. Neuron SDK compiles PyTorch/TensorFlow models to Inferentia — no code rewrite needed. Reference: https://docs.aws.amazon.com/ec2/latest/instancetypes/ac.html
Storage-optimised families — I / D / H
Storage-optimised instances exist because EBS is not always fast enough for local-NVMe-hungry workloads.
- I4i / I4g / I7ie / I8g: local NVMe SSD with up to millions of random IOPS per instance. Correct for NoSQL-on-EC2 (self-managed Cassandra/ScyllaDB/ClickHouse), large OLTP caches, high-throughput log processing. Data is ephemeral — instance stop/terminate loses storage.
- D3 / D3en: local HDD with tens of terabytes per node. Correct for distributed file systems (HDFS/MapReduce), data warehousing on EC2.
- H1: mixed storage-optimised, legacy — largely superseded by D3 for new designs.
Nitro Enclaves — isolated compute inside an EC2 instance
AWS Nitro Enclaves is not a separate instance family but a feature of Nitro-based instances. An enclave is an isolated VM carved out of the parent EC2 instance with no persistent storage, no network, no operator access — it exists only to process secrets. You communicate with an enclave only via vsock (virtio sockets), and the enclave provides a cryptographic attestation document that KMS can verify before releasing keys.
SAP-C02 scenarios where Nitro Enclaves is the correct answer:
- Processing payment card data (PCI DSS scope reduction) in a carved-out enclave while the parent instance handles non-sensitive work.
- Multi-party computation or privacy-preserving ML where the training set must never be readable by the instance operator.
- Attestation-gated access to KMS keys for cryptographic signing services.
A common distractor: "use Nitro Enclaves to isolate untrusted code". That is what Firecracker micro-VMs (under Lambda/Fargate) or containers are for. Nitro Enclaves is specifically for sensitive-data isolation with cryptographic attestation — it cannot reach the network, cannot mount storage, and cannot run arbitrary multi-purpose workloads. Reference: https://docs.aws.amazon.com/enclaves/latest/user/nitro-enclave.html
Graviton migration strategy
A recurring SAP-C02 scenario: "a fleet of 2,000 EC2 instances on M5 must reduce cost by 30% without rearchitecting." The correct answer is typically a Graviton migration following this sequence:
- Inventory language runtimes with AWS Compute Optimizer's Graviton recommendations — it flags which workloads are Graviton-ready.
- Rebuild container images as multi-arch (arm64 + x86_64) via AWS CodeBuild ARM builders or Docker Buildx.
- Roll out first to non-production behind an ALB with mixed target groups, observing p50/p99 and error rate.
- Gradually shift traffic; use EC2 Auto Scaling mixed-instances policy with both
m6g.largeandm6i.largeduring transition. - Retire x86 after two weeks of stable metrics, updating Savings Plans commitment from EC2-specific to Compute Savings Plan (Compute SP applies across families and architectures).
Storage Performance — EBS io2 Block Express, FSx, S3 Express, and Instance Store
Storage selection at Pro depth hinges on IOPS, throughput, latency, durability, and whether the data survives an instance stop.
Amazon EBS volume type selection
- gp3: general-purpose SSD. Baseline 3,000 IOPS and 125 MB/s with independent provisioning up to 16,000 IOPS and 1,000 MB/s. Default choice for most new workloads.
- io2 / io2 Block Express: high-performance SSD with 99.999% durability. io2 Block Express delivers up to 256,000 IOPS and 4,000 MB/s per volume with sub-millisecond latency and 64 TiB capacity. Correct for I/O-intensive databases (Oracle, SAP HANA on EBS, high-volume OLTP).
- st1 / sc1: throughput-optimised HDD and cold HDD — sequential throughput workloads only (log processing, data warehouse append).
- gp2 (legacy): burst-credit model; avoid in new designs — gp3 is cheaper and more predictable.
io2 Block Express costs roughly 5x gp3 for equivalent IOPS. The correct answer is io2 Block Express only when the workload explicitly needs over 16,000 IOPS or 1,000 MB/s sustained, five-nines durability, or sub-millisecond consistency. For a typical web app database, gp3 is the right performance architecture choice. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisioned-iops.html
FSx family — purpose-built shared file systems
- FSx for Lustre: purpose-built for HPC and ML training. Delivers hundreds of GB/s throughput and millions of IOPS, integrates natively with S3 (lazy-loaded working set, export on completion). Correct answer for "distributed ML training reading a 50 TB dataset across 256 GPU nodes".
- FSx for NetApp ONTAP: enterprise NFS/SMB with snapshots, cloning, SnapMirror replication. Correct answer for "migrate existing NetApp workloads to AWS without re-architecting".
- FSx for Windows File Server: SMB with Active Directory integration, DFS. Correct for Windows workgroup file shares.
- FSx for OpenZFS: POSIX-compliant with snapshots, cloning, high IOPS for transactional file workloads.
S3 performance architecture
S3 is massively parallel but only if you design for it.
- S3 Standard: default, multi-AZ, 11 nines durability, ~100 ms first-byte latency. Scales to thousands of requests per second per prefix.
- S3 Intelligent-Tiering: automatic tiering across Frequent / Infrequent / Archive Instant / Archive / Deep Archive with a small monthly monitoring fee. Correct default for data whose access pattern is unknown or changes over time.
- S3 Express One Zone: single-AZ, consistent single-digit-millisecond first-byte latency — up to 10x lower latency than S3 Standard for small objects at the cost of single-AZ durability. Correct answer for "AI/ML training datasets accessed repeatedly within a single AZ" or "latency-sensitive interactive analytics directly on S3".
- S3 Standard-IA / One Zone-IA: infrequent access with a 30-day minimum; retrieval costs apply.
- S3 Glacier Instant Retrieval / Flexible Retrieval / Deep Archive: archival tiers with retrieval SLAs from milliseconds to 48 hours.
Performance patterns:
- Use multipart upload for objects larger than 100 MB; multipart parallelism and retry granularity is what makes S3 fast for large files.
- Use S3 Transfer Acceleration when a global client base uploads to a single bucket — clients hit the nearest CloudFront edge and ride the AWS backbone to the bucket region.
- For millions of requests per second, distribute request load across many key prefixes; per-prefix limits are 3,500 PUT/POST/DELETE and 5,500 GET/HEAD per second.
- Enable S3 Byte-Range Fetches for large objects being processed in parallel (Athena, Spark).
Instance store (ephemeral NVMe)
I-family local NVMe offers millions of random IOPS with sub-100-microsecond latency — faster than any EBS volume. Use it for:
- Cache tiers where data can be rebuilt from upstream (Redis/KeyDB on EC2 holding precomputed results).
- Scratch storage for shuffle/sort in Spark/MapReduce.
- Transactional logs for self-managed databases that replicate across nodes (ScyllaDB, Aerospike).
Never use instance store for data that cannot be rebuilt — stop/terminate loses it.
Network Performance — Placement Groups, EFA, Global Accelerator vs CloudFront
Network is the layer SAP-C02 loves to probe because the wrong primitive is off by an order of magnitude.
Placement Groups — Cluster, Partition, Spread
- Cluster placement group: all instances placed in the same rack for low-latency, high-bandwidth inter-node communication. Combined with EFA, delivers sub-10-microsecond inter-node latency and up to 3,200 Gbps per instance on supported families. Correct for tightly coupled HPC, distributed ML training, in-memory analytics.
- Partition placement group: instances split across up to 7 logical partitions per AZ; no two partitions share hardware. Correct for distributed systems that model their own fault domain (HDFS, Cassandra, Kafka) — replicas across partitions.
- Spread placement group: each instance on distinct hardware, maximum fault isolation. Limited to 7 instances per AZ. Correct for small numbers of critical stateful instances.
Elastic Fabric Adapter (EFA)
EFA is a special EC2 network interface that bypasses the OS kernel for collective communication — it is the only way to get MPI and NCCL performance approaching on-premises HPC. Available on P4/P5/Trn1/Trn2/C5n/C6i/C7i/HPC6a/HPC7a/HPC7g. Required when:
- Multi-node ML training with NCCL / AllReduce.
- MPI-based HPC simulation (CFD, weather, molecular dynamics).
- Distributed in-memory databases with high inter-node RPC volume.
Global Accelerator vs CloudFront — the decision tree
This is the single most tested network performance architecture decision at Pro level.
- Amazon CloudFront is a content delivery network (CDN). It caches HTTP(S) responses at edge locations. It terminates TLS at the edge and originates to ALB/NLB/S3/custom origin. It does not carry raw TCP/UDP. Use when the workload is web/API content, streaming video, static assets and the acceleration comes from caching plus TLS termination.
- AWS Global Accelerator is a network accelerator. It provides two static anycast IP addresses that route clients to the nearest edge, then carries traffic across the AWS global backbone to regional endpoints (ALB, NLB, EC2, Elastic IP). It accepts both TCP and UDP, preserves source IP, and routes stateful connections deterministically. It does not cache content. Use when the workload is non-HTTP (game servers, VoIP, IoT, MQTT), HTTP APIs that need sticky connections, multi-region active-active with instant failover, or stateful TCP that cannot tolerate re-handshake.
If the scenario says "UDP game server", "MQTT broker", "preserve client source IP", or "multi-region HTTP API with instant regional failover via a single IP", the answer is Global Accelerator. If the scenario says "static website", "video on demand", "signed URLs", or "edge WAF", the answer is CloudFront. If the scenario needs both (e.g., static assets + stateful API), use both in combination. Reference: https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html
CloudFront performance patterns at scale
- Origin Shield: an extra regional cache tier between edge and origin — reduces origin load by de-duplicating edge-miss requests. Use when many edge locations each miss the same origin object (typical for popular assets).
- Cache behaviours: path-based cache policies, TTL overrides, query-string normalisation. Critical for high cache-hit ratio.
- CloudFront Functions (runs at every edge, sub-millisecond, tens-of-requests-per-ms) vs Lambda@Edge (runs at regional edge, millisecond-plus, full Node/Python runtime). Use Functions for header/URL rewrites; Lambda@Edge for anything requiring SDK calls or large code.
- Origin Access Control (OAC): replaces the legacy Origin Access Identity for S3 — correctly signs requests with SigV4 and supports KMS-encrypted buckets.
Database Performance — Purpose-Built, Aurora at Pro Depth, DynamoDB Patterns
At Pro level, you should not default to "RDS" — the exam expects you to map the access pattern to the correct engine, then pick the right shape within that engine.
Purpose-built database selection matrix
- Relational OLTP with strict consistency → Amazon Aurora (MySQL/PostgreSQL-compatible) or RDS.
- Relational OLTP at massive scale beyond single-writer → Aurora Limitless Database.
- Key-value / document at any scale → Amazon DynamoDB.
- Graph traversal (fraud, social, knowledge) → Amazon Neptune.
- Time-series (IoT, metrics, logs) → Amazon Timestream.
- Cassandra-compatible wide-column → Amazon Keyspaces.
- Search and log analytics → Amazon OpenSearch Service.
- Ledger / immutable audit → Amazon QLDB (being superseded by Aurora PostgreSQL with blockchain extensions, but still exam-relevant).
- In-memory cache → ElastiCache for Redis/Valkey/Memcached, or MemoryDB for Redis (durable).
Aurora at Pro Depth — Serverless v2, Limitless, Global Database, Parallel Query
Aurora is the relational engine SAP-C02 tests most heavily.
- Aurora Serverless v2: scales on-line in half-ACU increments (0.5 ACU = ~1 GiB RAM + proportional CPU) from a minimum floor to a configured ceiling. Correct for variable or unpredictable workloads. Unlike v1, v2 scales without dropping connections and supports read replicas, Global Database, and RDS Proxy. The minimum 0.5 ACU floor means it is no longer "scales to zero" by default, though a 0-ACU option is available on newer releases for dev/test.
- Aurora Limitless Database: horizontally sharded Aurora PostgreSQL. Automatically distributes tables across multiple compute shards, transparent to the application for most queries. Correct answer when "single writer cannot keep up with 100,000+ writes/sec" and the application is PostgreSQL-compatible.
- Aurora Global Database: cross-region replication with typical lag under one second, sub-minute RPO, and one-minute-or-better RTO via managed failover. Headline feature: a secondary region can be promoted to primary and takes writes within ~1 minute. Correct for multi-region active-passive with aggressive RTO/RPO, and for global read-latency reduction via read-only replicas in each region.
- Aurora Parallel Query: pushes SQL processing into the Aurora storage layer across thousands of storage nodes, dramatically speeding up long-running analytical queries on an OLTP database. Enable when OLTP and light OLAP co-exist on one Aurora cluster.
Unless you are on a recent version that supports a 0-ACU minimum, Aurora Serverless v2 will bill at least the configured minimum ACU 24/7. For truly spiky workloads with long idle periods, Aurora Serverless v1 (legacy, pausing) or DynamoDB on-demand may be the correct performance architecture + cost answer. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html
DynamoDB performance patterns at Pro depth
- Partition key design is 80% of DynamoDB performance. A hot partition key (all writes to one user ID, all reads to one product SKU) throttles regardless of provisioned capacity. Compose keys with high-cardinality attributes; use write sharding (
user#<id>#<0–9>) when a single logical entity takes disproportionate traffic. - Adaptive capacity reallocates throughput across partitions within a table — runs continuously, no configuration needed. It mitigates but does not eliminate bad key design.
- On-demand capacity mode bills per-request, no provisioning, instant scaling to 2x previous peak. Correct for unpredictable or new workloads. Roughly 7x the per-request cost of fully-utilised provisioned — break-even depends on utilisation.
- Provisioned capacity with auto-scaling is cheaper when utilisation is steady and predictable (70%+). Auto-scaling responds in minutes, not seconds — cold bursts will throttle.
- Global Secondary Indexes (GSIs): separate throughput, eventually consistent only. Use for alternate access patterns. GSI hot key issues exist independently of the base table.
- Local Secondary Indexes (LSIs): share base table partition capacity, must be defined at table creation, strongly consistent. Use rarely; composite GSI is usually a better choice.
- DynamoDB Streams: ordered per-partition-key change log with 24-hour retention. Feeds Lambda (via Event Source Mapping) or Kinesis Data Streams (via managed DDB-to-Kinesis pipe).
- TTL: per-item expiration attribute; DynamoDB deletes within 48 hours of expiry at no cost. Correct for session data, temporary caches, time-bound records. TTL deletions generate stream events for downstream cleanup.
DAX vs ElastiCache-in-front-of-DynamoDB
- DAX is the built-in item/query/scan cache. Microsecond latency for cache hits. Write-through: writes go through DAX to DynamoDB synchronously. Reads are eventually consistent only — strongly consistent reads bypass DAX. Correct for read-heavy DynamoDB workloads with repeated key reads.
- ElastiCache-in-front-of-DynamoDB is a manual pattern: application writes to DynamoDB and invalidates/updates an ElastiCache entry. More work, more failure modes, but appropriate when:
- Cache must be shared across DynamoDB and other data sources (e.g., join results from DDB + RDS).
- Cache must serve strongly consistent hits (possible via single-writer cache design, not DAX).
- Custom data structures (sorted sets, streams, pub/sub) are required — only Redis/Valkey offer these.
OpenSearch with UltraWarm and Cold Storage
OpenSearch manages hot, UltraWarm, and Cold tiers so you do not store a year of logs on SSD.
- Hot tier: data nodes with EBS SSD, indexed for full-speed queries.
- UltraWarm: S3-backed, queried at roughly 10x slower latency than hot, at roughly 1/10 the cost per GB. Correct for 30-to-90-day log retention.
- Cold storage: S3-backed, detached from the cluster — queryable only after reattachment. Correct for 90+-day retention where ad-hoc query is acceptable.
- Index State Management (ISM) policies automate transitions between tiers.
A recurring SAP-C02 scenario: "logs cost USD 50k/month on OpenSearch, retention must stay at 1 year." Correct answer: ISM policy moving indices older than 30 days to UltraWarm and older than 90 days to Cold — typical cost reduction 60–80%. Reference: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ultrawarm.html
RDS Performance Insights
RDS Performance Insights provides a top-down load view of a database — average active sessions (AAS) by wait event, SQL statement, user, and host. It is the correct diagnostic tool for:
- Identifying the SQL statement consuming most CPU.
- Attributing lock waits to specific queries.
- Long-term performance trending across 2 years of retention on paid tier.
- Integration with DevOps Guru for RDS to auto-detect anomalies.
Caching Hierarchy — Browser → CloudFront → ElastiCache → DAX → Origin
Performance architecture at scale is layered caching. The earlier in the chain a request is served, the faster and cheaper it is.
- Browser cache (HTTP Cache-Control + ETag). Serves in zero milliseconds from the client. Configure CloudFront to emit correct cache headers with long
max-agefor immutable assets (hash-named bundles). - CloudFront edge cache. Serves in tens of milliseconds from the nearest edge. Tune cache behaviours, query-string normalisation, cookie forwarding, and Origin Shield for cache-hit ratios above 90% on static assets.
- Global Accelerator (optional for non-cacheable TCP/UDP). Routes bypass public internet but do not cache.
- Regional cache — ElastiCache / DAX / MemoryDB. Serves in microseconds to low milliseconds from in-memory nodes. ElastiCache for general-purpose caching and shared state; DAX specifically for DynamoDB items; MemoryDB when the cache must survive a node failure with durability.
- Read replicas (Aurora/RDS). Serves reads from replicas to offload the writer. Lag is typically sub-100ms but the application must tolerate eventual consistency.
- Origin (Aurora/RDS/DynamoDB/S3). The source of truth. Every request that reaches origin costs more and takes longer than any cached layer.
Cache sizing and eviction
- Time-based eviction (TTL): default; correct when staleness of
xseconds is acceptable. - Write-through invalidation: application invalidates cache on every write; correct when read-your-own-writes is required.
- Lazy loading (cache-aside): populate on miss. Simple but risks thundering herd on cold start — mitigate with request coalescing or pre-warming.
Cache failure modes
- Thundering herd on cache miss: when a hot key expires, thousands of requests simultaneously hit the origin. Mitigate with jittered TTL, single-flight (one request populates, others wait), or SQS-buffered asynchronous repopulation.
- Cache stampede on restart: a cold cache restart drives a burst of origin traffic. Mitigate with warm-up scripts, staggered node restarts, and graceful-degradation paths.
- Stale cache under deploy: cache keyed on old code serves to new code. Mitigate with version prefix in cache keys (
v7:user:123).
Browser → CloudFront → Global Accelerator (if non-HTTP) → ElastiCache / DAX / MemoryDB → Read Replica → Origin. When a question gives you a layered design, check every layer is present and correctly typed for the workload. Reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html
Asynchronous and Streaming Patterns — Kinesis, Lambda, Step Functions Express
Not every workload can be faster synchronously. Many can only be faster by becoming asynchronous.
Kinesis Data Streams + Lambda for real-time ingestion
Kinesis Data Streams is an ordered, replayable, multi-consumer stream. Its performance model:
- Shards: each shard supports 1 MB/sec or 1,000 records/sec write; 2 MB/sec aggregate read. Scale by adding shards.
- Enhanced fan-out: each consumer gets a dedicated 2 MB/sec per shard — correct when multiple consumers need full throughput.
- Lambda Event Source Mapping: polls shards and invokes Lambda with batches. Parallelization factor (1–10) can run multiple Lambda invocations per shard in parallel while preserving order within a partition key. Correct for 10x throughput on the same shard count when order-per-key is sufficient.
- Extended retention: up to 365 days for replay scenarios.
Step Functions Express vs Standard
- Standard: up to 1 year duration, exactly-once execution, full audit history. Correct for human-in-the-loop, long-running, audited workflows. Pricing per state transition.
- Express: up to 5 minutes duration, at-least-once execution, logs to CloudWatch. Priced per request + duration + memory — much cheaper at high volume. Correct for high-throughput ingestion pipelines, IoT event processing, serverless microservice orchestration.
Async request patterns
- API Gateway + SQS + Lambda worker: client POSTs, API Gateway enqueues immediately, Lambda worker processes asynchronously, client polls status. Correct for long-running requests that would otherwise time out.
- EventBridge + Lambda: client publishes an event, multiple consumers subscribe. Correct for fan-out with filtering.
- DynamoDB Streams + Lambda: write triggers downstream processing. Correct for materialised views, search indexing, audit logs.
Scenario: ML Serving at 10,000 QPS with p99 < 50 ms
Apply the full performance architecture framework to a canonical Pro-level scenario.
Requirements:
- Serve a 70-billion-parameter transformer model.
- 10,000 queries per second steady-state, 25,000 QPS peak.
- p99 latency < 50 ms end-to-end, measured at the client.
- Three regions (us-east-1, eu-west-1, ap-northeast-1), active-active.
- Monthly cost ceiling USD 120,000.
- PII in queries must never leave the enclave; key material protected by KMS.
Performance architecture:
- Compute — Inferentia2 (inf2.48xlarge). Inferentia2 delivers the best price-per-inference for a mainstream transformer. Neuron SDK compiles the model from PyTorch. 12 inf2.48xlarge per region (36 total) at ~USD 2,500/month reserved-equivalent each — roughly USD 90,000/month compute. GPU (P5) would cost 3–4x more.
- Accelerated storage — FSx for Lustre linked to S3. Model weights staged on Lustre for sub-millisecond cold-start load; S3 is the authoritative store with cross-region replication.
- Tokenisation and PII redaction — Nitro Enclaves. Raw queries land in the enclave; PII redaction happens inside the enclave; only redacted prompts leave. KMS attests the enclave before releasing the encryption key.
- Network — Global Accelerator with regional NLB endpoints. Static anycast IPs route clients to the nearest healthy region; instant regional failover. Preserves source IP for audit. CloudFront is not appropriate here because inference responses are non-cacheable and per-query.
- Caching — ElastiCache for Valkey in front of the model tier for deterministic prompts (repeat queries). Hit rate of even 20% removes 2,000 inference calls/sec. Cache-aside with MD5-hashed prompt as key, 5-minute TTL.
- Metadata store — DynamoDB on-demand for request metadata and user session state. On-demand handles the 2.5x peak without throttling. DAX in front for the session read path (p99 < 5 ms).
- Async audit log — Kinesis Data Firehose → S3 → Athena. Every inference emits an audit record; Firehose buffers and delivers to S3 partitioned by region/date/hour for Athena analytics without inline cost.
- Regional data — Aurora Global Database for user and billing data. Writer in us-east-1, read-only replicas in eu-west-1 and ap-northeast-1 for sub-50ms reads; RPO < 1s for cross-region writer failover.
- Observability — RDS Performance Insights for Aurora + CloudWatch Embedded Metrics Format from Lambda/Inferentia runners + X-Ray end-to-end tracing. p99 latency tracked at every layer.
Latency budget breakdown (target p99 end-to-end < 50 ms):
- Client → Global Accelerator edge: 5 ms
- Global Accelerator backbone → regional NLB: 5 ms
- NLB → Inferentia pod: 2 ms
- Tokenisation (Nitro Enclave): 3 ms
- DAX session lookup: 2 ms
- Inference (cached): 3 ms / (uncached): 30 ms
- Response path: 5 ms
- Uncached p99 budget total: ~50 ms — leaves no slack; cache hits and parallel tokenisation buy margin.
Expect a question nearly identical in structure: a latency SLA, a throughput target, a cost ceiling, and multi-region requirements. The correct answer is always a combination of purpose-built compute (Inf/Trn), Global Accelerator (not CloudFront), a layered cache, a purpose-built database, and async telemetry. Any answer missing Inferentia for cost-sensitive inference, or substituting CloudFront for stateful API traffic, is wrong. Reference: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/welcome.html
Performance Monitoring and Validation
Performance architecture is not complete without continuous measurement.
- Amazon CloudWatch: baseline metrics, composite alarms, anomaly detection, dashboards. Use Contributor Insights on DynamoDB to find hot partition keys in production.
- AWS X-Ray: distributed tracing across Lambda, API Gateway, Step Functions, ECS, EKS. Correct tool to identify which hop owns p99 latency.
- RDS Performance Insights: database-layer wait-event analysis (see above).
- CloudWatch RUM (Real-User Monitoring): browser-side latency from actual end users.
- CloudWatch Synthetics: scripted canaries measuring end-to-end availability and latency.
- AWS DevOps Guru: ML-based anomaly detection across application, RDS, Lambda metrics — surfaces regressions without threshold tuning.
- Compute Optimizer: continuous right-sizing recommendations for EC2, Lambda, EBS, ECS, ASG.
Common Traps and Fast-Reject Heuristics
- "Use CloudFront for a UDP workload" → wrong, CloudFront is HTTP(S) only. Use Global Accelerator.
- "Use DAX for strongly consistent reads" → wrong, DAX serves eventually consistent only.
- "Use Aurora Serverless v2 for a workload that is idle 95% of the time" → check if the minimum ACU floor makes it more expensive than provisioned-small or DynamoDB on-demand.
- "Use io2 Block Express for a web app database" → usually over-spec; gp3 suffices unless explicitly needing > 16,000 IOPS or sub-ms consistency.
- "Use P5 for inference" → possible but expensive; Inf2 is the cost-optimal answer for mainstream transformer inference.
- "Use Lambda provisioned concurrency for 100,000 QPS steady-state" → possible but cost-prohibitive; containers on Fargate or EC2 likely cheaper at that scale.
- "Use Nitro Enclaves to isolate untrusted tenant code" → wrong, use Firecracker (via Fargate/Lambda) or containers.
- "Use instance store for the transactional log of record" → wrong, it is ephemeral.
- "Use CloudFront Functions for SDK calls" → wrong, use Lambda@Edge; Functions has no SDK access.
- "Use Spread placement group for 50 EC2 instances" → wrong, Spread is limited to 7 instances per AZ; use Partition instead.
- "Use Kinesis Data Firehose for replay" → wrong, Firehose is fire-and-forget; use Kinesis Data Streams for replay.
- "Use Aurora Global Database for multi-region active-active writes" → correct only if using write-forwarding from secondary regions; otherwise Aurora Global Database has one writer region. For true multi-master, use DynamoDB Global Tables or Aurora Limitless within one region.
Decision Matrix — Which Primitive for Which Performance Goal
| Goal | Primary primitive | Notes |
|---|---|---|
| Train 70B-parameter model lowest cost | Trn2 / Trn2 UltraServer + EFA + FSx Lustre | Savings over P5 of 30–50% on supported architectures |
| Inference at 10k+ QPS lowest cost | Inf2 + Neuron SDK | GPU only if model architecture incompatible with Neuron |
| Sub-10-microsecond inter-node latency | Cluster Placement Group + EFA | Required for tight HPC/ML collectives |
| Stateful TCP/UDP global routing | Global Accelerator | Not CloudFront |
| Static web + video global delivery | CloudFront (with Origin Shield) | Edge cache, not backbone acceleration |
| Key-value reads at 50k+/sec on same item | DAX | Eventually consistent reads only |
| Shared cache across DDB + RDS + custom | ElastiCache for Valkey/Redis | Pick over DAX when multi-source required |
| Relational OLTP beyond single writer | Aurora Limitless Database | PostgreSQL-compatible only |
| Multi-region relational active-passive, RPO < 1s | Aurora Global Database | Promote secondary in ~1 minute |
| Multi-region key-value active-active | DynamoDB Global Tables | Last-writer-wins on conflict |
| Log retention 30d hot / 90d warm / 1y cold | OpenSearch + ISM + UltraWarm + Cold | 60–80% cost reduction vs all-hot |
| Single-AZ ultra-low-latency S3 | S3 Express One Zone | 10x lower latency, single-AZ durability |
| Cross-region fast upload | S3 Transfer Acceleration | Backbone acceleration on upload |
| Millions of IOPS local storage | Instance Store (I-family) | Ephemeral, rebuild on restart |
| Up to 256k IOPS durable block | io2 Block Express | 99.999% durability, sub-ms |
| HPC/ML shared filesystem | FSx for Lustre | Integrates with S3 |
| High-volume short-duration workflow | Step Functions Express | Up to 5 minutes, cheap per run |
| Long-running audited workflow | Step Functions Standard | Up to 1 year, exactly-once |
| Isolated processing of PII with attestation | Nitro Enclaves | No network, no storage, KMS-attested |
| Database wait-event analysis | RDS Performance Insights | Top SQL by AAS |
FAQ
When should I use DAX vs ElastiCache in front of DynamoDB?
Use DAX when the workload is a simple read-heavy DynamoDB pattern with eventually consistent reads — DAX is purpose-built, managed, and requires no application changes beyond swapping the SDK client. Use ElastiCache (Redis or Valkey) in front of DynamoDB only when you need features DAX lacks: strongly consistent cache hits, cross-source caching (DynamoDB + RDS + computed results), custom data structures (sorted sets, streams, geospatial), or cache sharing across services. The cost and operational overhead of running ElastiCache is higher; pick it only when you can name the specific DAX limitation you are working around.
Should I migrate to Graviton4 now?
For Linux workloads, yes, if your dependencies have ARM builds — which almost all mainstream runtimes (Java, Python, Node.js, Go, Rust, .NET) do in 2025. Graviton4 typically delivers 30% better price-performance versus comparable x86. Migration steps: inventory with Compute Optimizer, rebuild container images as multi-arch, roll out to non-prod behind a mixed-instance ASG, validate p99 metrics, then shift production. Keep Compute Savings Plans (not EC2 Instance Savings Plans) during migration so your commitment applies across architectures. Workloads that should stay on x86: those using x86-only libraries (rare today), legacy Windows-on-EC2, or any ISV binary without an ARM build.
When is Global Accelerator worth the extra cost versus CloudFront?
Global Accelerator costs ~USD 0.025/hour per accelerator plus data transfer. It is worth it when: the workload is stateful TCP or UDP (game servers, VoIP, MQTT, TCP-based APIs with long-lived connections); the workload needs source-IP preservation; the architecture is multi-region active-active with instant failover; or the performance gain from using the AWS backbone (rather than the public internet) is measurable for your user geography. It is not worth it when the workload is cacheable HTTP(S) content — CloudFront already accelerates via edge caching plus TLS termination. Some high-traffic architectures use both: CloudFront for static assets and public HTML, Global Accelerator for the stateful API tier.
How do I decide between Aurora Serverless v2, Aurora provisioned, Aurora Limitless, and DynamoDB for a new OLTP workload?
Start with the access pattern. If the schema is key-value and access is by primary key with known access patterns, DynamoDB is the performance architecture answer — it scales to any QPS and any size with single-digit-millisecond latency, no servers to manage. If the workload needs SQL joins, transactions across multiple tables, and relational modelling, pick Aurora. Within Aurora: Serverless v2 if the workload is variable or unpredictable and you want auto-scaling without connection drops; provisioned if the workload is steady and you want Reserved Instance pricing; Limitless if a single writer cannot absorb your write throughput (typically > 100k writes/sec on PostgreSQL). Use Aurora Global Database on top of any of the above for multi-region disaster recovery and read locality.
What is the correct instance family for training a large language model on AWS?
For cost-sensitive training of mainstream architectures (transformers, diffusion models, CNNs), AWS Trainium (Trn2 or Trn2 UltraServer) is typically the correct answer — up to 50% better price-per-training versus comparable NVIDIA GPU instances for supported models, with the Neuron SDK handling PyTorch/TensorFlow compilation. For frontier-model research where you need the absolute latest NVIDIA CUDA features or custom kernels, P5e / P5en / P6 (H100/H200/B200) are the correct answer, at higher cost. Always combine with EFA, Cluster Placement Group, and FSx for Lustre for multi-node distributed training. Single-node debugging or small-scale fine-tuning may use G6 or a single P5.
How do I meet a p99 latency SLA below 50 ms for a global API?
Combine: Global Accelerator for source-IP-preserving, backbone-accelerated network ingress; regional active-active deployments in at least three regions; DynamoDB with DAX or ElastiCache for Valkey for sub-millisecond hot-path reads; asynchronous write path via SQS or Kinesis so writes do not block the read path; CloudFront in front of static content to remove non-critical traffic from the dynamic tier; compute on the lightest-weight primitive that meets the requirement (Lambda with provisioned concurrency for sub-100ms cold starts, Fargate for medium, EC2 for heavy); and X-Ray end-to-end tracing to find every hop that threatens the budget. Measure p99 per region from synthetic canaries (CloudWatch Synthetics) and from real users (CloudWatch RUM) — both are required to close the loop.
What are the most common performance architecture traps on SAP-C02?
Three traps dominate: (1) picking general-purpose compute for an accelerator workload — if the scenario mentions ML training, HPC simulation, or video transcoding, a non-accelerator answer is almost certainly wrong; (2) picking CloudFront for stateful TCP/UDP — Global Accelerator is the answer for anything that is not HTTP(S) cacheable; (3) picking ElastiCache over DAX for a pure DynamoDB read cache — DAX is purpose-built and the correct answer unless a specific limitation rules it out. A fourth trap is defaulting to RDS when DynamoDB is correct — read carefully for "at any scale", "single-digit-millisecond latency at 100k QPS", or "serverless" signals that point to DynamoDB.
Summary
Performance architecture on AWS is a selection and trade-off discipline: pick the right compute family (Graviton4 for general-purpose, Inf/Trn for ML, I/D/H for storage, Nitro Enclaves for sensitive compute), the right storage tier (gp3 default, io2 Block Express for demanding IOPS, FSx Lustre for HPC/ML, S3 Express One Zone for single-AZ low latency, instance store for ephemeral), the right network primitive (Placement Groups + EFA for HPC, Global Accelerator for stateful global TCP/UDP, CloudFront for cacheable HTTP(S)), the right database (Aurora Serverless v2 / Limitless / Global Database for relational at scale, DynamoDB + DAX for key-value at scale, OpenSearch with UltraWarm for search/logs), and layered caching (browser → CloudFront → ElastiCache/DAX/MemoryDB → read replicas → origin). Asynchronous and streaming patterns (Kinesis, Step Functions Express) remove synchronous bottlenecks entirely. On SAP-C02, the correct performance architecture answer is almost always a combination that names the right primitive at every layer, uses purpose-built where purpose-built exists, and measures end-to-end via X-Ray, RDS Performance Insights, and CloudWatch. Master the fast-reject heuristics (CloudFront is HTTP only, DAX is eventually consistent, Inf2 is cost-optimal inference, Trn2 is cost-optimal training, Aurora Global Database failover is about one minute, Global Accelerator preserves source IP) and performance architecture questions become the fastest-scoring section of the exam.