examhub .cc 用最有效率的方法,考取最有價值的認證
Vol. I
本篇導覽 約 37 分鐘

效能優化:EBS、RDS、EC2 與 S3

7,400 字 · 約 37 分鐘閱讀

Performance tuning on AWS is the SysOps engineer's quietest, hardest skill: when an application is just slightly slower than yesterday, the metric panels look almost normal, the customer complaints come in sideways through Twitter, and the team has to find the bottleneck before the next standup. SOA-C02 Task Statement 6.2 — "Implement performance optimization strategies" — codifies this skill into five operational levers: recommending compute resources from performance metrics, monitoring and modifying Amazon EBS, implementing S3 performance features, monitoring and tuning RDS, and enabling enhanced EC2 capabilities. Where SAA-C03 asks "which volume type for this workload?", SOA-C02 asks "the existing gp2 volume's BurstBalance is at 12 percent and dropping — what is the runbook?". Compute, Storage, and Database Performance Tuning is the topic where every CloudWatch metric you learned in Domain 1 becomes a tuning knob in production.

This guide walks through performance tuning on AWS from the SysOps angle: which CloudWatch metrics tell you a volume is throttled versus a database is bottlenecked, when to switch from gp2 to gp3 and what changes about IOPS billing, how ModifyVolume lets you change type, size, and IOPS on a live volume, why S3 prefix scaling matters more than bucket throughput, when Transfer Acceleration is worth the surcharge, how Performance Insights surfaces the top SQL and wait events that explain a slow query, when RDS Proxy fixes connection-storm symptoms, and how cluster, spread, and partition placement groups shape network behaviour for HPC, fault isolation, and big-data workloads. You will see the SOA-C02 scenario shapes the exam loves: BurstBalance approaching zero, p99 latency creeping up after a deploy, multipart upload thresholds for very large objects, ENA enabled but enhanced networking still off, and Compute Optimizer recommendations that flag both over-provisioning and under-provisioning.

Why Performance Tuning Sits Inside SOA-C02 Domain 6

Domain 6 (Cost and Performance Optimization) is worth 12 percent of the SOA-C02 exam — only two topics share the budget — but Task Statement 6.2 threads through every other domain. The Auto Scaling group from Domain 2 only scales correctly if its metric source is publishing at the right resolution and the EBS volume keeps up with bursts. The RDS Multi-AZ deployment from Domain 2.1 still presents read latency spikes during business hours unless Performance Insights is enabled. The CloudFormation deployment in Domain 3 fails when EBS quotas are reached. The VPC Flow Log analysis in Domain 5 sometimes points at network throttling that an enhanced networking instance type would have absorbed. Domain 6.2 is the operational surface where storage, compute, and database tuning meet.

The exam's framing is operational, not architectural. SAA-C03 asks "which storage class fits this workload?". SOA-C02 asks "the workload is already on gp2, BurstBalance dropped from 100 to 30 over four hours, the application is slowing down, what do you change right now without downtime?". The answer is ModifyVolume to gp3 with provisioned IOPS — but the candidate must know the volume types, the metrics that signal saturation, the no-downtime migration path, and the cost implication. SOA-C02 expects you to read CloudWatch metrics like a doctor reads vitals: VolumeQueueLength consistently above 1 means the volume is the bottleneck; BurstBalance declining means a gp2 burst is depleting; high BurstBalance but slow application means the bottleneck is elsewhere (CPU, network, database).

  • IOPS (Input/Output Operations Per Second): the number of read or write operations the storage subsystem completes per second. EBS volumes have an IOPS ceiling that depends on volume type, size, and instance bandwidth.
  • Throughput: the bytes per second the storage moves. Distinct from IOPS — a workload doing many small operations is IOPS-bound, while a workload doing few large operations is throughput-bound.
  • BurstBalance: a CloudWatch metric for gp2 and st1/sc1 volumes representing the percentage of remaining burst credits. Drops as the volume sustains operations above its baseline.
  • VolumeQueueLength: pending I/O requests at the EBS volume. Sustained values above 1 indicate the volume is the bottleneck.
  • Multipart upload: the S3 protocol for splitting an object into parts that upload in parallel and reassemble on the bucket. Recommended for objects larger than 100 MB and required above 5 GB.
  • Transfer Acceleration: an S3 feature that routes uploads through CloudFront edge locations and AWS's backbone for faster cross-continent transfers, for an additional per-GB fee.
  • Performance Insights: an RDS and Aurora dashboard that visualizes database load (DBLoad, AAS) and identifies the top SQL statements, hosts, users, and wait events.
  • DB Load (AAS): Average Active Sessions — the number of database sessions actively running queries at any moment. The headline Performance Insights metric.
  • Wait event: the resource a database session is waiting for (CPU, IO:DataFileRead, Lock:transaction, etc.). The diagnosis lever inside Performance Insights.
  • RDS Proxy: a managed connection pool sitting in front of RDS or Aurora. Reduces connection overhead for serverless and high-concurrency workloads.
  • Enhanced networking: EC2 capability that uses SR-IOV via the Elastic Network Adapter (ENA) or Elastic Fabric Adapter (EFA) to deliver higher bandwidth, higher PPS, and lower latency than the default virtio interface.
  • Placement group: a logical grouping of EC2 instances that influences how AWS places them on the physical hardware — Cluster (close, low latency), Spread (apart, fault isolation), or Partition (separated by failure domain).
  • Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

白話文解釋 Compute, Storage, and Database Performance Tuning

Performance tuning vocabulary stacks fast. Three analogies help the constructs stick.

Analogy 1: The Highway Lane Upgrade

Storage and network performance behave like a highway. IOPS is the number of cars per minute that pass a checkpoint — many small cars going through quickly. Throughput is the total tons of cargo moved per minute — fewer trucks but each carrying more. A road can be IOPS-bound (the toll booth is slow even though each car is small) or throughput-bound (the lanes are jammed by oversized trucks). gp2 is a two-lane road that can briefly add a third lane during burst hour — the BurstBalance is the fuel reserve in the temporary lane's lights; when it runs out, the road shrinks back to two lanes and traffic crawls. gp3 is a three-lane road by design with a baseline of 3,000 cars/minute (IOPS) and 125 tons/minute (throughput) regardless of road length, with the option to pay extra for additional lanes up to 16,000 cars/minute and 1,000 tons/minute. io2 Block Express is the express bullet train track: 256,000 cars/minute, sub-millisecond latency, but only Nitro-class stations (instance types) can connect. VolumeQueueLength is the queue at the on-ramp — if cars are waiting to enter the highway, the highway itself is the bottleneck regardless of speed limit. The SysOps engineer's job is to read the on-ramp queue, the lane-balance gauge, and the toll-booth speed, and decide whether the answer is more lanes (provisioned IOPS), bigger trucks (throughput), or a different road altogether (volume type swap).

Analogy 2: The Restaurant Kitchen Choosing Appliances

Picking an EC2 instance type and storage is like a restaurant kitchen choosing appliances. The CPU is the head chef — does most of the active cooking. Memory is the prep counter space — if the counter is full, the chef has to keep clearing dishes (paging). Network is the dumbwaiter to the dining room — bandwidth-bound when many large plates leave at once, latency-bound when many small orders need urgent delivery. Enhanced networking with ENA is upgrading the dumbwaiter from manual to motorized: same kitchen, same chef, but plates move faster and arrive more reliably. Instance store is the stainless steel countertop right next to the stove — extremely fast, but if the stove is unplugged (instance stop or terminate), everything on it goes in the bin (data is ephemeral). EBS is the walk-in fridge — durable, networked, you can detach and attach to a different stove. Cluster placement group is putting all the stations of a tasting menu side-by-side so chefs can hand plates over instantly with no walking — perfect for high-frequency communication; everyone in one corner of the rack. Spread placement group is scattering the salad station, soup station, and grill across separate kitchens so a single fire only takes one out — fault isolation. Partition placement group is the multi-floor kitchen where each floor is its own failure domain but each floor has multiple stations; perfect for distributed databases like Cassandra and HDFS where a "rack" maps to a partition.

Analogy 3: The Gym Treadmill Speed Adjustment

Database tuning with Performance Insights is like watching a gym treadmill display. DB Load (AAS) is the current speed in km/h — how hard the engine is working. Top SQL is the list of exercises currently running — squat, bench, deadlift; one of them is hogging the time. Wait events are the reason an exercise is slow: IO:DataFileRead is "waiting for the equipment to arrive from storage", Lock:transaction is "waiting for the squat rack to be free", CPU is "the trainer is fully booked". RDS Proxy is the receptionist who pools client signups so the gym does not have to spin up a new locker for every visitor — a Lambda gym sees thousands of one-minute drop-ins, and without a receptionist the locker room (DB connection limit) overflows in seconds. Multi-AZ failover is moving to the identical mirror gym across the street when this one floods — pre-warmed, identical equipment, the membership card still works. The SysOps engineer reads the speed dial (AAS), spots the slow exercise (top SQL), identifies whether the bottleneck is the equipment (IO:DataFileRead → add IOPS or read replica) or the schedule (Lock:transaction → query rewrite or index), and either adjusts the treadmill (modify instance class) or installs a receptionist (RDS Proxy).

For SOA-C02, the highway lane analogy is most useful when a question mixes EBS volume types and metrics — BurstBalance, VolumeQueueLength, and IOPS/Throughput limits are all lane-and-toll-booth concepts. The gym treadmill analogy is the right one for any RDS Performance Insights scenario, because the language of wait events maps cleanly to "what is the workout waiting for". Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

Recommending Compute Resources from Performance Metrics

The first 6.2 skill in the exam guide is "Recommend compute resources based on performance metrics." This is the SysOps version of right-sizing — not a one-shot architecture decision but a recurring operational review.

CloudWatch metrics that drive compute recommendations

The headline EC2 metrics in AWS/EC2:

  • CPUUtilization — percent of CPU in use. Sustained above 80 percent suggests under-provisioning; sustained below 10 percent suggests over-provisioning.
  • NetworkIn / NetworkOut — bytes per period. Approaching the instance type's documented bandwidth ceiling means the network is the bottleneck.
  • NetworkPacketsIn / NetworkPacketsOut — packets per period. Many small packets stress the packet-per-second (PPS) limit even before the bandwidth limit.
  • StatusCheckFailed_System / StatusCheckFailed_Instance — 0 means healthy, 1 means failed.

The metrics that come from inside the OS via the CloudWatch agent (in the CWAgent namespace):

  • mem_used_percent — the most-tested missing metric. Sustained above 90 percent means memory pressure.
  • disk_used_percent per mount — filesystem fullness, not the same as EBS volume fullness.
  • procstat counts — process and thread counts.
  • swap_used_percent — non-zero swap usage on a workload that should not swap means memory is undersized.

AWS Compute Optimizer — the ML-based right-sizing engine

AWS Compute Optimizer ingests up to 14 days of CloudWatch metrics (and optionally CloudWatch agent memory metrics for richer recommendations) and emits structured recommendations across EC2 instances, EC2 Auto Scaling groups, EBS volumes, Lambda functions, and ECS services on Fargate.

Each recommendation falls into one of these classifications:

  • Optimized — current configuration is sized correctly.
  • Under-provisioned — increase resources; the workload is being throttled.
  • Over-provisioned — decrease resources; you are paying for unused capacity.
  • Not optimized (a generic version of the above for some resource types).

Compute Optimizer also reports a performance risk score (1–4) indicating how confident it is that the recommendation will not regress performance. A risk score of 1 is safe to action; a score of 4 means the recommendation may degrade performance and you should validate first.

For richer EC2 recommendations, enable enhanced infrastructure metrics in Compute Optimizer; this consumes the CloudWatch agent's memory metric to suggest right-sizing that accounts for memory pressure (otherwise Compute Optimizer assumes memory is not bound).

Operational recommendation workflow

  1. Open Compute Optimizer in the management console (or query via CLI / API).
  2. Filter by recommendation classification (start with under-provisioned to fix performance complaints, then over-provisioned to recover budget).
  3. Review the candidate instance types — Compute Optimizer suggests up to three options ranked by performance risk and cost.
  4. For ASG recommendations, update the launch template to the new instance type and trigger an instance refresh (covered in the AMI/Image Builder topic).
  5. Validate the new configuration against the original metrics for at least one full business cycle before declaring done.

Compute Optimizer's quality scales with input quality. A two-day-old instance produces low-confidence recommendations. Without the CloudWatch agent's memory metric, Compute Optimizer cannot detect memory-bound workloads — it will recommend a smaller instance because CPU looked low, when in fact memory was already pinned at 95 percent. SOA-C02 expects the candidate to know that "enable enhanced infrastructure metrics" plus the CloudWatch agent are prerequisites for trustworthy right-sizing. Reference: https://docs.aws.amazon.com/compute-optimizer/latest/ug/what-is-compute-optimizer.html

EBS Metric Monitoring: IOPS, Throughput, QueueLength, BurstBalance

Amazon EBS publishes a rich set of CloudWatch metrics in the AWS/EBS namespace. Reading them correctly is the difference between a 30-second diagnosis and a four-hour incident.

The core EBS metrics

  • VolumeReadOps / VolumeWriteOps — total operations completed in the period. Divide by period seconds to get IOPS.
  • VolumeReadBytes / VolumeWriteBytes — total bytes read or written. Divide by period seconds to get throughput.
  • VolumeTotalReadTime / VolumeTotalWriteTime — total seconds spent on operations. Divide by ops to get average latency per operation.
  • VolumeQueueLength — average pending I/O requests at the volume. The single most diagnostic metric.
  • VolumeIdleTime — seconds with no operations. The inverse of utilization.
  • BurstBalance — for gp2, st1, sc1 only — the percentage of burst credits remaining (0–100).
  • VolumeThroughputPercentage / VolumeConsumedReadWriteOps — for io1/io2 only — how close the volume is to its provisioned IOPS limit.

How to read the metrics

The diagnostic decision tree:

  1. Is VolumeQueueLength consistently above 1? Yes → the volume is the bottleneck. Either provision more IOPS/throughput or switch volume type.
  2. Is BurstBalance declining (gp2/st1/sc1)? Yes → the workload is burning burst credits and will throttle to baseline once depleted. Either upsize the volume to raise the baseline or migrate to gp3 where IOPS is decoupled from size.
  3. Is VolumeThroughputPercentage near 100 percent (io1/io2)? Yes → you have hit the provisioned IOPS ceiling. Use ModifyVolume to raise the provisioned IOPS.
  4. Are VolumeReadOps divided by period > documented IOPS limit for the instance type? Yes → the instance may be the bottleneck (its EBS bandwidth is exhausted), not the volume. Move to a larger or EBS-optimized instance type.

The SysOps engineer must check both the volume side (provisioned IOPS) and the instance side (EBS bandwidth attached to the instance type). A 64,000-IOPS io2 volume attached to a 4,750-IOPS instance ceiling delivers only 4,750 IOPS.

Setting up the canonical alarms

A baseline production EBS dashboard always includes alarms on:

  • BurstBalance < 20% for any gp2 / st1 / sc1 volume — early warning before depletion.
  • VolumeQueueLength > 1 average over 5 minutes — the volume is becoming the bottleneck.
  • For io1/io2: VolumeThroughputPercentage > 90% — approaching the provisioned IOPS ceiling.
  • For application-critical mounts: agent-collected disk_used_percent > 85% — filesystem fullness.
::warning

gp2 volumes silently throttle from burst (up to 3,000 IOPS) down to baseline (3 IOPS per GB) the moment BurstBalance reaches zero. The application slows down, the team scrambles to find a cause, and the metric that reveals it (BurstBalance) is not on most default dashboards. SOA-C02 routinely tests this scenario — the candidate must know to alarm on BurstBalance < 20% and to migrate to gp3 (or a larger gp2) when the burst pattern is regular rather than occasional. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cloudwatch_ebs.html ::

EBS Volume Types Compared: gp2, gp3, io1, io2, st1, sc1

Choosing and modifying volume types is the single most operational EBS decision on SOA-C02.

General Purpose SSD — gp2 vs gp3

  • gp2: legacy general-purpose SSD. IOPS is 3 IOPS per GB baseline, with a burst up to 3,000 IOPS for volumes smaller than 1,000 GB (above 1,000 GB the baseline already exceeds 3,000 so burst is moot). Maximum 16,000 IOPS at 5,334 GB. Throughput scales with size up to 250 MiB/s. The IOPS and throughput are coupled to volume size — the only way to get more IOPS on a small volume is to grow it.
  • gp3: modern general-purpose SSD. Baseline 3,000 IOPS and 125 MiB/s regardless of size, from 1 GB up. Provisioned IOPS up to 16,000 and provisioned throughput up to 1,000 MiB/s, both billed independently of size. About 20 percent cheaper than gp2 at equivalent baseline performance. The default choice for new general-purpose workloads.

The gp2 → gp3 migration is a no-downtime ModifyVolume call. For SOA-C02, the giveaway scenario phrases are "BurstBalance dropping" or "the IOPS we get does not match the size we need" — both point to gp3.

Provisioned IOPS SSD — io1, io2, io2 Block Express

  • io1: legacy provisioned-IOPS SSD. Up to 64,000 IOPS per volume on Nitro instances, 50:1 IOPS-to-GB ratio max. Higher durability than gp2/gp3 (99.8–99.9 percent). Used for mission-critical OLTP and large databases.
  • io2: same envelope as io1 but 99.999 percent annual durability and a 500:1 IOPS-to-GB ratio. The default replacement for io1.
  • io2 Block Express: io2 on Nitro instances with up to 256,000 IOPS per volume, 4,000 MiB/s throughput, 64 TiB size, and sub-millisecond latency. Required for the largest SAP HANA, Oracle, and SQL Server workloads. Requires a Nitro-based instance type.

For SOA-C02, the io1 → io2 migration is recommended for the durability bump. io2 Block Express is the answer when the scenario mentions sub-millisecond latency or IOPS above 64,000.

Throughput-Optimized HDD — st1

  • st1: low-cost HDD optimized for streaming sequential workloads — big data, log processing, data warehouse staging. Throughput up to 500 MiB/s per volume; IOPS-bound for random access. Cannot be a boot volume.

Cold HDD — sc1

  • sc1: lowest-cost HDD for infrequently accessed sequential data — cold archives that still need block storage. Throughput up to 250 MiB/s. Cannot be a boot volume.

Magnetic (deprecated)

  • standard / magnetic: legacy generation; do not select for new workloads.

Live volume modification — ModifyVolume

The Elastic Volumes feature lets you change type, size, provisioned IOPS, and provisioned throughput on a live attached volume without detaching or stopping the instance. The general flow:

  1. Issue aws ec2 modify-volume --volume-id vol-... --volume-type gp3 --iops 6000 --throughput 250 --size 500.
  2. The volume enters the modifying state, then optimizing, then completed. During optimizing performance is reduced — plan modifications during low-traffic windows.
  3. Operating-system-level steps are still required after a size change: growpart and resize2fs (or xfs_growfs) to extend the partition and filesystem. The block device grew but the OS does not auto-grow the filesystem.
  4. After a type change you do not need OS-level work — the change is transparent.
  5. Cooldown: after a successful modification, EBS enforces a 6-hour cooldown before the same volume can be modified again.

SOA-C02 frequently tests "the volume size was modified but df -h still shows the old size" — the candidate must know the OS-level resize commands.

  • gp3 baseline: 3,000 IOPS and 125 MiB/s, regardless of size.
  • gp3 max: 16,000 IOPS, 1,000 MiB/s, 16 TiB size — provisioned independently.
  • gp2 baseline: 3 IOPS per GB; max 16,000 IOPS at 5,334 GB; burst to 3,000 IOPS on small volumes.
  • io2 max (standard): 64,000 IOPS, 1,000 MiB/s, 16 TiB.
  • io2 Block Express max: 256,000 IOPS, 4,000 MiB/s, 64 TiB, sub-millisecond latency, Nitro instances only.
  • st1: streaming sequential HDD, up to 500 MiB/s, cannot boot.
  • sc1: cold HDD, up to 250 MiB/s, cannot boot.
  • ModifyVolume cooldown: 6 hours between modifications on the same volume.
  • Durability: gp2/gp3/io1 = 99.8–99.9%; io2 / io2 Block Express = 99.999% annual durability.
  • Reference: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volume-types.html

A SOA-C02 distractor: "the application requires sub-millisecond latency at 100,000 IOPS" — the candidate picks io2 Block Express, but the scenario also says the workload runs on m4.4xlarge (Xen, not Nitro). io2 Block Express is supported only on Nitro instances (m5/m6i/m7i/c5/c6i/c7i/r5/r6i/r7i and X-series). The fix is to migrate to a Nitro instance family first, then attach io2 Block Express. Reference: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volume-types.html

S3 Performance Features: Multipart, Transfer Acceleration, Byte-Range Fetches

S3 performance tuning is mostly client-side — S3 itself scales effectively without limit when used correctly, but the client must use the right protocol features.

Request rate per prefix — the most-tested S3 performance fact

S3 scales per prefix, not per bucket. A prefix is the path before the object key — images/thumbnails/ and images/originals/ are two different prefixes. Each prefix supports:

  • 3,500 PUT/COPY/POST/DELETE requests per second.
  • 5,500 GET/HEAD requests per second.

There is no fixed bucket-wide limit. To exceed 5,500 GET/sec on a hot dataset, fan out across multiple prefixes. The classic example: instead of bucket/2026-04-25/file-001, write bucket/abc/2026-04-25/file-001, bucket/def/2026-04-25/file-002 — randomized first-character prefixes spread requests across many partitions.

Multipart upload

For objects larger than 100 MB, AWS recommends multipart upload. For objects above 5 GB, multipart is required — single-PUT cannot exceed 5 GB. Each part is at most 5 GB, with up to 10,000 parts per upload, for a maximum object size of 5 TB.

Multipart upload benefits:

  • Parallelism: parts upload concurrently from the client, multiplying total throughput.
  • Resumability: if a part fails, only that part is retried. A four-hour 1 TB upload does not restart from zero on a single network blip.
  • Time-to-first-byte: the client can start uploading before knowing the full object size.

Multipart uploads must either complete (CompleteMultipartUpload) or abort (AbortMultipartUpload); incomplete uploads remain in the bucket as orphan parts that you pay for. Configure a lifecycle rule to abort incomplete multipart uploads after N days.

Byte-range fetches

For large objects, downloads can be parallelized using HTTP Range: headers. The client makes multiple concurrent GETs for distinct byte ranges of the same object, reassembling locally. This shortens total download time on multi-GB downloads to roughly the slowest range.

Transfer Acceleration

S3 Transfer Acceleration routes uploads through the CloudFront edge network and AWS's private backbone instead of going directly across the public internet. The client uploads to the nearest edge, AWS moves the bytes over its backbone to the destination region.

When to use:

  • Cross-continent uploads where the user is far from the bucket region (e.g., users in Tokyo uploading to a us-east-1 bucket).
  • Large objects (the per-GB fee is amortized over the bytes saved).
  • Workloads where consistent performance matters more than the lowest cost.

When not to use:

  • Same-region uploads — Transfer Acceleration adds cost without benefit.
  • Small objects under a few MB — the extra cost rarely justifies the marginal speedup.

The S3 Transfer Acceleration Speed Comparison Tool estimates the percent improvement Transfer Acceleration would give for a specific bucket from your network location, before you enable it.

CloudFront in front of S3 for downloads

For downloads of static or cacheable content to many clients, CloudFront sitting in front of S3 is usually faster and cheaper than Transfer Acceleration. CloudFront caches at the edge so subsequent reads do not even hit the bucket. Transfer Acceleration is for uploads and for one-time large downloads where caching does not help.

  • Per-prefix request rate: 3,500 PUT/COPY/POST/DELETE per second, 5,500 GET/HEAD per second.
  • Multipart upload threshold (recommended): objects > 100 MB.
  • Multipart upload required: objects > 5 GB cannot use single PUT.
  • Maximum part size: 5 GB. Maximum parts per upload: 10,000. Maximum object size: 5 TB.
  • Single PUT max object size: 5 GB.
  • Transfer Acceleration: routes through CloudFront edges + AWS backbone; per-GB surcharge.
  • Multipart abort lifecycle rule: should be set on every bucket that ingests large uploads.
  • Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

A common SOA-C02 distractor implies S3 has a "bucket throughput limit" or that "all objects share the same scaling pool". They do not. Each prefix has its own scaling envelope (3,500 writes/sec, 5,500 reads/sec). A bucket with a single hot prefix throttles even though the bucket-wide aggregate is well below any imaginary limit. The fix is to redistribute keys across more prefixes, not to provision a larger bucket. Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

RDS Performance Insights: Top SQL, Wait Events, AAS

RDS Performance Insights is the SysOps team's diagnostic dashboard for RDS and Aurora. It is enabled per database instance (one toggle in the console or --enable-performance-insights on the CLI).

The DB Load (AAS) metric

The headline Performance Insights metric is DB Load, expressed as Average Active Sessions (AAS). AAS is the number of database sessions actively running queries at any moment. The reference baseline is the number of vCPUs on the instance — if AAS is consistently above the vCPU count, the database is CPU-bound. If AAS is high but mostly non-CPU wait events (IO, lock, network), the bottleneck is elsewhere.

Top dimensions

Performance Insights groups DB Load by dimensions you can pivot through:

  • Top SQL — which queries are accumulating the most DB Load. The first place to look when overall load is high.
  • Top waits — which wait events dominate (CPU, IO:DataFileRead, IO:XactSync, Lock:transaction, etc.). The diagnostic axis.
  • Top hosts — which client hosts originate the load. Useful for finding the rogue cron job.
  • Top users — which DB users are consuming load. Useful in shared-database environments.
  • Top databases — which logical database (in multi-DB instances).

Common wait event diagnoses

  • CPU — the SQL is computationally expensive. Optimize the query, add an index, scale up the instance class.
  • IO:DataFileRead — the SQL is reading data from disk that was not in the buffer cache. Add memory (larger instance), provision more IOPS, or add an index to read fewer pages.
  • IO:XactSync (PostgreSQL) / io_completion (SQL Server) — write commits waiting on storage. Provision more IOPS or move to io2 Block Express.
  • Lock:transaction — sessions waiting on row locks held by another transaction. The application has lock contention; investigate slow transactions.
  • Client:ClientRead — the database is waiting on the client to read results. The client app is slow, not the DB.

Retention and pricing

  • Free tier: 7 days of Performance Insights data retention.
  • Paid (long-term retention): up to 24 months for an additional fee.

For SOA-C02, the giveaway is "the team needs trend analysis over the past quarter" → long-term retention; "the team needs to debug last week's incident" → free tier is enough.

Enhanced Monitoring vs Performance Insights

Two RDS-specific monitoring features that are easy to confuse:

Feature Granularity What it shows Tier
CloudWatch metrics 60s default Instance-level CPU, IOPS, connections, replica lag Free
Enhanced Monitoring 1s, 5s, 10s, 15s, 30s, 60s OS-level metrics from inside the DB host (per-process CPU/mem, OS load) Per-second per-instance fee
Performance Insights 1s sampling, 7d retention DB load, top SQL, wait events Free for 7d, paid for ≤24mo

Performance Insights answers "which SQL and which waits". Enhanced Monitoring answers "what is the OS doing". They complement each other; neither replaces the other.

SOA-C02 questions about RDS performance almost always have Performance Insights as the right answer. The exam guide explicitly names Performance Insights under Task Statement 6.2. The candidate must know that AAS = active sessions, that wait events name the bottleneck, and that the free 7-day retention is enough for most operational debugging. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

The 24-month retention number is paid. SOA-C02 sometimes phrases the question to make 24 months sound free. The free tier is exactly 7 days. If the scenario requires longer retention for capacity-planning trends or audit, choose long-term retention and accept the additional cost. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

RDS Proxy: Connection Pooling for Connection Storms

RDS Proxy is a managed connection pool that sits between application clients and the RDS or Aurora database. It maintains a warm pool of connections to the database and multiplexes client connections across them.

Why RDS Proxy exists

Database connections are expensive to open. A typical PostgreSQL or MySQL connection consumes memory and a backend process. Lambda scaling out to thousands of concurrent invocations creates a connection storm — each Lambda instance opens its own connection, the database hits its max_connections limit, and new connections error with too many connections. The traditional fix (application-side connection pool) does not work for serverless because each Lambda is a fresh process.

RDS Proxy solves this by:

  • Pooling and reusing database connections across many client sessions.
  • Multiplexing — many client sessions share fewer backend connections.
  • Holding extra clients in a queue when backend connections are saturated, smoothing burst.
  • Failing over faster than direct RDS connections during Multi-AZ events (the proxy retries to the new primary instead of forcing the client to reconnect).
  • Integrating with Secrets Manager for credential rotation without application changes.
  • Enforcing IAM authentication at the proxy layer.

When to use RDS Proxy

  • Lambda functions hitting RDS or Aurora.
  • Web applications with high connection churn (each request opens a new connection).
  • Workloads requiring faster failover (proxy reduces failover times on Multi-AZ).
  • Workloads requiring Secrets Manager-backed credential rotation without restarting application instances.

When RDS Proxy is overkill

  • Long-running EC2-hosted apps with their own well-tuned application-side pool.
  • Workloads with very low connection rates (the proxy adds latency overhead).

Pricing

RDS Proxy is billed per hour per vCPU of the underlying database. Not free, but typically cheaper than the cost of database connection-limit incidents.

The classic SOA-C02 scenario: "the team migrated their API to Lambda, and now the database returns Too many connections during traffic spikes". The answer is RDS Proxy, full stop. Increasing max_connections is a temporary patch that masks the cost of high connection churn; RDS Proxy is the architectural fix. Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html

EC2 Enhanced Networking: ENA, SR-IOV, and Bandwidth

Enhanced networking is an EC2 capability that delivers higher bandwidth, higher PPS, and lower latency by bypassing the hypervisor's standard virtio path with SR-IOV (Single-Root I/O Virtualization).

The two enhanced networking adapters

  • ENA (Elastic Network Adapter) — the modern enhanced networking adapter on Nitro instances. Up to 100 Gbps on the largest instance types (p4d, p5, c7gn). The default for any current-generation instance.
  • EFA (Elastic Fabric Adapter) — a specialized adapter for HPC and ML workloads that adds OS-bypass for MPI and NCCL. Used on c5n, c6gn, c7gn, p3dn, p4d, p5. Requires the EFA driver. Out of scope for typical SysOps; included for completeness.

The legacy Intel ixgbevf (VF) driver is still on the exam as a historical alternative for older c3/c4/r3 instances; assume current-generation = ENA.

Enabling enhanced networking

For ENA:

  1. Use a current-generation instance type (most are ENA-capable by default — check the documentation table).
  2. The AMI must include the ENA driver (Amazon Linux 2/2023 and current Ubuntu/RHEL/SUSE/Windows AMIs do).
  3. The enaSupport attribute on the AMI and instance must be true (Amazon-published AMIs have this set; for custom AMIs, set with aws ec2 modify-instance-attribute --ena-support).

If any of the three is missing, the instance falls back to the standard virtio interface and you do not get enhanced networking — even though the instance is capable. SOA-C02 tests this misconfiguration: "the SysOps team chose c5n.18xlarge for a 100 Gbps workload but is only seeing 10 Gbps" — the AMI is missing the ENA driver or enaSupport=false.

Instance bandwidth ceilings

Each instance type has a documented bandwidth ceiling (e.g., m5.large ≈ 10 Gbps burst, c5n.18xlarge ≈ 100 Gbps sustained). The ENA adapter is necessary but not sufficient — the instance type's ceiling caps actual throughput. SOA-C02 favors network-bound scenarios where the candidate must (a) confirm enhanced networking is on, (b) confirm the instance is sized for the bandwidth requirement, (c) check NetworkPacketsIn/Out for PPS limits.

Three things must be true for ENA-based enhanced networking to actually work: a supported instance type, an AMI with the ENA driver baked in, and enaSupport=true on the AMI/instance. Missing any one results in a silent fallback to virtio at much lower bandwidth. The SysOps engineer's diagnostic is ethtool -i eth0 — the driver should be ena, not vif. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Instance Store: Ephemeral NVMe for High IOPS

Instance store is local block storage physically attached to the host machine. Unlike EBS (network-attached, durable), instance store is:

  • Ephemeral: lost on instance stop, hibernation, terminate, or hardware failure. Survives reboot only.
  • High-performance: NVMe SSD with millions of IOPS and gigabytes-per-second throughput on i3, i3en, i4i, m5d, c5d, r5d series.
  • Free with the instance: no separate billing. Capacity is determined by the instance type.
  • Cannot be attached or detached: it is part of the instance.

When to use instance store

  • Caching layers — Redis-on-EC2, Memcached fleets, application-side caches that can be re-warmed from durable storage.
  • Scratch / temp data — sort spill files, temporary index builds, video transcoding intermediate frames.
  • Distributed databases with their own replication — Cassandra, MongoDB, Elasticsearch using instance store for the data path with replication providing durability across nodes.
  • Performance benchmarking where EBS network bandwidth is the bottleneck.

When NOT to use instance store

  • Single-source-of-truth data that must survive instance failure.
  • Data that must persist across stop/startaws ec2 stop-instances discards instance store.
  • Workloads expecting EBS snapshots — instance store cannot be snapshotted.

Operational considerations

  • The data is encrypted at rest by default on Nitro instances (AES-256, AWS-managed keys, no charge).
  • Instance store volumes appear as block devices (/dev/nvme1n1, etc.) but require manual filesystem creation and mounting.
  • Replacing an instance type usually means losing instance store contents — the new instance has fresh, empty volumes.

SOA-C02 may ask "the workload needs over 1 million IOPS for a sort spill at sub-millisecond latency, and durability is not required because the data is rebuilt on each run". This is the textbook instance store use case — io2 Block Express tops out at 256K IOPS per volume, instance store on i4i.32xlarge exceeds 1M IOPS at much lower cost. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html

EC2 Placement Groups: Cluster, Spread, Partition

Placement groups influence how AWS places EC2 instances on the underlying physical hardware to optimize for either low latency or fault isolation.

Cluster placement group

  • Packs instances close together within a single Availability Zone, on the same rack or adjacent racks.
  • Low latency (sub-millisecond) and high bandwidth (up to 10 Gbps per flow, 100 Gbps aggregate on ENA-supported instances).
  • Use case: HPC tightly-coupled workloads — MPI simulations, in-memory data analytics, ML training synchronization.
  • All instances in the group should be the same type (mixing types reduces packing efficiency).
  • Single AZ — no fault isolation. A rack-level failure can take down the entire group.

Spread placement group

  • Spreads instances across distinct underlying hardware so a single hardware failure affects at most one instance.
  • Each instance is on a separate rack with its own network and power.
  • Maximum 7 instances per AZ in a spread group.
  • Use case: small fleets of critical instances that must each survive — domain controllers, MQTT brokers, license servers.
  • Across AZs is supported.

Partition placement group

  • Divides the group into partitions, each of which is on distinct racks with no shared infrastructure.
  • Up to 7 partitions per AZ, each with many instances.
  • Use case: large distributed workloads where the framework knows about partitions / racks — Cassandra, HDFS, Kafka. The application maps replicas to different partitions so a rack failure takes out at most one replica.
  • The partition number is exposed via instance metadata for the application to consume.

Comparison table

Attribute Cluster Spread Partition
Goal Low latency Fault isolation Distributed fault isolation
Hardware Same rack/adjacent Distinct racks Partition = rack
Size limit Practical for tens of instances 7 per AZ 7 partitions × many instances per AZ
Multi-AZ No Yes Yes
Instance types Should be uniform Any Any
Typical workload HPC, MPI, ML training Small critical fleet Cassandra, HDFS, Kafka

Operational notes

  • A placement group is created empty, then instances are launched into it. You cannot retroactively add a running instance to a Cluster placement group.
  • For Spread / Partition, you can move a running stopped instance into the group via aws ec2 modify-instance-placement.
  • Cluster placement groups should use ENA-enabled instance types in the same family for predictable bandwidth.

SOA-C02 tests recognition of the workload signal: "low-latency HPC simulation" → Cluster; "small fleet of MQTT brokers that must each survive a host failure" → Spread; "Cassandra cluster with replicas spread across rack-level failure domains" → Partition. The candidate who memorizes "Spread for HA" without mapping to scale ends up choosing Spread for a 200-node Cassandra cluster — wrong, because Spread caps at 7 per AZ. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

Scenario Pattern: EBS Volume Hitting BurstBalance Zero — Upgrade Path

A canonical SOA-C02 troubleshooting scenario. The runbook:

  1. Identify the volume type. Console → EC2 → Volumes → check Volume type. If it is gp2, st1, or sc1, BurstBalance applies. If it is gp3 or io2, BurstBalance does not exist (no burst credits — performance is provisioned).
  2. Confirm BurstBalance is the bottleneck. CloudWatch → Metrics → AWS/EBSBurstBalance. Trending toward 0 over the past hours means burst credits are depleting; once at 0 the volume throttles to baseline.
  3. Decide on the upgrade path. Two options:
    • Migrate to gp3 with ModifyVolume. gp3 baseline is 3,000 IOPS regardless of size — solves most gp2 burst depletion cases at lower cost. For higher IOPS, additionally provision IOPS up to 16,000.
    • Upsize the gp2 volume so the baseline (3 IOPS/GB) matches the steady-state demand. Less efficient and more expensive than gp3 in nearly all cases — included only as the legacy answer.
  4. Issue the modification. aws ec2 modify-volume --volume-id vol-... --volume-type gp3 --iops 6000 --throughput 250. The volume enters modifying, then optimizing (during which performance is reduced), then completed.
  5. Validate with metrics. After the modification settles, watch VolumeQueueLength — it should drop to under 1 if the new IOPS/throughput fits the workload. Watch VolumeReadOps/VolumeWriteOps divided by period for actual achieved IOPS.
  6. Set up the alarm. Create a BurstBalance < 20% alarm on any remaining gp2 volumes — and ideally migrate them all to gp3 over time. For gp3, alarm on VolumeQueueLength > 1 instead.

The most common root cause is a steady-state IOPS demand that exceeded gp2's size-coupled baseline; the fix is gp3 in 95 percent of cases. SOA-C02 expects this exact sequence.

Scenario Pattern: RDS p99 Latency Spike — Top SQL → Wait Events → Fix

Another canonical scenario. The application reports p99 latency tripling during business hours. The runbook:

  1. Open Performance Insights for the database. Set the time window to the spike period.
  2. Read the DB Load (AAS) chart. Compare against the vCPU count. If AAS is far above vCPUs, the DB is saturated; if AAS is moderate but the application p99 still spikes, the bottleneck might be elsewhere (network, app tier).
  3. Slice by Top SQL. The list shows which queries are accumulating load. Often a single new query (post-deploy) dominates.
  4. Slice by Top Waits. This names the bottleneck:
    • CPU — the SQL is computationally heavy. Add an index, rewrite the query, or scale the instance class.
    • IO:DataFileRead — the working set exceeds the buffer cache. Add memory (larger instance class), provision more IOPS, or add an index to read fewer pages.
    • Lock:transaction — application has lock contention. Investigate the transaction holding the lock; commit faster or refactor.
    • Client:ClientRead — the DB is waiting on the client. The application is slow, not the DB.
  5. Check DBConnections. If connections are at the max_connections limit, that alone causes new requests to queue or fail. The fix is RDS Proxy.
  6. Check ReadLatency / WriteLatency on the underlying storage. If they are higher than expected (>10ms for SSD), the storage is throttling — provision more IOPS or move to io2 Block Express.
  7. Apply the targeted fix (index, query rewrite, instance class change, IOPS bump, RDS Proxy, or read replica for read scaling), then re-validate against the same Performance Insights view for the next business-hours window.

For SOA-C02, the signal phrases match the wait events: "high latency on a complex JOIN" → Top SQL + index suggestion; "many short queries timing out" → DBConnections + RDS Proxy; "writes are slow" → ReadLatency/WriteLatency + provisioned IOPS.

Common Trap: gp2 Burst Credit Depletion Is Invisible Without an Alarm

CloudWatch does not alarm by default on BurstBalance. A gp2 volume can run for days at full burst, slowly depleting credits, and the team only notices when the application becomes unresponsive after credits hit zero. By then the throttle is in effect and a runbook scramble is on. The fix is preventive: alarm on BurstBalance < 20% so the team has hours of lead time, and migrate from gp2 to gp3 wherever the burst pattern is regular rather than rare.

Common Trap: io2 Block Express Requires Nitro Instances

io2 Block Express delivers up to 256,000 IOPS per volume and sub-millisecond latency, but it is only supported on Nitro-based instance types. Attaching an io2 Block Express volume to a non-Nitro instance type is not allowed; SOA-C02 sometimes embeds an older-generation instance type (m4, c4, r4) in a scenario asking for io2 Block Express performance. The candidate must spot the mismatch and either migrate to a Nitro instance type or accept io2 (non-Block Express) which works on more instance families.

Common Trap: S3 Prefix Scaling Is Per-Prefix, Not Bucket-Wide

Newer SysOps engineers sometimes assume the bucket has a fixed throughput limit. It does not. The 3,500 PUT/sec and 5,500 GET/sec limits are per prefix, and a bucket can have effectively unlimited prefixes. A single hot prefix throttles regardless of bucket size. The fix is key naming: spread keys across many prefixes (e.g., randomized first character) so requests fan out across S3's internal partitioning.

Common Trap: Performance Insights Free Retention Is 7 Days, Not 24 Months

The 24-month retention number is the paid long-term retention option. The default and free tier is 7 days. SOA-C02 can phrase a question to make 24 months sound free; the candidate must remember that long-term retention is opt-in and billed.

Common Trap: Detailed Monitoring vs CloudWatch Agent for EBS

Detailed EC2 monitoring shifts the EC2 hypervisor metric period from 5 minutes to 1 minute, but it does not affect EBS metrics — those publish at 1-minute granularity by default for any attached volume regardless of detailed monitoring. The CloudWatch agent is what you need for filesystem-level metrics (disk_used_percent), not for EBS volume-level metrics (which the EBS service publishes automatically). Confusing the two leads to dashboards missing critical information.

SOA-C02 vs SAA-C03: The Operational Lens

SAA-C03 and SOA-C02 both touch performance topics, but the lenses differ.

Question style SAA-C03 lens SOA-C02 lens
Volume selection "Which EBS volume type for a high-IOPS database?" "BurstBalance hit zero on gp2 — what's the runbook to fix without downtime?"
Storage scaling "Choose an architecture that handles 10x growth." "Run ModifyVolume to grow size and IOPS; what OS-level commands extend the filesystem?"
S3 throughput "Architect for high-throughput data ingestion." "PUT requests are returning 503 SlowDown — diagnose prefix scaling."
RDS performance "Choose RDS instance class and storage for the workload." "Performance Insights shows IO:DataFileRead dominating — what's the fix?"
Database connections "Use RDS Proxy for serverless." "Lambda is hitting Too many connections — configure RDS Proxy and verify."
Networking "Choose enhanced-networking-capable instance." "ENA-capable instance is only delivering 10 Gbps — diagnose missing driver / enaSupport."
Placement "Which placement group for HPC?" "200-node Cassandra cluster — Partition with how many partitions per AZ?"
Right-sizing "Which instance type for this CPU + memory profile?" "Use Compute Optimizer with enhanced infrastructure metrics; pick risk-1 recommendation."

The SAA candidate selects the right resource. The SOA candidate measures, modifies live, diagnoses, and validates the change.

Exam Signal: How to Recognize a Domain 6.2 Question

Domain 6.2 questions on SOA-C02 follow predictable shapes.

  • "BurstBalance is dropping" → migrate gp2 to gp3 with ModifyVolume.
  • "Sub-millisecond latency at 200,000 IOPS" → io2 Block Express on a Nitro instance.
  • "Too many connections from Lambda" → RDS Proxy.
  • "Slow query during business hours" → Performance Insights → Top SQL → wait events.
  • "PUT 503 SlowDown errors" → S3 prefix redistribution.
  • "Object > 100 MB upload is slow / unreliable" → multipart upload + Transfer Acceleration if cross-continent.
  • "100 Gbps workload only getting 10 Gbps" → ENA driver missing or enaSupport=false.
  • "Cassandra / Kafka / HDFS distributed cluster" → Partition placement group.
  • "HPC simulation needs lowest network latency" → Cluster placement group.
  • "Domain controllers / license servers / brokers must each survive a host failure" → Spread placement group.
  • "Right-size the fleet based on actual usage" → Compute Optimizer with enhanced infrastructure metrics.
  • "Capacity-plan trend over the last quarter" → Performance Insights long-term retention (paid).

With Domain 6 worth 12 percent and Task Statement 6.2 covering the performance half, expect 6 to 8 questions in this exact territory. Mastering the EBS metric tree, gp2-vs-gp3 mechanics, Performance Insights wait events, and placement group selection covers most of them. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

Decision Matrix — Performance Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal Primary construct Notes
Stop gp2 burst depletion ModifyVolume → gp3 with provisioned IOPS Live, no downtime; OS resize only needed for size changes.
Sub-millisecond DB latency at 200K IOPS io2 Block Express on Nitro Confirm the instance is Nitro-class first.
Identify slow SQL on RDS Performance Insights, Top SQL + wait events Free 7d retention; 24mo retention paid.
Lambda connection storm to RDS RDS Proxy Pools, multiplexes, integrates with Secrets Manager.
Upload > 100 MB to S3 Multipart upload Required > 5 GB; configure abort lifecycle rule.
Cross-continent S3 upload speedup Transfer Acceleration Per-GB surcharge; validate with Speed Comparison Tool.
Many small reads of one large S3 object Byte-range fetches Concurrent ranges, parallel HTTP GETs.
S3 hot prefix throttling Redistribute keys across more prefixes Per-prefix, not bucket-wide.
100 Gbps workload Nitro instance + ENA + AMI with ENA driver + enaSupport=true All three required.
HPC tight coupling Cluster placement group Same AZ, same rack — no fault isolation.
Small critical fleet HA Spread placement group Max 7/AZ, distinct racks.
200-node Cassandra Partition placement group Up to 7 partitions/AZ, many instances each.
1M+ IOPS ephemeral scratch Instance store on i4i / m5d Lost on stop/terminate.
Right-size the fleet Compute Optimizer + CloudWatch agent memory metric Use risk-1 recommendations first.
Recommend instance from metrics Compute Optimizer + 14 days of metrics Enhanced infrastructure metrics for memory awareness.
Live volume size + filesystem extend ModifyVolume then growpart + resize2fs/xfs_growfs OS-level resize is mandatory after size change.
Avoid orphan multipart uploads S3 lifecycle rule: abort incomplete multipart after N days Otherwise you pay for orphaned parts indefinitely.

Common Traps Recap — Performance Tuning

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: gp2 burst depletion goes unnoticed without an alarm

CloudWatch does not auto-alarm on BurstBalance. Set one at 20 percent and migrate to gp3 wherever a regular burst pattern recurs.

Trap 2: io2 Block Express on non-Nitro instance types

The exam embeds older-generation instance types in a scenario asking for io2 Block Express; the right answer is a Nitro instance.

Trap 3: S3 prefix scaling confused with bucket scaling

3,500 writes/sec and 5,500 reads/sec are per-prefix. The fix is key redistribution, not bucket sharding.

Trap 4: Performance Insights 24-month retention is not free

The free tier is 7 days. Long-term retention is opt-in and billed.

Trap 5: Detailed monitoring is not the answer for memory or filesystem metrics

Detailed monitoring changes the period of EC2 hypervisor metrics from 5 to 1 minute. It does not produce memory or disk-fullness metrics — that is the CloudWatch agent.

Trap 6: Multipart upload threshold is a recommendation, but 5 GB is a hard limit

Single PUT cannot exceed 5 GB. AWS recommends multipart above 100 MB; on the exam, "object is 8 GB and the team is using single PUT" is a multipart fix.

Trap 7: ENA driver missing on a custom AMI

Three conditions must hold for enhanced networking: capable instance type, AMI with ENA driver, enaSupport=true. Custom AMIs sometimes miss the driver and silently fall back to virtio.

Trap 8: Placement group sizing limits

Spread is capped at 7 instances per AZ. A 50-node fleet cannot fit in a Spread group — the right answer is Partition or per-instance hardware-distinct provisioning.

Trap 9: ModifyVolume size change without OS-level resize

The block device grows, but the partition and filesystem do not auto-grow. growpart + resize2fs (or xfs_growfs) are mandatory after a size modification.

Trap 10: Instance store data persists across stop/start

It does not. Instance store survives reboot only. Stop, terminate, hibernate, or hardware failure all discard the data.

FAQ — Compute, Storage, and Database Performance Tuning

Q1: When should I switch from gp2 to gp3?

In nearly all cases as soon as it is operationally convenient. gp3 has a higher baseline (3,000 IOPS, 125 MiB/s regardless of size) than gp2 below 1,000 GB, decouples IOPS and throughput from size (so you can keep small volumes fast), supports independent IOPS and throughput provisioning up to 16,000 IOPS / 1,000 MiB/s, and costs about 20 percent less than equivalent gp2. The migration is aws ec2 modify-volume --volume-id vol-... --volume-type gp3 and runs live with no downtime. The only reason to stay on gp2 is a rare mature workload where you have already over-provisioned size to satisfy IOPS — in that case migrate at the next maintenance window. SOA-C02 considers gp3 the default for new general-purpose workloads.

Q2: How do I know whether the bottleneck is the EBS volume or the EC2 instance?

Check VolumeQueueLength on the volume side and the documented EBS bandwidth ceiling on the instance side. If VolumeQueueLength is consistently above 1, the volume is the bottleneck — increase IOPS, throughput, or migrate volume type. If VolumeQueueLength is below 1 but the application is still slow, the limit is somewhere else: the instance's EBS bandwidth ceiling (each instance type has a documented ceiling that caps total volume throughput from all attached EBS), the network, the CPU, or the database. SOA-C02 sometimes presents a 64,000-IOPS io2 volume on an instance with a 4,750-IOPS ceiling — the candidate must recognize the instance side caps the achievable IOPS regardless of the volume's provisioned IOPS.

Q3: What is the difference between IOPS and throughput, and which one matters more?

IOPS is operations per second; throughput is bytes per second. They measure different aspects of the same workload. A workload doing many small reads (4 KB each) is IOPS-bound — it can saturate the IOPS ceiling long before the throughput ceiling. A workload doing few large reads (1 MB each) is throughput-bound — it can saturate throughput before IOPS. OLTP databases and transactional applications are typically IOPS-bound; data warehousing, video processing, and big-data scans are typically throughput-bound. gp3 lets you provision both independently, which is exactly the SysOps lever for tuning. The CloudWatch metrics map one-to-one: VolumeReadOps/VolumeWriteOps for IOPS, VolumeReadBytes/VolumeWriteBytes for throughput. Always check both.

Q4: When should I use S3 Transfer Acceleration?

When the client is far from the bucket region (cross-continent uploads), the objects are large enough that the per-GB surcharge is amortized, and consistent upload performance matters more than absolute lowest cost. Run the S3 Transfer Acceleration Speed Comparison Tool for the bucket from your network location — if the comparison shows a meaningful percent improvement, enable Transfer Acceleration. For same-region uploads, small objects, or downloads to many clients, Transfer Acceleration is not the right answer. For downloads to many clients of cacheable content, CloudFront is faster and cheaper than Transfer Acceleration because the second request hits the cache. Transfer Acceleration is for uploads and one-time large downloads; CloudFront is for cacheable downloads.

Q5: How does Performance Insights actually find the slow query?

Performance Insights samples the database session list once per second, recording which sessions are active and what each is waiting on. Over time this produces a histogram of DB Load (AAS) sliced by SQL statement, host, user, database, and wait event. The slow query is the SQL with the highest accumulated load — meaning either the most concurrent active sessions running it, or the longest individual runs. The wait events tell you why each session is slow: IO:DataFileRead means storage reads, CPU means computation, Lock:transaction means contention. The top-SQL plus top-waits combination uniquely identifies the bottleneck. Performance Insights does not require enabling the slow-query log or any client-side instrumentation; it works at the database engine layer.

Q6: Should I use RDS Proxy for every RDS deployment?

No. RDS Proxy is the right answer for high connection-churn workloads — Lambda functions hitting RDS, web apps that open a connection per request, microservices with autoscaling fleets. For long-running EC2-hosted applications with their own well-tuned application-side connection pool, RDS Proxy adds latency overhead without solving a real problem. The decision rule is connection rate: if the application is opening more than dozens of new connections per second, or hitting the database's max_connections limit, RDS Proxy is justified. RDS Proxy also accelerates failover during Multi-AZ events, which is itself a reason to deploy it for any tier-1 RDS workload regardless of connection pattern. SOA-C02 favors RDS Proxy whenever the scenario mentions Lambda + RDS.

Q7: What happens to instance store data when I stop the instance?

It is permanently lost. Instance store survives reboot only. Stop, terminate, hibernate, and underlying hardware failure all discard instance store contents. This is the operational reason instance store is reserved for ephemeral data: caches, scratch space, sort spills, replicated data in distributed databases. If the data must persist across aws ec2 stop-instances, use EBS. If the data must be recoverable from snapshot, use EBS. Instance store cannot be snapshotted, cannot be detached, and cannot be moved between instances.

Q8: How do I pick between Cluster, Spread, and Partition placement groups?

Match the placement group to the workload's communication pattern and failure tolerance. Cluster for HPC, MPI, ML training synchronization, and any workload where instances communicate frequently with each other and benefit from low latency and high bandwidth on a single rack — accepting that a rack failure takes everyone out. Spread for small fleets (≤ 7 per AZ) of individually critical instances — domain controllers, MQTT brokers, license servers — where you want each on a distinct rack so a hardware failure affects at most one. Partition for large distributed workloads where the framework understands "rack" or "partition" — Cassandra, HDFS, Kafka — and you want replicas spread across partitions so a partition failure takes out at most one replica. The size hint is the giveaway: ≤ 7 critical → Spread; HPC tightly-coupled → Cluster; large distributed → Partition.

Q9: My Compute Optimizer recommendations look wrong — why?

Compute Optimizer needs at least 14 days of CloudWatch metrics for a confident recommendation, and it needs the CloudWatch agent's memory metric to detect memory-bound workloads. Without the agent, Compute Optimizer assumes memory is not bound and recommends smaller instances based purely on CPU; if the workload was actually memory-pinned, downsizing per the recommendation will degrade performance. Enable enhanced infrastructure metrics in Compute Optimizer (consumes the agent's memory metric) and ensure the CloudWatch agent is publishing memory and disk metrics for at least two weeks before trusting the recommendations. Also, treat the performance risk score as gating: only act on risk-1 (lowest risk) recommendations without further validation; risk-3 and risk-4 recommendations require benchmarking before applying.

Q10: Why is my c5n.18xlarge instance only delivering 10 Gbps when the spec says 100 Gbps?

Three conditions must all be true for ENA-based enhanced networking to deliver the documented bandwidth: (a) the instance type is ENA-capable (c5n is); (b) the AMI has the ENA kernel driver installed (current Amazon Linux, Ubuntu, RHEL, SUSE, Windows AMIs do; custom AMIs sometimes do not); (c) the enaSupport attribute is set to true on both the AMI and the instance. If any one is missing, the instance falls back to the standard virtio interface and you get standard EC2 networking — which on a c5n.18xlarge is around 10 Gbps. The diagnostic command on Linux is ethtool -i eth0; the driver line should read driver: ena. If it reads driver: vif or another driver, ENA is not active. Fix by updating the AMI (run the ENA installer for your distribution) and setting enaSupport=true on the instance via aws ec2 modify-instance-attribute. After the fix, stop and start (not reboot) the instance for the new attributes to take effect.

Performance tuning interlocks with the rest of SOA-C02. The next operational layers are: EC2 Auto Scaling and ELB high availability for the workload tier whose performance these knobs serve, RDS and Aurora resilience for the database tier where Performance Insights and RDS Proxy live alongside Multi-AZ failover, cost visibility and rightsizing for the budget side of the same Compute Optimizer recommendations, and CloudWatch metrics, alarms, and dashboards for the foundation that makes every metric in this topic observable.

官方資料來源