Elastic Compute Scaling on AWS (EC2 Auto Scaling, Spot, Graviton)

Elastic compute scaling is the beating heart of AWS Solutions Architect — Associate (SAA-C03). Every resilient, high-performing, and cost-optimized architecture on AWS ultimately rests on the idea of elastic compute scaling: provisioning the right compute shape, then adding or removing capacity automatically as demand changes. Task statement 3.2 of the SAA-C03 exam guide — "Design high-performing and elastic compute solutions" — funnels directly into this topic, and adjacent tasks in Domain 2 (resilience) and Domain 4 (cost) also test elastic compute scaling from different angles. This chapter walks through elastic compute scaling in the depth the SAA-C03 exam actually rewards: EC2 instance families, AWS Graviton, purchasing options, EC2 Auto Scaling Groups, every scaling policy type (target tracking, step, simple, scheduled, predictive), launch templates, mixed-instances policies, cooldowns, warm pools, and Spot interruption handling. Expect elastic compute scaling questions to make up 8 to 12 scenario items on a real SAA-C03 attempt.

What Is Elastic Compute Scaling on AWS?

Elastic compute scaling on AWS is the discipline of matching compute capacity to workload demand automatically, without operator intervention, and in a cost-aware way. At the service level, elastic compute scaling combines three building blocks: (1) a compute runtime — Amazon EC2, AWS Fargate, AWS Lambda, or AWS Batch; (2) a scaling controller — Amazon EC2 Auto Scaling, Application Auto Scaling, Kubernetes HPA/Cluster Autoscaler, or the Lambda concurrency engine; and (3) a signal — CloudWatch metrics, ALB request counts, custom metrics, or time schedules. When you hear "elastic compute scaling" on SAA-C03, the primary reference is EC2 Auto Scaling, but the concept generalizes to every compute surface AWS offers.

Elastic compute scaling solves three architectural problems at once:

Performance under load — scale out horizontally before latency climbs.
Cost efficiency at idle — scale in aggressively when demand drops.
Self-healing — replace unhealthy instances so the desired fleet size is always met.

The SAA-C03 exam guide explicitly lists scalability, high availability, and cost optimization as cross-cutting concerns, which is why elastic compute scaling shows up in so many scenarios. A correct elastic compute scaling answer has to balance all three, not just one.

Why Elastic Compute Scaling Dominates SAA-C03

Elastic compute scaling appears in every domain. Domain 2 (resilience) tests elastic compute scaling via multi-AZ Auto Scaling Groups that survive an AZ failure. Domain 3 (high performance) tests elastic compute scaling via target tracking, predictive scaling, and placement groups. Domain 4 (cost) tests elastic compute scaling via Spot + On-Demand mixed-instances policies, Reserved Instances, and Savings Plans. If you internalize elastic compute scaling deeply, you unlock points in every domain instead of just one.

Elastic compute scaling is the automatic, policy-driven adjustment of compute capacity — typically by adding or removing instances, containers, or concurrent function executions — in response to demand signals, while preserving availability and respecting cost constraints. On AWS, the canonical elastic compute scaling primitive is the EC2 Auto Scaling Group combined with a scaling policy. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html

EC2 Instance Families — Choosing the Right Shape for Elastic Compute Scaling

Elastic compute scaling starts with picking the right instance shape. If the underlying instance family is wrong, no scaling policy in the world will save you. The SAA-C03 exam expects you to map workload profiles to instance families by their family letter.

General Purpose — T and M Families

T family (T3, T3a, T4g) — burstable general purpose. Baseline CPU is low; CPU credits accumulate during idle and spend during bursts. Ideal for low-traffic web servers, dev/test boxes, microservices with spiky but low-average CPU. T4g uses AWS Graviton.
M family (M5, M5a, M6i, M6a, M6g, M7g, M7i) — balanced CPU and memory. The go-to family for production web apps, small databases, and backend services that are not obviously compute or memory heavy.

Compute Optimized — C Family

C family (C5, C5n, C6i, C6g, C6gn, C7g, C7i) — highest CPU per dollar. Batch processing, HPC, scientific modeling, video encoding, ad serving, gaming servers, high-throughput web servers. C6gn adds 100 Gbps networking. Many C-family SKUs are available on AWS Graviton.

Memory Optimized — R, X, and z Families

R family (R5, R5a, R6i, R6g, R7g) — high memory-to-vCPU ratio for in-memory caches, real-time analytics, ElastiCache self-managed, mid-size databases.
X family (X1, X1e, X2, X2idn, X2iedn) — extreme memory (up to ~4 TB). SAP HANA, large in-memory databases, Apache Spark with huge working sets.
z family (z1d) — high-frequency CPUs plus high memory for relational databases with single-threaded licensing.

Storage Optimized — I, D, and H Families

I family (I3, I3en, I4i, Im4gn, Is4gen) — high local NVMe IOPS. NoSQL databases (Cassandra, MongoDB), OLTP transaction stores.
D family (D2, D3, D3en) — dense HDD local storage. MapReduce, distributed file systems, log processing.
H family (H1) — balanced HDD + compute for big-data workloads.

Accelerated Computing — G, P, Inf, Trn, F Families

G family (G4dn, G4ad, G5, G5g) — GPU for graphics, game streaming, cost-effective ML inference.
P family (P3, P4, P5) — high-end NVIDIA GPUs for ML training and HPC.
Inf (Inf1, Inf2) — AWS Inferentia chips for ML inference.
Trn (Trn1, Trn1n) — AWS Trainium chips for ML training.
F (F1) — FPGA for custom silicon acceleration.

T = Turbo / burstable, M = Main / balanced, C = Compute, R = RAM, X = eXtreme memory, I = IOPS, D = Dense HDD, H = HDD balanced, G = Graphics GPU, P = Parallel ML GPU, Inf = Inferentia, Trn = Trainium, F = FPGA. Memorize the first letter and most SAA-C03 elastic compute scaling questions reveal their answer in the scenario text. Reference: https://aws.amazon.com/ec2/instance-types/

AWS Graviton — ARM-Based Instances

AWS Graviton is AWS's own ARM-based CPU family (Graviton, Graviton2, Graviton3, Graviton4). Graviton instance type names end with the letter g — for example, M6g, C7g, R7g, T4g, Inf2 (Inferentia uses its own chip but runs on Graviton hosts). Graviton delivers up to 40% better price-performance than comparable x86 instances for a wide range of workloads, especially web servers, containers, Java/.NET services, open-source databases, and in-memory caches.

On SAA-C03, reach for Graviton whenever the scenario says:

"Lowest cost at equivalent or better performance" for a compatible workload.
"Best price-performance for containerized or serverless workloads" — AWS Fargate and AWS Lambda both offer Graviton execution.
"Sustainable / low-carbon compute" — Graviton generally uses less energy per unit of work.

Not every workload ports cleanly — binary-only Windows apps, some legacy x86 libraries, or GPU-bound training jobs require x86 and/or accelerators — but elastic compute scaling answers that mention "lowest cost without sacrificing performance for a typical web tier or microservice" almost always include Graviton.

If the scenario says "lowest cost", "best price-performance", and the workload is a web server, container, cache, or open-source database, Graviton is usually the right answer — combine Graviton with a standard EC2 Auto Scaling Group for maximum elastic compute scaling value. Reference: https://aws.amazon.com/ec2/graviton/

EC2 Purchasing Options — The Cost Layer of Elastic Compute Scaling

Elastic compute scaling is not only about how many instances to run — it is also about how to buy them. SAA-C03 expects fluent use of all four purchasing options plus Dedicated Hosts for compliance.

On-Demand — The Default

Pay per second (or per hour on Windows and some Linux distros) with no commitment. On-Demand is the baseline for unpredictable workloads and the fallback when Spot capacity disappears. Elastic compute scaling answers default to On-Demand when the question emphasizes flexibility and short-lived workloads.

Reserved Instances — 1 or 3 Year Commitment

Reserved Instances (RIs) give up to 72% off On-Demand when you commit for 1 or 3 years.

Standard RI — highest discount but locked to instance family, OS, region (or AZ) for zonal reservations.
Convertible RI — smaller discount but can be exchanged across families, OS, tenancy.
Scheduled RI — deprecated for most use cases, not typical on SAA-C03.

RIs are a billing construct — they apply automatically to matching running instances. They do not launch instances for you. For elastic compute scaling answers, RIs cover the steady baseline layer of an Auto Scaling Group.

Savings Plans — Commit to Dollars per Hour

Savings Plans commit you to a dollar-per-hour spend for 1 or 3 years in exchange for up to 72% off. Two flavors:

Compute Savings Plans — apply across EC2, AWS Fargate, and AWS Lambda regardless of instance family, region, tenancy, or OS. Maximum flexibility, slightly lower discount than EC2 Instance Savings Plans.
EC2 Instance Savings Plans — higher discount, but locked to a specific family and region.

For modern elastic compute scaling designs that mix EC2, Fargate, and Lambda, Compute Savings Plans are usually the SAA-C03 preferred answer because they adapt as the workload architecture evolves.

Spot Instances — Up to 90% Off

Spot Instances use AWS's spare capacity and offer up to 90% discount. The trade-off: AWS can reclaim the instance with a 2-minute interruption notice. Spot is perfect for fault-tolerant, stateless, or checkpointable workloads — data processing, CI builds, stateless web tiers behind a queue, batch jobs, rendering.

Dedicated Hosts and Dedicated Instances

Dedicated Hosts give you a physical server for BYOL (bring-your-own-license) or compliance. Dedicated Instances guarantee your VMs do not share physical hardware with other customers but do not expose the host. Both cost more and reduce elastic compute scaling flexibility; reach for them only when licensing or regulation demands it.

Build production elastic compute scaling cost architectures in four layers: (1) Savings Plans or Reserved Instances cover the steady baseline. (2) On-Demand covers normal peak. (3) Spot covers fault-tolerant bulk capacity via a mixed-instances policy. (4) Dedicated Hosts cover compliance islands. The SAA-C03 exam consistently rewards this layered answer in Domain 4 cost-optimization questions. Reference: https://aws.amazon.com/savingsplans/

EC2 Auto Scaling Groups — The Core of Elastic Compute Scaling

An EC2 Auto Scaling Group (ASG) is the primary elastic compute scaling primitive on AWS. An ASG groups EC2 instances so they can be treated as a single fleet for scaling, health checking, and replacement.

Key ASG Parameters

Minimum capacity — the floor. ASG will never drop below this count.
Desired capacity — the target. ASG actively drives the running count toward this value.
Maximum capacity — the ceiling. ASG will never exceed this count.
Launch template / launch configuration — the blueprint for new instances.
VPC subnets — the ASG picks across these subnets (usually in multiple AZs) to balance instances.
Health check type — EC2 (default, based on EC2 status checks) or ELB (uses load balancer target health).
Health check grace period — how long after launch before health checks count; critical for boot-heavy workloads.
Termination policies — rules for which instance to terminate during scale-in (default, oldest instance, newest instance, oldest launch configuration, closest-to-next-billing-hour, allocation strategy).

AZ Balancing — Free High Availability from Elastic Compute Scaling

An ASG automatically balances instances across the subnets (AZs) you give it. If one AZ fails, the ASG launches replacements in the remaining AZs to restore desired capacity. Combining a multi-AZ ASG with a load balancer delivers high availability as a natural byproduct of elastic compute scaling — this is why Domain 2 resilience questions so often point at an ASG.

Health Checks and Self-Healing

If an instance fails an EC2 status check or an ELB health check, the ASG terminates it and launches a replacement. Self-healing is one of the headline benefits of elastic compute scaling; a correct SAA-C03 answer often says "Auto Scaling Group with ELB health checks" rather than "manual EC2 instance with CloudWatch alarm to replace on failure."

Default EC2 status checks only verify the hypervisor and instance reachability. If your app crashes but the OS is healthy, EC2 health checks still pass and the ASG leaves a broken instance in rotation. For any ASG fronted by ALB/NLB, switch to ELB health check type so elastic compute scaling actually replaces broken application-layer targets. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html

Launch Templates vs Launch Configurations

A launch template (or the older, deprecated launch configuration) tells the ASG how to launch a new instance — AMI, instance type, key pair, security groups, IAM instance profile, user data, block device mappings, network interfaces, metadata options, and tags.

Why Launch Templates Win

Launch templates are versioned, support every modern EC2 feature (Spot options, Capacity Reservations, T2/T3 unlimited, Placement Groups, tag specifications, metadata v2 IMDSv2), and can be referenced by multiple services (EC2 fleet, Spot fleet, ASG). Launch configurations are legacy; AWS explicitly recommends launch templates for all new elastic compute scaling work and for predictive scaling and mixed-instances policies.

Multiple Versions and Default Version

You can maintain many versions of a launch template (v1, v2, v3…) and mark one as the default. Rolling an ASG forward is as simple as updating the ASG's launch template version — optionally combined with an instance refresh.

Instance Refresh

Instance refresh replaces instances in an ASG gradually to adopt a new launch template version. You set a MinHealthyPercentage (how much of the fleet must remain healthy during the refresh) and a WarmupSeconds (how long a new instance is considered initializing). Instance refresh is the SAA-C03-approved way to roll elastic compute scaling fleets forward without downtime.

Scaling Policy Types — The Four (plus Predictive) Families

EC2 Auto Scaling supports multiple scaling policy types that differ in how they decide to adjust desired capacity. SAA-C03 expects you to map a scenario to the correct policy type.

Target Tracking Scaling — The Default Recommendation

Target tracking scaling is the modern default. You pick a metric (CPU utilization, ALB request count per target, network bytes in/out, or a custom CloudWatch metric) and a target value (e.g., 50% CPU). EC2 Auto Scaling creates CloudWatch alarms under the hood and adjusts desired capacity to hold the metric near target. It is self-tuning, easy to set up, and handles scale-out and scale-in symmetrically.

Use target tracking when:

You can express the workload goal as a single metric target (CPU, RPS per instance, average queue depth via custom metric).
You want a fire-and-forget policy.

Step Scaling — Granular Response to Alarm Breaches

Step scaling reacts to CloudWatch alarm breaches with multiple steps: "if CPU 50–70%, add 1 instance; if 70–85%, add 3; if 85%+, add 5." It is more aggressive and surgical than target tracking when traffic spikes are non-linear. Step scaling does not use cooldowns the same way simple scaling does — it uses an instance warm-up period so the effects of the last scaling action count toward metric calculations.

Use step scaling when:

Traffic spikes are uneven and you want tiered responses.
You need fine control over how many instances are added per breach level.

Simple Scaling — Legacy Single-Step

Simple scaling triggers a single action (add or remove N instances or a percentage) and then waits for a cooldown period (default 300 seconds) before reacting to the next alarm. Simple scaling is older and rarely the SAA-C03-preferred answer — step scaling or target tracking almost always wins. The main reason simple scaling still appears is to illustrate cooldown semantics.

Scheduled Scaling — Time-Based

Scheduled scaling changes min/max/desired at a specific date and time (or on a recurring cron schedule). Use it for predictable daily, weekly, or seasonal patterns — "scale up to 20 every weekday at 08:30 UTC, scale down to 5 at 18:00 UTC." Scheduled scaling is complementary to target tracking: the schedule sets the minimum floor, and target tracking handles deviations from expectation.

Predictive Scaling — ML-Driven

Predictive scaling uses a machine-learning model trained on 14 days of historical load to forecast load for the next 48 hours and pre-provision capacity before the forecasted demand arrives. It is the only policy type that proactively scales out; all others react to metrics crossing thresholds. Predictive scaling pairs beautifully with target tracking — predictive sets the base capacity curve; target tracking handles intraday deviations.

Use predictive scaling when:

The workload has a clear daily/weekly pattern (batch jobs that run every morning, consumer apps with nightly peaks, office-hour enterprise apps).
Instance boot time is long enough that reactive scaling cannot catch a spike.
You want to eliminate cold-start latency for predictable surges.

Target tracking = hold a metric at a value, self-tuning. Step scaling = tiered add/remove on alarm breach levels. Simple scaling = one action plus cooldown, legacy. Scheduled scaling = date/time adjustments for known patterns. Predictive scaling = ML forecast pre-provisions for the next 48 hours. For elastic compute scaling on SAA-C03, target tracking is the default answer; predictive is the answer when the scenario says "forecast" or "known pattern + long boot time." Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scaling-policy-overview.html

Mixed-Instances Policy — Multiple Instance Types and Spot Fallback

A mixed-instances policy is where elastic compute scaling meets deep cost optimization. A single ASG can span multiple instance types and a blend of On-Demand and Spot capacity.

Anatomy of a Mixed-Instances Policy

LaunchTemplate — the base template.
Overrides — a list of alternative instance types (and optionally weights and subnets): for example m5.large, m5a.large, m5n.large, m6i.large, m6a.large, m6g.large.
InstancesDistribution — controls the On-Demand vs Spot split:
- OnDemandBaseCapacity — absolute number of On-Demand instances the ASG must keep (baseline reliability).
- OnDemandPercentageAboveBaseCapacity — percentage of additional capacity that must be On-Demand (e.g., 20% On-Demand + 80% Spot above the base).
- SpotAllocationStrategy — capacity-optimized (the SAA-C03 favored answer for minimum interruption) or lowest-price, capacity-optimized-prioritized, price-capacity-optimized.
- SpotInstancePools — how many Spot pools to use with lowest-price.
- SpotMaxPrice — usually leave unset to pay current Spot market price.

Why Multiple Instance Types Matter for Spot

Spot availability is per instance type per AZ. If your ASG can only use m5.large in one AZ, interruptions cascade. If you allow m5.large, m5a.large, m5n.large, m6i.large, m6a.large, m6g.large across three AZs, you effectively have dozens of Spot pools to choose from, and the capacity-optimized strategy picks the pool least likely to be interrupted. This is the single biggest lever for stable Spot-heavy elastic compute scaling.

Attribute-Based Instance Type Selection

Modern ASGs support attribute-based instance type selection: instead of listing explicit instance types, you specify requirements (vCPU range, memory range, network performance, architecture, accelerator type) and AWS picks matching families. Attribute-based selection keeps the ASG future-proof as AWS releases new instance generations.

A classic SAA-C03 trap pairs a mixed-instances policy with a launch configuration. Launch configurations do not support mixed-instances policies — only launch templates do. If an answer choice says "launch configuration + Spot fallback + multiple instance types" it is wrong. The correct elastic compute scaling answer always uses launch templates. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups.html

Cooldowns, Warmups, and Lifecycle Hooks

Timing parameters are the under-appreciated half of elastic compute scaling.

Default Cooldown

The default cooldown (applies to simple scaling) is 300 seconds. After any simple scaling action, ASG ignores further alarm breaches until the cooldown elapses. This prevents rapid flapping — scaling out, then scaling in, then scaling out.

Instance Warm-Up (Step and Target Tracking)

Instance warm-up is a different concept: it tells the ASG how long a newly launched instance takes to become useful (boot OS, install app, hydrate cache). During warm-up, the ASG counts the new instance toward capacity but damps the metric contribution so target tracking and step scaling don't over-react.

Lifecycle Hooks

Lifecycle hooks pause instances in a waiting state during launch (autoscaling:EC2_INSTANCE_LAUNCHING) or terminate (autoscaling:EC2_INSTANCE_TERMINATING) so you can run custom logic — configuration management, log drain, graceful connection closure — before the instance goes live or disappears. Lifecycle hooks are how you drain connections before scale-in without losing user sessions.

Scale-In Protection

Scale-in protection marks specific instances (or the whole ASG) as ineligible for scale-in. Use it for stateful leaders in an otherwise stateless fleet, or during a deployment window.

If your app needs 120 seconds to boot, hydrate a cache, and register with service discovery, set InstanceWarmup to at least 120 seconds. A too-short warm-up makes elastic compute scaling over-provision because new instances' CPU reads low while they are still booting. A too-long warm-up delays scale-in of genuinely idle instances. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-default-cooldown.html

Warm Pools — Eliminating Boot Time

A warm pool keeps a reserve of instances in Stopped or Running (hibernated) state that have already been booted, baked, and prepped. When the ASG needs to scale out, it pulls from the warm pool — skipping the slow boot / user-data / config-management / cache-hydration cycle — and moves the instance to InService in seconds instead of minutes.

When Warm Pools Are the Right Answer

The instance takes several minutes to boot and become useful (large AMIs, heavy Java warm-up, big cache hydration).
Scale-out spikes are steep and reactive policies can't catch up.
You want predictive scaling's benefits for a less-predictable workload.

Warm Pool States

Stopped — lowest cost (no compute billing), slowest reactivation.
Running — fastest reactivation, full compute cost.
Hibernated — in-memory state preserved on disk; medium cost, fast reactivation, requires hibernation-capable instances.

Warm pools integrate with lifecycle hooks so you can complete prep work before the instance joins the warm pool rather than on each scale-out — a subtle but powerful elastic compute scaling pattern.

Warm pools reduce reactive scale-out latency regardless of pattern. Predictive scaling pre-provisions based on forecast. On SAA-C03, "boot takes 5 minutes, spikes unpredictably" points at a warm pool. "Boot takes 2 minutes, daily 9am peak" points at predictive scaling (often combined with target tracking). Both techniques can coexist. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-warm-pools.html

Spot Interruption Handling — Graceful Scale-In Under Reclaim

Spot Instances save up to 90%, but AWS can reclaim them. Correct Spot handling is a high-frequency SAA-C03 topic.

The 2-Minute Interruption Notice

When AWS decides to reclaim a Spot Instance, it publishes a notice through two channels:

Instance metadata at http://169.254.169.254/latest/meta-data/spot/instance-action with a termination timestamp.
EventBridge via the EC2 Spot Instance Interruption Warning event.

Your application has up to 2 minutes to:

Stop accepting new work (deregister from load balancer target groups).
Complete or checkpoint in-flight work.
Flush state to durable storage (S3, DynamoDB, EFS, queue).
Terminate cleanly.

Capacity Rebalancing

ASGs with CapacityRebalance enabled react to the earlier EC2 Instance Rebalance Recommendation signal — AWS's hint that a Spot Instance is at elevated risk of interruption (but has not yet received the 2-minute notice). ASG launches a replacement ahead of time so the fleet does not dip below capacity when the interruption finally arrives.

Combining Spot with SQS for Fault Tolerance

The canonical SAA-C03 Spot-friendly pattern: SQS queue + ASG of Spot workers with mixed-instances policy + capacity-optimized allocation strategy + lifecycle hooks that drain the message and mark it visible again on interruption. This pattern tolerates interruptions without losing work and delivers maximum elastic compute scaling cost savings.

Spot Instances require the workload to tolerate interruption. Stateful databases, long-running sessions with user affinity, and single-instance deployments are wrong for Spot. SAA-C03 tests this constantly — any question that says "Spot" plus "stateful" plus "no work loss allowed" has a wrong answer somewhere. Correct Spot use cases are stateless, checkpointable, queue-backed, or replicated. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

Placement Groups — Topology Control for Elastic Compute Scaling

Placement groups control where inside an AZ your instances land relative to each other, which matters for latency-sensitive or availability-sensitive elastic compute scaling.

Cluster Placement Group

All instances packed on the same rack / network spine for lowest inter-instance latency and highest throughput. Ideal for HPC, tightly coupled distributed systems, low-latency market data.

Spread Placement Group

Each instance on distinct underlying hardware — up to seven per AZ. Ideal for small critical fleets (stateful leaders, license servers) where hardware failure of one instance must not take down another.

Partition Placement Group

Instances spread across logical partitions (up to seven per AZ) that map to different hardware racks. Used by distributed databases (Cassandra, HDFS, Kafka) so that the loss of one rack affects only one partition.

Integrating Elastic Compute Scaling with Other Services

Elastic compute scaling rarely operates in isolation.

With Elastic Load Balancing

Attach a target group to the ASG; the load balancer distributes traffic, the ASG replaces unhealthy targets. Use ELB health checks on the ASG for application-layer liveness.

With SQS — Decoupled Worker Scaling

For queue-driven workloads, scale on queue depth per instance using a CloudWatch target tracking policy on a custom metric (ApproximateNumberOfMessagesVisible / GroupInServiceInstances). This is the elastic compute scaling answer to "process a million messages per hour with Spot workers."

With ECS Service Auto Scaling

Amazon ECS uses Application Auto Scaling (a separate service) to scale tasks — target tracking, step, and scheduled policies on metrics like ECSServiceAverageCPUUtilization, ECSServiceAverageMemoryUtilization, or ALB RequestCountPerTarget. ECS Cluster Auto Scaling then adjusts the underlying EC2 ASG (for EC2 launch type) using Capacity Providers. With AWS Fargate, there is no underlying fleet to manage — elastic compute scaling stops at the task level.

With EKS — Cluster Autoscaler and Karpenter

Amazon EKS uses either the Kubernetes Cluster Autoscaler (which scales the EC2 ASG behind node groups) or AWS Karpenter (which launches right-sized EC2 capacity directly, faster and more efficiently than the Cluster Autoscaler). Horizontal Pod Autoscaler scales pods; Karpenter scales nodes. Together they deliver EKS-native elastic compute scaling.

With Lambda — Concurrency as Scaling

AWS Lambda scales by spawning additional concurrent execution environments automatically. You control elastic compute scaling through reserved concurrency (caps function concurrency to protect downstreams) and provisioned concurrency (pre-warms environments to eliminate cold starts). Lambda can scale to thousands of concurrent invocations in seconds, but per-account regional concurrency limits (default 1000) still apply.

With AWS Batch — Job-Driven Compute

AWS Batch provisions EC2 or Fargate compute on demand to drain a queue of jobs. It supports On-Demand, Spot, and managed compute environments that pick cost-optimal instance types automatically. For long-running, embarrassingly parallel workloads, AWS Batch is the elastic compute scaling answer.

Key Numbers to Memorize for Elastic Compute Scaling

ASG default cooldown: 300 seconds.
Spot interruption notice: 2 minutes (120 seconds).
Predictive scaling training window: 14 days of history.
Predictive scaling forecast horizon: 48 hours.
Lambda default account concurrency: 1,000 per region (soft limit).
Lambda max timeout: 15 minutes.
RI discount: up to 72% vs On-Demand.
Savings Plans discount: up to 72% vs On-Demand.
Spot discount: up to 90% vs On-Demand.
Placement group spread limit: 7 instances per AZ.
Launch template versions: 10,000 per template (soft limit).
ASG maximum instances: account-level soft limit (commonly 2,500 per ASG; raise via quota request).

Common Exam Traps in Elastic Compute Scaling

SAA-C03 recycles elastic compute scaling traps in predictable ways. Know them cold.

Trap 1 — Horizontal vs Vertical Scaling

If a scenario says "application uses a single large database with high CPU, add more capacity quickly", vertical scaling (bigger instance type) may be appropriate. If the scenario says "stateless web tier under traffic spikes", horizontal scaling with an ASG is the elastic compute scaling answer. Watch for "stateless" and "traffic spikes" as horizontal-scaling signals.

Trap 2 — Target Tracking vs Predictive Scaling

Target tracking is reactive; predictive is proactive. If the question says "pre-provision before the daily morning peak" → predictive. If it says "hold CPU near 60%" → target tracking. Combine both if the scenario allows.

Trap 3 — Cooldown Misuse

Cooldown applies to simple scaling; step scaling and target tracking use instance warm-up instead. A scenario claiming "increase cooldown to make target tracking smoother" is wrong — the right lever for target tracking is InstanceWarmup.

Trap 4 — Launch Configuration vs Launch Template

Launch configurations cannot use mixed-instances policies, cannot specify multiple instance types, and do not support newer features. If the answer requires mixed-instances policy, Spot blend, or attribute-based selection, launch template is the only correct building block.

Trap 5 — Spot for Stateful Workloads

Spot plus "stateful database with no tolerance for interruption" is a contradiction. The correct Spot patterns are stateless, queue-backed, checkpointable, or replicated. Expect distractors that try to pair Spot with the wrong workload profile.

Trap 6 — Ignoring Multi-AZ in ASG Answers

An ASG in a single AZ loses its resilience benefit. Correct elastic compute scaling answers place an ASG across multiple AZs unless the scenario explicitly constrains to one (usually a licensing or latency edge case).

Trap 7 — Predictive Scaling Without History

Predictive scaling requires ~14 days of metric history. A brand-new workload cannot use predictive scaling until it accumulates the training data. Early-life workloads should use target tracking or scheduled scaling first.

The single most common SAA-C03 elastic compute scaling distractor combines "mixed-instances policy" with "launch configuration". The correct answer must always use a launch template. A secondary mega-trap is "Spot Instances for a database that must never lose data" — Spot requires interruption tolerance, so a stateless worker plus a durable queue is the right pattern, not a raw Spot database. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html

Elastic Compute Scaling vs Serverless — When to Pick Each

Elastic compute scaling via EC2 Auto Scaling is not always the right choice. The SAA-C03 exam wants you to recognize when AWS Lambda or AWS Fargate beats an ASG.

Pick EC2 Auto Scaling when the workload needs specific instance types (GPU, bare-metal), long-lived processes (> 15 minutes), persistent in-memory state between requests, daemon-style workloads per host, Reserved Instances or Savings Plans discount stacking, or OS-level control.
Pick AWS Fargate when you want containers without host management, variable workloads where hosts would sit idle, per-task billing granularity, or integration with ECS/EKS orchestration without managing the node fleet.
Pick AWS Lambda when the workload is event-driven, short (< 15 min), bursty, benefits from zero-idle-cost billing, and does not need persistent connections.
Pick AWS Batch when you have queued, long-running, embarrassingly parallel jobs that can tolerate scheduling latency.

Elastic compute scaling at its best combines these primitives — a React app on CloudFront + API Gateway + Lambda (elastic at the function level), a backend on ECS Fargate (elastic at the task level), and a batch pipeline on AWS Batch with Spot (elastic at the job level).

Practice Question Patterns — Task 3.2 Scenarios

Drill these elastic compute scaling patterns until the mapping is reflex:

"Web tier with unpredictable traffic, must survive AZ failure, minimize cost" → multi-AZ ASG with target tracking on CPU or RequestCountPerTarget, mixed-instances policy with Spot + On-Demand baseline.
"Overnight batch job with 3-hour runtime" → AWS Batch with Spot, not Lambda.
"Daily 9am traffic spike takes 5 minutes to absorb" → predictive scaling + target tracking + warm pool.
"Lowest cost for containerized stateless worker" → ECS on Fargate Spot, or ECS on EC2 with Spot via mixed-instances policy.
"Need to replace unhealthy web instances automatically" → ASG with ELB health check type.
"Need lowest-latency inter-node communication for HPC" → cluster placement group + C or Hpc family instances.
"Fault-tolerant fleet of distributed database nodes" → partition placement group.
"Flexible discount across EC2, Fargate, Lambda" → Compute Savings Plans.
"Maximum discount, locked to specific family for a year" → EC2 Instance Savings Plans or Standard Reserved Instance.
"Instance must drain connections before scale-in" → lifecycle hook on EC2_INSTANCE_TERMINATING.
"Boot takes 10 minutes, spikes unpredictably" → warm pool in Stopped state.
"Scale-in must protect the leader node" → scale-in protection on the leader instance.

Elastic Compute Scaling Boundary with Other SAA-C03 Topics

Elastic compute scaling (task 3.2) brushes against several neighboring topics. Keep the boundaries clean:

vs Serverless and Containers (2.1) — task 2.1 is "what abstraction?"; task 3.2 is "how does it scale?". Lambda and Fargate appear in both topics but from different angles.
vs High Availability and Multi-AZ (2.2) — task 2.2 is "can it survive failure?"; task 3.2 is "does it grow with demand?". A good ASG answer satisfies both.
vs Cost-Optimized Compute (4.2) — task 4.2 focuses on purchasing decisions (Spot, Savings Plans, RI); task 3.2 focuses on scaling mechanics. Cost questions lean on 4.2 framing; performance or elasticity questions lean on 3.2 framing.
vs Messaging and Decoupling (2.1) — SQS-driven scaling is a classic elastic compute scaling pattern but the queue belongs to messaging topics; the worker fleet is elastic compute scaling.
vs High-Performing Database Solutions (3.3) — RDS/Aurora have their own scaling primitives (Aurora Serverless v2, read replicas). Elastic compute scaling lives on the compute tier; database scaling is its own topic.

FAQ — Elastic Compute Scaling Top Questions

Q1. What is the difference between target tracking and step scaling for elastic compute scaling?

Target tracking scaling holds a metric at a chosen target value — you pick "CPU = 50%" and the ASG self-tunes by adding or removing capacity. Step scaling reacts to CloudWatch alarms with tiered responses — "add 1 if CPU 50–70%, add 3 if 70–85%, add 5 if >85%." Target tracking is the SAA-C03-preferred default because it is self-tuning and symmetric; step scaling wins when you need surgical control over the magnitude of response at different breach levels. Both are better than simple scaling, which is legacy.

Q2. When should I use predictive scaling instead of target tracking for elastic compute scaling?

Predictive scaling is the right elastic compute scaling choice when the workload has a known, repeating pattern (daily morning peak, weekly batch run, seasonal business hours) and instance boot time is long enough that reactive target tracking cannot catch the spike in time. Predictive scaling uses 14 days of historical data to forecast the next 48 hours and pre-provisions capacity before the spike. The best practice is to combine predictive scaling (sets the curve) with target tracking (handles intraday deviations) so elastic compute scaling is both proactive and reactive.

Q3. How does a mixed-instances policy with Spot fallback work?

A mixed-instances policy lets a single Auto Scaling Group span multiple instance types and blend On-Demand with Spot capacity. You define OnDemandBaseCapacity (baseline reliability), OnDemandPercentageAboveBaseCapacity (how much of additional capacity must remain On-Demand), and a Spot allocation strategy — capacity-optimized is the preferred choice for minimizing interruptions. The ASG picks the least-interrupted Spot pool across your listed instance types and AZs, delivering up to 90% cost savings while preserving fault tolerance. Mixed-instances policies require launch templates — launch configurations cannot be used.

Q4. What is a warm pool and when should I use it?

A warm pool is a reserve of pre-initialized instances (in Stopped, Running, or Hibernated state) that the ASG can pull from during scale-out, skipping boot, user-data, config-management, and cache-hydration steps. Warm pools are the right elastic compute scaling answer when instances take minutes to become useful — large AMIs, heavy Java warm-up, large cache loads. Stopped warm pools cost almost nothing and reactivate in under a minute; Running warm pools reactivate in seconds but cost full compute; Hibernated warm pools preserve in-memory state on disk for medium cost and fast reactivation.

Q5. How should I handle Spot Instance interruptions in an elastic compute scaling architecture?

Spot interruption handling requires three layers: (1) watch for the 2-minute interruption notice on instance metadata (http://169.254.169.254/latest/meta-data/spot/instance-action) and/or the EventBridge EC2 Spot Instance Interruption Warning event; (2) during those 2 minutes, stop accepting new work, finish or checkpoint in-flight work, and flush state to durable storage (S3, DynamoDB, SQS); (3) enable CapacityRebalance on the ASG so it launches replacements when AWS sends the earlier Rebalance Recommendation, reducing the window where the fleet is under-provisioned. Combine Spot with SQS-driven worker patterns and capacity-optimized allocation across multiple instance types and AZs for maximum resilience.

Q6. What is the difference between launch templates and launch configurations?

Launch configurations are the older, immutable blueprint for EC2 Auto Scaling. Launch templates are the modern, versioned replacement that supports every current EC2 feature (mixed-instances policies, Spot options, IMDSv2, Capacity Reservations, attribute-based selection, tag specifications). AWS officially recommends launch templates for all new elastic compute scaling work. Mixed-instances policies, predictive scaling, and attribute-based instance type selection all require launch templates — launch configurations do not support them.

Q7. How does elastic compute scaling work for containers on ECS and EKS?

For Amazon ECS, Application Auto Scaling scales the number of tasks using target tracking, step, or scheduled policies on metrics like ECSServiceAverageCPUUtilization or ALB RequestCountPerTarget. For the EC2 launch type, ECS Capacity Providers adjust the underlying EC2 Auto Scaling Group. For Fargate, no host-level scaling exists — elastic compute scaling stops at the task level. For Amazon EKS, Horizontal Pod Autoscaler scales pods, and either the Kubernetes Cluster Autoscaler (backed by EC2 ASG) or AWS Karpenter (direct EC2 provisioning) scales the node layer. Karpenter is faster and more cost-efficient than the Cluster Autoscaler for modern elastic compute scaling on EKS.

Q8. When is horizontal scaling better than vertical scaling for elastic compute scaling?

Horizontal scaling (more instances of the same size) is better for stateless workloads, linear-scaling architectures, cost-efficiency under Spot, and resilience against single-instance failure. Vertical scaling (bigger instance type) is better for stateful workloads that cannot be partitioned (large single-node databases), memory-bound workloads where network partitioning is expensive, or quick short-term capacity boosts during migration. On SAA-C03, elastic compute scaling answers almost always prefer horizontal scaling in an ASG — vertical scaling is usually a distractor unless the scenario explicitly describes a stateful workload that cannot be sharded.

Summary — Elastic Compute Scaling at a Glance

Elastic compute scaling means matching compute capacity to demand automatically via EC2 Auto Scaling (or ECS, EKS, Lambda, Batch equivalents).
Start by picking the right EC2 instance family (T/M/C/R/X/I/D/G/P) and consider AWS Graviton for up to 40% better price-performance on compatible workloads.
Layer purchasing options: Savings Plans or Reserved Instances for the steady baseline, On-Demand for peak, Spot with a mixed-instances policy for fault-tolerant bulk capacity.
Drive elastic compute scaling with launch templates (not launch configurations), an ASG spread across multiple AZs, and the right scaling policy: target tracking as default, step for tiered responses, scheduled for known patterns, predictive for forecastable peaks with long boot times.
Use warm pools to eliminate boot latency, lifecycle hooks to drain connections, and CapacityRebalance plus capacity-optimized allocation to minimize Spot interruptions.
Combine elastic compute scaling with ELB for distribution, SQS for decoupled workers, ECS/EKS for containers, and Lambda for event-driven bursts.
Respect the traps: launch template is required for mixed-instances policies; Spot requires interruption tolerance; multi-AZ is mandatory for resilience; target tracking uses instance warm-up, not cooldown.

Master elastic compute scaling and you unlock points across every SAA-C03 domain. The same mental model powers resilient design (Domain 2), high-performing architecture (Domain 3), and cost optimization (Domain 4) — which is why elastic compute scaling is among the highest-leverage topics in the entire SAA-C03 study plan.

What Is Elastic Compute Scaling on AWS?

Why Elastic Compute Scaling Dominates SAA-C03

EC2 Instance Families — Choosing the Right Shape for Elastic Compute Scaling

General Purpose — T and M Families

Compute Optimized — C Family

Memory Optimized — R, X, and z Families

Storage Optimized — I, D, and H Families

Accelerated Computing — G, P, Inf, Trn, F Families

AWS Graviton — ARM-Based Instances

EC2 Purchasing Options — The Cost Layer of Elastic Compute Scaling

On-Demand — The Default

Reserved Instances — 1 or 3 Year Commitment

Savings Plans — Commit to Dollars per Hour

Spot Instances — Up to 90% Off

Dedicated Hosts and Dedicated Instances

EC2 Auto Scaling Groups — The Core of Elastic Compute Scaling

Key ASG Parameters

AZ Balancing — Free High Availability from Elastic Compute Scaling

Health Checks and Self-Healing

Launch Templates vs Launch Configurations

Why Launch Templates Win

Multiple Versions and Default Version

Instance Refresh

Scaling Policy Types — The Four (plus Predictive) Families

Target Tracking Scaling — The Default Recommendation

Step Scaling — Granular Response to Alarm Breaches

Simple Scaling — Legacy Single-Step

Scheduled Scaling — Time-Based

Predictive Scaling — ML-Driven

Mixed-Instances Policy — Multiple Instance Types and Spot Fallback

Anatomy of a Mixed-Instances Policy

Why Multiple Instance Types Matter for Spot

Attribute-Based Instance Type Selection

Cooldowns, Warmups, and Lifecycle Hooks

Default Cooldown

Instance Warm-Up (Step and Target Tracking)

Lifecycle Hooks

Scale-In Protection

Warm Pools — Eliminating Boot Time

When Warm Pools Are the Right Answer

Warm Pool States

Spot Interruption Handling — Graceful Scale-In Under Reclaim

The 2-Minute Interruption Notice

Capacity Rebalancing

Combining Spot with SQS for Fault Tolerance

Placement Groups — Topology Control for Elastic Compute Scaling

Cluster Placement Group

Spread Placement Group

Partition Placement Group

Integrating Elastic Compute Scaling with Other Services

With Elastic Load Balancing

With SQS — Decoupled Worker Scaling

With ECS Service Auto Scaling

With EKS — Cluster Autoscaler and Karpenter

With Lambda — Concurrency as Scaling

With AWS Batch — Job-Driven Compute

Key Numbers to Memorize for Elastic Compute Scaling

Common Exam Traps in Elastic Compute Scaling

Trap 1 — Horizontal vs Vertical Scaling

Trap 2 — Target Tracking vs Predictive Scaling

Trap 3 — Cooldown Misuse

Trap 4 — Launch Configuration vs Launch Template

Trap 5 — Spot for Stateful Workloads

Trap 6 — Ignoring Multi-AZ in ASG Answers

Trap 7 — Predictive Scaling Without History

Elastic Compute Scaling vs Serverless — When to Pick Each

Practice Question Patterns — Task 3.2 Scenarios

Elastic Compute Scaling Boundary with Other SAA-C03 Topics

FAQ — Elastic Compute Scaling Top Questions

Q1. What is the difference between target tracking and step scaling for elastic compute scaling?

Q2. When should I use predictive scaling instead of target tracking for elastic compute scaling?

Q3. How does a mixed-instances policy with Spot fallback work?

Q4. What is a warm pool and when should I use it?

Q5. How should I handle Spot Instance interruptions in an elastic compute scaling architecture?

Q6. What is the difference between launch templates and launch configurations?

Q7. How does elastic compute scaling work for containers on ECS and EKS?

Q8. When is horizontal scaling better than vertical scaling for elastic compute scaling?

Summary — Elastic Compute Scaling at a Glance

Official sources