EC2 Auto Scaling and Elastic Load Balancing for SysOps HA

EC2 Auto Scaling and Elastic Load Balancing are the load-bearing pair behind every horizontally scalable AWS workload, and on SOA-C02 they are tested through the operational lens — not "should the architect use an ALB or an NLB" but "the ALB is returning 502s during peak, the ASG is flapping every ten minutes, and the new instances are being killed before they finish boot — fix it". Domain 2 (Reliability and Business Continuity) is worth 16 percent of the exam, and Task Statements 2.1 and 2.2 conflate scalability and high availability into one running stack: an Auto Scaling group launches instances from a launch template, registers them behind an ELB target group, the ELB health checks decide whether a new instance is fit to serve traffic, and the ASG decides whether to launch more, terminate, or rebalance across Availability Zones. SOA-C02 tests every seam in that pipeline.

This guide walks through the running stack from the SysOps angle: how an Auto Scaling group is configured (launch templates, desired/min/max, AZ distribution, mixed instances), how the three scaling policy types behave (target tracking, step, scheduled) and the math behind target tracking, why cooldown and warm-up exist and how to tune them so the ASG stops flapping, lifecycle hooks for graceful boot/drain, ELB target group internals (instance vs IP target type, health check thresholds), the health check grace period that explains so many "the new instance got killed for being unhealthy" outages, deregistration delay and sticky sessions, the operational differences between ALB, NLB, and GLB, multi-AZ rebalancing, and how Route 53 health checks layer on top for cross-region failover. You will also see the recurring SOA-C02 scenario shapes: 502 / 504 troubleshooting, ASG flapping diagnosis, instance refresh stuck at 0 percent healthy, and "the load balancer routes to old unhealthy instances" symptoms.

Why Auto Scaling and ELB Sit at the Core of SOA-C02 Domain 2

The official SOA-C02 Exam Guide v2.3 lists five skills under Task Statement 2.1 (scalability and elasticity) and four under 2.2 (high availability and resilient environments). Auto Scaling and ELB appear in nearly every one: create and maintain Auto Scaling plans, configure ELB and Route 53 health checks, differentiate single-AZ from Multi-AZ deployments, implement fault-tolerant workloads with EFS and Elastic IPs, and apply Route 53 routing policies (failover, weighted, latency). Where SAA-C03 asks "which load balancer should the architect choose" once at design time, SOA-C02 keeps asking "the running stack is misbehaving — which knob do you turn?" — and that knob is almost always health check grace period, deregistration delay, datapoints to alarm, cooldown, warm-up, or treatMissingData on the scaling alarm.

The SysOps framing is operational: the workload is already designed and deployed; you are running it. The exam loves these recurring shapes — ALB returning 502 during a peak (target unhealthy → deregistration delay too short → in-flight requests dropped), CPU alarm flapping every ten minutes (period 60s with M=N=1 is too sensitive — change to 3-of-5), instance refresh stuck (min healthy percentage too high to make progress), and sticky sessions breaking after a deployment (lb_cookie rotates per target group, but the app_cookie from the application server was not configured). EC2 Auto Scaling, ELB, and Multi-AZ HA is the topic where every later operational decision plugs in: the CloudWatch alarms feeding the scaling policies, the VPC subnets the ASG launches into, the IAM instance profile granting access to S3 and Secrets Manager, and the AWS Backup plan covering the underlying EBS volumes for stateful tiers.

Auto Scaling group (ASG): a logical group of EC2 instances managed together — launches and terminates instances to keep the count between min and max, targets desired, and spreads across one or more Availability Zones.
Launch template: the immutable definition of how an instance is launched (AMI, instance type, key pair, security groups, user data, IAM instance profile, block device mapping). Launch templates are versioned; ASGs reference a specific version, $Latest, or $Default.
Desired / min / max capacity: the three integers that govern an ASG. Desired is the count the ASG tries to maintain; min is the floor (no scale-in below); max is the ceiling (no scale-out above).
Scaling policy: the rule that changes desired capacity. Three types: target tracking (KeepMetricAt(50% CPU)), step scaling (+2 if CPU>70, +4 if CPU>85), and scheduled (set desired=10 at 09:00 Mon-Fri).
Cooldown period: a wait window after a simple scaling activity completes before the ASG considers another scaling action — prevents thrashing.
Warm-up time (instance warmup): the seconds a newly launched instance is excluded from aggregated metrics so a still-booting instance does not skew target tracking.
Lifecycle hook: a pause point during launch (autoscaling:EC2_INSTANCE_LAUNCHING) or terminate (autoscaling:EC2_INSTANCE_TERMINATING) where the ASG sends an event and waits for CompleteLifecycleAction before proceeding.
Target group: an ELB construct that holds registered targets (EC2 instances by ID, IP addresses, Lambda functions, or other ALBs) and runs health checks against them.
Health check grace period: ASG-level setting (in seconds) that delays the start of ELB health-check evaluation for a freshly launched instance — prevents premature termination of still-booting instances.
Deregistration delay (connection draining): ELB-level setting (in seconds) that keeps an in-flight connection alive while a target is being removed; default 300, max 3600.
Sticky sessions: ELB feature that pins a client to a specific target via a cookie (lb_cookie for ALB-managed, app_cookie for application-managed).
Instance refresh: a managed rolling replace of every instance in an ASG when the launch template or AMI changes — MinHealthyPercentage and InstanceWarmup govern progress.
Route 53 health check: an external health probe (HTTP/HTTPS/TCP, calculated, or CloudWatch-alarm-backed) that integrates with failover routing, weighted, and latency routing policies.
Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

白話文解釋 EC2 Auto Scaling and Elastic Load Balancing

Auto Scaling and ELB jargon is dense — three analogies make the moving parts intuitive.

Analogy 1: The Restaurant Kitchen Brigade

An Auto Scaling group is a restaurant kitchen brigade. The launch template is the standard recipe + station setup card — every new line cook arrives knowing exactly which station, which knives, which uniform (AMI), which apron color (security group). The desired/min/max capacity is the head chef's staffing plan: minimum three cooks at any time so dinner service never stalls (min), maximum twelve so the kitchen does not run out of stations (max), targeting six during normal service (desired). Target tracking is the head chef's rule "keep every station at 50 percent utilization" — if every cook is slammed at 90 percent, more cooks get called in; if cooks are sitting around at 20 percent, some get sent home. The cooldown period is the 15-minute observation window after calling in a new cook before considering another call — without it, the head chef would frantically call in and dismiss cooks every minute. The warm-up time is the 20 minutes a new cook needs to read the menu and find the salt before their station's "utilization" counts in the chef's averaging — they look idle but they are still ramping up.

The Elastic Load Balancer is the maître d' at the front door. The target group is the list of available stations. A health check is the maître d' walking by each station every ten seconds to confirm the cook is responding ("two thumbs up = healthy"). Health check grace period is the five-minute pass for new cooks — the maître d' will not start judging a brand-new cook for two minutes; otherwise every newly-arrived cook would be fired before opening their station. Deregistration delay is the 300-second courtesy — when a cook ends shift, the maître d' stops sending new orders to that station, but lets the cook finish the dishes already on their counter before walking out. Sticky sessions are the "this is your usual server" rule — a regular customer who likes a specific waiter gets routed back to them on every visit (cookie). Multi-AZ is the two-kitchen restaurant — half the cooks in the east kitchen, half in the west; if a fire shuts down one kitchen, dinner service continues from the other.

Analogy 2: The Building Elevator System

A scaling policy is an elevator dispatch system. Target tracking is the modern algorithm — "keep average wait time under 30 seconds across all elevators". The system observes load, decides which elevator to send where, and the operator never specifies "if floor 5 button is pressed and elevators 1-3 are busy, send elevator 4". Step scaling is the older traffic-tier rules — "if more than 5 buttons pending, dispatch 2 elevators; if more than 10, dispatch 4". You manually defined the steps, the system follows them. Scheduled scaling is the morning rush mode — every weekday at 8:30am, automatically position three elevators on the lobby floor for the upcoming surge, no need to wait for metrics. The cooldown is the doors-closing pause — once an elevator is dispatched to floor 12, the system does not redirect it to floor 7 mid-trip. The lifecycle hook is the elevator inspector window — between car arrival and "ready for passengers", the inspector has a chance to sign off; only after CompleteLifecycleAction does the elevator open its doors.

Analogy 3: The Hospital Emergency Department Triage

An ALB doing path-based routing is triage at the ED entrance. Path /api/* goes to one team (the API target group), /admin/* goes to another team (admin target group), /* goes to general medicine. The listener rule priority is the order of triage questions — lower number is asked first. A target group health check is the bed-status board — green if the bed is ready, red if the patient just left and the bed needs cleaning. The deregistration delay is the bed-cleaning grace period — the bed is no longer accepting new admissions but the current patient is allowed to finish their treatment. NLB instance vs IP target type is the two ways to address a bed: instance (assigned bed number, hospital-managed) vs IP (the patient's own portable monitor IP, useful when the patient moved across wards).

For SOA-C02, the kitchen brigade analogy is the most useful because it covers the full launch → health → drain → rebalance cycle. When a question describes "instances are launching but the ALB still routes to the old unhealthy ones", think: the maître d' has not yet confirmed two thumbs up on the new cook (health check grace period not expired) AND has not finished walking the old cook out (deregistration delay still draining). When a question describes "ASG is flapping every ten minutes", think: head chef is calling cooks in and out without a cooldown observation window. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html

Auto Scaling Group Components: Launch Templates, Desired/Min/Max, AZ Distribution

An Auto Scaling group is the SysOps unit of fleet management. Three groups of settings govern its behavior.

Launch templates (and the legacy Launch Configurations)

A launch template captures everything an instance needs at launch: AMI ID, instance type (or list of types for mixed instances), key pair, security groups, IAM instance profile, user-data script, block device mapping, network interfaces, monitoring, hibernation, metadata options (IMDSv2 only is the SOA-recommended setting), and tags. Launch templates are versioned — every edit creates a new version number. ASGs reference a specific version (1, 2), or the special pointers $Latest (always the newest) and $Default (the version flagged as default).

The older Launch Configurations are immutable, unversioned, and lack support for newer features (mixed instances policies, multiple network interfaces, several newer instance types). AWS officially deprecated launch configurations for new ASGs created on or after 2023-12-31. SOA-C02 may show launch configurations in older question stems, but the exam-correct answer for any modern ASG is a launch template.

Desired, minimum, and maximum capacity

Three integers govern the ASG's count:

MinSize — the floor. The ASG never scales below this number (a scale-in policy that would drop below min is clamped at min).
MaxSize — the ceiling. The ASG never scales above this number (a scale-out policy that would exceed max is clamped at max).
DesiredCapacity — the count the ASG actively maintains. Scaling policies move desired up and down within [min, max].

A common SysOps error is setting MaxSize too low for the worst-case demand — the ASG saturates at max, the alarm stays in ALARM, and customers get errors. Another is setting MinSize = MaxSize = DesiredCapacity (a fixed-size ASG), which provides AZ rebalancing and instance replacement on health failure but no elasticity.

AZ distribution and Multi-AZ

An ASG is configured with one or more VPC subnets, each in a distinct AZ. The ASG distributes instances evenly across AZs by default — if you have desired = 6 and three subnets in three AZs, you get 2 instances per AZ. When an AZ becomes impaired, the ASG launches replacement capacity in the surviving AZs.

For HA, the SOA-recommended baseline is at least two AZs (preferably three) per ASG. Single-AZ ASGs are appropriate only for batch / dev workloads where AZ isolation is acceptable.

Default cooldown: 300 seconds (simple scaling policies; target tracking and step scaling do not use the default cooldown — they use their own warm-up).
Default health check type: EC2 (only EC2 status checks). ELB health checks must be explicitly enabled on the ASG for it to react to ELB target health.
Default health check grace period: 300 seconds (5 minutes) when ELB health checks are enabled. New ASGs that omit this default to 0 seconds in some console paths — a frequent SOA-C02 trap.
Default deregistration delay (connection draining): 300 seconds (5 minutes). Range 0–3600 (1 hour).
Default ALB target group health check interval: 30 seconds, healthy threshold 5, unhealthy threshold 2 (varies by target type).
Default NLB target group health check interval: 30 seconds for TCP/HTTP/HTTPS health checks, healthy and unhealthy thresholds both 3.
Maximum scaling rate (target tracking): target tracking can add at most the larger of 1 instance or 10% of the group size per scaling action's evaluation cycle in some configurations — tune EstimatedInstanceWarmup to control reactivity.
Instance warmup default for instance refresh: inherits the ASG's DefaultInstanceWarmup if set; otherwise the policy's specified warmup; otherwise the health check grace period.
Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html

Scaling Policy Types: Target Tracking, Step, Scheduled — Operational Selection

Three scaling policy types, each with a clear operational fit.

Target tracking — the default, most-used type

Target tracking keeps a single metric at a target value, similar to a thermostat. You pick a metric (predefined ASGAverageCPUUtilization, ASGAverageNetworkIn/Out, ALBRequestCountPerTarget, or a custom metric) and a target value (e.g., 50 percent CPU). The ASG creates two CloudWatch alarms behind the scenes — one for scale-out (metric > target) and one for scale-in (metric < target) — and adjusts desired to keep the metric at the target.

Target tracking math:

Scale-out is fast and aggressive — the alarm fires after a short evaluation window (often 3 datapoints), and the ASG adds enough instances to bring the metric down. The added count is computed proportionally: if current desired = 10 and average CPU is 80 percent against a target of 40 percent, the policy aims for ~20 instances.
Scale-in is slow and conservative — the scale-in alarm requires 15 consecutive datapoints (15 minutes by default) before firing, to prevent termination of capacity that may be needed in the next minute.

Use target tracking when:

The metric has a clear target value (CPU at 50%, request count per target at 1000).
You want AWS to manage the alarms and scaling math.
The workload's load is roughly proportional to the chosen metric.

Step scaling — for tiered responses

Step scaling lets you define adjustment "steps" based on how far the metric has departed from the threshold:

CPU 70–80% → +2 instances.
CPU 80–90% → +4 instances.
CPU >90% → +8 instances.
CPU 30–20% → -1 instance.
CPU <20% → -2 instances.

Each step is defined by a CloudWatch alarm tied to the policy. Step scaling is appropriate when:

The relationship between metric and required capacity is non-linear.
You want explicit control over how aggressively the fleet grows under heavy breach.
The workload tolerates occasional over- or under-shoot.

Scheduled scaling — for predictable traffic patterns

Scheduled scaling changes desired, min, or max at a specific time. Configured with one-time or recurring (cron) schedules:

Every weekday at 08:30 UTC: desired = 10 (morning rush).
Every weekday at 19:00 UTC: desired = 4 (after-hours).
One-time on Black Friday at 00:00: desired = 50.

Use scheduled scaling for:

Known traffic patterns (business hours, marketing campaigns, batch windows).
Pre-warming before a forecast spike — set desired higher before traffic arrives so target tracking does not have to react.
Cost reduction in non-prod (scale dev environments to zero overnight).

Scheduled and dynamic (target tracking / step) policies coexist on the same ASG: schedule sets a baseline; target tracking handles the rest.

Predictive scaling

Predictive scaling is a fourth flavor — it analyzes historical CloudWatch metric data, learns daily/weekly patterns, and pre-launches capacity before the predicted demand arrives. It is typically combined with target tracking so predictive handles the baseline curve and target tracking absorbs the unpredicted variance. SOA-C02 mentions predictive scaling occasionally; the high-frequency questions stay on target tracking, step, and scheduled.

Whenever a SOA-C02 scenario describes "the team wants to keep CPU at 50%" or "scale to handle a steady metric", the answer is target tracking. Step scaling shows up in scenarios with explicit tiered responses ("if CPU > 80, add 4; if CPU > 90, add 8"). Scheduled scaling is reserved for explicit time-based patterns ("every weekday at 9am"). When a scenario blends a steady baseline with a predictable surge, the SOA-correct answer is scheduled scaling combined with target tracking on the same ASG. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scaling-target-tracking.html

Cooldown Period and Warm-Up Time: Preventing Thrashing

The two settings most often blamed for ASG flapping.

Cooldown — the post-action observation window (simple scaling)

Cooldown applies to simple scaling policies (the legacy single-step type). After a scale-out or scale-in action completes, the ASG waits the cooldown duration (default 300 seconds) before considering another scaling action of the same type. The intent is to give new instances time to come online and influence the metric before the policy reacts again.

Cooldown does not apply to target tracking or step scaling — those use instance warm-up instead.

Instance warm-up — the metric-aggregation grace period

Instance warm-up (also called EstimatedInstanceWarmup) is the duration during which a freshly launched instance's metrics are excluded from the ASG's aggregated statistics. The reasoning: a brand-new instance at 0 percent CPU would drag down the average and trick target tracking into thinking the fleet is over-provisioned, causing premature scale-in.

A typical web tier needs 60–180 seconds of warm-up; an instance with heavy bootstrapping (configuration management, container pulls, JVM warmup, cache priming) may need 300–600 seconds. The right warm-up matches how long the instance takes to become representative of normal load.

`DefaultInstanceWarmup` (ASG-level)

Modern ASGs support DefaultInstanceWarmup as an ASG-level setting that applies to every dynamic scaling policy, every health check grace period default, and every instance refresh — one place to set it instead of duplicating across every policy. SOA-C02 expects you to know this consolidation exists and to prefer setting DefaultInstanceWarmup on the ASG over per-policy warm-ups.

Default cooldown (simple scaling only): 300 seconds.
Default instance warm-up: depends on policy type — newer policies inherit DefaultInstanceWarmup from the ASG; older ones default to 300 seconds for the warm-up.
Health check grace period default: 300 seconds for ASGs created with ELB health checks enabled (some console paths default to 0 — verify).
Maximum cooldown: no fixed maximum, but rarely makes sense above 600 seconds.
Maximum scaling rate per cycle (target tracking): governed by metric staleness and warm-up; a too-short warm-up causes "ladder" scale-out where the policy adds more than necessary.
Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html

A SOA-C02 distractor: "the team set a 300-second cooldown but the target tracking policy is still flapping — why?". The answer is that target tracking ignores the cooldown setting entirely; its anti-thrashing mechanism is the instance warm-up (and the asymmetric scale-out vs scale-in alarm windows AWS configures internally). To stop target tracking flapping, increase EstimatedInstanceWarmup (or DefaultInstanceWarmup on the ASG), not the cooldown. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html

AZ Rebalancing: How an ASG Reacts to Uneven AZ Distribution

When the ASG's instances become unevenly distributed across AZs — for example because one AZ went impaired and the ASG launched replacements in the surviving AZs, then the impaired AZ recovered — the ASG triggers AZ rebalancing: it launches new instances in the under-populated AZ(s) and terminates instances in the over-populated AZ(s) to restore even distribution.

Rebalancing follows the scale-out, then scale-in pattern: the ASG launches first (temporarily exceeding desired), waits for the new instances to be in service, and then terminates the old ones — preserving capacity throughout. The same instance warm-up applies.

A few operational notes:

AZ rebalancing can trigger unexpectedly when an ASG is reconfigured to add a new subnet/AZ. The ASG sees an under-populated AZ and rebalances, replacing instances. Plan for this when expanding to a new AZ.
Termination policies (default, oldest, newest, oldest launch configuration, oldest launch template, allocation strategy) determine which instance is chosen to terminate during rebalancing or scale-in. The default is a balanced policy that prefers to terminate the oldest launch template and rebalance AZs.
Suspended processes: SysOps engineers can suspend AZRebalance, Launch, Terminate, HealthCheck, ReplaceUnhealthy, AlarmNotification, ScheduledActions, or AddToLoadBalancer to investigate or pause activity. Suspending AZRebalance is a common debugging move when working in a single AZ for testing.

Lifecycle Hooks: Pause Points for Custom Boot and Drain Logic

A lifecycle hook turns the ASG's launch and terminate actions into pausable events that wait for an external CompleteLifecycleAction API call before proceeding.

The two hook types

autoscaling:EC2_INSTANCE_LAUNCHING — fires when a new instance reaches Pending:Wait state. The ASG holds the instance in that state until you call CompleteLifecycleAction (or the timeout expires). Use this to bootstrap configuration, attach an EFS volume, register with a third-party config store, prime caches, or run integration tests before letting the instance enter service.
autoscaling:EC2_INSTANCE_TERMINATING — fires when an instance is selected for termination. Held in Terminating:Wait. Use this to drain queues, flush logs, deregister from external service registries, snapshot ephemeral state, or upload final logs to S3 before EC2 terminates the instance.

Hook configuration

Each hook has:

HeartbeatTimeout — how long the ASG waits before timing out (default 3600 seconds, max 7200). On timeout, the DefaultResult action is taken.
DefaultResult — CONTINUE (proceed with launch/terminate) or ABANDON (terminate the instance and roll back).
NotificationTargetARN + RoleARN — typically an SNS topic or SQS queue that the bootstrap/drain process consumes; or EventBridge consumes the EC2 Instance-launch Lifecycle Action event directly.

Common operational patterns

Configuration management bootstrap: launch hook → SNS → Lambda → run Ansible/Chef → CompleteLifecycleAction CONTINUE.
Graceful drain: terminate hook → EventBridge → SSM Automation document → drain queue → flush logs → CompleteLifecycleAction CONTINUE.
Container draining: terminate hook → ECS-style task draining → Lambda waits for tasks to finish → CompleteLifecycleAction.
Snapshot before terminate: terminate hook → SSM Run Command → aws ec2 create-snapshot → upload manifest to S3 → CompleteLifecycleAction.

A common SOA-C02 confusion: "the team needs to run a script on every new instance — user-data or lifecycle hook?" Both are valid in different windows. User-data runs once on first boot as part of the OS bootstrap — fast, atomic, but it cannot block ASG state transitions and is invisible to the ASG. Lifecycle hooks run as a controlled gate that the ASG holds open — they can block, time out, succeed, or ABANDON the launch. For "the instance must finish a one-shot OS install" → user-data. For "the instance must register with an external config server before serving traffic, and we need the ASG to roll back if registration fails" → lifecycle hook. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

Health Checks: ASG vs ELB — Which the ASG Uses

Two distinct health-check systems run side by side. Confusing them is the most common SOA-C02 mistake.

EC2 status checks (the ASG default)

Every EC2 instance has two status checks published by AWS:

StatusCheckFailed_System — AWS infrastructure failed (host, network, AZ). EC2 Auto Recovery handles many of these automatically.
StatusCheckFailed_Instance — the instance OS is unreachable (kernel panic, network misconfiguration, disk full).

By default, the ASG uses only EC2 status checks to decide whether an instance is healthy. If both pass, the instance is considered healthy regardless of whether it can serve application traffic.

ELB target group health checks

ELB target groups run their own health checks against the registered targets — typically an HTTP/HTTPS probe at /healthz or /health returning 200, or a TCP probe to a port. This is a richer, application-aware signal: a target with a 200 response from /healthz is verified to be reachable AND the application is responding.

Wiring ELB health into the ASG

The ASG only acts on ELB health if you explicitly enable ELB health checks at the ASG level. The setting (HealthCheckType) has three modes:

EC2 (default) — only EC2 status checks count.
ELB — both EC2 status checks AND ELB target group health checks count. An instance failing either is replaced.
VPC_LATTICE — for VPC Lattice service network targets (newer scope, not on most SOA-C02 questions).

Health Check Grace Period (the SOA gotcha)

When a brand-new instance launches, the ELB starts running health checks against it almost immediately. If the instance has not finished booting (Apache not yet started, Java not yet warm), the health checks fail and the ASG marks the instance unhealthy and terminates it — only to launch another that hits the same fate. This is the "new instance keeps getting killed before serving traffic" symptom.

The fix is the health check grace period: a duration (in seconds) during which ELB health-check failures do not count against the instance. Default is 300 seconds for ASGs with ELB health checks enabled, but several console paths and CLI defaults yield 0. A grace period that is too short kills booting instances; one that is too long delays detection of genuinely broken instances.

A canonical SOA-C02 distractor: "the SysOps team wants the ASG to replace an instance whose application is unhealthy on the ALB, but the ASG keeps the instance running for hours". The fix is to change the ASG's HealthCheckType from EC2 (default) to ELB. The exam frequently shows ASG XML/JSON with HealthCheckType: EC2 and asks the candidate to spot the misconfiguration. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html

The default health check grace period should be 300 seconds when ELB health checks are enabled, but some console wizards and the CreateAutoScalingGroup API return 0 if the field is omitted. A grace period of 0 means the ELB starts judging instances the moment they launch, before any application has booted — and the ASG kills every new instance for being unhealthy. The most common ASG launch-failure outage on SOA-C02 is exactly this. Always set HealthCheckGracePeriod explicitly. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/health-check-grace-period.html

Connection Draining and Deregistration Delay: Graceful Termination

When the ASG decides to terminate an instance — for scale-in, instance refresh, or health failure — the ELB must stop sending new connections to that target while letting in-flight connections complete.

Deregistration delay (connection draining) on ALB / NLB

When a target is deregistered (or moved to draining state), the ELB:

Stops sending new connections to it.
Allows existing connections to complete for up to the deregistration delay (default 300 seconds, max 3600 seconds).
After the delay expires, forcibly closes any remaining connections.

A too-short delay drops in-flight requests — visible as 502 / 503 spikes in the application logs and ELB access logs. A too-long delay slows scale-in (the ASG must wait for all draining instances before terminating them).

The right value depends on the workload:

Stateless web tier with short requests: 30–60 seconds is plenty.
API with long-polling or large file uploads: 300–900 seconds.
WebSocket or persistent connections: 600–1800 seconds (or terminate WebSockets at the application layer first).

ASG draining vs ELB draining

When the ASG terminates an instance attached to an ELB target group:

The ASG calls DeregisterTargets on the target group.
The target enters draining state on the ELB.
The ASG waits for the deregistration delay to elapse before actually terminating the EC2 instance.
The instance is terminated.

If a lifecycle hook is also configured for EC2_INSTANCE_TERMINATING, the ASG runs both: it deregisters from the ELB AND fires the lifecycle hook simultaneously, waiting for both to complete.

SOA-C02 sometimes asks "the application needs 60 seconds to flush its queue before terminating; what's the right configuration?". The answer is a terminate lifecycle hook that runs the flush logic and calls CompleteLifecycleAction. Deregistration delay alone only stops new connections — it does not give the application a chance to do custom cleanup. Use the deregistration delay to let in-flight ELB connections finish; use the lifecycle hook for custom drain logic. Reference: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay

ALB vs NLB vs GLB: Operational Differences for SysOps

Three ELB types in the modern family. SOA-C02 focuses on ALB and NLB; GLB appears occasionally.

Application Load Balancer (ALB) — Layer 7

HTTP / HTTPS / gRPC / WebSocket routing.
Path-based (/api/* → group A, /admin/* → group B), host-based (api.example.com vs web.example.com), HTTP header and query string routing.
Listener rules with priorities — first match wins; lower priority number is higher priority.
Target types: instance (EC2 by ID), ip (any IP including on-prem via VPN/DC), lambda (invoke Lambda for HTTP), alb (chain to another ALB).
Sticky sessions via ALB-managed lb_cookie (duration configurable) or application-managed app_cookie (the application sets the cookie; ALB pins to it).
WAF integration at Layer 7.
Has security groups — operator controls ingress.
Use ALB when: HTTP-based traffic, microservice routing, WAF protection, header/path-based routing.

Network Load Balancer (NLB) — Layer 4

TCP / UDP / TLS pass-through.
Static IP per AZ (one Elastic IP per AZ), supports PrivateLink as a service endpoint.
Extreme performance — millions of requests per second, ultra-low latency.
Target types: instance, ip (must be VPC-routable), alb (rare; NLB-in-front-of-ALB pattern).
Preserves source IP when target type is instance (and via Proxy Protocol v2 for ip target type).
No security group on the NLB itself historically — security groups attach to the targets only. Newer NLB feature added optional NLB-level security groups; SOA-C02 still tests the historical default that "NLB has no security group".
Cross-zone load balancing disabled by default (you pay inter-AZ data transfer when enabled, but distribution is more even). ALB has cross-zone enabled by default.
Use NLB when: TCP/UDP non-HTTP, static IP requirement, ultra-high throughput, source IP preservation, PrivateLink endpoint.

Gateway Load Balancer (GLB) — Layer 3 traffic-inspection

Distributes IP traffic to a fleet of third-party virtual appliances (firewalls, IDS/IPS, deep packet inspection).
Uses the GENEVE protocol on port 6081.
Operationally niche — appears occasionally on SOA-C02 in scenarios about deploying a marketplace firewall fleet.

Target group nuances

Same target group can be the destination for multiple ALB listener rules.
Health check protocol/path/port are configured per target group, not per ALB.
Slow start mode (ALB only): newly registered targets receive a linearly increasing share of traffic over a configurable duration (max 900 seconds), preventing cold-cache thundering. Disabled by default.

A SOA-C02 question may describe "the team needs to allow only specific IPs to reach the NLB" and offer "configure a security group on the NLB" as a distractor. Historically, NLB has no security group; you control ingress via the target instances' security groups or NACLs on the subnets. AWS later added optional NLB-level security groups, but the SOA-C02 baseline question still expects "NLB has no SG" as the operational truth. Reference: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html

Sticky Sessions: lb_cookie vs app_cookie

When a workload needs the same client to keep hitting the same target across requests (legacy session state in instance memory, sticky shopping carts, WebSocket-style affinity), enable sticky sessions on the ALB target group.

Two sticky modes

lb_cookie (ALB-managed) — the ALB sets and rotates a cookie named AWSALB. The duration is configurable (default 1 day, max 7 days). When the client returns with the cookie, the ALB routes them to the same target. Per-target-group cookie — different target groups maintain independent stickiness.
app_cookie (application-managed) — the application sets its own session cookie (JSESSIONID, PHPSESSID, custom). The ALB observes the cookie value and pins the client to whichever target last set that cookie value. The application controls cookie name, domain, path, and lifetime.

Operational implications

lb_cookie is simpler: enable on the target group, ALB does the rest. Works for any application that does not already set its own session cookie.
app_cookie is preferred when the application already has its own session management — it avoids two-cookie confusion and lets the application logic control session lifetime.
Sticky sessions break when the target the cookie pins to is terminated, deregistered, or fails health checks. The next request rebalances to a new target — and the session state on the old target is lost (unless the application persists state externally).

Configuration knobs

On the target group:

Stickiness type — lb_cookie or app_cookie.
Stickiness duration — for lb_cookie, how long the cookie is valid.
App cookie name — for app_cookie, the name of the application cookie to observe.

A SOA-C02 scenario: "after the team deployed a new target group via blue/green, sticky sessions broke for active users". The cause is that lb_cookie is per-target-group — when traffic shifted to a new target group, the old AWSALB cookie was meaningless and every client got rebalanced to a new target. Mitigations: (a) use app_cookie so the application's own session cookie persists across target groups; (b) externalize session state to ElastiCache or DynamoDB so target re-pinning is not catastrophic; (c) accept the disruption as part of blue/green. Reference: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/sticky-sessions.html

Mixed Instances Policy and Spot Integration

Modern ASGs support a mixed instances policy — multiple instance types in the same group, with separate purchase options for on-demand and Spot. Operationally important for cost optimization and resilience.

Policy components

Launch template with overrides — specify multiple instance types (m5.large, m5a.large, m4.large, m5n.large) so the ASG can substitute when one type is unavailable.
OnDemandPercentageAboveBaseCapacity — the percentage of additional capacity (above the on-demand base) to launch as on-demand. The remainder is Spot.
OnDemandBaseCapacity — the absolute number of always-on-demand instances at the bottom (typically your steady baseline).
SpotAllocationStrategy — price-capacity-optimized (preferred default) picks the most stable Spot capacity at the lowest price; lowest-price picks cheapest first; capacity-optimized picks deepest pools first.
SpotMaxPrice — optional ceiling on what the ASG will pay for Spot.

Operational patterns

Steady baseline + bursty Spot: OnDemandBaseCapacity = 4, OnDemandPercentageAboveBaseCapacity = 0, all extra capacity Spot. Saves 60–90% versus pure on-demand.
Resilient mixed: OnDemandPercentageAboveBaseCapacity = 50 — half on-demand, half Spot above the base. Trades savings for resilience to Spot interruption.
Spot interruption handling: ASG receives a 2-minute interruption notice via instance metadata; configure a lifecycle hook on EC2_INSTANCE_TERMINATING to drain. Spot capacity rebalancing (newer feature) proactively replaces Spot instances showing elevated interruption risk.

Instance Refresh: Rolling Replace for AMI / Launch Template Updates

When the launch template changes (new AMI version, new instance type, updated user-data), the ASG does not automatically replace existing instances — they continue running the old launch template. Instance refresh is the managed rolling replacement that brings the fleet up to the new launch template version.

How instance refresh works

You start a refresh with StartInstanceRefresh specifying:

MinHealthyPercentage — the lower bound on healthy capacity during the refresh (default 90%, range 0–100). Lower values let the refresh proceed faster but with more reduced capacity.
InstanceWarmup — how long before a replacement is considered "in service" for the purposes of computing healthy percentage (default inherits from the ASG).
CheckpointPercentages + CheckpointDelay — pause points so the team can validate before continuing.
Strategy — Rolling (default) or Launch before terminating (newer; launches replacement first).

The refresh:

Selects a batch of instances (sized to keep the fleet above MinHealthyPercentage).
Launches replacements with the new launch template.
Waits for InstanceWarmup and verifies health.
Terminates the old instances.
Repeats until every instance is on the new launch template.

Common operational issues

Refresh stuck at 0% complete: MinHealthyPercentage too high — the ASG cannot terminate any instance without dropping below the threshold. Lower the percentage to allow progress.
Refresh fails with "instance never reached InService": the new AMI's bootstrap is slower than the warm-up; replacement instances time out. Increase InstanceWarmup or pre-bake more into the AMI.
Refresh skips instances: instances in protected scale-in state are skipped — useful for canary nodes you want to preserve.

The exam frequently asks "the team has a new AMI with the latest patches; how do they update the ASG fleet?". The wrong answers are "delete and recreate the ASG", "manually terminate instances and let ASG launch", or "use AWS CodeDeploy" (out of scope). The right answer is: update the launch template version, then StartInstanceRefresh with appropriate MinHealthyPercentage and InstanceWarmup. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html

Multi-AZ Deployments: Spreading the ASG for Fault Tolerance

The single most important HA configuration for an ASG: multiple subnets in multiple AZs. SOA-C02 expects the candidate to know:

Minimum two AZs for fault tolerance — three preferred.
The ASG distributes evenly across the configured subnets/AZs.
When an AZ fails, EC2 launches do not succeed in the failed AZ — the ASG launches replacement capacity in the surviving AZs (this can require enough headroom in those AZs).
After the failed AZ recovers, AZ rebalancing redistributes back to even.
The ASG must be registered to ELB target groups across all the same AZs so traffic continues to flow.

Cross-zone load balancing — ALB vs NLB

ALB: cross-zone load balancing is enabled by default and cannot be disabled. Every ALB target receives roughly equal traffic regardless of AZ skew.
NLB: cross-zone load balancing is disabled by default — each AZ's NLB node only forwards to targets in the same AZ. Enabling cross-zone on NLB costs inter-AZ data transfer but evens out distribution when targets are unevenly distributed.

When NLB cross-zone is disabled and one AZ has 9 instances while another has 1, the AZ with 1 instance receives 50% of the traffic (because 50% of the NLB nodes are in that AZ), and that instance gets crushed. SOA-C02 tests this asymmetry: enable NLB cross-zone OR equalize instance count across AZs.

Elastic IPs and EFS for Stateful Fault Tolerance

Some workloads cannot be made fully stateless — a SysOps engineer needs the right primitives for the stateful tier.

Elastic IPs

An Elastic IP is a static public IP allocated to your account. Operationally:

Used when a specific instance must always be reachable at a known public IP (legacy whitelisted partner integrations, on-prem VPN endpoints).
An Elastic IP can be moved to a replacement instance during failover — a Lambda function triggered by a CloudWatch alarm or EventBridge event runs aws ec2 associate-address.
Ephemeral public IPs are not preserved on stop/start — only Elastic IPs persist.
Elastic IPs NOT associated are charged a small hourly fee (idle EIP penalty) — Trusted Advisor flags this.

For ELB-fronted workloads, Elastic IPs typically appear on the NLB (one per AZ) rather than on instances directly. ALB does not expose static IPs (use Global Accelerator for static anycast IPs in front of ALB).

Amazon EFS for shared filesystems

Amazon EFS is a managed NFSv4 filesystem that mounts on every instance in the ASG simultaneously. Operationally:

Use EFS when multiple instances need shared file state (uploaded media, shared config, content directories for a CMS).
Mount target per AZ — make sure every AZ the ASG launches into has an EFS mount target.
Performance modes: General Purpose (default) vs Max I/O. Throughput modes: Bursting (default), Provisioned, or Elastic.
Lifecycle policies move infrequently-accessed files to EFS-IA or EFS Archive for cost reduction.

FSx alternatives

FSx for Windows File Server — SMB shares for Windows-based workloads (Active Directory integration).
FSx for Lustre — high-performance compute.
FSx for NetApp ONTAP / FSx for OpenZFS — feature-rich shared storage for specialized workloads.

For SOA-C02 stateful HA scenarios:

"Multiple EC2 instances need to read and write the same files" → EFS.
"A specific instance must always have the same public IP after restart" → Elastic IP.
"Windows fleet needs an SMB share" → FSx for Windows.

Route 53 Health Checks and Failover Routing

Layered on top of ELB health checks, Route 53 health checks provide DNS-level health awareness for cross-region failover and weighted routing.

Health check types

Endpoint health check: HTTP / HTTPS / TCP probe from Route 53's distributed checkers (15+ regions). Configurable interval (10s standard, 30s default), failure threshold, request string matching, latency measurement.
Calculated health check: combines up to 256 child health checks with a Boolean (AT LEAST N must be healthy).
CloudWatch alarm health check: maps a CloudWatch alarm's state to a health check result. Useful when the signal is internal (queue depth, error rate) and not a public HTTP endpoint.

Failover routing policy

A failover routing policy has two records — primary and secondary — each tied to a health check:

DNS queries return the primary value as long as its health check is healthy.
When the primary health check fails, queries return the secondary.
TTL controls how fast clients pick up the change.

Common patterns:

Active-passive cross-region: primary us-east-1 ALB, secondary us-west-2 ALB. When the east region health check fails, DNS shifts to west.
Active-passive with on-prem: primary on-prem datacenter, secondary AWS warm standby.

Weighted: split traffic by percentage across multiple values. Combine with health checks for fault-tolerant blue/green or canary at DNS layer.
Latency-based: route to the lowest-latency region for the client.
Geolocation / Geoproximity: route by geographic location.
Multivalue answer: return up to 8 healthy values for the client to choose, similar to round-robin DNS but with health-aware filtering.

SOA-C02 questions about "what happens if the entire region fails" almost always involve Route 53 failover routing. The pattern: primary record points to ELB in us-east-1 with a health check; secondary record points to ELB in us-west-2. When the primary health check fails, DNS shifts to the secondary. The SysOps team's job is to keep the warm standby ASG/ELB pre-provisioned (capacity sized for failover load) and verify the failover with periodic DR drills. Reference: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html

Scenario Pattern: ALB Returns 502 During Peak — Diagnosis Runbook

A recurring SOA-C02 troubleshooting scenario.

Symptoms

ALB access logs show elb_status_code = 502 and target_status_code = - (no response from target).
CloudWatch metric HTTPCode_ELB_5XX_Count is elevated; HTTPCode_Target_5XX_Count may also be elevated.
Customer-visible errors during scale events.

Runbook

Check target group health: how many targets are healthy vs unhealthy at the moment of the spike? An ALB that lost half its targets during a deregistration is undersized for the remaining traffic.
Check deregistration delay: too-short delay drops in-flight connections during scale-in or instance refresh. If the workload has long-poll or large-upload requests, raise the delay to 300–900 seconds.
Check health check grace period on the ASG: too short → newly launched instances are killed before booting → effective fleet shrinks → 502s.
Check target health check thresholds: UnhealthyThresholdCount = 2 is aggressive. A flaky network blip flaps targets in and out. Raise to 3–5 for less sensitivity (at the cost of slower detection).
Check the application itself: target_status_code = 502 from the target's own response means the application returned 502 — not an ELB-injected one. Check application logs.
Check target group connection idle timeout: ALB default is 60 seconds. Long-running requests beyond 60 seconds are cut by ALB → 504 (gateway timeout) typically, but mis-tuned applications may return 502 first.
Check security group / NACL: target SG must allow ingress from ALB SG on the target port; NACLs must allow ephemeral return ports.

The most common root causes in order: deregistration delay too short during scale events, health check grace period 0 / too short, target group health check thresholds too aggressive.

Scenario Pattern: ASG Flapping Every Ten Minutes — Tuning Runbook

Another canonical SOA-C02 scenario.

Symptoms

ASG launches 2 instances, terminates them ten minutes later, launches 2 again.
CloudWatch alarm oscillates OK → ALARM → OK every cycle.
CPU graph shows a sawtooth pattern around the target.

Runbook

Datapoints to alarm: if the alarm uses M = N = 1 (one breaching point fires the alarm), every spike triggers scale. Change to 3 of 5 to require sustained breach.
Instance warm-up too short: target tracking includes the new instance's metric immediately, sees low CPU on a still-booting instance, computes "fleet is over-provisioned", scales in. Increase EstimatedInstanceWarmup or DefaultInstanceWarmup to match real boot time (often 180–300 seconds).
Target value too tight: a target of 50% with workload that swings naturally between 40% and 60% will continually trigger small scale actions. Raise to 60% with wider acceptable range, or use scheduled scaling for the predictable component.
Cooldown (only if simple scaling): increase to 300–600 seconds.
Use composite alarm: pair the CPU alarm with a sustained-load alarm (5xx error rate or queue depth) — only scale when both indicate genuine load.
Switch to anomaly detection if the workload has seasonal patterns; static thresholds are flapping because normal swings cross the threshold.

The most common root causes: M=N=1 datapoints, instance warm-up too short, target value too tight.

Scenario Pattern: ELB Routes to Old Unhealthy Instances After Refresh

Symptoms

After deploying a new AMI via instance refresh, customers continue hitting old instances that return 5xx.
Target group shows a mix of healthy and unhealthy targets.

Runbook

Health check grace period: too long → unhealthy old instances are not detected fast enough by the ASG. Verify it's not set to a value that exceeds the rollout window.
MinHealthyPercentage too high: the refresh did not progress because terminating one instance would breach the threshold; old instances stay in service.
Health check protocol/path/port: if the new AMI changed the health endpoint (e.g., from /health to /healthz), the target group's health check still hits the old path → all new instances appear unhealthy → refresh stalls.
Stuck deregistration: a long deregistration delay combined with persistent connections can keep old targets in draining for the full delay window.
Sticky sessions: clients with valid lb_cookie keep returning to old targets even when newer healthy targets are available. After refresh, clear stickiness for affected sessions or use shorter cookie duration.

Common Traps Recap — Auto Scaling and ELB

Every SOA-C02 attempt sees several of these.

Trap 1: ASG default health check is EC2 only

The ASG ignores ELB target health unless HealthCheckType=ELB is explicitly set. Candidates assume "the ALB is health-checking the target, the ASG must be using that signal" — wrong by default.

Trap 2: Health check grace period default 0 in some paths

API/CLI omitted defaults yield 0. Always set explicitly to ≥120 seconds for any non-trivial application.

Trap 3: Cooldown applies only to simple scaling

Target tracking and step scaling use instance warm-up, not cooldown. Tuning cooldown to fix target tracking flap is futile.

Trap 4: ALB sticky `lb_cookie` is per-target-group

Blue/green deployments to a new target group break sticky sessions for active users.

Trap 5: NLB has no security group historically

Ingress filtering must happen on the target instances' security groups. (Optional NLB-level SG is a newer feature.)

Trap 6: NLB cross-zone disabled by default

Uneven AZ distribution causes some targets to be crushed while others sit idle. Enable cross-zone explicitly when needed.

Trap 7: Detailed monitoring is required for 1-minute scaling

Default EC2 monitoring is 5-minute period; target tracking with 1-minute alarms requires detailed monitoring on the metric source instances.

Trap 8: Instance refresh stuck because `MinHealthyPercentage` too high

If the ASG cannot terminate any instance without breaching the threshold, the refresh hangs. Lower the percentage to make progress.

Trap 9: Deregistration delay too short drops in-flight requests

300s is the default for a reason — only lower it for true stateless tiers with short requests.

Trap 10: ASG does not auto-replace when launch template updates

The launch template is only consulted at instance launch. Existing instances keep running the old version. Instance refresh is required to roll out changes.

Trap 11: Single AZ ASG provides no fault tolerance

A SOA-C02 distractor: an ASG in one AZ is described as "highly available". It is not — AZ failure takes the entire fleet offline. Minimum 2 AZs for any production HA claim.

Trap 12: Lifecycle hook abandoned by default on timeout

If DefaultResult is ABANDON (or omitted), a timed-out hook causes the instance to be terminated and rolled back. Set DefaultResult=CONTINUE if the hook is best-effort.

SOA-C02 vs SAA-C03: The Operational Lens

Question style	SAA-C03 lens	SOA-C02 lens
Choosing a scaling policy	"Which scaling type for predictable load?"	"Target tracking is flapping — what's the fix?"
ALB vs NLB selection	"Which load balancer for HTTP routing?"	"ALB returns 502 during peak — diagnose the chain."
Health checks	"Which service performs application health checks?"	"Health check grace period is 0 — every new instance is killed; fix it."
Multi-AZ design	"How to make the architecture fault tolerant?"	"AZ failed; ASG does not launch in surviving AZs — diagnose insufficient capacity vs subnet config."
Sticky sessions	"Which feature enables session affinity?"	"After blue/green deploy, sticky sessions broke — explain and remediate."
AMI rollout	"Which deployment pattern updates AMIs?"	"Instance refresh is stuck at 0% — what setting blocks progress?"
Cross-region failover	"Which Route 53 routing policy enables failover?"	"Configure Route 53 health check + failover record + verify RTO with a DR drill."
Cooldown / warm-up	Rarely tested.	Heavily tested — pick warm-up vs cooldown for the right policy type.

Exam Signal: How to Recognize a Domain 2.1 / 2.2 Question

Domain 2 questions on SOA-C02 follow predictable shapes.

"Instance launches but is killed before serving traffic" → health check grace period default 0, increase to ≥180s.
"ASG flapping every N minutes" → datapoints to alarm M-of-N, instance warm-up too short, target value too tight.
"ELB returns 502 / 504 during peak" → deregistration delay, health check grace period, target health thresholds, application timeout.
"After AMI update, fleet still on old version" → start an instance refresh.
"Instance refresh stuck" → MinHealthyPercentage too high, InstanceWarmup too short, health check path changed.
"Sticky sessions broke after deployment" → lb_cookie per target group; use app_cookie or externalize state.
"Cross-region failover required" → Route 53 health check + failover routing + warm standby in second region.
"NLB target distribution uneven" → cross-zone disabled by default; enable or equalize AZ counts.
"ASG keeps unhealthy instance" → HealthCheckType is EC2, change to ELB.
"Need custom drain logic before terminate" → terminate lifecycle hook, not deregistration delay.

With Domain 2 worth 16 percent and EC2 Auto Scaling + ELB the largest topic in the domain (alongside RDS and backup/DR), expect 5–7 questions specifically in this Auto Scaling / ELB territory. The flapping, 502 troubleshooting, and grace period scenarios alone account for the majority. Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html

Decision Matrix — ASG and ELB Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal	Primary construct	Notes
Keep CPU at a target value	Target tracking scaling policy	Default; AWS manages the alarms.
Tiered response to load	Step scaling policy	Manual steps; explicit control.
Predictable traffic spikes	Scheduled scaling action	Combine with target tracking for unpredicted variance.
Stop ASG flapping (target tracking)	Increase `EstimatedInstanceWarmup` / `DefaultInstanceWarmup`	Cooldown does not apply.
Stop ASG flapping (simple scaling)	Increase cooldown to 300–600s	Cooldown is the simple-scaling brake.
Replace fleet on new AMI	Launch template new version + instance refresh	Tune `MinHealthyPercentage` and `InstanceWarmup`.
Run script before instance enters service	Launch lifecycle hook	User-data is a fallback for one-shot install only.
Drain queue before terminate	Terminate lifecycle hook	Deregistration delay alone does not run custom logic.
Path-based HTTP routing	ALB with listener rules	Lower priority number = higher precedence.
TCP/UDP pass-through, static IP	NLB with one EIP per AZ	Cross-zone off by default.
Stateful session affinity	ALB sticky sessions (`lb_cookie` or `app_cookie`)	Externalize state for resilience.
Stateful shared filesystem	EFS mounted on every instance	Mount target per AZ.
Stable public IP that survives failure	Elastic IP + Lambda re-association	Or NLB with EIPs.
Cross-region failover	Route 53 failover routing + health check	Warm standby pre-provisioned.
ASG reacts to ALB health	Set `HealthCheckType=ELB` on the ASG	Default is `EC2` only.
Avoid replacing booting instances	Health check grace period ≥180s	Default 0 in some console paths — set explicitly.
Long-poll / file upload draining	Deregistration delay 300–900s	Default 300; raise for long requests.
60–90% cost reduction on fleet	Mixed instances policy with Spot	`price-capacity-optimized` is preferred allocation strategy.
Stop launching in one AZ for testing	Suspend `AZRebalance` + remove subnet	Otherwise ASG tries to rebalance back.
Roll out new AMI without downtime	Instance refresh, `Launch before terminating` strategy	Newer; cleaner than rolling.
Auto-recover from instance failure	CloudWatch `StatusCheckFailed_System` alarm + EC2 Recover action	Or rely on ASG with `ELB` health type.

FAQ — EC2 Auto Scaling and ELB for SysOps HA

Q1: Why is my ASG marked as healthy in CloudWatch but the ALB is returning 502s?

The most likely cause is that the ASG is using HealthCheckType=EC2 (the default), so it only counts EC2 status checks. The instances pass EC2 status checks (the OS is responsive, the host is fine) but the application is unhealthy at Layer 7. The ALB sees the application failure and returns 502, but the ASG never replaces the instance. Switch the ASG to HealthCheckType=ELB so application failures trigger replacement. Combine this with a sensible health check grace period (≥180 seconds) so freshly launched instances are not killed before booting.

Q2: What is the difference between cooldown and instance warm-up, and which should I tune?

Cooldown is a wait window after a scaling action completes before another scaling action of the same type is allowed — and it applies only to simple scaling policies (the legacy single-action type). Target tracking and step scaling ignore cooldown entirely. Instance warm-up is the duration during which a freshly launched instance's metrics are excluded from the ASG's aggregated statistics so a still-booting instance does not skew the metric. For target tracking flapping, increase EstimatedInstanceWarmup or DefaultInstanceWarmup on the ASG. For simple scaling flapping, increase cooldown. Confusing the two is the most common SOA-C02 anti-flap mis-fix.

Q3: When should I use a lifecycle hook versus user-data versus deregistration delay?

User-data is the one-shot OS bootstrap script that runs on first boot. It is fast and atomic, but the ASG cannot wait on user-data, cannot roll back, and gets no feedback. Use it for trivial OS-level setup (install packages, write config files). Lifecycle hooks pause the ASG at Pending:Wait (launch hook) or Terminating:Wait (terminate hook) and require an explicit CompleteLifecycleAction API call to proceed. Use them for any custom logic that the ASG must wait for, especially graceful drain, external service registration, or pre-flight tests. Deregistration delay is an ELB-level setting that lets in-flight connections complete before a target is forcibly disconnected — it does not run any custom code. Combine: use deregistration delay for connection-level grace, lifecycle hooks for application-level grace.

Q4: Why is my instance refresh stuck at 0% complete?

Two likely causes. (a) MinHealthyPercentage is too high — if the threshold is 100% and the ASG has 4 instances, the refresh cannot terminate any instance without dropping below the threshold, so it never starts. Lower to 75% (or whatever leaves at least one instance available for replacement). (b) InstanceWarmup shorter than real boot time — the refresh launches a replacement, considers it InService after the warm-up, but the actual application has not finished booting and is unhealthy. The refresh sees the unhealthy target as a failure and rolls back. Raise the warm-up to match real boot time, or pre-bake more into the AMI to shorten boot.

Q5: How do I size the health check grace period correctly?

Measure how long a new instance takes from launch to "first 200 OK on /healthz" — this is the minimum grace period. Add 30–60 seconds buffer for variability. Typical values: a stateless web tier on a pre-baked golden AMI: 60–120 seconds. A Java application that needs JVM warm-up: 180–300 seconds. A container host that pulls images from ECR: 300–600 seconds. A Windows instance with full domain join + SSM bootstrap: 600–900 seconds. Setting the grace period too short kills booting instances; too long delays detection of genuinely broken instances. Always test in a non-prod environment.

Q6: What is the difference between ALB and NLB health checks?

ALB target group health checks are HTTP/HTTPS-based by default — they hit a configurable path and port, expect a status code in a configurable range (default 200), and have configurable interval (default 30s), thresholds (5 healthy / 2 unhealthy by default), and timeout. NLB target group health checks support TCP, HTTP, or HTTPS — for TCP, the health check is a connection establishment (no payload); for HTTP/HTTPS, it is similar to ALB but with somewhat different defaults (interval 30s, thresholds 3/3 by default for TCP). NLB health checks happen from each NLB node in each AZ, which is why uneven AZ distribution combined with cross-zone-disabled NLB causes uneven traffic. ALB cross-zone is always on; NLB defaults to off.

Q7: How does Route 53 failover routing work with ELB?

Two records, primary and secondary, each tied to a Route 53 health check. The primary record points to the ELB in region A; the secondary points to the ELB in region B. The Route 53 health check probes the ELB endpoint (or any other URL). When the health check is healthy, DNS queries return the primary value; when it fails, queries return the secondary. The TTL of the record (typically 60 seconds) controls how fast clients pick up the change. Operationally, the secondary region must have the ASG and ELB pre-provisioned and warm — failover is fast for DNS but the workload itself must be ready to receive traffic. Test the failover with periodic DR drills (manually disable primary and verify clients shift).

Q8: Why does my ALB sticky session break after a blue/green deployment?

ALB-managed sticky sessions use the lb_cookie cookie which is scoped to a single target group. When you switch traffic from the old target group (blue) to the new one (green), the existing lb_cookie cookie is meaningless to the new target group, and every client gets rebalanced to a target it has never seen before. If the application stored session state in instance memory, those sessions are lost. Three remediations: (a) use app_cookie (the application's own session cookie) so stickiness is portable across target groups, (b) externalize session state to ElastiCache for Redis or DynamoDB so target re-pinning does not lose state, (c) accept the disruption as part of blue/green and notify users. Option (b) is the SOA-recommended long-term fix.

Q9: When does NLB cross-zone load balancing matter?

NLB cross-zone is disabled by default, meaning each AZ's NLB node only forwards traffic to targets in its own AZ. When AZs have equal target counts (the normal ASG distribution), this is fine — each AZ's traffic share matches its target share. But when AZs are uneven (one AZ has 9 targets, another has 1), the AZ with 1 target gets crushed because it still receives 50% of the NLB traffic (the NLB has equal nodes per AZ). Three fixes: (a) enable cross-zone load balancing on the NLB (costs inter-AZ data transfer, but evens out distribution), (b) maintain equal target counts per AZ via the ASG's even distribution and AZ rebalancing, (c) use fewer AZs in the configuration. SOA-C02 specifically tests recognizing this asymmetry vs ALB which is always cross-zone.

Q10: How do I diagnose "ASG is not launching new instances during scale-out"?

Run through this checklist: (a) MaxSize reached — the ASG cannot scale above max; check current desired vs max. (b) No capacity in the target AZ / instance type — Spot pools depleted, on-demand vCPU service quota exceeded, instance type unavailable in AZ. Check CloudWatch alarm logs and ASG activity history. (c) Subnet IP exhaustion — the subnet ran out of IPs; expand or add another subnet. (d) Launch template references invalid AMI — AMI deregistered or not available in the region. (e) IAM instance profile missing — IAM role not assumable. (f) Service quota — vCPU limit, ENI limit, EIP limit, EBS volume count. The ASG Activity tab in the console shows the failure reason directly — that is the first place to look.

Q11: Should I use a single ASG or multiple ASGs for a multi-tier app?

One ASG per tier. The web tier ASG is independent from the API tier ASG, which is independent from the worker tier ASG. Each has its own launch template, scaling policy, target group, and health check. This separation lets you scale tiers independently (the API may be CPU-bound while the worker is IO-bound), deploy them independently (instance refresh on the web tier without touching the worker), and tune health checks per tier (web with HTTP /healthz, worker without an ELB at all). A single ASG hosting multiple tiers is an anti-pattern — you cannot scale or update them independently.

Q12: When does scheduled scaling combine with target tracking, and how?

When the workload has both a predictable baseline pattern (business hours, batch windows) and unpredictable variance on top (random user traffic). Scheduled scaling sets the baseline: at 08:30 every weekday, desired = 10; at 19:00, desired = 4. Target tracking handles variance on top of that baseline: keep CPU at 50%, scaling up to 20 if needed. The two coexist on the same ASG — the scheduled action moves desired to the baseline at the exact time, and target tracking continues to react to metrics from there. Without scheduled scaling, target tracking has to wait for the morning surge to actually arrive before reacting, often leaving the first wave of users with degraded performance. Scheduled scaling is the SOA-correct answer for "pre-warm before the predictable spike".

Once the running stack is healthy, the next operational layers in SOA-C02 Domain 2 are: RDS and Aurora resilience for the database tier the ASG fronts, with Multi-AZ failover, read replicas, and ElastiCache for caching; Backup, Restore, and Disaster Recovery for the data the workload depends on, including AWS Backup plans, RDS PITR, and S3 Cross-Region Replication; VPC Configuration and Connectivity for the network plumbing the ASG and ELB live in (subnets per AZ, NAT gateway, VPC endpoints, route tables); and CloudWatch Metrics, Alarms, and Dashboards for the metric and alarm signals that feed scaling policies and operational visibility.

Why Auto Scaling and ELB Sit at the Core of SOA-C02 Domain 2

白話文解釋 EC2 Auto Scaling and Elastic Load Balancing

Analogy 1: The Restaurant Kitchen Brigade

Analogy 2: The Building Elevator System

Analogy 3: The Hospital Emergency Department Triage

Auto Scaling Group Components: Launch Templates, Desired/Min/Max, AZ Distribution

Launch templates (and the legacy Launch Configurations)

Desired, minimum, and maximum capacity

AZ distribution and Multi-AZ

Scaling Policy Types: Target Tracking, Step, Scheduled — Operational Selection

Target tracking — the default, most-used type

Step scaling — for tiered responses

Scheduled scaling — for predictable traffic patterns

Predictive scaling

Cooldown Period and Warm-Up Time: Preventing Thrashing

Cooldown — the post-action observation window (simple scaling)

Instance warm-up — the metric-aggregation grace period

DefaultInstanceWarmup (ASG-level)

AZ Rebalancing: How an ASG Reacts to Uneven AZ Distribution

Lifecycle Hooks: Pause Points for Custom Boot and Drain Logic

The two hook types

Hook configuration

Common operational patterns

Health Checks: ASG vs ELB — Which the ASG Uses

EC2 status checks (the ASG default)

ELB target group health checks

Wiring ELB health into the ASG

Health Check Grace Period (the SOA gotcha)

Connection Draining and Deregistration Delay: Graceful Termination

Deregistration delay (connection draining) on ALB / NLB

ASG draining vs ELB draining

ALB vs NLB vs GLB: Operational Differences for SysOps

Application Load Balancer (ALB) — Layer 7

Network Load Balancer (NLB) — Layer 4

Gateway Load Balancer (GLB) — Layer 3 traffic-inspection

Target group nuances

Sticky Sessions: lb_cookie vs app_cookie

Two sticky modes

Operational implications

Configuration knobs

Mixed Instances Policy and Spot Integration

Policy components

Operational patterns

Instance Refresh: Rolling Replace for AMI / Launch Template Updates

How instance refresh works

Common operational issues

Multi-AZ Deployments: Spreading the ASG for Fault Tolerance

Cross-zone load balancing — ALB vs NLB

Elastic IPs and EFS for Stateful Fault Tolerance

Elastic IPs

Amazon EFS for shared filesystems

FSx alternatives

Route 53 Health Checks and Failover Routing

Health check types

Failover routing policy

Other routing policies (related)

Scenario Pattern: ALB Returns 502 During Peak — Diagnosis Runbook

Symptoms

Runbook

Scenario Pattern: ASG Flapping Every Ten Minutes — Tuning Runbook

Symptoms

Runbook

Scenario Pattern: ELB Routes to Old Unhealthy Instances After Refresh

Symptoms

Runbook

Common Traps Recap — Auto Scaling and ELB

Trap 1: ASG default health check is EC2 only

Trap 2: Health check grace period default 0 in some paths

Trap 3: Cooldown applies only to simple scaling

Trap 4: ALB sticky lb_cookie is per-target-group

Trap 5: NLB has no security group historically

Trap 6: NLB cross-zone disabled by default

Trap 7: Detailed monitoring is required for 1-minute scaling

Trap 8: Instance refresh stuck because MinHealthyPercentage too high

Trap 9: Deregistration delay too short drops in-flight requests

Trap 10: ASG does not auto-replace when launch template updates

Trap 11: Single AZ ASG provides no fault tolerance

Trap 12: Lifecycle hook abandoned by default on timeout

SOA-C02 vs SAA-C03: The Operational Lens

Exam Signal: How to Recognize a Domain 2.1 / 2.2 Question

`DefaultInstanceWarmup` (ASG-level)

Trap 4: ALB sticky `lb_cookie` is per-target-group

Trap 8: Instance refresh stuck because `MinHealthyPercentage` too high