examhub .cc 用最有效率的方法,考取最有價值的認證
Vol. I
本篇導覽 約 36 分鐘

CloudWatch Metrics、Alarms 與 Dashboards

7,100 字 · 約 36 分鐘閱讀

Amazon CloudWatch is the operational nervous system of every AWS workload, and on SOA-C02 it is the single most heavily examined service. Domain 1 (Monitoring, Logging, and Remediation) is worth 20 percent of the exam — more than any other domain — and Task Statement 1.1 explicitly requires you to "implement metrics, alarms, and filters" using CloudWatch. Where SAA-C03 tests which metric to alarm on at design time, SOA-C02 tests what to do when an alarm fires at 3am: misconfigured thresholds, alarms stuck in INSUFFICIENT_DATA after an instance terminates, missing memory metrics because the CloudWatch agent was never installed, and dashboards that need to span three accounts and two regions for a real incident review.

This guide walks through CloudWatch from the SysOps angle: how metrics arrive in CloudWatch, why EC2 default metrics never include memory or disk, how alarm states transition and what treatMissingData actually does on each state, when to choose anomaly detection over a static threshold, how composite alarms cut noise during failovers, and which dashboard widget types are the right tool for a given operational view. You will also see the recurring SOA-C02 scenario shapes: agent-not-publishing diagnostics, INSUFFICIENT_DATA after an Auto Scaling termination, datapoints-to-alarm tuning to stop ASG flapping, Service Quotas notifications wired through CloudWatch, and AWS Health Dashboard events as a complementary signal to CloudWatch alarms.

Why CloudWatch Sits at the Heart of SOA-C02

The official SOA-C02 Exam Guide v2.3 lists six skills under Task Statement 1.1, and CloudWatch appears in five of them: collect logs from CloudWatch Logs and Logs Insights, collect metrics and logs using the CloudWatch agent, create CloudWatch alarms, create metric filters, create CloudWatch dashboards, and configure notifications via SNS, Service Quotas, CloudWatch alarms, and AWS Health events. Task Statement 1.2 then layers remediation on top — the alarm fires, EventBridge or SNS routes it, and Systems Manager Automation or a Lambda function takes corrective action. CloudWatch is the source of truth for every one of those flows.

At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "which CloudWatch metric should you alarm on for an Auto Scaling group?" SOA-C02 asks "the ASG is flapping — instances are launching and terminating every five minutes, the alarm keeps oscillating between OK and ALARM, what do you change?" The answer is rarely a different metric; it is Datapoints to alarm, evaluation periods, scaling cooldown, or treatMissingData. CloudWatch Metrics, Alarms, and Dashboards is the topic where every later SOA-C02 topic plugs in: ELB health checks (Domain 2), Auto Scaling target tracking policies (Domain 2), CloudFormation stack event monitoring (Domain 3), KMS key usage tracking (Domain 4), VPC Flow Log metric extraction (Domain 5), EBS BurstBalance and RDS Performance Insights baselines (Domain 6).

  • Metric: a time-ordered set of data points published to CloudWatch under a namespace, optionally tagged with dimensions. Native AWS service metrics are auto-published; custom metrics are pushed via PutMetricData.
  • Namespace: a container that isolates metrics from different services or applications (for example AWS/EC2, AWS/RDS, CWAgent, custom MyApp/Prod). Metrics in different namespaces never collide even if they share a name.
  • Dimension: a name-value pair that identifies a specific instance of a metric (InstanceId=i-0abc..., LoadBalancer=app/...). A metric with different dimensions is a different time series.
  • Resolution: the granularity at which data points are stored — standard (60-second intervals) or high-resolution (1, 5, 10, or 30-second intervals for custom metrics).
  • Alarm: a monitor that watches a single metric (or metric math expression) over a time window and transitions among OK, ALARM, and INSUFFICIENT_DATA based on whether the threshold is breached.
  • Period: the length of time over which the statistic is computed (60s, 300s, etc.). Distinct from the evaluation period count.
  • Evaluation period: how many of the most recent periods CloudWatch examines to decide alarm state.
  • Datapoints to alarm: the M-of-N rule — alarm requires M breaching points within the last N evaluation periods.
  • Composite alarm: an alarm whose state is computed from a Boolean expression over other alarms (ALARM("a") AND ALARM("b")).
  • Dashboard: a customizable home page of CloudWatch widgets for visualizing metrics, alarms, and logs across accounts and regions.
  • Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

白話文解釋 CloudWatch Metrics, Alarms, and Dashboards

CloudWatch jargon stacks fast. Three analogies help the constructs stick.

Analogy 1: The Hospital Triage Center

CloudWatch is a hospital triage center for your AWS workload. Metrics are the vital signs — heart rate (CPU), blood pressure (NetworkOut), respiratory rate (DiskWriteOps) — streamed continuously from each patient (resource). Namespaces are the wards that group similar patients together: cardiac ward (AWS/RDS), surgery (AWS/EC2), pediatrics (AWS/Lambda). Dimensions are the patient ID bracelets that identify exactly which person the vitals belong to. Alarms are bedside monitors programmed with thresholds — if heart rate exceeds 120 for three consecutive readings, the nurse station gets paged. Alarm states mirror medical triage colors: green (OK, patient stable), red (ALARM, intervention needed now), grey (INSUFFICIENT_DATA, the monitor stopped reporting — maybe the cable came loose, maybe the patient stepped out, but we cannot tell). The treatMissingData setting is the standing order for grey patients: should we assume they are fine (notBreaching), assume they are in critical condition (breaching), keep the previous status (ignore), or treat the gap as data missing for the threshold check (missing)? Dashboards are the central command monitor wall showing every patient's vitals at a glance. Composite alarms are the doctor's clinical judgement that pages the family only when both heart rate is high and oxygen is low — single high readings on their own happen all the time.

Analogy 2: The Smart Thermostat With Anomaly Detection

A static threshold alarm is the old wall thermostat — turn on the AC when the room exceeds 28 degrees, no matter the season, no matter the time. A CloudWatch anomaly detection alarm is the modern Nest thermostat — it learns "for this house, on weekday afternoons in summer the room normally drifts between 26 and 29 degrees, but on weekend mornings the AC is off and 32 is normal", and it only beeps when reality leaves the learned band of normal. For a SOA-C02 candidate, this is the difference between getting paged every Monday morning at 9am because traffic always spikes (false positive on a static threshold) versus getting paged only when the Monday spike looks unusually different from past Mondays (anomaly detection band).

Analogy 3: The Apartment Building Fire Panel

A CloudWatch dashboard is the fire panel in the apartment building lobby. A metric widget is one smoke detector indicator showing the current sensor reading. A single-value widget with a sparkline is the single LED + last-five-minutes mini chart. An alarm status widget is the bank of panel lamps that turn from green to red the moment any unit triggers. A log widget with Logs Insights queries embedded is the printout strip showing the last 50 sensor messages. A cross-account cross-region dashboard is the multi-building command center where the security firm watches dozens of buildings on one screen — they signed sharing agreements (CloudWatch-CrossAccountSharingRole) with each building owner so they can see the panels remotely.

For SOA-C02, the hospital triage analogy is the most useful when a question mixes alarm states with treatMissingData. When an instance terminates, its metric stops publishing; the alarm is now a grey patient. Whether the alarm should fire depends entirely on the standing order you wrote — if the alarm drives Auto Scaling, you usually want notBreaching so a terminated instance does not spawn a replacement; if the alarm drives a security pager, you usually want breaching so silence is treated as suspicious. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

CloudWatch Metrics Fundamentals: Namespaces, Dimensions, and Resolution

Before alarms make sense you need a precise mental model of how a metric arrives in CloudWatch and how it is identified.

Namespaces — the top-level container

A namespace is the first-level grouping for metrics. AWS reserves the AWS/ prefix for service-published metrics — AWS/EC2, AWS/RDS, AWS/ApplicationELB, AWS/Lambda, AWS/S3, and dozens more. Custom metrics live in any namespace you choose; the CloudWatch agent publishes to CWAgent by default. Namespaces are isolated: a metric named CPUUtilization in AWS/EC2 is a different time series from CPUUtilization in MyApp/Prod, even if they describe the same thing.

Dimensions — the identifying tags on a metric

A dimension is a name-value pair that pins a metric to a specific resource. AWS/EC2 exposes CPUUtilization with dimensions like InstanceId=i-0abc123, AutoScalingGroupName=web-asg, or InstanceType=m5.large. Each unique combination of dimensions creates a separate metric — alarm-able and graph-able as its own series. CloudWatch supports up to 30 dimensions per metric, but each unique combination counts as a distinct metric and is billed separately, so SysOps engineers must be deliberate.

Resolution — standard vs high-resolution

Standard-resolution metrics are stored at 60-second granularity. Every AWS-published metric is standard resolution by default, with one exception called out below. High-resolution metrics are custom metrics published with a StorageResolution of 1 second; they accept data points at 1, 5, 10, or 30-second intervals and can be alarmed on at periods as short as 10 seconds. High-resolution metrics cost more and should be reserved for workloads where a one-minute lag would mask the problem (real-time trading, gaming session quality, IoT telemetry).

Metric retention

CloudWatch automatically rolls up older metric data into coarser buckets:

  • Sub-minute (high-resolution) data is retained for 3 hours.
  • 1-minute data is retained for 15 days.
  • 5-minute data is retained for 63 days.
  • 1-hour data is retained for 15 months (455 days).

Beyond 15 months, no metric data is kept. If your SysOps role requires longer retention for capacity planning, export to S3 via metric streams or pull on a schedule with GetMetricData and store yourself.

  • EC2 default monitoring: 5-minute (300-second) period — every 5 minutes, free.
  • EC2 detailed monitoring: 1-minute (60-second) period — costs extra, must be explicitly enabled.
  • Most non-EC2 AWS services: 1-minute period by default and free (ALB, RDS, Lambda, etc.).
  • High-resolution custom metric: as fast as 1-second intervals; alarm period as short as 10 seconds.
  • Maximum dimensions per metric: 30 (each unique combination = a distinct metric).
  • Metric retention: 3h (sub-minute) / 15d (1-min) / 63d (5-min) / 15 months (1-hour); zero after 15 months.
  • Metric ingestion latency: AWS-published metrics are typically visible within 1–2 minutes of the event; custom PutMetricData is visible in seconds.
  • Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

CloudWatch Agent: Installing on EC2 and On-Premises for OS-Level Metrics

The single most-tested operational gotcha on SOA-C02 is that EC2 default metrics published by the hypervisor never include memory or disk usage from inside the operating system. The hypervisor cannot see what the guest OS is doing with its allocated RAM. To collect memory, swap, disk-used, and process-level metrics you must install the CloudWatch agent inside each instance and grant it permission to call cloudwatch:PutMetricData.

What the agent provides

The unified CloudWatch agent runs on Amazon Linux 2 / 2023, Ubuntu, RHEL, SUSE, Windows Server, and on-premises Linux/Windows hosts. It collects:

  • Linux metrics: mem_used_percent, swap_used_percent, disk_used_percent per mount, processes_running, network and TCP-state counters via procstat, collectd integration.
  • Windows metrics: Performance Counter equivalents — Memory % Committed Bytes In Use, LogicalDisk % Free Space, Paging File % Usage, custom counter sets.
  • Custom application metrics through StatsD (UDP) or collectd integration.
  • Log files — application logs, system logs, IIS logs — pushed to CloudWatch Log Groups (covered in detail in the Logs topic).

The agent publishes by default into the CWAgent namespace (you can override). Default dimensions include InstanceId plus host, cpu, device, and path depending on the metric.

Installation paths

Three operational paths to install the agent:

  1. AWS Systems Manager State Manager + Run Command + the AWS-managed AmazonCloudWatch-ManageAgent document is the SOA-C02 sanctioned path for fleet-scale install and config refresh.
  2. EC2 Image Builder baking the agent into a golden AMI is the right answer when the scenario emphasizes "every new instance must have monitoring on first boot".
  3. User-data script invoking the platform-specific installer is acceptable for ad-hoc or non-Systems-Manager environments.

IAM requirements (the most common failure point)

The instance needs an IAM instance profile that grants the agent two permissions sets:

  • CloudWatchAgentServerPolicy — managed policy that grants cloudwatch:PutMetricData, ec2:DescribeVolumes, ec2:DescribeTags, logs:PutLogEvents, logs:CreateLogGroup, logs:CreateLogStream, logs:DescribeLogStreams, ssm:GetParameter (for retrieving the agent config from Parameter Store).
  • AmazonSSMManagedInstanceCore if you want Systems Manager Run Command and State Manager to manage the agent lifecycle.

If either policy is missing, the agent runs but silently fails to publish — and the metric you expected never appears in the console. SOA-C02 routinely tests this exact pattern.

::warning

A SysOps candidate sees CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps, StatusCheckFailed_System, and StatusCheckFailed_Instance in AWS/EC2 and assumes that "DiskWriteOps" implies disk-fullness monitoring. It does not — that metric only counts I/O operations against the EBS volume. To alarm on a filesystem filling up at 90 percent you must install the CloudWatch agent and alarm on CWAgent namespace disk_used_percent. Memory utilization is not exposed by the hypervisor at all. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html ::

On-premises hosts

The same agent runs on on-premises servers when registered with Systems Manager as a hybrid managed instance. The host needs an IAM service role (not an instance profile, since it is not EC2) and a Systems Manager activation code. Once registered, the same cloudwatch:PutMetricData permissions apply, and the metrics arrive in CloudWatch under a namespace and dimensions you choose. SOA-C02 can ask "monitor a fleet of on-prem Linux servers in a single CloudWatch dashboard alongside EC2" — the answer is the unified CloudWatch agent + SSM hybrid activation.

EC2 Default vs Detailed Monitoring: 5-Minute vs 1-Minute Granularity

Even before the agent enters the picture, EC2 itself can publish hypervisor-level metrics at two cadences.

Default (basic) monitoring

Default monitoring is enabled automatically and free. It publishes at a 5-minute (300-second) period for the seven core EC2 metrics: CPUUtilization, NetworkIn, NetworkOut, NetworkPacketsIn, NetworkPacketsOut, DiskReadOps, DiskWriteOps, DiskReadBytes, DiskWriteBytes, plus StatusCheckFailed_System / StatusCheckFailed_Instance / StatusCheckFailed at 1-minute (these status checks are the exception — always 1-minute even at default).

Detailed monitoring

Detailed monitoring publishes the standard EC2 metrics at a 1-minute (60-second) period. It is opt-in per instance, billable per metric per month, and is required when:

  • Auto Scaling target tracking or step scaling policies use a 1-minute period — most production scaling configurations want detailed monitoring on the metric source instances.
  • An alarm needs an evaluation period of 1 minute or shorter.
  • A dashboard needs near-real-time charting at 1-minute granularity for SLO reviews.

Detailed monitoring is enabled per instance through the EC2 console, launch template, RunInstances API, or by tag-based remediation. It does not add memory, disk-fullness, or any new metric — it only changes the period of existing hypervisor-level metrics. Adding detailed monitoring to an instance that lacks the CloudWatch agent still leaves you blind to memory and disk usage.

A common SOA-C02 distractor: "the SysOps team needs to alarm on memory utilization — should they enable detailed monitoring or install the CloudWatch agent?" The question seems to test cost trade-offs, but the trap is that detailed monitoring does not produce a memory metric at all. Detailed monitoring only changes the period of hypervisor-published metrics. The only way to get memory is the CloudWatch agent. Candidates who confuse "detailed = more metrics" lose easy points. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-basic-detailed.html

Metric Math: Combining Metrics with Expressions

Metric math lets you combine, transform, and derive new metrics from raw CloudWatch metrics without publishing custom metrics. Metric math expressions are evaluated server-side, so they cost nothing extra and execute on demand at the alarm or dashboard.

Common metric math patterns

  • Error rate: (errors / requests) * 100 to compute a percentage from two counters. The classic ALB pattern is 100 * (m_5xx + m_target_5xx) / m_request_count.
  • Sum across instances: SUM(METRICS()) to aggregate CPUUtilization across every instance in an Auto Scaling group when the per-instance dimension fragments the view.
  • Available capacity remaining: 100 - mem_used_percent to alarm on free memory rather than used.
  • Anomaly band: ANOMALY_DETECTION_BAND(m1, 2) to wrap a metric with a learned upper/lower band at two standard deviations.
  • Fill missing values: FILL(m1, 0) to substitute zero for missing data (sometimes the right answer for sparse metrics).
  • Rate of change: RATE(m1) to compute per-second change of a counter.

Using metric math in alarms

A CloudWatch alarm can watch either a raw metric or a metric math expression. The alarm references metric IDs (m1, m2) that you define alongside, then sets ThresholdMetricId: e1 to point at the expression. Two common SOA-C02 patterns:

  • Composite metric alarm for an SLO — alarm fires when error rate exceeds 1 percent over 5 minutes. The expression (m1 / m2) * 100 > 1 with m1 = HTTPCode_Target_5XX_Count and m2 = RequestCount.
  • Capacity-based alarm for batch fleets — alarm when free CPU across the fleet (sum) drops below threshold.

When the operations team needs a derived value such as "error rate" or "free memory percentage", a SysOps engineer's first instinct should be metric math, not a custom metric. Custom metrics cost money per metric per dimension, while metric math is free and computed at query time. Custom metrics are still right when the value cannot be derived from existing metrics (for example, queue oldest-message-age) — but for ratios and aggregations across existing metrics, metric math wins. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

Alarm States: OK, ALARM, INSUFFICIENT_DATA — and treatMissingData

CloudWatch alarms always live in exactly one of three states.

  • OK — the most recent evaluation period(s) have data, and the metric does not breach the threshold per the M-of-N rule.
  • ALARM — the most recent evaluation period(s) breach the threshold per the M-of-N rule.
  • INSUFFICIENT_DATA — there is not enough data to make a decision: the alarm just started, or the metric stopped publishing within the evaluation window.

State transitions trigger alarm actions — OK -> ALARM invokes the AlarmActions, ALARM -> OK invokes the OKActions, and entry into INSUFFICIENT_DATA invokes InsufficientDataActions. Each is a separately configurable list of action ARNs (typically SNS topics, but also EC2 stop/reboot/recover, Auto Scaling policies, Systems Manager, and Lambda via EventBridge).

Period vs evaluation periods vs datapoints to alarm

Three numeric settings determine when a state transition happens:

  • Period (e.g., 60 seconds): the bucket size over which CloudWatch computes a single statistic value (Average, Sum, p99, Max, Min) for the metric.
  • Evaluation periods (often abbreviated N): how many of the most recent periods CloudWatch examines.
  • Datapoints to alarm (often abbreviated M): how many of those N periods must breach for the alarm to fire (the M-of-N rule).

A common SysOps-tier configuration is Period = 60, Evaluation periods = 5, Datapoints to alarm = 3 — meaning the alarm fires when 3 of the last 5 one-minute datapoints breach. This is far less flap-prone than M = N = 1 (alarm on a single breaching period) yet still detects sustained problems within five minutes.

treatMissingData modes

When a period has no data point at all (the metric did not publish), CloudWatch must decide what to do. The four modes:

  • missing (default) — the missing period is excluded from the M-of-N evaluation. If too many periods are missing, the alarm transitions to INSUFFICIENT_DATA.
  • notBreaching — the missing period is treated as a non-breaching value. The alarm prefers to stay OK.
  • breaching — the missing period is treated as a breaching value. The alarm prefers to fire.
  • ignore — the alarm state is locked at its current value; missing data does not change it.

The right choice depends on what action the alarm drives:

  • Auto Scaling scale-out alarm: notBreaching is usually correct. When an instance terminates the metric stops publishing; you do not want to interpret silence as "high CPU, scale out".
  • Application heartbeat alarm: breaching is correct. Silence means the application stopped publishing the heartbeat — that is the failure you wanted to detect.
  • Status check failed alarm wired to EC2 recovery: missing (default) is fine — if the instance is gone there is nothing to recover, and INSUFFICIENT_DATA does not invoke the recover action.
  • Long-running batch job alarm: ignore is sometimes correct so the alarm holds its OK state through expected gaps.

Multiple SOA-C02 questions hinge on choosing the right treatMissingData mode. The pattern: an Auto Scaling instance terminates, the per-instance CPU metric stops publishing, the alarm transitions to INSUFFICIENT_DATA, and the candidate must pick the correct mode to either (a) keep the alarm quiet so a replacement is not spuriously launched (notBreaching) or (b) page the on-call because the metric source disappeared unexpectedly (breaching). The wrong default ruins every Auto Scaling group and every health monitor in production. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarm-evaluation

Anomaly detection alarms

A static threshold is a poor fit for traffic that has a daily or weekly seasonality. CloudWatch anomaly detection trains a machine-learning model on the metric's history (typically 14 days) and produces an anomaly band — a learned envelope of expected values per time-of-day. Alarms can fire when the metric:

  • Goes outside the band on either side (GreaterThanUpperThreshold OrLowerThanLowerThreshold).
  • Goes above only the upper bound (typical for latency or error count).
  • Goes below only the lower bound (typical for request volume — silence is failure).

The band width is set in standard deviations (default 2). A wider band reduces false positives but raises detection latency. Anomaly detection is enabled per metric and excluded periods can be marked to teach the model that scheduled spikes (Black Friday, Monday-9am traffic) are not anomalies.

Use a static threshold alarm when the metric has an absolute meaning (disk_used_percent > 90, 5xx error count > 100). Use anomaly detection when the metric has a relative meaning that varies with time-of-day or day-of-week (RequestCount for a public website, OrderCount for an e-commerce checkout). On SOA-C02, the giveaway phrase is "the team is tired of false alarms during the morning peak" — that is anomaly detection. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html

Composite Alarms: Combining Multiple Child Alarms

A composite alarm is an alarm whose state is computed from a Boolean expression over other alarms. Where a regular alarm watches a metric, a composite alarm watches ALARM("alarm-name"), OK("alarm-name"), and INSUFFICIENT_DATA("alarm-name") predicates joined with AND, OR, and NOT.

Why composite alarms exist

Real workloads produce alarm storms: when an availability zone has a glitch, dozens of per-instance alarms fire at once and the on-call inbox explodes. Composite alarms collapse the noise into a single high-level signal:

  • ALARM(asg-cpu-high) AND ALARM(asg-error-rate-high) — only page when both compute saturation and error rate are elevated; either alone is normal traffic.
  • (ALARM(rds-cpu) OR ALARM(rds-iops)) AND ALARM(rds-replica-lag) — only page when the database is both stressed and falling behind, ignoring transient spikes.
  • NOT OK(maintenance-window) — suppress paging during scheduled maintenance.

Configuration mechanics

A composite alarm references child alarms by ARN or name within the same account and region. Child alarms can themselves be composite (nested up to 10 levels). The composite alarm has its own AlarmActions, OKActions, and InsufficientDataActions ARNs — typically an SNS topic for high-fidelity paging. Crucially, you can configure the composite alarm to suppress child alarm actions while it is in ALARM state, so the on-call sees one page instead of fifty.

Suppressor alarms

A composite alarm can specify a suppressor alarm and a wait period. While the suppressor is in ALARM, the composite alarm does not transition to ALARM regardless of the rule. Common pattern: a MaintenanceWindow alarm that flips to ALARM during the planned window suppresses every operational composite alarm so the patching cycle does not page anyone.

When a SOA-C02 scenario describes "the SysOps team is being paged 30 times during every AZ blip" or "the on-call must be notified only when both the load balancer is unhealthy AND the application is returning 5xx", the answer is a composite alarm with ALARM(...) AND ALARM(...), plus action suppression on the child alarms. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html

CloudWatch Dashboards: Widgets, Cross-Account, Cross-Region, and Sharing

A CloudWatch dashboard is a customizable, JSON-backed page of widgets that visualize metrics, alarms, logs, and metric math expressions. Dashboards are the SysOps team's situational awareness tool — the one URL the on-call opens during an incident.

Widget types

  • Line widget — classic time-series chart for one or more metrics; supports left/right Y axes and overlay annotations.
  • Stacked area widget — same as line but stacked, useful for fleet-level visualizations.
  • Number widget (single value) — current value plus a sparkline; ideal for "how many active alarms right now".
  • Gauge widget — circular meter showing where a value sits in a defined range; useful for SLO burn.
  • Bar / pie — categorical comparisons.
  • Alarm status widget — list of alarms color-coded by current state; the most important widget for an incident dashboard.
  • Text widget — Markdown for runbook links and ownership tags.
  • Logs widget — embeds a CloudWatch Logs Insights query result inside the dashboard.
  • Custom widget (Lambda-backed) — render arbitrary content (table, image, JSON) returned by a Lambda function. Powerful for cost-by-tag panels or AWS Health event summaries.

Cross-account, cross-region dashboards

A single dashboard can show data from multiple AWS accounts and regions. Two prerequisites must be set up first:

  1. Sharing accounts (the accounts that hold the source metrics) install the CloudWatch-CrossAccountSharingRole IAM role with a trust policy permitting the monitoring (viewer) account to assume it. This is one-click in the CloudWatch console under Settings.
  2. Monitoring account registers the sharing accounts under Settings → "Manage source accounts".

After setup, when adding a widget you simply pick the account from a dropdown. SOA-C02 frequently asks "consolidate metrics from 12 accounts into one operational dashboard" — the answer is cross-account dashboards via the sharing role, not custom Lambda aggregation.

For cross-region widgets the dashboard JSON includes a region field per widget; the same metric chart can show us-east-1 and eu-west-1 side by side. Cross-region also requires no extra IAM if the accounts already trust each other.

Sharing dashboards externally

A dashboard can be shared with users who do not have AWS console access through three options:

  • Public link (with optional email/Google/SAML SSO sign-in) — useful for executive stakeholders.
  • Specific email addresses with sign-in.
  • Third-party SSO via Cognito identity pools.

Shared dashboards are read-only and respect the source account's data permissions.

A SOA-C02 best practice is to manage dashboards as JSON in a Git repository, deploy via CloudFormation AWS::CloudWatch::Dashboard, and review changes in pull requests. Hand-built dashboards drift, get accidentally deleted, and are not portable across environments. The exam favors managed-IaC answers; ad-hoc console clicks lose points. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html

Service Quotas Notifications: Integrating CloudWatch with Service Quotas

AWS Service Quotas is the service that displays current usage and limits for AWS services across an account. SOA-C02 explicitly tests configuring notifications when usage approaches a quota.

How it works

For supported quotas (most service quotas are supported, including EC2 vCPUs, EBS volume count, Lambda concurrent executions, VPC count, security groups per VPC), Service Quotas publishes a metric in the AWS/Usage namespace to CloudWatch. You then create a CloudWatch alarm on the usage metric with a threshold expressed as a percentage of the quota — typically 80 percent for a yellow alert and 95 percent for a red alert.

The alarm action is an SNS topic notifying the SysOps team. From there the team either (a) requests a quota increase via Service Quotas, (b) cleans up unused resources to free quota, or (c) refactors to a different service that does not consume the quota.

Quota request workflow

For a quota increase request, Service Quotas itself or AWS Support handles the change. Operational maturity demands you alarm at 80 percent so you have time to file the increase before saturation, not at 100 percent when production is already failing.

Service Quotas vs Trusted Advisor

Trusted Advisor's "Service Limits" check covers a smaller, older set of quotas and refreshes daily. Service Quotas is the modern, real-time, programmatic source. SOA-C02 prefers Service Quotas for any "alert when approaching a service limit" question.

Operational maturity means the SysOps team gets paged before customers do. On SOA-C02, when a question asks "alert the team before the EC2 vCPU quota is exhausted", the right answer pairs Service Quotas + CloudWatch alarm at 80 percent threshold + SNS topic + a runbook for the on-call (request increase, clean up unused instances, or migrate workload). Alarming at 100 percent guarantees an outage. Reference: https://docs.aws.amazon.com/servicequotas/latest/userguide/configure-cloudwatch.html

Health Dashboard Integration: AWS Health Events vs CloudWatch Alarms

CloudWatch alarms detect your operational issues. AWS Health Dashboard events tell you about AWS's operational issues affecting your account — planned maintenance, service disruptions, security advisories, and lifecycle changes (e.g., a deprecated AMI being retired). A mature monitoring strategy layers both signals.

Two AWS Health surfaces

  • Service Health Dashboard (formerly status.aws.amazon.com): public, regional, all customers.
  • AWS Health Dashboard (formerly Personal Health Dashboard): per-account, account-specific, includes scheduled events for your resources (e.g., EC2 instance retirement on instance ID i-0abc123 next Tuesday).

The AWS Health Dashboard is what the SOA-C02 exam means by "Health events".

Wiring AWS Health into the alarm pipeline

AWS Health publishes events to EventBridge as the aws.health event source. SysOps teams build EventBridge rules that match on event categories (scheduledChange, issue, accountNotification) and route to:

  • An SNS topic for human notification (email, SMS, chat).
  • A Lambda function that opens a Jira ticket or runs a Systems Manager Automation runbook.
  • A CloudWatch dashboard custom widget that renders the current Health events alongside CloudWatch alarms.

For organization-wide visibility, you can enable AWS Health Organizational View (requires AWS Organizations) so events from every member account are aggregated in the management or delegated admin account.

When to use which

Signal source Detects Primary tool
CloudWatch alarms Your workload's metric breaches CloudWatch + SNS / EventBridge
AWS Health events AWS-side issues affecting your account EventBridge aws.health rule
AWS Health Organizational View AWS-side issues across all org accounts Management/delegated admin account
AWS Service Health Dashboard AWS-side global service issues Public RSS / status page

On SOA-C02, when a scenario asks "the team didn't know AWS was performing planned maintenance on a database, and got paged for failover at 2am" — the gap is AWS Health Dashboard events not being routed to the on-call. The fix is an EventBridge rule on aws.health events of category scheduledChange filtered for affected services, posting to the same SNS topic as critical CloudWatch alarms so the on-call sees both signals. Reference: https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html

Scenario Pattern: EC2 Memory Utilization Never Shows in CloudWatch Console

This is the canonical SOA-C02 troubleshooting scenario. The runbook:

  1. Confirm the namespace. Memory metrics live in CWAgent namespace, not AWS/EC2. If the SysOps engineer is searching AWS/EC2 they will never find it.
  2. Verify the agent is installed and running. SSH or Session Manager into the instance: sudo systemctl status amazon-cloudwatch-agent (Linux) or Get-Service AmazonCloudWatchAgent (Windows). If the service is not present, the agent was never installed — fix with Systems Manager Run Command using AWS-ConfigureAWSPackage to install AmazonCloudWatchAgent.
  3. Verify the agent has a config file. The agent does nothing without a config. The config is JSON at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json or fetched from SSM Parameter Store. Use the wizard amazon-cloudwatch-agent-config-wizard for an interactive build, or push from Parameter Store.
  4. Verify the IAM instance profile. The instance must have a role with CloudWatchAgentServerPolicy attached. Check via EC2 console → Instance → Security tab → IAM Role. If missing, attach it (no instance reboot required, but the agent may need a restart: sudo systemctl restart amazon-cloudwatch-agent).
  5. Verify the agent log. /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log shows publish failures. Look for AccessDenied (IAM problem), NetworkUnreachable (security group / NACL / VPC endpoint), or ConfigError (malformed JSON).
  6. Verify network path. If the instance is in a private subnet without a NAT gateway, you need a VPC endpoint for monitoring.region.amazonaws.com (CloudWatch metrics) so the agent can reach the API privately.

The most common root causes in order of frequency: missing IAM role policy, missing or wrong agent config JSON, agent installed but not started, and missing VPC endpoint in a private subnet.

Scenario Pattern: Alarm Stuck in INSUFFICIENT_DATA After Instance Termination

Another canonical SOA-C02 scenario. An Auto Scaling group terminates an instance during scale-in. The per-instance CPU alarm that was watching that instance enters INSUFFICIENT_DATA and stays there because no more data points arrive. The alarm's treatMissingData mode determines what happens next:

  • missing (default) — the alarm sits in INSUFFICIENT_DATA forever. Often acceptable for diagnostic alarms.
  • notBreaching — the alarm transitions back to OK. Correct for Auto Scaling scale-out alarms so a terminated instance does not keep triggering scale-out actions.
  • breaching — the alarm transitions to ALARM. Correct only when silence really does mean failure (e.g., a heartbeat metric).
  • ignore — the alarm holds its previous state.

The fix is rarely "delete the alarm". The right answer is usually:

  1. Identify the alarm's purpose (diagnostic, action-driving, or paging).
  2. Pick treatMissingData accordingly.
  3. If the alarm targets a stable resource (an ASG-level metric like GroupInServiceInstances, not a per-instance metric), prefer the ASG-level metric so termination of one instance does not orphan the alarm.

For an ASG, the canonical pattern is: alarm on AWS/EC2 CPUUtilization with the dimension AutoScalingGroupName=web-asg (not InstanceId), so the metric aggregates across the fleet and survives instance churn.

Common Trap: Standard EC2 Metrics Confused with Detailed Monitoring

A persistent SOA-C02 confusion: candidates conflate "detailed monitoring" with "more metrics", because the word detailed implies more. It does not. Detailed monitoring only changes the publishing period of the existing hypervisor metrics from 5 minutes to 1 minute. It adds zero new metric types. Memory, swap, disk used percentage, and process counts come exclusively from the CloudWatch agent regardless of whether detailed monitoring is on or off. Detailed monitoring also costs per metric per month, so enabling it on a fleet of 200 instances has a billing impact you should be ready to defend.

The clean mental separation:

  • More frequent data on existing metrics → enable detailed monitoring.
  • More metric types (memory, disk-fullness, processes) → install the CloudWatch agent.
  • Custom application metrics (queue depth, business KPI) → publish via PutMetricData API or StatsD/collectd via the agent.

Common Trap: High-Resolution Custom Metric Cost

High-resolution metrics are seductive — 1-second granularity for 10-second alarms — but they cost meaningfully more. Each high-resolution metric is billed at the per-metric rate, and high-resolution alarms cost more than standard alarms (the 10-second alarm SKU vs the 60-second SKU). A SysOps engineer who turns every metric into high-resolution to "have the option" can multiply the CloudWatch bill by 10 to 60 times. The exam-correct rule: high-resolution only for metrics where one-minute lag would mask the problem — real-time trading, gaming, IoT telemetry — and standard resolution everywhere else.

Common Trap: Metric Publishing Latency

CloudWatch is not real-time. AWS-published metrics typically arrive within 1–2 minutes of the underlying event. Custom PutMetricData arrives within seconds, but the alarm evaluation engine still needs the period to close before evaluating. An alarm with a 1-minute period and 1 evaluation period typically fires 60–120 seconds after the breaching condition, not instantly. SOA-C02 sometimes asks "the team needs to detect a failure within 30 seconds and trigger remediation" — the honest answer is that CloudWatch alarms cannot guarantee sub-30-second detection; for that you need application-level health checks, ELB health checks (which have their own thresholds independent of CloudWatch), or Route 53 health checks, depending on the architecture.

SOA-C02 vs SAA-C03: The Operational Lens

SAA-C03 and SOA-C02 both test CloudWatch, but the lenses differ.

Question style SAA-C03 lens SOA-C02 lens
Selecting a metric "Which CloudWatch metric should the architect alarm on for ALB latency?" "The alarm on TargetResponseTime is flapping every five minutes — what's the fix?"
Default vs detailed monitoring "Choose the cost-effective monitoring option for development." "Auto Scaling target tracking is reacting too slowly; the period is 5 minutes — what change?"
treatMissingData Rarely tested. Heavily tested — pick the right mode for the alarm's purpose.
Composite alarms "Which feature reduces alert noise across multiple metrics?" "Configure a composite alarm with action suppression for the maintenance window."
CloudWatch agent "The architecture should monitor memory utilization." "Memory metrics are missing from the CloudWatch console — list every step to diagnose."
Anomaly detection "Which CloudWatch feature handles seasonal traffic patterns?" "Configure an anomaly detection alarm with a learned 14-day model and exclude Black Friday."
Dashboards "Use a CloudWatch dashboard to visualize." "Build a cross-account, cross-region dashboard with the CloudWatch sharing role."

The SAA candidate selects the metric; the SOA candidate configures the alarm correctly, troubleshoots when it misbehaves, and operates the dashboard during incidents.

Exam Signal: How to Recognize a Domain 1.1 Question

Domain 1.1 questions on SOA-C02 follow predictable shapes. Recognize them and your time on each question drops dramatically.

  • "The metric is missing" — the answer involves the CloudWatch agent, an IAM permission gap, or a VPC endpoint absence. Memory and disk metrics are virtually always agent issues.
  • "The alarm fires too often / too rarely" — the answer adjusts Datapoints to alarm, Evaluation periods, or the period itself. Sometimes the answer is anomaly detection in place of static thresholds.
  • "The alarm is stuck in INSUFFICIENT_DATA" — the answer is treatMissingData configuration, usually notBreaching for ASG alarms.
  • "The on-call is overwhelmed by alarms" — the answer is composite alarms with action suppression on child alarms.
  • "Operations needs visibility across accounts and regions" — the answer is cross-account cross-region CloudWatch dashboards via the sharing role.
  • "Approaching a service limit" — the answer is Service Quotas + CloudWatch alarm at 80 percent + SNS notification.
  • "AWS-side maintenance was missed" — the answer is EventBridge rule on aws.health events to SNS / dashboard.
  • "Compute a derived value (error rate, free capacity)" — the answer is metric math, not a custom metric.

With Domain 1 worth 20 percent and Task Statement 1.1 covering most of CloudWatch, expect 10 to 13 questions in this exact territory. Mastering the patterns in this section is the single highest-leverage study activity for SOA-C02. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Decision Matrix — CloudWatch Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal Primary construct Notes
Alarm on memory or disk usage CloudWatch agent + alarm on CWAgent namespace Default EC2 metrics never expose memory or disk-fullness.
Alarm on EC2 CPU at 1-minute granularity Detailed monitoring + alarm with 60-second period Default 5-minute is too coarse for fast scaling.
Compute error rate from two metrics Metric math expression Cheaper than publishing a custom rate metric.
Suppress alarm during scheduled maintenance Composite alarm with suppressor alarm Or schedule a Lambda action that disables alarms.
Reduce paging during transient blips Datapoints to alarm M-of-N E.g., 3 of 5 instead of 1 of 1.
Stop spurious alarms after instance terminates treatMissingData=notBreaching Auto Scaling-driven alarms in particular.
Detect missing heartbeat treatMissingData=breaching Silence is failure for these.
Adapt to traffic seasonality Anomaly detection alarm Two standard deviations is the default band.
Alarm on service quota usage Service Quotas usage metric + alarm at 80% AWS/Usage namespace.
Aggregate across many accounts Cross-account dashboards via CloudWatch-CrossAccountSharingRole One-click setup in console settings.
React to AWS-side maintenance EventBridge rule on aws.health Plus SNS, plus dashboard custom widget.
Auto-recover an EC2 instance CloudWatch alarm on StatusCheckFailed_System → EC2 recover action Built-in alarm action, no SNS needed.
Trigger Auto Scaling CloudWatch alarm → Auto Scaling action Or target tracking which manages alarms internally.
Run a remediation script CloudWatch alarm → SNS → Lambda or alarm → EventBridge → SSM Automation EventBridge route preferred for retry/decoupling.
Manage dashboards as code CloudFormation AWS::CloudWatch::Dashboard Versioned in Git, reviewed in PRs.
Centralized logs from on-prem CloudWatch agent + SSM hybrid activation One agent, one namespace, one dashboard.

Common Traps Recap — CloudWatch Metrics, Alarms, and Dashboards

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: Detailed monitoring adds new metrics

It only changes the period of existing hypervisor metrics. Memory, disk-fullness, and process metrics still require the CloudWatch agent.

Trap 2: treatMissingData=missing keeps the alarm useful

missing simply excludes missing periods from M-of-N evaluation; if all periods are missing, the alarm sits in INSUFFICIENT_DATA and never reaches OK or ALARM. For action-driving alarms, choose notBreaching or breaching deliberately.

Trap 3: One alarm per instance scales

Per-instance alarms become orphan alarms when the instance terminates. Prefer ASG-level dimensions (AutoScalingGroupName) so the metric aggregates across the fleet.

Trap 4: Alarm period equals evaluation period

They are independent. Period is the bucket size for the statistic; evaluation periods is the count of buckets examined. Confusing them produces alarms that fire too fast or too slow.

Trap 5: Custom metric for a derived value

If the value can be computed from existing metrics, use metric math. Custom metrics cost per metric per dimension and are unnecessary for ratios and aggregations.

Trap 6: Console-built dashboards as the source of truth

Console dashboards drift, get accidentally deleted, and cannot be reviewed. Manage dashboards as CloudFormation AWS::CloudWatch::Dashboard JSON in Git.

Trap 7: High-resolution metrics for everything

The cost is 10x to 60x standard. Reserve for genuinely sub-minute-critical metrics.

Trap 8: Alarming at 100% of a service quota

By the time you breach 100 percent, production is failing. Alarm at 80 percent and have time to file an increase.

Trap 9: Expecting CloudWatch to detect failures in seconds

AWS metrics arrive 1–2 minutes after the event; alarm evaluation adds the period length. Sub-minute detection requires application-level, ELB, or Route 53 health checks.

Trap 10: Forgetting AWS Health Dashboard

CloudWatch only knows what your workloads say. AWS-side maintenance and service issues come from AWS Health via EventBridge aws.health events.

FAQ — CloudWatch Metrics, Alarms, and Dashboards

Q1: What is the difference between Period and Evaluation Periods?

Period is the bucket size — how long CloudWatch waits before computing one statistic value (Average, Sum, Max, Min, p99) for the metric. Common values are 60, 300, or higher seconds. Evaluation periods is the count N of the most recent buckets the alarm examines together with Datapoints to alarm (M) to apply the M-of-N rule. So Period = 60, Evaluation periods = 5, Datapoints to alarm = 3 means: every minute, check whether 3 of the last 5 one-minute averages breach. Period sets the resolution; evaluation periods set the window over which the rule is applied. Confusing them is the single most common alarm-tuning error on SOA-C02.

Q2: Why does my CloudWatch alarm sit in INSUFFICIENT_DATA forever?

INSUFFICIENT_DATA means the alarm cannot decide because there is not enough metric data within the evaluation window. Three typical causes: (a) the metric is not publishing at all — install the CloudWatch agent, fix the IAM role, or check the VPC endpoint; (b) the metric is sparse — set treatMissingData to notBreaching for scale-out alarms or breaching for heartbeat alarms; (c) the source resource was terminated — switch the alarm to an ASG-level or service-level dimension instead of a per-instance dimension so the metric survives churn. The right fix depends on the alarm's purpose, not a one-size-fits-all default.

Q3: How do I choose between cross-account dashboards and a custom Lambda aggregator?

Use cross-account dashboards for the vast majority of cases — one-click setup, no Lambda code to maintain, supports cross-region in the same configuration, and works with cross-account metric math. Use a custom Lambda aggregator only when you need to push the aggregated data into a third-party SaaS (Datadog, Grafana Cloud), apply business logic AWS does not natively support (e.g., weighted SLO across accounts), or store long-term summaries beyond CloudWatch's 15-month retention. SOA-C02 prefers the AWS-native answer; pick custom Lambda only when the question explicitly rules out cross-account dashboards.

Q4: When does anomaly detection beat a static threshold, and vice versa?

Anomaly detection wins when the metric has a daily or weekly seasonality — public website traffic, e-commerce checkout volume, login rate. The model learns "normal for Monday 9am is different from Sunday 3am" and only alarms on departures from learned normal. Static thresholds win when the metric has an absolute meaning that does not vary with time — disk_used_percent > 90, replication lag seconds > 60, 5xx error count > 100. The giveaway phrase on SOA-C02 is "tired of false alarms during expected peaks" → anomaly detection; "the SLO requires latency under 200ms" → static threshold at 200.

Q5: How does the M-of-N rule (Datapoints to alarm) actually evaluate?

CloudWatch examines the most recent N evaluation periods. It counts how many of those periods breach the threshold. If at least M of them breach, the alarm transitions to ALARM (or stays there). If fewer than M breach across the latest N periods, the alarm transitions to OK. Missing periods are handled per treatMissingData. The M-of-N rule decouples sensitivity (low M) from sustained-condition detection (high N): M=3, N=5 says "the breach has been frequent in the last few minutes" and resists single-minute spikes; M=1, N=1 says "alarm on the very next breach" and is appropriate only for binary signals like status checks.

Q6: What is the cleanest way to send an alarm to a Lambda function?

Two paths. (a) The simple way: alarm action → SNS topic → Lambda subscription. (b) The robust way: alarm action → SNS topic, plus an EventBridge rule on the aws.cloudwatch event source matching CloudWatch Alarm State Change → Lambda target. The EventBridge route gives you retries, dead-letter queues, and decoupling without owning the SNS subscription. SOA-C02 favors the EventBridge route for any "automated remediation" answer because it composes naturally with Systems Manager Automation, Step Functions, and cross-account event buses. Direct alarm-to-Lambda action is also possible but tightly coupled and lacks built-in retries.

Q7: Can a single alarm watch a metric in another account or region?

Not directly. A CloudWatch alarm always evaluates in the account and region where it is defined and watches metrics in the same scope. To alarm on a cross-account or cross-region metric you have three options: (a) publish a copy of the metric into the alarm's account/region with a Lambda + PutMetricData; (b) use CloudWatch metric streams to ship metrics to a central account; (c) deploy the alarm itself into each source account/region and route the alarm state change events to a central EventBridge bus. Option (c) is the SOA-C02 recommended pattern for organization-wide monitoring — alarms stay close to the metric, and aggregation happens at the event-bus layer.

Q8: What is the right way to alarm on the AWS Health Dashboard?

AWS Health events are not CloudWatch metrics — they are events on the aws.health event source in EventBridge. Build an EventBridge rule that matches the event categories and services you care about (scheduledChange, issue, accountNotification for relevant services), then route to: an SNS topic for human notification, a Lambda function to update a Jira/ServiceNow ticket, and a CloudWatch dashboard custom widget that renders ongoing Health events alongside CloudWatch alarms. For multi-account scope, enable AWS Health Organizational View so events from every member account aggregate into the management or delegated admin account before going through your EventBridge pipeline.

Q9: How long does CloudWatch keep my metric data?

CloudWatch automatically rolls metric data into coarser buckets as it ages: high-resolution (sub-minute) data for 3 hours, 1-minute data for 15 days, 5-minute data for 63 days, and 1-hour data for 15 months (455 days). After 15 months, the data is gone. If your SysOps role requires longer retention for capacity planning, audit, or compliance, you must export — either via metric streams to S3/Firehose/third-party, or via scheduled GetMetricData calls into your own data lake. The exam may test this number directly: "what is the longest CloudWatch will retain a metric without manual export?" — 15 months.

Q10: Should I use a composite alarm or metric math for combining signals?

Use metric math when the combined signal is a value — error rate, free capacity, weighted average — and you want to alarm on the computed value. The alarm is a single metric-math alarm. Use a composite alarm when the combined signal is a Boolean over alarm states — "page only when both A and B are in ALARM" or "suppress paging while maintenance window alarm is active". Composite alarms also let you suppress child-alarm actions, which metric math cannot. A common SOA-C02 trick is to offer metric math when the right answer is composite (state combination), or composite when the right answer is metric math (value combination); read the scenario for whether it talks about values or alarm states.

Once metrics are in place, the next operational layers are: CloudWatch Logs and Logs Insights for application and system log analysis (the natural follow-on to metrics for SOA-C02 Domain 1), EventBridge rules and SNS notifications for routing alarms into automated remediation pipelines, CloudTrail and AWS Config for audit and configuration compliance signals that feed into the same alarm and dashboard fabric, and EC2 Auto Scaling and ELB high availability for the workload tier whose health these alarms supervise.

官方資料來源