EventBridge and SNS for SysOps Remediation

Amazon EventBridge and Amazon Simple Notification Service (SNS) are the two services that turn passive observations into active operations on AWS. SOA-C02 Task Statement 1.2 — "Remediate issues based on monitoring and availability metrics" — is built almost entirely around the chain that runs from a CloudWatch alarm or AWS Config rule, through EventBridge rules and SNS topics, into Systems Manager Automation runbooks, Lambda functions, or SQS queues that perform the corrective action. Where SAA-C03 asks how to design an event-driven architecture, SOA-C02 asks how to wire one up correctly, troubleshoot it when an event never reaches its target, and tune the retry, dead-letter, and filtering behavior so the on-call gets the signal without drowning in noise.

This guide walks through EventBridge and SNS from the operational SysOps angle: how event buses route AWS service events, how event patterns differ from JSON path matching, why scheduled rules are slowly being supplanted by EventBridge Scheduler, the four CloudWatch alarm actions that bypass SNS entirely, the SNS fan-out pattern with message filtering policies, the retry semantics that change between HTTP/S and AWS-native delivery, and the canonical three-service remediation chain — Config rule, EventBridge rule, SSM Automation document — that SOA-C02 tests in roughly one out of every four Domain 1 questions. You will also see the recurring scenario shapes: cross-region and cross-account event forwarding, SNS message filter policies on MessageAttributes versus the message body, EventBridge rule dead-letter queues, and the operational difference between EventBridge and direct alarm-to-Lambda actions when retries and decoupling matter.

The official SOA-C02 Exam Guide v2.3 describes Task Statement 1.2 in three skills: troubleshoot or take corrective actions based on notifications and alarms, configure Amazon EventBridge rules to invoke actions, and use AWS Systems Manager Automation runbooks to take action based on AWS Config rules. EventBridge and SNS thread through every one of those skills. The CloudWatch alarm in Topic 1 fires; its action is an SNS topic or an EventBridge rule. The Config rule in the CloudTrail and Config topic detects non-compliance; the non-compliance event lands on the default EventBridge bus. The Systems Manager Automation document in Domain 3 executes corrective steps; it is invoked by an EventBridge rule, not by direct alarm action.

At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "design an event-driven architecture for an order processing pipeline." SOA-C02 asks "the EventBridge rule is configured but the SSM Automation runbook never runs — what is missing?" The answer set is small and predictable: the rule's IAM role lacks permission to invoke the target, the event pattern does not match because of a single quoted-string typo, the target is in another region or account without the cross-region/cross-account plumbing, the rule input transformer is malformed, or the rule lives on the default bus while the source event arrives on a custom bus. EventBridge and SNS is the topic where every monitoring and remediation flow plugs in: CloudWatch alarms route through it (Domain 1), Auto Scaling lifecycle hooks emit events to it (Domain 2), CloudFormation stack events surface here (Domain 3), GuardDuty findings land here (Domain 4), Health Dashboard events flow through it (Domain 1), and Config-driven auto-remediation depends on it (Domain 3).

Event bus: a router for events. The default event bus in each account/region receives events from AWS services. Custom event buses receive application events you publish via PutEvents. Partner event buses receive events from SaaS integrations (Datadog, Auth0, Zendesk, etc.).
Event pattern: a JSON document that filters which events a rule matches. Event patterns use exact match, prefix match, anything-but, numeric, IP range, exists, and equals-ignore-case operators on event fields.
Rule: an EventBridge resource that watches a bus, applies an event pattern OR a schedule expression, and sends matching events to one or more targets.
Target: where a matched event goes. Common targets include Lambda, SNS, SQS, Step Functions, Systems Manager Automation, EC2 actions (reboot/stop/terminate), API destinations, ECS tasks, and Kinesis streams.
Schedule expression: rate(5 minutes) or cron(0 18 ? * MON-FRI *) — a time-based trigger instead of an event pattern.
EventBridge Scheduler: a separate, newer scheduling service with one-time schedules, time zones, flexible time windows, and a higher quota than rule-based schedules.
SNS topic: a fan-out endpoint. Standard topics deliver each message at-least-once and out-of-order; FIFO topics deliver in-order and exactly-once but only support SQS FIFO subscribers.
Subscription: a delivery target attached to a topic — email, SMS, HTTP/S, Lambda, SQS, mobile push, or another SNS topic via cross-account.
Message filter policy: a JSON document on a subscription that limits which messages reach this subscriber, evaluated against MessageAttributes (default) or the message body (opt-in).
DLQ (dead-letter queue): an SQS queue that receives events EventBridge or SNS could not deliver to the target after retries. Critical for operational visibility.
Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

白話文解釋 EventBridge and SNS

EventBridge and SNS jargon lives in three overlapping layers — events, routing, and delivery. Three analogies help.

Analogy 1: The Postal System With Smart Sorting

EventBridge is the postal sorting hub of your AWS account. Every AWS service drops mail into the default event bus, which is the central post office in your region. Event patterns are the address-matching rules the sorters apply: "if the envelope says EC2 and the state is stopped, route to the operations team." Custom event buses are dedicated mailrooms for a single application — your order pipeline has its own bus so its events never collide with AWS service events. Partner buses are incoming mail from outside vendors like Auth0 and Datadog, kept on their own bus so you can apply different rules.

SNS is the fan-out distribution list at the destination. The post office hands one envelope (message) to SNS, and SNS makes copies for everyone on the subscription list — five emails, two pagers, a Slack webhook, and a Lambda function. Message filter policies are like per-subscriber preferences: "only forward me envelopes about my own building, not the whole neighborhood." Dead-letter queues are the undeliverable mail bin — when an SNS HTTP subscriber is offline for too long, the message goes to a separate SQS queue so the postmaster (SysOps engineer) can investigate later.

For SOA-C02 the punchline is: EventBridge sorts and routes one event to a small number of explicitly chosen targets; SNS fans out one message to a list of subscribers, each with its own filter and delivery channel. They compose: an EventBridge rule's target is often an SNS topic so the same event can fan out to humans and Lambda at once.

Analogy 2: The Fire Alarm Wired to Multiple Responses

A CloudWatch alarm is a fire alarm sensor in a building. When it fires, it has four possible actions wired into the panel: an SNS topic that rings the building bell so everyone hears it, an EC2 action that automatically closes a fire door (reboot, stop, terminate, recover), an Auto Scaling action that dispatches more firefighters (scale-out), and a Systems Manager OpsCenter action that opens a ticket for the building manager. SNS is the bell and the human notification channel; the EC2 and Auto Scaling actions are mechanical responses that bypass SNS entirely. EventBridge is the building's central alarm panel that can wire the sensor to a far richer set of responses — call the fire department (SSM Automation runbook), schedule a follow-up inspection (Step Functions), or dim the lights (Lambda).

For SOA-C02, the operational question is which path to use. CloudWatch alarm direct action is fast but limited and tightly coupled. CloudWatch alarm → SNS → email is the simplest human notification. CloudWatch alarm → EventBridge rule → SSM Automation is the production-grade path with retries, decoupling, transforming, and cross-account fan-out.

Analogy 3: The Restaurant Kitchen Order Ticket System

EventBridge is the kitchen ticket rail in a restaurant. Each customer order (event) prints a ticket and slides down the rail. The rules are the station chefs — the grill chef only picks up tickets that say "steak", the salad chef only picks up tickets that say "salad". Multiple chefs can pick up the same ticket if both stations are needed (one event, multiple targets). The scheduled rule is the prep schedule — at 5pm someone always pulls onions out of the cooler regardless of any customer order. The event bus is the rail itself; a private dining room with a private menu has its own custom rail.

SNS is the expediter at the pass who, when a dish is ready, calls out to the runners — runner one delivers to table 5, runner two takes a copy to the manager's tablet, runner three logs the timing in the POS system. The filter policy is each runner saying "I only deliver desserts." The dead-letter queue is the lost-ticket bucket for orders that nobody picked up after enough tries. The fanout pattern — SNS to multiple SQS queues — is the same ticket showing up at the prep station, the inventory system, and the customer feedback survey simultaneously.

For SOA-C02 the postal sorting analogy is the most useful when reasoning about event patterns and cross-account/cross-region forwarding. The fire alarm analogy clarifies why CloudWatch alarm direct actions exist in addition to SNS — those are mechanical responses wired to the panel, not human notifications. The kitchen ticket analogy is the cleanest mental model for SNS fan-out and message filter policies. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

EventBridge Event Buses: Default, Custom, and Partner

The event bus is the routing channel for events. Every AWS region in every account has exactly one default bus, plus any custom and partner buses you create.

Default event bus

Every AWS service that emits events publishes to the default event bus in the account and region where the event occurred — EC2 instance state changes, S3 object events (if EventBridge is enabled on the bucket), CodeBuild build state, CodePipeline stage transitions, GuardDuty findings, Health Dashboard events, AWS Config compliance changes, CloudWatch alarm state changes, Systems Manager parameter changes, Auto Scaling lifecycle hooks, ECS task state changes, and many more. SOA-C02 expects you to know that the default bus is where AWS service events land, that it is regional, and that you cannot delete it.

Custom event buses

A custom event bus is one you create explicitly. Application events your code publishes via the PutEvents API land on a custom bus you choose, isolating them from AWS service traffic on the default bus. Custom buses also enable cross-account event delivery: you grant another account permission to publish to your bus through a resource-based policy. SOA-C02 scenarios for custom buses include "consolidate events from 50 member accounts onto a single security event bus" and "build a custom application event taxonomy without polluting the default bus".

Partner event buses

A partner event bus receives events from a SaaS partner — Auth0, Datadog, MongoDB Atlas, Zendesk, PagerDuty, and so on. The partner generates events from their platform, and you create a partner bus configured with their event source ARN. Partner events arrive directly on this bus and never touch the default bus. The exam treats partner buses lightly, but recognize the term.

Archives and replay

EventBridge supports event archives that store events from a bus, with optional retention. Replay lets you re-emit archived events back to the bus for debugging or recovery — for example, replaying the last hour of events through a fixed Lambda. SOA-C02 occasionally tests "the team needs to test a new EventBridge rule against last week's events" — the answer is archive + replay, not custom log scraping.

Event buses per account per region: 100 (default + 99 custom/partner).
Rules per event bus: 300 default soft limit, can be raised.
Targets per rule: up to 5.
Maximum event size: 256 KB (same as SNS message size limit).
PutEvents throughput: 10,000 events per second per region (soft limit).
Schedule expression minimum frequency: rate(1 minute) for rules; EventBridge Scheduler supports 1-minute schedules and one-time schedules with second-level offset windows.
Retry duration for AWS API targets: 24 hours by default; configurable on the rule.
Cross-account event delivery: requires resource-based policy on the receiving bus.
Cross-region event delivery: requires an IAM role on the rule that allows events:PutEvents on the destination bus.
Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-bus.html

EventBridge Rules: Event Pattern Matching vs Scheduled Rules

A rule is an EventBridge resource that watches a single bus and routes matched events to one or more targets. A rule has exactly one trigger source: either an event pattern (matching live events on the bus) or a schedule expression (firing on a fixed cadence). A rule cannot have both.

Event pattern matching

An event pattern is a JSON document that says "match events that look like this". The pattern is matched against the event payload field-by-field. Patterns support several match operators:

Exact value: "source": ["aws.ec2"] matches when source is exactly aws.ec2.
Prefix: "region": [{"prefix": "us-"}] matches us-east-1, us-west-2, etc.
Suffix: [{"suffix": ".example.com"}] matches FQDNs ending in .example.com.
Anything-but: [{"anything-but": "ap-southeast-1"}] matches any region except that one.
Numeric: [{"numeric": [">=", 500]}] matches numeric values meeting the comparison.
Exists: [{"exists": true}] matches when the field is present at all.
Equals-ignore-case: [{"equals-ignore-case": "WARNING"}] for case-insensitive matches.
IP address range: [{"cidr": "10.0.0.0/8"}] for source IP filtering.

A typical pattern matching EC2 instance state changes looks like:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["stopped", "terminated"]
  }
}

The pattern is hierarchical and AND-ed across fields; the array values within a field are OR-ed. There is no nested OR across top-level fields — a common cause of "the rule does not fire" troubleshooting.

Scheduled rules

A scheduled rule uses a schedule expression instead of an event pattern. Two syntaxes:

rate(value unit) — rate(5 minutes), rate(1 hour), rate(7 days). Simple recurring intervals.
cron(minutes hours day-of-month month day-of-week year) — six fields, with ? allowed in the day-of-month or day-of-week field but not both.

The classic SOA-C02 cron expression is cron(0 18 ? * MON-FRI *) — every weekday at 18:00 UTC, with ? in the day-of-month field because day-of-week is specified. Candidates lose points by writing cron(0 18 * * MON-FRI *) (a five-field POSIX cron) which EventBridge rejects.

::warning

A surprising amount of SOA-C02 wrong-answer churn comes from the cron syntax. EventBridge cron has six fields (minutes hours day-of-month month day-of-week year), not the POSIX five. Exactly one of day-of-month and day-of-week must be ?; using * in both is a rejection. Scheduled rules also evaluate in UTC only — there is no time-zone support. If a scenario requires a job to fire at 9am Asia/Taipei, you must either compute the UTC offset yourself (and remember to update on DST changes for non-Asia regions) or switch to EventBridge Scheduler, which has native time-zone support. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html ::

EventBridge Scheduler vs scheduled rules

Amazon EventBridge Scheduler is a separate, newer scheduling service launched in 2022. It supersedes scheduled rules for new workloads and offers:

Higher quotas — millions of schedules per account vs the rule-per-bus limit.
One-time schedules — fire exactly once at a specified timestamp.
Time-zone support — cron(0 9 * * ? *) evaluated in Asia/Taipei rather than UTC.
Flexible time windows — fire any time within a 15-minute window for load smoothing.
Built-in retry, DLQ, and target encryption.

For SOA-C02, both are testable. A scheduled rule on the default bus is the older, simpler option that works for AWS-service-event-style schedules. EventBridge Scheduler is the answer for "thousands of per-customer schedules", "one-time deferred actions", or any time-zone-sensitive cron.

Rule input transformers

A rule can transform the event before sending it to the target. The input transformer uses input paths that extract fields from the event and a template that interpolates them into a new payload. This is essential when an SSM Automation document expects parameters in a specific shape that does not match the EC2 state change event format.

A common SOA-C02 trap: the EventBridge rule fires correctly but the SSM Automation document fails with a parameter validation error. The fix is an input transformer that maps $.detail.instance-id to the InstanceId parameter the runbook expects. Reading the rule's CloudWatch metrics — Invocations (the rule fired) but no MatchedEvents-to-target success — tells you the wiring is right but the payload shape is wrong. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rules.html

Common EventBridge Targets

A rule can have up to 5 targets. The most common targets in SOA-C02 territory:

AWS Lambda function — synchronous invocation; event passed as the function input. Retries on Lambda errors are governed by the rule, not by Lambda.
Amazon SNS topic — fan-out to multiple subscribers (humans, Lambda, SQS).
Amazon SQS queue — durable storage for downstream consumers; standard or FIFO.
AWS Systems Manager Automation document (runbook) — multi-step remediation; the rule needs an IAM role with ssm:StartAutomationExecution permission.
AWS Step Functions state machine — orchestrate long-running multi-step workflows.
Amazon ECS task — launch a one-off task on demand.
EC2 actions — RebootInstances, StopInstances, TerminateInstances. Note these are EventBridge built-in actions distinct from CloudWatch alarm EC2 actions.
Amazon Kinesis Data Stream / Firehose — streaming pipeline ingestion.
API destination — outbound HTTPS call to a third-party API with stored authentication.
Cross-account / cross-region event bus — forward the event to a bus in another account or region.

IAM permissions

EventBridge needs permission to invoke each target. For Lambda, SNS, and SQS targets in the same account, EventBridge uses resource-based policies — adding the rule's ARN as a permitted principal on the target. For SSM Automation, Step Functions, ECS, and most other targets, the rule has an explicit execution IAM role that grants the necessary action. Forgetting the IAM role on an SSM Automation target is the single most common reason a rule "fires but does nothing".

Rule dead-letter queues

A target can fail — Lambda throttled, SNS topic deleted, SSM Automation parameters invalid. By default EventBridge retries for 24 hours with exponential backoff. After exhausting retries, the event is discarded unless you configure a dead-letter queue (SQS) on the rule. SOA-C02 tests this directly: "events occasionally do not reach the target but no one knows" — the answer is a DLQ on the rule plus a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible.

Without a DLQ, EventBridge silently discards events after retry exhaustion and the only signal is a missed remediation. With a DLQ, undelivered events are preserved in SQS and a CloudWatch alarm on ApproximateAgeOfOldestMessage pages the on-call. SOA-C02 expects DLQs on every production rule; "configure a DLQ" is a frequent answer choice for "ensure no events are lost". Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html

CloudWatch alarms have four built-in action categories that fire on state transitions (OK, ALARM, INSUFFICIENT_DATA). Each is independent.

The most common alarm action is an SNS topic ARN. The alarm publishes a JSON message describing the state change to the topic, which fans out to subscribers. A typical pattern:

Critical alarm SNS topic — subscribed by PagerDuty HTTPS endpoint and on-call email.
Warning alarm SNS topic — subscribed by Slack channel webhook and team email.
Info SNS topic — subscribed only by an audit Lambda for log retention.

The alarm action and the SNS topic must be in the same region (alarms cannot directly publish to a cross-region topic). For cross-region notification, the SNS topic in region A subscribes a Lambda that re-publishes to region B, or the alarm action is an EventBridge rule that forwards to a cross-region bus.

EC2 actions

CloudWatch alarms can directly invoke four EC2-instance actions without going through SNS or EventBridge:

Recover — for impaired instances (status check failures rooted in AWS-side hardware). The instance keeps its instance ID, private IP, Elastic IP, and metadata. Available only on instance types and EBS-only instances.
Stop — useful for runaway dev or test instances on a billing alarm.
Terminate — useful with autoscaling for scaled-out instances that fail health checks.
Reboot — quick remediation for soft hangs.

These four actions exist as alarm actions specifically because they are common, latency-sensitive remediations and going through EventBridge → Lambda → EC2 API would add unnecessary indirection. SOA-C02 tests "auto-recover an EC2 instance on system status check failure" — the answer is the CloudWatch alarm recover action, not EventBridge.

Auto Scaling actions

A CloudWatch alarm can directly invoke an Auto Scaling scaling policy ARN as its action. Target tracking and step scaling policies use this internally. Most production setups let target tracking manage the alarms automatically rather than building them by hand.

Systems Manager OpsCenter / Incident Manager actions

Newer alarm action categories include creating an OpsItem in Systems Manager OpsCenter or triggering an Incident Manager response plan. These are operational-incident-management features that show up in some SOA-C02 questions.

Why EventBridge over alarm actions for remediation

Direct alarm actions (SNS, EC2, Auto Scaling) are simple and fast but tightly coupled. For complex remediation — multi-step SSM Automation, conditional branching, retries, DLQ, cross-account fan-out — the SOA-C02 best practice is to send the alarm to an SNS topic AND to an EventBridge rule. The SNS topic notifies humans; the EventBridge rule on aws.cloudwatch source matching CloudWatch Alarm State Change invokes the SSM Automation runbook with retries, DLQ, and decoupling.

Both exist and both can reboot or stop an EC2 instance. The CloudWatch alarm recover/stop/terminate/reboot actions are bound to the instance the alarm is monitoring (InstanceId dimension). The EventBridge RebootInstances/StopInstances/TerminateInstances built-in target accepts an instance ID list as a target parameter and is decoupled from the source event. SOA-C02 distractors often offer "alarm action stops the instance" when the scenario actually needs cross-instance logic — that requires EventBridge → Lambda or SSM Automation. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmActions.html

Amazon SNS is a fully managed pub/sub service for fan-out delivery. A topic is the publish endpoint; subscriptions are the delivery targets.

Subscription types

A single SNS topic can have many subscriptions, mixing types:

Email — RFC-compliant email; subscriber must confirm via clickthrough link before receiving messages.
Email-JSON — same but with the full JSON payload.
SMS — text messages; rate-limited and region-restricted (no SMS in some regions like ap-east-1).
HTTP/S — POST to a webhook URL; useful for Slack, PagerDuty, Datadog, ServiceNow.
AWS Lambda — invoke a function with the message as input.
Amazon SQS — durable buffer for downstream consumers; the canonical fan-out pattern.
Application (mobile push) — Apple Push, Firebase Cloud Messaging, Amazon Device Messaging.
Amazon Data Firehose — stream messages into S3, OpenSearch, Redshift, Splunk.
AWS Event Fork Pipelines — pre-built fan-out into archive (S3 via Firehose), analytics, replay, and DLQ pipelines.

Standard vs FIFO topics

Standard topics — best-effort ordering, at-least-once delivery, unlimited throughput. The right default for almost all SOA-C02 notification and remediation patterns.
FIFO topics — strict ordering by message group ID, exactly-once delivery, up to 300 publish requests per second (3,000 with batching). FIFO topics only support SQS FIFO subscribers — no Lambda, email, HTTP, or SMS.

The exam favors standard topics. Pick FIFO only when the scenario explicitly mentions ordered processing or deduplication of identical messages.

Message filter policies

A subscription filter policy is a JSON document that limits which messages reach a subscriber. By default it evaluates against the message's message attributes (key-value metadata sent alongside the body). Optionally, you can enable filtering on the message body itself using FilterPolicyScope: MessageBody.

A typical filter policy:

{
  "severity": ["critical", "high"],
  "region": ["us-east-1", "us-west-2"]
}

This subscriber only receives messages whose severity attribute is critical or high AND whose region attribute is in the listed values. Filter policies support the same operators as EventBridge patterns (prefix, anything-but, numeric, exists). The producer sets attributes when calling Publish.

Fan-out pattern

The canonical SOA-C02 fan-out pattern is one SNS topic → many SQS queues, each queue with its own filter policy and consumed by a different microservice or remediation Lambda. This decouples the publisher from N consumers, gives each consumer durable buffering, and lets each consumer apply its own filter and processing rate without affecting others.

Cross-account topics

An SNS topic policy can grant publish permission to a principal in another account. Subscribers in another account need their own permission to subscribe. SOA-C02 tests "central security topic in the audit account, subscribed by SQS queues in 50 application accounts" — the answer is a cross-account topic policy plus per-account subscription filter policies.

The naive design — one SNS topic per team, per environment, per severity — explodes into hundreds of topics that are hard to govern. The SOA-C02 best practice is one or two well-known topics with rich message attributes, and per-subscriber filter policies that select what each consumer wants. This is also the AWS-recommended fan-out pattern. Reference: https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering.html

SNS delivery semantics differ by subscriber type, and the differences are tested.

Retry policies

AWS-native subscribers (Lambda, SQS, Firehose) — SNS retries on AWS-side throttling and transient errors with exponential backoff, generally for tens of attempts over hours. Failed deliveries land in the topic-level or subscription-level DLQ if configured.
HTTP/S subscribers — SNS uses an explicit retry policy with three retry phases: immediate retry (a small number of attempts within seconds), pre-backoff (linear backoff), and post-backoff (exponential backoff). The defaults retry over roughly an hour, and the policy is configurable per subscription via the delivery policy JSON.
Email and SMS — SNS retries internally; failures (bounced email, blocked SMS) are not delivered to the DLQ.

Subscription DLQs

Each subscription can have an SQS DLQ. When SNS exhausts retries, the message lands in the DLQ with metadata about the failure. The DLQ is per subscription, not per topic — a topic with five subscribers needs five DLQ configurations if each subscription needs failure visibility.

Message size and structure

SNS messages are limited to 256 KB (same as EventBridge). For larger payloads, use the SNS Extended Client which stores the payload in S3 and sends only an S3 reference. Alternatively, send a notification message and have the subscriber fetch the full data from S3.

A SOA-C02 distractor: "the third-party HTTP webhook is sometimes down; configure SNS to retry for 24 hours". The default HTTP delivery policy retries for ~1 hour with the standard policy. To extend, configure a custom delivery policy with numRetries, numMaxDelayRetries, numMinDelayRetries, minDelayTarget, maxDelayTarget, and backoffFunction. Lambda subscribers use the SNS-Lambda integration and retry differently — they never use the HTTP delivery policy. Mixing the two retry models is a frequent trap. Reference: https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html

Automated Remediation with Systems Manager Automation

The canonical SOA-C02 remediation chain has three links: AWS Config rule detects non-compliance, EventBridge rule matches the Config compliance change event, Systems Manager Automation runbook executes the corrective steps.

The three-service chain in detail

Config rule evaluates resources against the rule's logic (managed rules like s3-bucket-public-read-prohibited or custom Lambda-backed rules). On non-compliance, Config publishes a Config Rules Compliance Change event to the default EventBridge bus.
EventBridge rule has an event pattern matching aws.config source, Config Rules Compliance Change detail-type, and the specific rule name and NON_COMPLIANT state. The target is an SSM Automation document.
SSM Automation runbook runs corrective actions — for an S3 bucket, this might be AWSConfigRemediation-RemoveS3BucketPublicReadAccess, which calls s3:PutBucketAcl to set ACL to private and s3:PutBucketPolicy to remove public statements.

The IAM role on the EventBridge rule needs ssm:StartAutomationExecution, and the role on the Automation document needs iam:PassRole plus the actual remediation permissions (s3:PutBucketAcl, etc.). Forgetting either layer is a frequent SOA-C02 wrong-answer trap.

Config managed remediation

Config also supports direct managed remediation without an explicit EventBridge rule — you attach an SSM Automation document directly to the Config rule's "Remediation actions" tab and choose Automatic or Manual mode. Automatic remediation runs immediately on non-compliance; Manual requires a human to click "Remediate" in the Config console.

When SOA-C02 asks "the simplest configuration to auto-remediate non-compliant resources", the answer is Config managed remediation with Automatic mode — no EventBridge rule needed. When the scenario requires conditional logic, fan-out, retries, or cross-account remediation, the answer is the explicit Config → EventBridge → SSM Automation chain.

Common AWS-managed remediation documents

AWSConfigRemediation-RemoveS3BucketPublicReadAccess
AWSConfigRemediation-RemoveS3BucketPublicWriteAccess
AWS-EnableEbsEncryptionByDefault
AWS-DisablePublicAccessForSecurityGroup
AWSConfigRemediation-EnableCloudTrailLogFileValidation
AWS-RestartEC2Instance
AWS-StopEC2Instance
AWSConfigRemediation-EnableEbsEncryptionByDefault

These cover roughly 80 percent of compliance auto-remediation scenarios. Custom Automation documents are written in YAML or JSON with steps like aws:executeAwsApi, aws:executeScript, aws:waitForAwsResourceProperty, and aws:assertAwsResourceProperty.

Expect at least one Domain 1 question that tests this exact three-service chain. The pattern is so heavily tested that recognizing it from the question stem ("auto-remediate non-compliant resources", "Config detects... what is the next step") is worth memorizing as a shortcut. The role-permission pitfalls are also testable — iam:PassRole, ssm:StartAutomationExecution, and the Automation execution role's resource permissions all need to be configured. Reference: https://docs.aws.amazon.com/config/latest/developerguide/remediation.html

Corrective Actions Flow: CloudWatch Alarm → SNS → Lambda → API Call

A second canonical remediation chain runs from a CloudWatch alarm rather than a Config rule.

The simple form

CloudWatch alarm detects a metric breach (e.g., RDS FreeableMemory < threshold).
SNS topic receives the alarm action message.
Lambda function subscribed to the SNS topic parses the message and calls AWS APIs to remediate (e.g., scale RDS to a larger instance class, or restart a process via SSM Run Command).

The robust form (SOA-C02 preferred)

CloudWatch alarm publishes the state change.
EventBridge rule on aws.cloudwatch source matching CloudWatch Alarm State Change for the specific alarm name and state.value: "ALARM".
EventBridge target is the SSM Automation runbook directly (no Lambda needed for AWS-managed remediations).
The rule has a DLQ to catch failed invocations.
The Automation document's role has the minimum permissions for the corrective action.
A separate EventBridge rule on the same alarm event (or a target on the same rule) sends to an SNS topic for human notification.

The robust form is preferred because EventBridge handles retries, DLQ, and rule input transformation, while the SNS topic is dedicated to humans, not to triggering remediation. Mixing remediation logic into Lambda subscribers on the SNS topic conflates notification with action.

When direct alarm-to-Lambda action is acceptable

Single-account, single-region, simple remediation logic.
Tight latency requirement (sub-second invocation; SNS adds tens of milliseconds).
Throwaway operational scripts without a long lifecycle.

The exam usually steers toward the EventBridge-based answer, but recognize that direct alarm-to-Lambda is not wrong — it is just less robust.

EventBridge Scheduled Events: Replacing EC2 Cron Jobs

A common SysOps anti-pattern is running cron jobs on individual EC2 instances for periodic tasks — nightly backups, weekly reports, hourly data exports. Each instance becomes a single point of failure and a configuration island. The SOA-C02 sanctioned alternative is EventBridge scheduled rules (or EventBridge Scheduler) invoking a Lambda function or SSM Automation document.

Migration pattern

Audit existing EC2 cron jobs across the fleet.
For each job, identify the work — is it a shell script that calls AWS APIs (rewrite as Lambda), a multi-step procedure (rewrite as SSM Automation document), or a long-running batch job (rewrite as ECS scheduled task or AWS Batch)?
Create an EventBridge scheduled rule (or Scheduler) with the equivalent cron expression in UTC.
Set the target to the new compute resource.
Configure a DLQ on the rule.
Test in staging, then disable the EC2 cron, then delete the EC2 cron entry.

Common scheduled patterns on AWS

Nightly EBS snapshots — cron(0 3 * * ? *) invoking SSM Automation AWS-CreateSnapshot (or use Amazon Data Lifecycle Manager directly).
Weekly compliance report — cron(0 9 ? * MON *) invoking Lambda that queries Config and Trusted Advisor, generates an S3 report.
Hourly cleanup — rate(1 hour) invoking Lambda that scans for orphaned snapshots, untagged EC2, expired Elastic IPs.
Business-hours scaling — two scheduled rules, one to scale up at 8am UTC weekdays and one to scale down at 8pm.

For EBS and AMI snapshots, Amazon Data Lifecycle Manager (DLM) is the AWS-native, purpose-built service. For OS-level patching, Systems Manager Maintenance Windows integrates with Patch Manager and supports approval workflows. For everything else, EventBridge scheduled rules or EventBridge Scheduler are the general-purpose answer. SOA-C02 distractors often offer EventBridge for snapshot scheduling when DLM is the cleaner answer; read the question for "managed lifecycle" vs "ad-hoc schedule". Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-scheduler.html

Cross-Account and Cross-Region Event Forwarding

Many SOA-C02 scenarios involve consolidating events from many accounts or regions into a central location for monitoring or remediation.

Cross-account event delivery

To send events from account A to account B:

In account B (the receiver), edit the resource-based policy on the target event bus (default or custom) to allow events:PutEvents from account A's principal.
In account A (the sender), create a rule whose target is arn:aws:events:region:accountB:event-bus/default (or the custom bus name).
Account A's rule needs an IAM role that grants events:PutEvents on the destination bus ARN.

The exam tests both halves — many candidates remember the resource-based policy on the destination but forget the IAM role on the source rule.

Cross-region event delivery

Same pattern: the target ARN points to the bus in another region, and the source rule's IAM role grants events:PutEvents on the destination bus across regions. EventBridge does not natively forward events across regions automatically; you must configure it explicitly.

Centralized event bus pattern

For a company with 50 accounts, the SOA-C02-correct architecture is:

A central security event bus in a dedicated audit account, in the primary region.
Each member account has rules on its default bus that forward security-relevant events (GuardDuty findings, Config compliance changes, root account activity from CloudTrail) to the central bus.
The central bus has rules that fan out to SNS (security team), Lambda (ticket creation), and S3 (audit log archive via Firehose).

The single most common SOA-C02 wrong answer for cross-account EventBridge is "configure the target bus's resource policy" — that is necessary but insufficient. The source rule also needs an IAM role granting events:PutEvents on the destination ARN. Both must be correct. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-cross-account.html

Scenario Pattern: EC2 Instance Repeatedly Fails Status Checks

A common SOA-C02 troubleshooting scenario. The runbook:

CloudWatch alarm monitoring StatusCheckFailed_System (system status check, indicating AWS-side hardware impairment).
The alarm has the EC2 recover action wired in directly — when the alarm fires, EC2 attempts to recover the instance on different underlying hardware.
If the recovery succeeds, the instance retains its private IP, Elastic IP, instance ID, and EBS attachments. The system status check returns to passing.
If the recovery fails (e.g., the instance type does not support recovery, or the alarm is on StatusCheckFailed_Instance which is OS-side), an EventBridge rule on EC2 Instance State-change Notification matching state: stopped invokes an SSM Automation runbook that runs deeper diagnostics, snapshots the volume for forensics, and creates an OpsItem for the on-call.
An SNS notification fans out to the on-call channel and a ticket is created in the IT service management system.

The EventBridge rule has the runbook as the target, the runbook has an execution role with ec2:DescribeInstances, ec2:CreateSnapshot, ssm:CreateOpsItem, and the rule has a DLQ. The notification flow is decoupled from the remediation flow so a Slack outage does not stop the snapshot.

Scenario Pattern: Config Flags Non-Compliant S3 Bucket

This is the canonical Config + EventBridge + SSM Automation scenario.

AWS Config rule s3-bucket-public-read-prohibited continuously evaluates every S3 bucket in the account.
A developer accidentally creates a bucket with a public-read ACL.
Config detects the non-compliance within minutes and publishes a Config Rules Compliance Change event to the default EventBridge bus.
EventBridge rule with event pattern matching source: ["aws.config"], detail-type: ["Config Rules Compliance Change"], detail.configRuleName: ["s3-bucket-public-read-prohibited"], detail.newEvaluationResult.complianceType: ["NON_COMPLIANT"] fires.
The rule's target is SSM Automation document AWSConfigRemediation-RemoveS3BucketPublicReadAccess, with input parameters mapped from the event via input transformer (specifically the bucket name from $.detail.resourceId).
The Automation document's role has s3:PutBucketAcl and s3:PutBucketPolicy permissions on arn:aws:s3:::*.
The runbook calls s3:PutBucketAcl to set ACL to private. Within seconds the bucket is no longer public.
A second target on the same rule sends to an SNS topic notifying the security team for audit purposes.

For the simpler case, the same outcome is achieved with Config managed remediation in Automatic mode — no EventBridge rule needed; the Config rule directly invokes the SSM Automation document. Pick the answer based on whether the scenario emphasizes simplicity (managed remediation) or extensibility/multi-target fan-out (explicit EventBridge rule).

Candidates frequently assume SNS retries are uniform across subscriber types. They are not.

HTTP/S subscribers use a configurable delivery policy with three retry phases (immediate, pre-backoff, post-backoff). Default total duration is around an hour.
Lambda subscribers use the SNS-Lambda integration; retry behavior is governed by Lambda's async invocation retry rules (two retries with exponential backoff, then DLQ if configured on the Lambda function or the SNS subscription).
SQS subscribers essentially never fail to deliver if SQS is healthy in the region; failures are rare and result in DLQ delivery.
Email and SMS retry internally; failures (bounce, blocked) are not visible to the publisher.

Mixing these retry models leads to wrong answers like "configure SNS retry policy to fix Lambda execution failures" — the SNS HTTP delivery policy does not apply to Lambda subscribers.

Common Trap: EventBridge Cross-Account Bus

A common SOA-C02 wrong answer for cross-account EventBridge: "create an SNS topic in the source account that publishes to a Lambda in the destination account, the Lambda re-emits the event". This works but is not the AWS-recommended pattern. The correct pattern is direct cross-account event delivery via the destination bus's resource policy + the source rule's IAM role. The Lambda re-emit pattern adds a hop and a moving part for no benefit; SOA-C02 expects the native cross-account flow.

A second cross-account trap: forgetting that the source rule's IAM role needs events:PutEvents permission on the destination bus ARN. Many candidates configure only the destination bus's resource policy and wonder why the rule fires (Invocations metric increments) but no event reaches the destination (MatchedEvents metric on the destination bus is zero).

By default, SNS subscription filter policies evaluate against message attributes, not the message body. If your producer sends the severity in the JSON body but not as an attribute, the filter policy never matches and either no messages or all messages are delivered (depending on whether the filter is a positive or negative pattern).

The fix is one of:

Set message attributes at publish time so filter policies can use them (preferred — attributes are designed for filtering).
Enable body filtering by setting FilterPolicyScope: MessageBody on the subscription. Then the policy evaluates against fields in the JSON body. Note that body filtering is a 2021+ feature and requires the JSON body, not arbitrary text.

SOA-C02 distractors include "the filter policy syntax is wrong" when the actual issue is FilterPolicyScope being unset. Read the scenario for "the producer publishes JSON" vs "the producer sets attributes" to pick the right answer.

Common Trap: EventBridge Rules Are Regional

EventBridge rules and event buses are regional. An EC2 state change in us-east-1 lands on the default bus in us-east-1, and a rule in us-west-2 cannot match it without explicit cross-region forwarding. This is the same model as CloudWatch alarms (regional) and most other AWS services.

When SOA-C02 asks "consolidate events from 12 regions onto a single dashboard", the answer involves cross-region event forwarding rules (one per source region) routing to the central region's bus, plus a central rule fanning out to SNS or Lambda. There is no "global" EventBridge bus.

SOA-C02 vs SAA-C03: The Operational Lens

Question style	SAA-C03 lens	SOA-C02 lens
Choosing EventBridge vs SNS	"Which service for fan-out to multiple consumers?"	"The rule fires but the SSM Automation never runs — what's missing?"
Event pattern syntax	Rarely tested in depth.	Heavily tested — exact match vs prefix vs anything-but.
CloudWatch alarm action	"Configure the alarm to notify the team."	"Choose between alarm action, EventBridge, and Auto Scaling action."
SNS filter policy	"Use SNS for one-to-many notification."	"The Lambda subscriber receives all messages despite a filter — fix it."
Config remediation	"Use Config to detect non-compliance."	"Wire the Config-EventBridge-SSM Automation chain end to end."
Cross-account events	"Centralize logs in an audit account."	"Configure the resource policy AND the source rule's IAM role."
Scheduled rules	"Use scheduled rules for periodic tasks."	"Convert this EC2 cron job to EventBridge Scheduler with timezone support."
Retry and DLQ	"Use SQS DLQ for failed messages."	"EventBridge silently discards events — add a DLQ on the rule."

The SAA candidate selects the service; the SOA candidate configures it correctly, troubleshoots when an event is lost, and operates the event-driven plumbing during incidents.

Exam Signal: How to Recognize a Domain 1.2 Question

Domain 1.2 questions on SOA-C02 follow predictable shapes.

"The rule fires but the target never runs" — IAM role missing on the rule, or input transformer malformed, or target in another region/account without the cross-region/cross-account plumbing.
"Auto-remediate non-compliant resources" — Config rule with Automatic remediation, OR the explicit Config → EventBridge → SSM Automation chain. Read the scenario for simplicity vs extensibility.
"The on-call missed an event" — DLQ not configured, or filter policy too restrictive, or message attribute not set.
"Schedule a periodic task" — EventBridge scheduled rule for simple cases, EventBridge Scheduler for time-zone or one-time cases, DLM for snapshots, SSM Maintenance Window for patching.
"Convert EC2 cron jobs to AWS" — EventBridge → Lambda or SSM Automation, with a DLQ.
"The HTTP webhook is sometimes down" — SNS HTTP delivery policy with extended retries, plus a subscription DLQ.
"Consolidate events from many accounts" — cross-account event bus with resource policy + IAM role on source rules.
"Auto-recover an EC2 instance" — CloudWatch alarm recover action on StatusCheckFailed_System, not EventBridge.
"Notify when alarm fires AND remediate" — alarm action to SNS for humans, EventBridge rule on alarm state change to SSM Automation for remediation.

Domain 1 is 20 percent of the exam, and Task Statement 1.2 (remediation) covers roughly half of that domain — call it 10 percent of the exam, or 6 to 7 of the 65 questions. Add Domain 3.2 (process automation, which heavily reuses EventBridge) and EventBridge plus SNS together account for ~10 to 13 questions. Mastering the patterns in this section is the second-highest-leverage study activity for SOA-C02 after CloudWatch metrics and alarms. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rules.html

Operational goal	Primary construct	Notes
Notify humans of an alarm	CloudWatch alarm action → SNS topic	Email, SMS, Slack via HTTP, PagerDuty via HTTP.
Auto-recover an impaired EC2	CloudWatch alarm `recover` action	Direct, no SNS or EventBridge needed.
Auto-remediate Config non-compliance (simple)	Config managed remediation, Automatic mode	No EventBridge rule needed.
Auto-remediate Config non-compliance (extensible)	Config rule → EventBridge rule → SSM Automation	Rule has DLQ; runbook has execution role.
Multi-step remediation workflow	EventBridge → SSM Automation document	Or Step Functions for branching/parallel.
Fan-out one event to many consumers	SNS topic with filter policies per subscription	Or EventBridge rule with up to 5 targets.
Periodic task (single account)	EventBridge scheduled rule	`rate()` or `cron()` UTC expressions.
Periodic task (timezone or one-time)	EventBridge Scheduler	Time zones, one-time, flexible windows.
Periodic EBS snapshot	Amazon Data Lifecycle Manager	Purpose-built, not EventBridge.
Periodic OS patching	Systems Manager Maintenance Window	Integrates with Patch Manager.
Replace EC2 cron jobs	EventBridge → Lambda or SSM Automation	With DLQ.
Consolidate events across accounts	Custom event bus + resource policy + source IAM role	Per-account rule forwards to central bus.
Consolidate events across regions	Cross-region rule + source IAM role on `events:PutEvents`	One forwarding rule per source region.
Durable buffer for downstream consumer	SNS → SQS fan-out	SQS gives at-least-once durable delivery.
Notify when DLQ has messages	CloudWatch alarm on SQS `ApproximateNumberOfMessagesVisible`	Then SNS to on-call.
HTTP webhook with extended retries	SNS HTTP subscription + custom delivery policy	Plus subscription DLQ.
Audit all events centrally	EventBridge archive on the central bus	With S3 export via Firehose.

Common Traps Recap — EventBridge and SNS

Trap 1: EventBridge rule fires but target never runs

IAM role missing on the rule (most common), input transformer malformed, target in another region/account without plumbing, or Lambda permission policy missing the EventBridge service principal.

Trap 2: Filter policy filters by body, not attributes, by default

Default scope is MessageAttributes. Set FilterPolicyScope: MessageBody if the producer puts the discriminator in the JSON body and not as attributes.

Trap 3: HTTP retry semantics applied to Lambda

SNS HTTP delivery policy does not apply to Lambda subscribers. Lambda async invocation retries are separate.

Trap 4: Cross-account requires both resource policy AND IAM role

The destination bus's resource policy is necessary but not sufficient. The source rule's IAM role must grant events:PutEvents on the destination bus ARN.

Trap 5: Cron expression with five fields

EventBridge cron has six fields: cron(minutes hours day-of-month month day-of-week year). Five-field POSIX cron is rejected. ? is required in either day-of-month or day-of-week.

Trap 6: No DLQ on production EventBridge rules

EventBridge silently discards events after retry exhaustion. Always configure a DLQ with a CloudWatch alarm on its message count.

Trap 7: Detailed monitoring vs CloudWatch alarm action confusion

CloudWatch alarm built-in EC2 actions (recover, stop, terminate, reboot) bypass SNS and EventBridge entirely. They are tied to the alarm's InstanceId dimension. EventBridge EC2 actions accept arbitrary instance ID lists.

Trap 8: SSM Automation execution role permissions

The EventBridge rule needs ssm:StartAutomationExecution. The Automation document needs its own execution role with iam:PassRole from the EventBridge role and the actual remediation permissions. Both are required.

Trap 9: FIFO topic with non-SQS-FIFO subscribers

FIFO SNS topics only deliver to SQS FIFO queues. Email, SMS, HTTP, and Lambda subscribers cannot subscribe to a FIFO topic.

Trap 10: Scheduled rules vs DLM for EBS snapshots

For EBS snapshot lifecycle, Amazon Data Lifecycle Manager is the AWS-native answer. EventBridge scheduled rules invoking AWS-CreateSnapshot works but lacks DLM's retention rules, fast snapshot restore management, and cross-region copy.

EventBridge is an event router. It receives events from AWS services and applications, applies pattern-matching rules, and routes matched events to up to 5 targets. It has buses, rules, archives, replay, and rich event pattern syntax. SNS is a fan-out pub/sub. It receives one published message and copies it to many subscribers (email, SMS, HTTP, Lambda, SQS, mobile push) with optional per-subscriber filter policies. They compose: an EventBridge rule's target is often an SNS topic so the same event can fan out to humans (email, SMS, Slack) and to other AWS services (Lambda, SQS) at once. Pick EventBridge for routing and remediation; pick SNS for fan-out notification with multiple delivery channels.

Q2: Why does my EventBridge rule fire but the target never runs?

Five common causes, in order of frequency: (a) the rule's IAM role is missing or lacks permission to invoke the target — check the role and the target's resource policy; (b) the input transformer produces a payload that the target rejects with a validation error — check CloudWatch Logs on the target; (c) the target is throttled (Lambda concurrency, SQS throughput) and the retry exhausted before delivery — add a DLQ on the rule; (d) the target was deleted or moved and the ARN is stale; (e) the rule is on the wrong bus (default vs custom). The CloudWatch metric Invocations increments when the rule fires, and FailedInvocations indicates target delivery failures — check the difference to localize the problem.

Q3: How do I auto-remediate a Config rule violation?

Two patterns. Simple: configure Config managed remediation with mode Automatic on the Config rule, attach an SSM Automation document (often an AWS-managed AWSConfigRemediation-... document), and supply the document's parameters. Config invokes the document directly on non-compliance. Extensible: build the explicit chain — Config rule emits a Config Rules Compliance Change event, an EventBridge rule with a pattern matching the rule name and NON_COMPLIANT invokes the SSM Automation document, and a second target sends to SNS for audit notification. Pick simple when the scenario emphasizes minimum configuration; pick extensible when fan-out, conditional logic, or cross-account remediation is needed.

Q4: How do I notify subscribers based on a property of the message?

Use SNS subscription filter policies. By default, filter policies evaluate against the message's MessageAttributes (key-value metadata sent alongside the body). Set attributes at publish time: Publish(TopicArn, Message, MessageAttributes={severity: {DataType: String, StringValue: "critical"}}). The subscriber's filter policy {"severity": ["critical", "high"]} then only delivers messages whose severity attribute matches. If the discriminator is in the JSON body and not in attributes, set FilterPolicyScope: MessageBody on the subscription so the policy evaluates against the body. Filter policies support exact match, prefix, anything-but, numeric, exists — the same operator family as EventBridge event patterns.

Q5: What is the right way to forward EventBridge events from member accounts to a central audit account?

Configure direct cross-account event delivery. (1) In the audit account, create a custom event bus (or use the default), and edit its resource-based policy to grant events:PutEvents to the principal ARNs of the member accounts. (2) In each member account, create a rule on the default bus with an event pattern matching the events you want to forward (GuardDuty findings, Config compliance changes, root account CloudTrail events, etc.). (3) The rule's target is the audit account's bus ARN. (4) The rule has an IAM role granting events:PutEvents on the destination bus ARN. (5) In the audit account, create rules on the central bus that fan out to SNS for the security team, Lambda for ticket creation, and S3 archive via Firehose. Both halves — destination resource policy and source IAM role — are required.

Q6: How long does EventBridge retry a failed target invocation?

By default, EventBridge retries with exponential backoff for 24 hours. After 24 hours of failed retries, the event is discarded unless you configure a dead-letter queue (SQS) on the rule. Production rules should always have a DLQ plus a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible so the on-call learns about delivery failures. The retry duration is configurable per rule via the RetryPolicy setting (MaximumEventAge in seconds), with a minimum of 60 seconds and a maximum of 86,400 seconds (24 hours).

Q7: When do I use EventBridge Scheduler instead of a scheduled EventBridge rule?

EventBridge Scheduler (launched 2022) is the newer scheduling service that supersedes scheduled rules for new workloads. Use Scheduler when: (a) you need millions of schedules per account (rules are limited per bus); (b) you need one-time schedules at a specific timestamp; (c) you need time-zone-aware cron expressions like cron(0 9 * * ? *) evaluated in Asia/Taipei; (d) you need flexible time windows for load smoothing (fire any time within a 15-minute window); (e) you want built-in retry, DLQ, and target encryption configured per schedule. Use a scheduled rule when: (f) you have a small number of simple cron schedules; (g) the schedule is naturally part of an event-driven flow on the default bus; (h) the workload is legacy and rules are already in place. Both are testable on SOA-C02; Scheduler is the modern preferred answer for scale and time-zone scenarios.

Two likely causes: (a) the HTTP endpoint is down longer than the SNS retry policy covers — the default policy retries for ~1 hour. Configure a custom delivery policy with extended retries (numRetries, numMaxDelayRetries, minDelayTarget, maxDelayTarget, backoffFunction). Beyond retry exhaustion, configure a subscription DLQ so failed messages land in an SQS queue rather than disappearing. (b) The endpoint returns a non-2xx code that SNS treats as terminal (4xx other than 408/429 by default). Check the endpoint's logs for the actual response. The SOA-C02 trap is assuming Lambda subscriber retry rules apply to HTTP — they do not; HTTP and Lambda use entirely separate retry models.

Q9: How do I trigger an SSM Automation runbook from a CloudWatch alarm?

Two paths. Direct: configure the EventBridge rule on aws.cloudwatch source matching CloudWatch Alarm State Change for the specific alarm name with state.value: ["ALARM"]. The rule's target is the SSM Automation document. The rule has an IAM role with ssm:StartAutomationExecution. The Automation document has its own execution role with the actual remediation permissions. The rule has a DLQ. Indirect via SNS: alarm action publishes to SNS topic; a Lambda subscribed to the topic parses the message and calls ssm:StartAutomationExecution. The direct EventBridge path is preferred — fewer moving parts, native retries, no Lambda code to maintain. SOA-C02 favors the direct EventBridge answer for any "automated remediation" scenario.

Yes, but it is not the SOA-C02 best practice. The clean separation is: SNS topic for humans (email, SMS, Slack, PagerDuty subscribers), EventBridge rule for automation (SSM Automation, Lambda, Step Functions targets). Both fire from the same source event (CloudWatch alarm state change, Config compliance change). Reasons to separate: (a) human notification and automation have different retry, filter, and DLQ requirements; (b) a Slack outage should not block remediation; (c) an Automation document failure should not silence the on-call; (d) message attributes useful for SNS filtering may differ from EventBridge pattern fields. The exception is small, simple environments where one SNS topic with a Lambda subscriber doing remediation alongside human-facing subscriptions is acceptable.

Once event-driven remediation is in place, the next operational layers are: CloudWatch Metrics, Alarms, and Dashboards for the upstream signal that feeds EventBridge and SNS, Systems Manager Automation and Patch Manager for the runbook tier that EventBridge invokes, Scheduled Tasks and Config Auto-Remediation for the periodic and continuous-compliance flows that share EventBridge as their backbone, and CloudWatch Logs and Insights for the application and system log telemetry that often produces the metric filters that drive alarms in the first place.

Why EventBridge and SNS Are the Backbone of SOA-C02 Remediation

白話文解釋 EventBridge and SNS

Analogy 1: The Postal System With Smart Sorting

Analogy 2: The Fire Alarm Wired to Multiple Responses

Analogy 3: The Restaurant Kitchen Order Ticket System

EventBridge Event Buses: Default, Custom, and Partner

Default event bus

Custom event buses

Partner event buses

Archives and replay

EventBridge Rules: Event Pattern Matching vs Scheduled Rules

Event pattern matching

Scheduled rules

EventBridge Scheduler vs scheduled rules

Rule input transformers

Common EventBridge Targets

IAM permissions

Rule dead-letter queues

CloudWatch Alarm Actions: SNS, EC2, Auto Scaling, and Systems Manager

SNS notification action

EC2 actions

Auto Scaling actions

Systems Manager OpsCenter / Incident Manager actions

Why EventBridge over alarm actions for remediation

SNS Topic Configuration: Subscriptions, Filtering, and Fan-Out

Subscription types

Standard vs FIFO topics

Message filter policies

Fan-out pattern

Cross-account topics

SNS Delivery Retries, DLQ, and HTTP Semantics

Retry policies

Subscription DLQs

Message size and structure

Automated Remediation with Systems Manager Automation

The three-service chain in detail

Config managed remediation

Common AWS-managed remediation documents

Corrective Actions Flow: CloudWatch Alarm → SNS → Lambda → API Call

The simple form

The robust form (SOA-C02 preferred)

When direct alarm-to-Lambda action is acceptable

EventBridge Scheduled Events: Replacing EC2 Cron Jobs

Migration pattern

Common scheduled patterns on AWS

Cross-Account and Cross-Region Event Forwarding

Cross-account event delivery

Cross-region event delivery

Centralized event bus pattern

Scenario Pattern: EC2 Instance Repeatedly Fails Status Checks

Scenario Pattern: Config Flags Non-Compliant S3 Bucket

Common Trap: SNS HTTP Retry Semantics vs Lambda Subscriber

Common Trap: EventBridge Cross-Account Bus

Common Trap: SNS Message Attributes vs Body for Filtering

Common Trap: EventBridge Rules Are Regional

SOA-C02 vs SAA-C03: The Operational Lens

Exam Signal: How to Recognize a Domain 1.2 Question

Decision Matrix — EventBridge or SNS for Each SysOps Goal

Common Traps Recap — EventBridge and SNS

Trap 1: EventBridge rule fires but target never runs

Trap 2: Filter policy filters by body, not attributes, by default

Trap 3: HTTP retry semantics applied to Lambda

Trap 4: Cross-account requires both resource policy AND IAM role

Trap 5: Cron expression with five fields

Trap 6: No DLQ on production EventBridge rules

Trap 7: Detailed monitoring vs CloudWatch alarm action confusion

Trap 8: SSM Automation execution role permissions

Trap 9: FIFO topic with non-SQS-FIFO subscribers

Trap 10: Scheduled rules vs DLM for EBS snapshots

FAQ — EventBridge and SNS Notifications

Q1: What is the difference between EventBridge and SNS, and when do I use each?

Q2: Why does my EventBridge rule fire but the target never runs?

Q3: How do I auto-remediate a Config rule violation?

Q4: How do I notify subscribers based on a property of the message?

Q5: What is the right way to forward EventBridge events from member accounts to a central audit account?

Q6: How long does EventBridge retry a failed target invocation?

Q7: When do I use EventBridge Scheduler instead of a scheduled EventBridge rule?

Q8: Why do my SNS HTTPS deliveries fail and the messages disappear?

Q9: How do I trigger an SSM Automation runbook from a CloudWatch alarm?

Q10: Can I use the same SNS topic for both human notification and automated remediation?