examhub .cc The most efficient path to the most valuable certifications.
In this note ≈ 40 min

Event-Driven and Serverless Architecture Design

7,850 words · ≈ 40 min read

Event-driven serverless architecture is the design paradigm that separates the Solutions Architect Professional from the Associate. SAP-C02 task statement 2.5 (performance) and 2.4 (reliability) both land on the same core question: can you compose Amazon EventBridge, AWS Step Functions, AWS Lambda, AWS AppSync, AWS IoT Core, Amazon SQS, and Amazon SNS into a coherent event-driven serverless architecture that scales to a million devices, respects regional compliance boundaries, survives regional failure, and stays debuggable. This chapter assumes you already passed SAA-C03 depth — Lambda timeouts, Fargate basics, SQS fan-out — and jumps straight into event-driven serverless architecture decisions that only show up at the professional level. If any Associate concept feels thin, revisit the SAA-C03 serverless-and-containers baseline first; here, event-driven serverless architecture is treated as a design discipline, not a service catalog.

Why Event-Driven Serverless Architecture Is a Pro-Level Discipline

Event-driven serverless architecture on SAP-C02 is tested as a design discipline, not as individual service trivia. The exam assumes you know what Lambda and EventBridge are; it asks you to justify custom bus design versus partner bus, to choose Step Functions Standard over Express under a fifteen-minute maximum that is wrong in a way the associate never sees, to decide when EventBridge Pipes replaces a Lambda transformer, to reason about a choreography collapse under a saga rollback, and to place an outbox in a DynamoDB Streams pipeline so the dual-write problem never reaches production. These decisions are why event-driven serverless architecture is Pro-level.

Every scenario in this topic is a composition problem. Event-driven serverless architecture scenarios on SAP-C02 typically present three to six AWS services and ask which stitching strategy satisfies RTO, RPO, idempotency, ordering, cross-account isolation, and cost simultaneously. Expect multi-constraint questions where every constraint rules out one candidate solution.

What Associate Knowledge Is Assumed

Associate-level event-driven serverless architecture (SAA-C03) assumes the reader already knows:

  • AWS Lambda packaging (zip, container image), 15-minute ceiling, 10 GB memory, 10 GB /tmp, 6 MB sync and 256 KB async payload, default 1000 concurrent executions per Region.
  • Basic SQS fan-out (SNS topic plus two SQS subscribers), DLQ as a legacy concept, at-least-once delivery.
  • EventBridge default bus receiving AWS service events; simple rule-target mapping; Lambda target.
  • Step Functions exists and has Standard and Express; Express is cheaper and faster.
  • API Gateway REST vs HTTP API and WebSocket API; Lambda integration basics.

What SAP-C02 Adds on Top

Event-driven serverless architecture at SAP-C02 adds:

  • Custom, partner, and global-endpoint bus design on EventBridge with schema registry, archive, and replay.
  • EventBridge Pipes for source-filter-enrich-target without Lambda glue code.
  • Cross-account and cross-region event routing and the quota, IAM, and failure semantics of each.
  • Step Functions patterns: saga, callback task tokens, Map iteration (inline vs distributed), Parallel, Choice, circuit breaker via Retry plus Catch plus DLQ.
  • Lambda at scale: reserved plus provisioned plus SnapStart composition, account-level quotas versus per-function quotas, burst versus sustained concurrency, cold-start mitigation decision tree.
  • AppSync for real-time GraphQL subscriptions as an event-driven edge.
  • IoT Core rules engine as a first-class event source for fleets.
  • Choreography vs orchestration trade-offs at scale.
  • Outbox pattern with DynamoDB Streams to fix the dual-write problem.
  • DLQ plus redrive plus observability as a first-class reliability concern.

Event-driven serverless architecture is a software pattern where loosely coupled services communicate by publishing and subscribing to events on a broker, using compute that scales to demand without host management. On AWS, event-driven serverless architecture centers on Amazon EventBridge as the primary broker, AWS Lambda as the primary ephemeral compute, AWS Step Functions as the orchestrator, and Amazon SQS and SNS as the durable buffer and fan-out primitives. The architecture gains its scalability and resilience from the combination of asynchronous messaging, idempotent consumers, and managed scale-to-zero compute. Reference: https://docs.aws.amazon.com/wellarchitected/latest/serverless-applications-lens/welcome.html

Core Design Principles of Event-Driven Serverless Architecture

Event-driven serverless architecture at SAP-C02 is governed by five recurring design principles. Every scenario tests at least two.

Principle One: Events Are Facts, Not Commands

An event records something that happened (OrderPlaced, PaymentFailed) and is immutable. A command tells a specific recipient to do something (ChargeCard). Event-driven serverless architecture scales because any number of subscribers can react to the same fact without the producer knowing. SAP-C02 questions use this distinction to filter answers: if the scenario says one sender must know the receiver succeeded, it is a command pattern and probably wants Step Functions or synchronous API Gateway plus Lambda, not EventBridge fan-out.

Principle Two: Assume At-Least-Once Delivery, Design for Idempotency

Every AWS event broker in event-driven serverless architecture (EventBridge, SQS Standard, SNS, Kinesis Data Streams, DynamoDB Streams) is at-least-once under failure. SAP-C02 expects the architect to design idempotent consumers using an idempotency key stored in DynamoDB with conditional writes, Powertools-style idempotency decorators, or database unique constraints. Only SQS FIFO and SNS FIFO give exactly-once, and they cap throughput.

Principle Three: Decouple Producer Scale From Consumer Scale

Event-driven serverless architecture inserts a durable buffer (SQS, Kinesis, DynamoDB Streams) between producer and consumer so bursts do not crash downstream. Lambda reserved concurrency on the consumer and visibility timeout on SQS together shape the consumption curve. This principle forces the outbox pattern in dual-write scenarios.

Principle Four: Observability Is Part of the Architecture

An event-driven serverless architecture without tracing is an unmaintainable one. AWS X-Ray traces across API Gateway, Lambda, Step Functions, EventBridge, and SQS via propagation headers; CloudWatch Embedded Metrics Format (EMF) writes structured metrics; CloudWatch Logs Insights queries across all function log groups. SAP-C02 regularly tests failures that are impossible to debug without tracing.

Principle Five: Regional and Account Boundaries Are Policy, Not Technology

Event-driven serverless architecture scenarios often mandate that personally identifiable information (PII) stays in one Region, or that security events cross from workload accounts to a central security account. EventBridge cross-account buses, resource-based policies, and PrivateLink all exist precisely to encode those policies as infrastructure.

Amazon EventBridge Bus Design at Pro Depth

EventBridge is the broker at the center of event-driven serverless architecture on AWS. SAP-C02 expects deep fluency with bus types, rule targets, schema management, archive and replay, Pipes, and global endpoints.

Bus Types: Default vs Custom vs Partner

EventBridge has three bus categories, and event-driven serverless architecture design starts with choosing correctly:

  • Default event bus exists in every Region of every account at creation. It receives events from AWS services automatically (EC2 state changes, S3 object events when EventBridge notification is enabled, CodePipeline state, etc.). You cannot delete it; you should not publish custom domain events here because mixing AWS-origin and application-origin events on one bus creates rule chaos at scale.
  • Custom event bus is created per business domain or per bounded context. Good event-driven serverless architecture uses one custom bus per bounded context (Orders, Payments, Inventory) so rule sets and retention policies align with domain ownership.
  • Partner event bus receives events from SaaS partners (Auth0, Datadog, Zendesk, Shopify, PagerDuty, Stripe, many more). You cannot publish to a partner bus; only the partner can. Each partner source automatically provisions a dedicated partner bus when you activate the integration.

EventBridge Rules, Targets, and Content-Based Filtering

Each bus holds rules that pattern-match incoming events and route to up to five targets per rule. Rule patterns support:

  • Exact match on event attributes ("source": ["orders.service"]).
  • Prefix match ("prefix": "ord").
  • Anything-but match, numeric match, IP CIDR match, exists check.
  • Nested attribute match into the detail JSON block.

Supported targets include Lambda, Step Functions state machine, SQS, SNS, Kinesis Data Streams, Kinesis Firehose, ECS task, Batch job, API destination (outbound HTTPS), another EventBridge bus (including cross-account or cross-region), API Gateway, and SageMaker pipeline.

On SAP-C02, scenarios hinting at Domain-Driven Design, microservices, or multi-team ownership almost always map to custom bus per bounded context. The default bus is reserved for AWS-origin events. Using the default bus for application events is flagged as an anti-pattern because it tangles rules, retention, and cross-account sharing. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-buses.html

Schema Registry and Schema Discovery

EventBridge Schema Registry stores JSON or OpenAPI schemas describing events on a bus. You can enable schema discovery so EventBridge infers schemas from actual events flowing through; you can generate code bindings for Java, Python, TypeScript to get statically typed event objects. In event-driven serverless architecture at professional scale, schema registry prevents the silent breaking change — a producer adds a field and a consumer parser dies — by versioning schemas explicitly.

Archive and Replay

Every custom bus can have an archive attached with retention from one day to indefinite. Archive stores a copy of matching events; replay reprocesses archived events onto the original bus, re-triggering all rules and targets, with a specified time range and optional rule filter. Archive and replay is the canonical answer on SAP-C02 to:

  • "A bug in the payment-completed handler dropped events for two hours; how do we reprocess without replaying upstream systems?"
  • "DR: after a regional failover, how do we reprocess the last 24 hours of events on the secondary Region?"
  • "Testing: how do we inject production-realistic event load into a staging environment?"

Before archive and replay, teams wrote events to S3 via Firehose and built replay tooling themselves. In event-driven serverless architecture on SAP-C02, archive and replay on EventBridge is always preferred when the requirement is "replay past events back through the original rules." If the requirement is to query events with Athena, Firehose to S3 is still the right answer alongside archive. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-archive.html

EventBridge Pipes: Source to Target Without Lambda

EventBridge Pipes is a point-to-point integration that replaces glue Lambda functions. A Pipe has four stages: Source (SQS, Kinesis, DynamoDB Streams, Kafka, MQ, etc.), optional filter expression, optional enrichment (Lambda, Step Functions Express, API destination, API Gateway), and target (any EventBridge target). Pipes matter for event-driven serverless architecture because they:

  • Eliminate boilerplate Lambda code whose only job is to forward, filter, or reshape events.
  • Provide batching, retries, and DLQ natively.
  • Charge per event and per enrichment call instead of per Lambda millisecond.
  • Maintain event ordering for ordered sources (Kinesis shard, DynamoDB Stream shard, SQS FIFO group).

Typical Pipes use cases:

  • DynamoDB Streams to EventBridge bus with a filter that drops eventName == REMOVE.
  • Kinesis Data Stream to Step Functions Express with an enrichment Lambda that looks up customer tier.
  • SQS FIFO to ECS RunTask where the target does ordered batch processing.

Cross-Account and Cross-Region Event Routing

In multi-account event-driven serverless architecture, EventBridge supports:

  • Cross-account targets: a rule in account A with a target that is a bus in account B. Account B's bus needs a resource-based policy allowing events:PutEvents from account A.
  • Organization-wide policies: use aws:PrincipalOrgID condition in the resource policy to grant any account in the Organization.
  • Cross-region targets: a rule in Region X with a target bus in Region Y. Useful for centralized auditing or for paired DR Regions. Cross-region adds latency and is billed for PutEvents in the destination Region.

Global Endpoints for Multi-Region Resilience

EventBridge global endpoints provide an active-passive event ingestion endpoint across two Regions paired with a Route 53 health check. Producers publish to a single global endpoint URL; EventBridge routes to the healthy primary Region and automatically fails over to the secondary Region when the health check trips. Events are also echoed to the secondary bus so that replay there preserves continuity. Global endpoints are the canonical answer on SAP-C02 when a scenario says "applications must continue publishing events during a regional outage without client code changes."

Default bus: free, receives AWS service events automatically, one per account-region, cannot be deleted. Custom bus: pay per PutEvents (1 million free per month aggregate), supports archive, replay, schema registry, one per bounded context. Partner bus: receives from SaaS sources only, provisioned per partner source. Global endpoint: active-passive across two Regions with Route 53 health check, automatic failover for producers. Archive retention: 1 day to indefinite. Replay: select time range and optional event pattern. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

AWS Step Functions Deep Dive for Event-Driven Serverless Architecture

Step Functions is the orchestrator in event-driven serverless architecture. At Pro depth, SAP-C02 tests Standard versus Express, a catalog of workflow patterns, and error handling that composes into a circuit breaker.

Standard vs Express Workflows

The choice is not purely cost:

  • Standard workflow supports up to one year of execution, exactly-once state transitions, full execution history retained for 90 days, asynchronous invocation only, Activity task workers, callback task tokens. Price: $0.025 per 1000 state transitions (US East). Best for human approvals, long-lived orchestration, compliance-audited workflows.
  • Express workflow supports up to five minutes of execution, at-least-once state transitions, no execution history (CloudWatch Logs instead), synchronous or asynchronous invocation. Price: per-execution and per-GB-second duration (much cheaper at scale). Best for high-volume event processing, API request orchestration, streaming pipelines.

A classic SAP-C02 trap proposes Standard for a high-volume event-processing workflow "because Standard is more reliable." For a workflow finishing in under five minutes that needs 100 million executions per month, Express is both cheaper and has higher per-second state transition capacity (100k per second account default). Standard's 2000 StartExecution per second quota and its per-transition pricing make it the wrong default for event-driven serverless architecture at scale. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html

Workflow Patterns: Saga, Callback Task Token, Map, Parallel, Choice

Saga pattern implements distributed transactions without two-phase commit. Each step in the saga has a paired compensating step. If step N fails, Step Functions walks the catch chain backward, invoking the compensating action for each completed step. Saga is the SAP-C02 answer for "order placement touches inventory, payment, and shipping services — how do we roll back partial success." Implement with Task states, Catch blocks, and compensating Task states; use Parallel or Map if steps are independent.

Callback task token (.waitForTaskToken) pauses a Task state until an external process calls SendTaskSuccess or SendTaskFailure with the token. The token is embedded in the input passed to the Task's resource. This enables:

  • Human approval workflows (SQS to email to web form to SendTaskSuccess).
  • External system integration with asynchronous response.
  • Long-running external jobs (manual data review, third-party API with webhook callback).

Maximum wait: one year on Standard, five minutes on Express. Timeouts bubble up as errors into Catch chains.

Map state iterates over an array and executes a sub-workflow for each item. Two modes:

  • Inline Map keeps all iterations in the parent execution history. Good for small arrays (up to 40 concurrent iterations).
  • Distributed Map launches child executions and can process up to 10000 concurrent iterations, reading input from S3 with per-item throughput isolation. This is the professional-depth answer for "process a million rows in parallel."

Parallel state runs a fixed set of branches concurrently and waits for all to finish. Use for independent sub-tasks known at design time (fetch customer, fetch inventory, fetch pricing).

Choice state branches on a JSON condition. Use for content-based routing inside a workflow.

Circuit Breaker via Retry + Catch + DLQ

Step Functions gives you a natural circuit breaker pattern without a dedicated service:

  1. Each Task state has a Retry array with per-error-type exponential backoff, maximum attempts, and jitter.
  2. After retries exhaust, Catch diverts to a named error-handling state.
  3. The error-handling state writes to an SQS DLQ or EventBridge failure bus for offline replay.
  4. A CloudWatch alarm on ExecutionsFailed above a threshold triggers a circuit-breaker Lambda that flips a Parameter Store flag; upstream producers read the flag and stop enqueuing.

The default Retry configuration retries any error three times with 1.0 backoff rate. For a Lambda that writes to RDS, this can quadruple write pressure during a partial outage. In event-driven serverless architecture at Pro depth, always specify IntervalSeconds, MaxAttempts, BackoffRate, and JitterStrategy: FULL explicitly, and scope Retry to specific error types. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html

Step Functions SDK Integration Modes

Task states invoking AWS SDK operations come in two flavors:

  • Optimized integrations: pre-built for hot-path services (Lambda, SQS, SNS, ECS RunTask, DynamoDB PutItem). Lower price, built-in features like .sync (wait for the invoked workflow or task), .waitForTaskToken, .syncExecute (Express child sync).
  • Non-optimized SDK integrations: raw AWS SDK call to any of 200-plus services. Costs per state transition only, no .sync variant.

Use .sync when Step Functions should block until the child (ECS task, Glue job, Lambda) completes. Without .sync, Task returns immediately after starting the child.

AWS Lambda at Professional Scale

Lambda is the ephemeral compute in event-driven serverless architecture. Pro-depth Lambda covers concurrency controls, SnapStart, cold-start mitigation trees, and quota strategy.

Account-Level and Per-Function Quotas

Lambda quotas that matter on SAP-C02:

  • Account concurrent executions: 1000 per Region by default (soft limit, raise via Service Quotas).
  • Burst concurrency: 500, 1000, or 3000 units per minute depending on Region. After burst, Lambda scales by 500 per minute until reaching the account quota.
  • Unreserved concurrency minimum: at least 100 units must remain unreserved for uncommitted functions.
  • Invocations per Region: no hard cap (limited by concurrency).
  • Async event age: up to 6 hours in the async queue before discard.
  • Async retry attempts: 2 by default, configurable 0 to 2.
  • Event source mapping SQS batch size: up to 10000 for Standard, 10 for FIFO.

Reserved vs Provisioned Concurrency vs SnapStart Composition

Three controls with distinct semantics:

  • Reserved concurrency sets both a ceiling and a guaranteed floor carved from the account pool. Does not prevent cold starts. Set to zero to disable a function as an emergency circuit breaker.
  • Provisioned concurrency pre-initializes N execution environments that stay warm; invocations up to N have no cold start. Costs hourly per provisioned unit. Can be auto-scaled via Application Auto Scaling target tracking on ProvisionedConcurrencyUtilization.
  • SnapStart snapshots the initialized execution environment after init and restores from snapshot on cold start. Free, supported on Java, Python, and .NET managed runtimes (not container images). Drops cold starts from seconds to sub-second.

Composition is common in Pro scenarios. A fraud-check Lambda might have reserved concurrency 200 (cap blast radius on downstream RDS), provisioned concurrency 50 (zero cold start for the p99 floor), and SnapStart enabled on its Java runtime as additional insurance.

Cold Start Mitigation Decision Tree

Use this tree on SAP-C02 scenarios asking to reduce p99 latency:

  1. Is the runtime Java or .NET or Python on a managed runtime? Enable SnapStart first (free).
  2. Is latency still above SLA after SnapStart or is it a container image runtime? Add Provisioned Concurrency sized to p95 traffic.
  3. Is the traffic pattern predictable (business hours spike)? Use Application Auto Scaling with scheduled actions on provisioned concurrency.
  4. Is the function VPC-attached? Confirm Hyperplane ENI (modern VPC-Lambda); old-style ENI creation is deprecated.
  5. Is the deployment package large? Move heavy dependencies to a Layer; prune unused code.
  6. Is the handler doing heavy init? Move to init phase so SnapStart captures it warm; lazy-init items that cannot be snapshotted.
  7. Is memory right-sized? Higher memory means faster init (CPU scales with memory).
  8. Is the invoker synchronous with bursty traffic? Add SQS buffer and convert to poll-based; burst is smoothed by the queue.

Event Source Mappings at Scale

Lambda event source mappings (ESM) for SQS, Kinesis, DynamoDB Streams, Kafka, and MQ are Lambda-managed pollers. Pro-depth tuning knobs:

  • Batch size: number of records per invocation. Bigger batch = fewer invocations = lower per-record cost but higher risk on partial failures.
  • Batch window (SQS, Kinesis, DDB): up to 5 minutes for SQS, up to 5 minutes for streams. Wait to accumulate full batch.
  • Parallelization factor (Kinesis, DDB Streams): up to 10 concurrent invocations per shard, multiplying effective parallelism without resharding.
  • Filter criteria: server-side filter drops non-matching events before Lambda charges apply; up to 5 filter patterns per ESM.
  • Report batch item failures (SQS, Kinesis, DDB): partial-batch response returns only failed record IDs to redeliver, avoiding full-batch reprocess.
  • Maximum concurrency (SQS only): caps Lambda pollers' concurrent invocations without setting reserved concurrency on the function — protects downstream resources without blocking other invokers.

Reserved = floor and cap, no warming, set to 0 for kill switch. Provisioned = warm pool, hourly cost, auto-scalable. SnapStart = free snapshot for Java/Python/.NET managed runtimes (no container images). SQS ESM maximum concurrency = cap Lambda pollers without reserved concurrency. Compose all four for a regulated financial workload. Reference: https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

AWS AppSync for Real-Time GraphQL in Event-Driven Serverless Architecture

AppSync is the managed GraphQL endpoint and the real-time event-driven edge in event-driven serverless architecture. SAP-C02 tests AppSync when the scenario mentions "real-time," "mobile clients," or "GraphQL subscriptions."

AppSync Capabilities

  • GraphQL API (queries, mutations, subscriptions) with schema-first development.
  • Resolvers written in JavaScript or VTL mapping to data sources: DynamoDB (direct), Aurora Serverless via Data API, Lambda, HTTP endpoints, OpenSearch, EventBridge.
  • Real-time subscriptions over WebSocket with MQTT-over-WebSocket transport; clients subscribe to mutation results filtered by arguments.
  • Event-driven mutations where a mutation triggers a subscription fan-out and simultaneously writes to a data source.
  • Caching at resolver or query level.
  • Authorization modes: API key, IAM, Cognito User Pools, OIDC, Lambda authorizer — combinable per-type.
  • Merged APIs to compose multiple source GraphQL APIs into one endpoint.

When AppSync Fits Event-Driven Serverless Architecture

Use AppSync instead of API Gateway plus Lambda when:

  • Clients need real-time updates without polling (chat, collaboration, dashboards).
  • Mobile clients prefer single-request composition over multiple REST calls.
  • Schema-driven contracts are required and the team uses GraphQL tooling.
  • EventBridge events should push to connected mobile clients via subscription.

AppSync integrates with EventBridge so that any event arriving on a bus can trigger an AppSync mutation, which cascades into subscription notifications to every connected client. This is the Pro-depth answer for "push real-time IoT telemetry to a mobile dashboard."

AWS IoT Core as Event Source in Event-Driven Serverless Architecture

IoT Core is the MQTT broker, device registry, and rules engine for device fleets. IoT Core matters in event-driven serverless architecture because fleet telemetry is the single largest source of asynchronous events in many enterprise architectures.

IoT Core Building Blocks

  • Device Gateway: MQTT, MQTT over WebSocket, or HTTPS ingestion with mutual TLS authentication.
  • Registry: metadata for things, thing types, thing groups.
  • Device Shadow: durable per-device JSON state (desired vs reported vs delta).
  • Rules engine: SQL-like statements against MQTT topic streams routing to AWS services.
  • IoT Core message broker supports millions of concurrent device connections.

Rules Engine as Event Router

Rules engine SQL filters MQTT messages by topic and payload content, then routes to any of: Lambda, SQS, SNS, Kinesis Data Streams, Kinesis Firehose, DynamoDB, S3, Timestream, CloudWatch, Step Functions, EventBridge bus, republish to another MQTT topic. Multiple rules can fire on the same topic for fan-out.

For event-driven serverless architecture at Pro depth, route IoT rules to EventBridge when the event needs to feed the broader enterprise bus, to Kinesis Data Streams when downstream needs ordered per-device analytics, and to Timestream when the event is time-series telemetry.

IoT Core Quotas That Drive Design

  • Messages per second per account: 20000 default; 100000 to 1M per Region with quota increase.
  • Device connections per account: 500000 default; tens of millions with quota increase.
  • Rules engine actions per rule: up to 10.
  • Message payload: 128 KB maximum.

For a million-device fleet at 5000 events per second (the scenario we build later), default quotas are sufficient without increase, but placement (Region selection, sharding across buses) still matters for blast radius.

Choreography vs Orchestration in Event-Driven Serverless Architecture

This trade-off is tested on almost every SAP-C02 event-driven serverless architecture question.

Choreography

Services publish events to EventBridge and independently subscribe to what they care about. No central coordinator. OrderService publishes OrderPlaced; PaymentService subscribes and publishes PaymentCompleted or PaymentFailed; InventoryService subscribes to PaymentCompleted and reserves stock.

Pros:

  • Scales linearly; no single orchestrator bottleneck.
  • New consumers added without modifying producers.
  • Natural loose coupling.

Cons:

  • No global view of flow state; debugging distributed failures is hard.
  • Rollback (saga) logic is scattered across services.
  • Emergent behavior: adding a subscriber can create event storms unintentionally.

Orchestration

Step Functions holds the flow. State machine calls OrderService, then PaymentService, then InventoryService, with explicit retry and catch. Saga compensation lives in one place.

Pros:

  • Global view of flow in Step Functions execution history.
  • Centralized error handling and compensation.
  • Easier onboarding for new engineers.

Cons:

  • Orchestrator can become a scale or reliability bottleneck under naive design (use Express for high volume, Distributed Map for large fan-out).
  • Tight coupling between orchestrator and step implementations.

How to Choose

On SAP-C02, apply these tests:

  • Does the business require an audit trail of every multi-step execution? Orchestration.
  • Is the flow long-running with human approval? Orchestration (Standard) with callback task tokens.
  • Is the flow short, high-volume, and does each service already own its logic? Choreography on EventBridge.
  • Do cross-service failures require compensating actions? Orchestration with Saga state machine, or choreography with a dedicated compensation coordinator service (rare).

Hybrid designs are common: choreography at the top level (OrderPlaced fans out to four services), orchestration inside each service (PaymentService runs a Step Functions saga for charge, rollback, refund).

Outbox Pattern with DynamoDB Streams

The dual-write problem: a service must atomically update its database and publish an event. If the database write succeeds but the event publish fails (or vice versa), state diverges. The outbox pattern fixes this by writing the event to a database-local outbox table in the same transaction as the state change, and a separate process forwards outbox rows to the broker.

DynamoDB Streams Implementation

On AWS, the canonical outbox for event-driven serverless architecture uses DynamoDB:

  1. Application writes both the aggregate item and an outbox-row item in a single DynamoDB transaction (TransactWriteItems).
  2. DynamoDB Streams captures both inserts as stream records.
  3. A filter (EventBridge Pipe filter or Lambda ESM filter) selects only outbox-row records.
  4. EventBridge Pipe forwards to the target bus or to EventBridge directly.
  5. The outbox row optionally has a TTL attribute so DynamoDB auto-expires consumed rows; or a cleanup Lambda deletes after confirmed delivery.

Why this works: DynamoDB Streams is ordered per partition key and at-least-once, so every outbox row is delivered. The single-transaction write means the outbox row exists if and only if the aggregate write committed. Idempotency on the consumer side covers duplicate deliveries.

The outbox pattern is a messaging pattern that eliminates the dual-write problem by storing outbound events in the same database transaction as the state change, then delivering them to the broker asynchronously through a change-data-capture stream. On AWS, DynamoDB Streams plus EventBridge Pipes is the canonical implementation in event-driven serverless architecture. Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Dead-Letter Queues, Redrive, and Observability

Dead-letter queues (DLQs) capture messages that failed processing after retries. In event-driven serverless architecture, DLQs appear at several layers: SQS per-queue, SNS per-subscription, Lambda per-function (legacy), EventBridge per-rule, Step Functions via Catch to SQS.

DLQ Configuration Layers

  • SQS redrive policy: set maxReceiveCount on the source queue; after N receives without deletion, SQS moves the message to the DLQ queue.
  • SNS redrive policy: per-subscription DLQ captures deliveries that failed after retries (typically to HTTP endpoints or cross-account targets).
  • Lambda destinations (on-failure): replace legacy Lambda DLQ with a richer context payload routed to SQS, SNS, EventBridge, or another Lambda.
  • EventBridge rule DLQ: rule-level SQS DLQ captures events that could not be delivered to the target (target throttling, invalid payload, missing permission).
  • Step Functions Catch to SQS: Catch all error types into a state that posts the failed execution input to an SQS DLQ for offline triage.

SQS Redrive to Source

Redrive from DLQ back to the source queue is a 2022 feature that removed the need for custom scripts. From the console or API, you can move messages from DLQ back to the original queue for reprocessing after fixing the consumer bug. Event-driven serverless architecture at Pro depth relies on this for safe recovery after deploy-bug incidents.

Observability Stack for Event-Driven Serverless Architecture

Pro-depth event-driven serverless architecture needs:

  • AWS X-Ray active tracing on Lambda, API Gateway, Step Functions, SNS, and SQS for end-to-end trace maps. X-Ray propagates via trace headers through async boundaries when source services enable active tracing.
  • CloudWatch Embedded Metrics Format for high-cardinality metrics (per-customer, per-device) without the cost of CloudWatch custom metric API calls.
  • CloudWatch Logs Insights for cross-function queries across many log groups.
  • EventBridge Archive doubling as an event audit log (plus Firehose to S3 plus Athena for ad-hoc querying).
  • Step Functions execution history for workflow-level audit.
  • AWS Lambda Powertools (Python, Java, TypeScript, .NET) libraries implementing structured logging, tracing, and idempotency patterns consistently across functions.

X-Ray trace IDs propagate only if the upstream service enables active tracing. A common SAP-C02 anti-pattern has API Gateway with active tracing on, Lambda with active tracing on, but EventBridge or SNS fan-out in between without trace headers — the trace map dies at the async boundary. Enable tracing at every AWS entry point and use Lambda Powertools to forward trace IDs across custom boundaries. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-services.html

Scenario: IoT Fleet of One Million Devices at 5000 Events per Second with Regional Compliance

This is the canonical SAP-C02 event-driven serverless architecture scenario. Requirements:

  • Fleet: 1 million IoT devices globally.
  • Ingestion: 5000 events per second sustained, 20000 per second peak during firmware rollouts.
  • Compliance: EU-originated device data must not leave eu-west-1; US data must not leave us-east-1.
  • Reliability: target 99.95% ingestion availability; no data loss on Regional control-plane impairment.
  • Real-time dashboards: fleet operators need sub-second updates of device state changes.
  • Batch analytics: monthly aggregate across all Regions.
  • Operations: small team, prefer fully managed services.

Architecture Walkthrough

  1. Device provisioning: devices registered per-Region in IoT Core (eu-west-1 and us-east-1). Certificate-based mTLS auth. Device groups tag devices by tenant and compliance zone.
  2. Ingestion: devices publish MQTT to their home-Region IoT Core endpoint. IoT rules engine filters and routes:
    • Telemetry rule: publishes to Kinesis Data Streams in the same Region for ordered per-device analytics.
    • State-change rule: publishes to EventBridge custom bus device-events in the same Region.
    • Alarm rule: publishes directly to SNS topic with SMS and Lambda subscribers for on-call escalation.
  3. Per-Region EventBridge bus: device-events bus receives state changes. Rules route to:
    • Lambda handlers (device-state-update) that UPSERT DynamoDB device-state table with DynamoDB Streams enabled.
    • AppSync mutation via EventBridge target, which pushes subscription updates to mobile and web dashboards.
    • Archive with 30-day retention for replay in case of downstream bug.
  4. Outbox for enterprise events: device-state-update Lambda writes both the aggregate and an outbox row in a DynamoDB transaction. EventBridge Pipe from DynamoDB Streams filters type == outbox and forwards to the device-events bus, guaranteeing durable publishing.
  5. Orchestration for firmware rollout: an operator triggers a Step Functions Standard workflow that Distributed-Maps over 1 million devices, reading the device list from S3, invoking a Lambda per device to send the firmware-update command via IoT Jobs, with callback task token waiting for the device's reported-update state. Retries on transient failure; Catch to an SQS DLQ per device for later triage.
  6. Real-time dashboards: AppSync subscriptions over WebSocket deliver device state changes to connected clients. Subscription filter by tenant ID.
  7. Batch analytics: Kinesis Data Streams in each Region feeds Firehose to S3 in the same Region. An AWS Glue crawler catalogs the partitions; Athena federated queries aggregate across Regions through a central dev/analytics account via Lake Formation cross-account sharing. Compliance is preserved because raw per-device data stays in Region — only aggregates cross.
  8. Cross-region resilience: EventBridge global endpoint in front of the device-events bus in both Regions, with a Route 53 health check on the primary bus's put-events health. If eu-west-1 fails, eu-west-3 takes over as secondary for EU traffic. IoT Core itself requires per-Region client configuration, so a secondary IoT Core endpoint and device-side fallback logic is out of scope for AWS to solve alone.
  9. Observability: all Lambdas use Powertools for structured logging and X-Ray tracing; Step Functions execution history retained 90 days; CloudWatch Synthetics probes hit IoT endpoints every minute; Security Hub aggregates GuardDuty and Config findings to a central security account; EventBridge Archive on both Region buses for 30 days enables incident replay.

Design Decisions That Make This a Pro-Level Answer

  • Custom bus per Region, not shared global: preserves data residency and isolates blast radius. A global bus would violate compliance.
  • EventBridge Pipes over Lambda forwarder: for DynamoDB Streams to device-events, Pipes is cheaper and removes a Lambda from the critical path.
  • Distributed Map over inline Map: one million devices exceeds inline Map's 40 concurrent iteration ceiling by four orders of magnitude.
  • Callback task token on firmware update: device updates are asynchronous and can take minutes; polling DynamoDB for state would be wasteful. Task token pauses the Step Function until IoT rules engine calls SendTaskSuccess.
  • AppSync mutation triggered from EventBridge rule: replaces a polling dashboard with push; sub-second latency from device event to dashboard.
  • Outbox pattern: eliminates dual-write risk on device-state updates that must produce enterprise events reliably.
  • Global endpoint only for the bus, not for IoT Core: IoT Core endpoints are per-Region; device firmware must handle failover at the transport layer.

Common Traps in Event-Driven Serverless Architecture on SAP-C02

Trap patterns repeat on every SAP-C02 sitting. Master these.

Trap 1: Using Default Bus for Application Events

The default bus is for AWS service events. Application events belong on a custom bus. Scenarios hinting at DDD, microservices, or cross-team ownership that put events on the default bus are wrong.

Trap 2: Standard Workflow for High-Volume Events

Standard is 2000 StartExecution per second and pays per state transition. A workflow running 50 million executions per month under five minutes is Express territory. Do not default to Standard for reliability reasons.

Trap 3: Reserved Concurrency Proposed as Cold-Start Fix

Reserved is a cap, not a warm pool. For cold starts, Provisioned Concurrency or SnapStart is correct.

Trap 4: EventBridge Cross-Region Treated as Free

Cross-Region PutEvents is charged in the destination Region and adds latency. Do not propose it casually in cost-sensitive scenarios.

Trap 5: SNS FIFO With SQS Standard Subscriber

SNS FIFO can only publish to SQS FIFO queues. A scenario asking for ordering with a Standard SQS subscriber is a mismatch.

Trap 6: Lambda DLQ Confused With Destinations

Lambda DLQ is legacy and async-only. Lambda Destinations cover success and failure and support richer targets. For new event-driven serverless architecture designs, Destinations is preferred.

Trap 7: Inline Map Used for Large Fan-Out

Inline Map caps at 40 concurrent and keeps history in the parent execution. For more than 40 concurrent or larger than S3-page input, use Distributed Map.

Trap 8: Choreography Proposed Where Compensation Is Required

If the scenario requires rollback on partial failure across multiple services, orchestration with a Saga state machine is usually correct. Pure choreography makes compensation hard to debug at scale.

Trap 9: Polling-Based Dashboards Instead of AppSync Subscriptions

SAP-C02 scenarios mentioning real-time dashboards for tens of thousands of clients should use AppSync subscriptions. Polling API Gateway plus Lambda is cost-inefficient at scale and misses the real-time requirement.

Trap 10: Ignoring Archive and Replay for Replay Requirements

"Replay the last 24 hours of events" points to EventBridge Archive and Replay. Home-grown Firehose-to-S3-plus-Lambda-replay is an anti-pattern when the source is EventBridge.

Async Lambda invocation retries twice by default with exponential backoff before sending to DLQ or Destinations. During a partial RDS outage, those retries can triple write pressure. At Pro depth, configure MaximumRetryAttempts: 0 for functions that write to resources under your own retry logic, or use Step Functions for explicit retry control instead of Lambda's built-in async retry. Reference: https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html

Event-Driven Serverless Architecture vs Neighboring SAP-C02 Topics

SAP-C02 keeps event-driven serverless architecture separate from adjacent topics. Know the borders.

  • event-driven-serverless-architecture vs new-solutions-reliability: this topic owns the event layer and the serverless compute; reliability topic owns multi-AZ, multi-Region, circuit breakers, and quota planning across all compute. Lambda concurrency details sit in both, but the event flow design sits here.
  • event-driven-serverless-architecture vs containerization-ecs-eks: the container topic owns long-running worker pools consuming events. When a Fargate worker reads from SQS that Lambda cannot handle (> 15 min), the SQS design and the event path are here; the ECS service details are in containers.
  • event-driven-serverless-architecture vs data-analytics-at-scale: Kinesis Data Streams is shared. For ordered telemetry feeding Glue and Athena, the analytics topic owns the batch path; the event broker design lives here.
  • event-driven-serverless-architecture vs new-solutions-performance: AppSync real-time GraphQL sits here because of its event-driven mutation model; caching, DynamoDB DAX, and CloudFront live in performance.
  • event-driven-serverless-architecture vs cicd-iac-deployment-strategy: Lambda alias traffic shifting and blue/green for serverless deployments belong to CI/CD; the workflow and event design belongs here.

Key Numbers to Memorize for Event-Driven Serverless Architecture

  • EventBridge PutEvents: 10000 per second per Region default (soft limit); event size up to 256 KB.
  • EventBridge rule targets: up to 5 per rule.
  • EventBridge Archive retention: 1 day to indefinite.
  • EventBridge global endpoints: two Regions active-passive with Route 53 health check.
  • Step Functions Standard: 1 year maximum execution, 2000 StartExecution per second, 25000 open executions default, 90-day history retention.
  • Step Functions Express: 5 minutes maximum execution, 100000 state transitions per second default.
  • Step Functions Distributed Map: up to 10000 concurrent child executions, input from S3.
  • Step Functions callback task token: 1 year max wait on Standard, 5 minutes on Express.
  • Lambda: 15-minute max timeout, 10 GB memory, 10 GB /tmp, 1000 concurrent per Region default.
  • Lambda Provisioned Concurrency: billed per GB-second while provisioned plus per-invocation discount.
  • Lambda SnapStart: Java, Python, .NET managed runtimes (no container images), free.
  • Lambda async: up to 6 hours in queue, up to 2 retries.
  • SQS Standard: unlimited throughput, at-least-once, at-most-once with FIFO at 300 TPS or 3000 TPS batched.
  • AppSync subscriptions: WebSocket connections, sub-second push.
  • IoT Core: 20000 msg/sec per account default, raisable to 1M; 500000 device connections default.
  • DynamoDB Streams: 24-hour retention, shard-level ordered, at-least-once.

FAQ — Event-Driven Serverless Architecture Top Questions

Q1. When should I use EventBridge instead of SNS for fan-out in an event-driven serverless architecture?

Use EventBridge when you need content-based routing, schema registry, archive and replay, cross-account or cross-region routing, partner event sources, or more than the 12.5 million subscriptions per topic SNS allows at large fan-out. Use SNS when you need ultra-low latency fan-out to a small fixed set of subscribers, SMS or email delivery, or mobile push. EventBridge charges per event ingested; SNS charges per publish plus per delivery. For 5-plus targets with different filtering logic, EventBridge rules are cleaner than multiple SNS filter policies. For strict FIFO with SQS FIFO subscribers, SNS FIFO is the only answer (EventBridge does not guarantee ordering). In event-driven serverless architecture on SAP-C02, EventBridge is the broker default, and SNS is the specialized fan-out when its unique features are required.

Q2. How do I choose between Step Functions Standard and Express for a workflow in event-driven serverless architecture?

Standard workflow supports up to one year of execution, exactly-once state transitions, 90 days of execution history retention, asynchronous invocation only, activity task workers, and callback task tokens up to one year. Express workflow supports up to five minutes of execution, at-least-once state transitions, no built-in execution history (CloudWatch Logs only), synchronous or asynchronous invocation, and up to 100000 state transitions per second. Choose Standard for long-running orchestration, human approval workflows with callback task tokens beyond five minutes, saga patterns requiring audited compensation, and any flow where execution history audit is a requirement. Choose Express for high-volume event-processing workflows, API request orchestration, streaming pipelines, and any flow that fits under five minutes and needs lower per-execution cost. You can also nest Express workflows inside Standard workflows using StepFunctions:StartExecution.sync to combine the best of both — audited parent with fast children.

Q3. What is the outbox pattern and how do I implement it on AWS in event-driven serverless architecture?

The outbox pattern eliminates the dual-write problem where a service must atomically update its database and publish an event to a broker. Without outbox, the database write might succeed while the event publish fails (or vice versa), leaving state divergent. On AWS, the canonical implementation is: in a single DynamoDB TransactWriteItems, write both the aggregate item and an outbox-row item representing the event. DynamoDB Streams captures both writes at-least-once in order per partition key. An EventBridge Pipe with a filter selecting only outbox rows forwards them to the target EventBridge bus, with built-in retries and DLQ. The outbox row can have a TTL for auto-cleanup or be deleted explicitly after confirmed delivery. Consumer idempotency (using an event ID stored in a DynamoDB idempotency table with conditional writes) handles at-least-once duplicates. This pattern is the SAP-C02 answer for any scenario that requires guaranteed event publication for every state change.

Q4. How does EventBridge Pipes differ from a Lambda glue function, and when should I choose which in event-driven serverless architecture?

EventBridge Pipes is a managed point-to-point integration with four stages: source, filter, optional enrichment, and target. Pipes supports SQS, Kinesis, DynamoDB Streams, Kafka, MQ as sources, and any EventBridge target as the destination. Choose Pipes when the integration is a straightforward forward or simple filter-and-reshape, when you want to replace boilerplate Lambda code, when cost per event is sensitive (Pipes charges per event instead of per Lambda millisecond), and when ordering must be preserved for stream sources. Choose a Lambda glue function when the transformation is complex, when enrichment requires multiple downstream calls with custom logic, when you need SDK-level control, or when the existing architecture has deep Lambda observability instrumentation. A hybrid pattern uses Pipes with an enrichment Lambda: Pipes handles source polling, filtering, retries, and DLQ, while a small enrichment Lambda does only the business transformation, keeping the Lambda simpler than a full glue function.

Q5. How do I prevent Lambda from overwhelming a downstream relational database in an event-driven serverless architecture?

Multiple controls compose to protect a downstream RDS or Aurora database from Lambda concurrency spikes. First, use SQS as a buffer in front of Lambda with an event source mapping, so Lambda poll rate is shaped by the queue, not direct synchronous pressure. Second, set the SQS event source mapping's MaximumConcurrency parameter to cap the number of concurrent Lambda invocations from that specific queue, leaving unreserved concurrency available for other functions. Third, use reserved concurrency on the Lambda function itself as a ceiling across all invokers. Fourth, use Amazon RDS Proxy in front of the database to pool connections and survive Lambda's connection churn; RDS Proxy reuses connections across invocations and handles failover transparently. Fifth, design for idempotency so retries on transient RDS failure do not corrupt state. Sixth, implement Step Functions retry with exponential backoff and jitter if the Lambda is invoked by a state machine, scoped to specific error types (throttling, transient connection failure). Together these controls turn Lambda from a database thundering herd into a well-behaved consumer.

Q6. What is the callback task token pattern in Step Functions and when should I use it in event-driven serverless architecture?

A callback task token (.waitForTaskToken) is a Step Functions Task state integration pattern where the state pauses indefinitely (up to one year on Standard, five minutes on Express) until an external process calls the SendTaskSuccess or SendTaskFailure API with the token. When Step Functions starts the Task with .waitForTaskToken, it generates a unique token and passes it to the Task's resource (Lambda, SQS, SNS, etc.) as part of the payload. The external process — a human approver clicking a link in an email, an IoT device reporting firmware-update completion, a third-party API webhook — captures the token and later calls SendTaskSuccess with a result or SendTaskFailure with an error. Use callback task token when the next step depends on an asynchronous event outside the workflow, when polling would be wasteful, or when the wait time is non-trivial (minutes to days). Common patterns include human approval workflows via SQS-email-webpage, IoT device command acknowledgement, external batch job completion, and third-party API integration with webhook responses. Without task tokens, architects resort to polling loops that are expensive and fragile; task tokens make the wait explicit and state-machine-managed.

Q7. How should I design a multi-region event-driven serverless architecture that respects data residency?

Data residency dictates that events containing regional data must stay in that Region. The architecture uses per-Region custom EventBridge buses, per-Region Lambda consumers, and per-Region DynamoDB tables (not Global Tables if residency forbids replication). Devices or users publish to their home-Region endpoints only; no cross-Region PutEvents for raw data. Aggregated or anonymized events that are allowed to cross can be forwarded via cross-Region EventBridge target to a central analytics Region. For multi-Region resilience of the ingestion path itself, use EventBridge global endpoints in active-passive mode with a Route 53 health check; the secondary Region must be inside the same residency zone (for example, eu-west-1 primary with eu-west-3 secondary for EU data). DR testing must include replay from archive, failover drill, and IoT or client SDK fallback to the secondary endpoint. Compliance artifacts include AWS Config rules asserting that cross-Region replication is disabled on regulated tables, SCPs denying PutEvents to non-residency Regions from workload accounts, and AWS Audit Manager evidence collection. In event-driven serverless architecture on SAP-C02, scenarios mentioning GDPR, HIPAA with Region restrictions, or country-specific data laws always map to this per-Region bus pattern.

Summary — Event-Driven Serverless Architecture at a Glance

  • Event-driven serverless architecture on SAP-C02 is a design discipline combining Amazon EventBridge, AWS Step Functions, AWS Lambda, AWS AppSync, AWS IoT Core, Amazon SQS, and Amazon SNS.
  • EventBridge bus design: default for AWS service events, custom per bounded context, partner per SaaS source, global endpoints for active-passive DR. Use schema registry, archive, and replay at Pro depth; prefer Pipes over Lambda glue for source-filter-enrich-target integration.
  • Step Functions Standard for long-running audited workflows with callback task tokens up to one year; Express for high-volume sub-five-minute processing; Distributed Map for million-scale fan-out; Saga pattern with Retry plus Catch plus compensating states for distributed transactions.
  • Lambda concurrency composes three controls: reserved (cap and floor), provisioned (warm pool with hourly cost), SnapStart (free snapshot for Java, Python, .NET managed runtimes). Cold-start mitigation follows a decision tree starting with SnapStart.
  • AppSync delivers real-time GraphQL subscriptions; pair with EventBridge rules for event-driven push to mobile and web clients.
  • IoT Core rules engine routes MQTT to EventBridge, Kinesis, Step Functions, and more; the first-class event source for fleets.
  • Choreography scales with loose coupling; orchestration centralizes compensation and audit. Hybrid designs use both at different layers.
  • Outbox pattern with DynamoDB Streams plus EventBridge Pipes eliminates the dual-write problem.
  • DLQ plus redrive plus X-Ray tracing plus CloudWatch EMF plus archive is the observability stack.
  • The million-device 5000-events-per-second scenario composes per-Region buses, Distributed Map firmware rollout, callback task tokens on device acknowledgement, AppSync subscriptions for dashboards, and EventBridge global endpoints for ingestion resilience.
  • SAP-C02 rewards multi-constraint reasoning: duration, residency, ordering, cost, observability simultaneously. Master event-driven serverless architecture as a composition game, not a service catalog.

Event-driven serverless architecture at this depth is what separates the SAP-C02 candidate who guesses services from the architect who justifies a design under constraint. When you can explain why this scenario uses EventBridge Pipes instead of a Lambda glue, Express instead of Standard, Distributed Map instead of inline Map, and AppSync instead of polling, event-driven serverless architecture stops being an exam topic and becomes a production instinct.

Official sources