AWS Step Functions is the orchestration service that most directly answers Task Statement 4.3 — "Optimize applications by using AWS services and features" — on the AWS Certified Developer Associate (DVA-C02) exam, and it is simultaneously the cleanest answer for distributed error handling scenarios under Task 4.1. AWS Step Functions removes retry loops, timeout babysitting, exception try/catch ladders, and polling code from AWS Lambda functions, replacing them with a JSON state machine written in Amazon States Language (ASL). On DVA-C02, AWS Step Functions appears whenever a question mentions "orchestrate multiple Lambda functions," "manage long-running workflows," "implement a saga pattern," "retry with exponential backoff," "iterate over thousands of S3 objects," or "pause a workflow for human approval." This chapter trains you to recognize every AWS Step Functions construct the DVA-C02 exam can throw at you — Standard vs Express workflows, every ASL state type, intrinsic functions, Retry and Catch semantics, asynchronous callback with task tokens, optimized service integrations, nested workflows, the saga compensation pattern, and Choice-driven circuit breakers. Master AWS Step Functions and you collectively answer roughly 15 percent of Domain 4 on DVA-C02.
What Is AWS Step Functions?
AWS Step Functions is a fully managed serverless orchestrator for distributed applications. With AWS Step Functions you write a state machine — a graph of named states, each with an explicit input, an explicit output, and an explicit next state — as a JSON document in Amazon States Language (ASL). AWS Step Functions executes the state machine by calling AWS services (AWS Lambda, Amazon DynamoDB, Amazon SQS, Amazon SNS, Amazon ECS, AWS Glue, Amazon SageMaker, and many more), managing retries, tracking execution history, and paying only for state transitions or per-execution duration depending on workflow type.
AWS Step Functions directly addresses the pain points DVA-C02 loves to probe: AWS Lambda timeouts at 900 seconds, AWS Lambda retry storms when chained via SDK calls, untraceable try/catch ladders spread across three functions, and the observability black hole when a request fans out across five services. AWS Step Functions replaces all of that with a declarative state machine definition, a visual execution view in the console, per-state CloudWatch metrics, and centralized Retry and Catch semantics.
How AWS Step Functions Fits the DVA-C02 Exam Map
On DVA-C02, AWS Step Functions shows up most heavily in Domain 4 (Troubleshooting and Optimization, 18%), but it also overlaps Domains 1 and 2:
- Domain 1 (Development, 32%): AWS Step Functions as orchestration for AWS Lambda functions, Amazon DynamoDB transactions, Amazon SQS fan-out, and event-driven chains.
- Domain 2 (Security, 26%): AWS Step Functions execution role calling downstream service APIs; resource-based policies when a workflow is triggered cross-account.
- Domain 4 (Troubleshooting, 18%): AWS Step Functions Retry and Catch for distributed error handling, bypassing AWS Lambda 15-minute timeout through orchestration, saga compensation, and circuit breakers.
Whenever a DVA-C02 stem mentions "orchestration," "long-running workflow," "visual execution," "retry with exponential backoff across services," or "coordinate multiple AWS services transactionally," AWS Step Functions is almost always the correct answer.
The AWS Step Functions Execution Model at 30,000 Feet
Every AWS Step Functions execution follows the same pattern: (1) a StartExecution API call (from the AWS SDK, EventBridge, API Gateway, Lambda, or another state machine) delivers input JSON; (2) the state machine runs each state in sequence, following Next or Choice transitions; (3) each Task state invokes a downstream AWS service, waits for or polls its result, and applies optional Retry and Catch logic; (4) the final state returns output JSON (Standard) or writes to CloudWatch Logs (Express); (5) AWS Step Functions bills per state transition (Standard) or per execution duration and memory (Express).
白話文解釋 AWS Step Functions
AWS Step Functions 講白了就是「把一堆 AWS Lambda 函式和 AWS 服務串起來照劇本演戲」的雲端導演。用下面三種類比,AWS Step Functions 的抽象就一次記牢。
Analogy 1 — The Movie Director With a Script
把 AWS Step Functions 想像成電影拍攝現場的導演。劇本(state machine definition)寫得清清楚楚:第一場景拍 Lambda A(驗證訂單),第二場景拍 Lambda B(扣庫存),如果扣不到庫存就跳到補拍場景(Catch)重拍 3 次(Retry),仍失敗就播出備案劇情(Fallback branch)。導演(AWS Step Functions service)自己看著場記表(execution history)推進,每一顆鏡頭的 take 全被錄下來(state transition 被 CloudWatch 記錄)。演員(Lambda、DynamoDB、ECS)只負責演好自己那一幕,導演負責場序、超時、重拍、結案。電影結束時你看得到完整的拍攝順序,不需要去翻每個演員的手機記錄,這就是 AWS Step Functions 的可觀測性。
AWS Step Functions = 把分散微服務當演員、自己掌場記的雲端導演。
Analogy 2 — The Restaurant Kitchen Order Ticket
AWS Step Functions 也像高級餐廳的出餐單(ticket system)。服務生把一張訂單(input JSON)貼到廚房導軌上,第一站冷盤師傅(Task: Lambda)做沙拉,第二站決策(Choice state)看今日客人是素食還是肉食走不同路線,第三站平行分支(Parallel state)讓牛排和龍蝦同時開火,最後集合上桌(result aggregation)。如果牛排燒焦(error),導軌會依照 SOP 重做兩次(Retry),第三次還焦就轉給主廚改菜(Catch),而不是讓客人整桌翻掉。出餐單從頭到尾有一張紙本記錄(execution history, 保留 1 年 for Standard),隨時可以追查哪個環節拖了 5 分鐘。
AWS Step Functions = 把菜單串成一條看得見的出餐軌道。
Analogy 3 — The Airport Ground-Crew Checklist
AWS Step Functions 更像飛機落地到再起飛的地勤 checklist。飛機下客 → 卸貨 → 加油 → 清艙 → 補給 → 裝貨 → 上客,每一步都依序完成;有些步驟能並行(地勤 Parallel state:同時加油和清艙),有些步驟要等地勤回報完成(Task Token callback,地勤人員刷卡回報 SendTaskSuccess),有些步驟有超時(Task TimeoutSeconds,若油車 15 分鐘沒到就觸發 Catch 改派備援車)。塔台(AWS Step Functions)只看 checklist 和狀態,不插手每一台加油車怎麼加油。整個 turnaround 結束,塔台有完整時序表,調度員可以檢討哪一步慢了。
AWS Step Functions = 可視化、可重試、可超時、可補償的流程 checklist。
三個類比串起來,AWS Step Functions 的「orchestration × error handling × long-running × observability」四大性格就全清晰。
An AWS Step Functions state machine is a JSON document written in Amazon States Language (ASL) that defines an ordered graph of states, transitions, and error-handling rules. Each execution of the state machine is an independent run with its own input, its own history, and its own output. Standard workflows retain execution history for 1 year; Express workflows stream history to CloudWatch Logs and retain nothing in the service itself. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html
Standard vs Express Workflows
AWS Step Functions ships two workflow types with fundamentally different pricing, durability, and throughput. On DVA-C02 the choice between Standard and Express is the single most-tested AWS Step Functions decision, so memorize this table cold.
Standard Workflows
Standard AWS Step Functions workflows are the original type. They are optimized for long-running, auditable, exactly-once orchestration.
- Maximum duration: 1 year per execution.
- Execution history retention: 90 days in the Step Functions service itself; full history viewable in the console without extra setup.
- Throughput: up to 2,000 StartExecution API calls per second, 4,000 state transitions per second per account per Region (soft limits).
- Execution semantics: exactly-once execution of each workflow step.
- Pricing: billed per state transition (around 0.000025 USD per transition).
- Use cases: order fulfillment, ETL pipelines, human approval flows, SageMaker training pipelines, infrastructure provisioning.
Standard is the default answer when a DVA-C02 stem mentions "auditable," "long-running," "human approval," "hours/days," or "exactly-once."
Express Workflows
Express AWS Step Functions workflows were added for high-volume, short-lived event processing — essentially the "Lambda-speed" version of AWS Step Functions.
- Maximum duration: 5 minutes per execution.
- Execution history retention: none inside AWS Step Functions; you must configure CloudWatch Logs to capture history. Without logging, post-mortem debugging is impossible.
- Throughput: more than 100,000 executions per second per account per Region.
- Execution semantics: two sub-types — Synchronous Express (caller waits for result, at-most-once) and Asynchronous Express (fire-and-forget, at-least-once).
- Pricing: billed per execution count plus per-GB-second of memory used (similar to AWS Lambda billing), typically cheaper than Standard for high-volume short flows.
- Use cases: streaming data processing, IoT ingestion, mobile backend request validation, high-TPS API orchestration.
Express is the default answer on DVA-C02 when a stem mentions "high volume," "short duration," "thousands per second," "low cost per execution," or "IoT/streaming."
Standard AWS Step Functions = 1 year max, 2,000 executions/s, 90-day history in service, exactly-once, per-transition pricing. Express AWS Step Functions = 5 min max, 100,000+ executions/s, CloudWatch Logs required for history, at-most-once (Sync) or at-least-once (Async), per-execution + memory pricing. If the stem says "auditable" or "long-running" → Standard. If it says "high volume" or "short-lived" → Express. If it says "exactly-once" → Standard, because Express Async is at-least-once and Express Sync is at-most-once. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html
Nested Workflows: Standard Calling Express
A classic AWS Step Functions pattern is a Standard workflow that orchestrates long-running steps and invokes an Express child workflow for high-throughput sub-tasks. You call the child via the arn:aws:states:::states:startExecution.sync optimized integration, the Standard parent waits for the Express child to finish, and you get exactly-once audit on the parent and 100,000-TPS throughput on the child. DVA-C02 tests this pattern explicitly when a scenario mixes an approval-bearing business flow with a bulk processing subroutine.
Amazon States Language (ASL) State Types
Every AWS Step Functions state machine is a JSON document whose top-level States object maps state names to one of eight state types. The DVA-C02 exam expects working fluency in all eight.
Task State
A Task state represents a unit of work executed by a downstream AWS service or an Activity worker. A Task state has a Resource field that is either a Lambda ARN, a service integration ARN (arn:aws:states:::sqs:sendMessage), or an Activity ARN.
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111122223333:function:ValidateOrder",
"Next": "ChargeCustomer",
"TimeoutSeconds": 30,
"HeartbeatSeconds": 10,
"Retry": [ { "ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 3, "IntervalSeconds": 2, "BackoffRate": 2.0 } ],
"Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "RefundAndFail", "ResultPath": "$.error" } ]
}
Task is the workhorse. On DVA-C02 most scenario questions revolve around configuring Retry and Catch on Task states.
Choice State
A Choice state branches based on input data.
"IsBigOrder": {
"Type": "Choice",
"Choices": [
{ "Variable": "$.total", "NumericGreaterThan": 1000, "Next": "FraudCheck" },
{ "Variable": "$.country", "StringEquals": "US", "Next": "USFulfillment" }
],
"Default": "StandardFulfillment"
}
Choice states have no Retry or Catch — they are pure routing. The Default field is mandatory if no Choice matches; omitting it raises a States.NoChoiceMatched error.
Parallel State
A Parallel state forks execution into N independent branches and merges the results into an array.
"FraudAndInventory": {
"Type": "Parallel",
"Branches": [
{ "StartAt": "FraudCheck", "States": { "FraudCheck": { "Type": "Task", "Resource": "...", "End": true } } },
{ "StartAt": "InventoryReserve", "States": { "InventoryReserve": { "Type": "Task", "Resource": "...", "End": true } } }
],
"Next": "ChargeCustomer"
}
All branches must succeed; any branch error propagates to Retry/Catch on the Parallel state itself.
Map State
A Map state iterates over an array, running the same sub-workflow for each element. Two modes exist: inline (legacy, up to 40 concurrent iterations, runs within the parent execution) and distributed (added in 2022, up to 10,000 concurrent child executions, iterates over S3 prefixes or CSV/JSON rows in S3 objects).
"ProcessItems": {
"Type": "Map",
"ItemsPath": "$.items",
"MaxConcurrency": 10,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "INLINE" },
"StartAt": "ProcessOne",
"States": { "ProcessOne": { "Type": "Task", "Resource": "...", "End": true } }
},
"Next": "Summarize"
}
The older Iterator keyword is deprecated; ASL 2022 uses ItemProcessor. The distributed Map is the AWS Step Functions answer for "iterate over 1 million S3 objects" — set ProcessorConfig.Mode to DISTRIBUTED and point ItemReader at an S3 bucket prefix or a CSV/JSON file.
Pass State
Pass is the no-op. It can transform input/output through Result, ResultPath, InputPath, and OutputPath without calling any service. Use Pass to inject constants, reshape payloads, or add debugging checkpoints.
Wait State
Wait pauses the workflow. You can wait for a fixed duration (Seconds) or until an absolute timestamp (TimestampPath).
"WaitAnHour": { "Type": "Wait", "Seconds": 3600, "Next": "SendReminder" }
Wait in AWS Step Functions Standard is free — no compute is running. This is a huge lever compared to a Lambda setTimeout that would tie up concurrency and burn Lambda-seconds.
Succeed State
Succeed is a terminal state that ends the execution successfully. No Next field.
Fail State
Fail is a terminal state that ends the execution with an error name and cause. Used inside Catch blocks after compensation runs, or to short-circuit a Choice branch.
AWS Step Functions has eight ASL state types: Task (call a service), Choice (branch on data), Parallel (fork-join), Map (iterate a collection, inline or distributed), Pass (no-op/transform), Wait (pause), Succeed (terminal OK), Fail (terminal error). Memorize the mnemonic "Task Chooses Parallel Maps, then Passes and Waits to Succeed or Fail." Almost every DVA-C02 ASL question keys on recognizing which state type fits the scenario. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-state-types.html
AWS Step Functions Intrinsic Functions
Intrinsic functions let you transform data inside ASL without invoking a Lambda function, saving a hop and millions of Lambda-ms. They are used in any field that accepts a .$ suffixed JSONPath reference.
The Core Intrinsic Functions
States.Format(template, args...)— sprintf-like string templating.States.Format('Order {} total ${}', $.orderId, $.total).States.StringSplit(string, separator)— splits a string into an array. Useful for parsing CSV headers or CloudWatch log lines.States.JsonMerge(obj1, obj2, deepMerge)— merges two JSON objects.deepMerge=true(boolean) merges nested objects recursively.States.Array(values...)— constructs an array from arbitrary values.States.ArrayContains(array, value)— boolean membership test inside Choice.States.ArrayGetItem(array, index)— zero-indexed element access.States.ArrayLength(array)— array size.States.ArrayPartition(array, chunkSize)— splits into chunks; essential for batching before a Map state.States.ArrayRange(start, end, step)— generates numeric arrays (like Python range).States.ArrayUnique(array)— deduplication.States.Base64Encode(string)/States.Base64Decode(base64String)— common for S3 object keys and Cognito claims.States.Hash(string, algorithm)— SHA-256 etc., for idempotency keys.States.JsonToString(obj)/States.StringToJson(string)— serialize/parse.States.MathAdd(a, b)/States.MathRandom(start, end)— arithmetic helpers.States.UUID()— generate a unique ID for an idempotency token.
Using Intrinsics in Payload
"BuildMessage": {
"Type": "Pass",
"Parameters": {
"greeting.$": "States.Format('Hello {}, order {} is {}', $.user, $.orderId, $.status)",
"combined.$": "States.JsonMerge($.base, $.overrides, false)",
"idempotencyKey.$": "States.UUID()"
},
"Next": "Notify"
}
On DVA-C02 the exam rarely quizzes every intrinsic by name, but States.Format, States.StringSplit, States.JsonMerge, and States.UUID do appear in "which AWS Step Functions feature eliminates this Lambda?" scenario questions.
Before intrinsic functions landed, teams wrote tiny AWS Lambda functions just to concatenate strings or merge JSON. Every one of those is an avoidable cold start plus IAM surface plus billing line item. Scan your state machines for Lambdas whose only job is data shaping — replace them with States.Format, States.JsonMerge, and States.StringSplit to cut latency and cost.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-intrinsic-functions.html
Error Handling Deep Dive: Retry and Catch
Error handling is the single richest vein in the AWS Step Functions exam surface. Every Task and Parallel and Map state supports a Retry array and a Catch array. Master this section and Domain 4 becomes predictable.
Built-in Error Names
AWS Step Functions normalizes downstream errors into named strings you match in ErrorEquals:
States.ALL— wildcard that matches every error exceptStates.DataLimitExceededpre-runtime and a few terminal classes. Always the last Retry/Catch rule.States.Timeout— the state exceededTimeoutSecondsorHeartbeatSeconds.States.TaskFailed— downstream service or Lambda returned an error.States.Permissions— AWS Step Functions execution role lacks permission to invoke the target.States.ResultPathMatchFailure— output couldn't be merged atResultPath.States.ParameterPathFailure— JSONPath inParametersdidn't resolve.States.BranchFailed— a branch of Parallel failed.States.NoChoiceMatched— Choice had no matching rule and noDefault.States.IntrinsicFailure— intrinsic function threw (e.g.,States.JsonToStringon invalid UTF-8).States.ExceedToleratedFailureThreshold— distributed Map exceeded its failure tolerance.
Lambda functions can also throw custom error names by having the function throw an exception whose class/name is used as the error name (Python class OrderNotFoundError(Exception), Node.js throw { name: 'OrderNotFoundError', message: '...' }). Custom names match in ErrorEquals ahead of States.ALL.
The Retry Field
Retry is a list of rules evaluated in order. Each rule specifies:
ErrorEquals— list of error names this rule matches.IntervalSeconds— initial wait before the first retry (default 1).MaxAttempts— max retries (default 3; set to 0 to disable retry).BackoffRate— multiplier applied toIntervalSecondseach retry (default 2.0).MaxDelaySeconds(added 2023) — cap the exponential backoff.JitterStrategy(added 2023) —FULLorNONE, spreads retries to avoid thundering herd.
"Retry": [
{ "ErrorEquals": ["OrderNotFoundError"], "MaxAttempts": 0 },
{ "ErrorEquals": ["Lambda.TooManyRequestsException", "Lambda.ServiceException"],
"IntervalSeconds": 2, "MaxAttempts": 6, "BackoffRate": 2.0, "MaxDelaySeconds": 60, "JitterStrategy": "FULL" },
{ "ErrorEquals": ["States.ALL"], "IntervalSeconds": 1, "MaxAttempts": 2, "BackoffRate": 2.0 }
]
Rules are evaluated top-to-bottom on each failure; the first matching rule is used. MaxAttempts: 0 means "this error is terminal, do not retry" — a common pattern for business errors like "order not found" where retries would be pointless.
The Catch Field
Catch runs after all Retry attempts are exhausted. Each Catch rule specifies:
ErrorEquals— list of error names.Next— the fallback state to transition to.ResultPath— where to inject the error object into the state's input.
"Catch": [
{ "ErrorEquals": ["States.Timeout"], "Next": "NotifyOpsAndFail", "ResultPath": "$.timeoutError" },
{ "ErrorEquals": ["States.ALL"], "Next": "CompensateAndFail", "ResultPath": "$.error" }
]
ResultPath: Preserving the Original Input on Error
ResultPath is the make-or-break detail most candidates miss. By default, a Catch block replaces the state input with the error object, destroying the business payload (orderId, customerId, etc.) that downstream compensation needs. Set ResultPath to a sub-node (e.g., $.error) and AWS Step Functions injects the error alongside the original input, so the Catch target receives { "orderId": "O-123", "customerId": "C-1", "error": { "Error": "States.Timeout", "Cause": "..." } }.
"ChargeCustomer": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:ChargeCustomer",
"Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 3, "BackoffRate": 2, "IntervalSeconds": 2, "JitterStrategy": "FULL" }],
"Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.chargeError", "Next": "RefundAndNotify" }],
"End": true
}
The number-one AWS Step Functions trap on DVA-C02: forgetting ResultPath in Catch. Without ResultPath, the error object replaces the state's input, and your compensation Lambda suddenly has no orderId or customerId to work with. Always set ResultPath to a scoped key like $.error so the fallback state keeps the business payload AND sees the error. If the exam shows a Catch block missing ResultPath and asks why compensation fails to find the order, this is the answer.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
Timeout and Heartbeat
Every Task state supports TimeoutSeconds (total allowed time) and HeartbeatSeconds (max silence between heartbeats from an Activity worker or a waitForTaskToken callback). If either fires, AWS Step Functions raises States.Timeout, which your Retry/Catch can handle. Without TimeoutSeconds, a stuck task would wait forever (or up to 1 year for Standard). Set a realistic timeout on every Task — this is an AWS Well-Architected best practice and a DVA-C02 favorite.
Service Integration Patterns
AWS Step Functions exposes three ways to invoke a downstream AWS service. Knowing which one applies to which scenario is mandatory for DVA-C02.
Request-Response (Direct) Integration
Resource ARN pattern: arn:aws:states:::<service>:<action>. AWS Step Functions calls the service API, receives the synchronous response, and proceeds to Next. Example: arn:aws:states:::sqs:sendMessage. No waiting for the downstream work to "finish" — just for the API call to return.
Run-a-Job (.sync) Integration
Resource ARN pattern: arn:aws:states:::<service>:<action>.sync. AWS Step Functions calls the service API, then polls or subscribes to the service's completion event and holds the Task state open until the downstream job is actually done. Available for long-running services like AWS Batch, Amazon ECS RunTask, AWS Glue StartJobRun, Amazon EMR, Amazon SageMaker training jobs, Amazon Step Functions child executions, and AWS CodeBuild projects. This is how a Standard state machine can "wait for a 45-minute ECS task to complete" without writing any polling Lambda.
Wait-for-Callback (.waitForTaskToken) Integration
Resource ARN pattern: arn:aws:states:::<service>:<action>.waitForTaskToken. AWS Step Functions calls the service API, injecting a task token into the payload, then pauses the Task state indefinitely (up to 1 year for Standard). Some worker — a human clicking a link in an email, an on-premises process, or a Lambda waiting on SQS — eventually calls SendTaskSuccess(taskToken, output) or SendTaskFailure(taskToken, error, cause) to resume the workflow.
This is the AWS Step Functions answer for:
- Human approval flows (email with Approve/Reject links that hit API Gateway → Lambda → SendTaskSuccess).
- Third-party integrations where the third party calls back asynchronously.
- On-premises workers pulling tasks over the Activity API.
For long-running callbacks, workers call SendTaskHeartbeat(taskToken) periodically; if the Task state's HeartbeatSeconds elapses without a heartbeat, AWS Step Functions raises States.Timeout.
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:111122223333:ApprovalTopic",
"Message": {
"orderId.$": "$.orderId",
"taskToken.$": "$$.Task.Token"
}
},
"TimeoutSeconds": 86400,
"HeartbeatSeconds": 3600,
"Next": "Fulfill"
}
$$.Task.Token is the context-object path that yields the current task token.
Direct (arn:aws:states:::svc:action) = fire API, continue. Run-a-Job (.sync) = fire API, block until the job finishes (Batch, ECS, Glue, SageMaker, child state machine). Wait-for-Callback (.waitForTaskToken) = fire API with a token, pause until a worker calls SendTaskSuccess/SendTaskFailure. On DVA-C02 match the pattern to the scenario: "wait for human approval" → .waitForTaskToken; "wait for a 20-minute ECS task" → .sync; "drop a message on SQS and move on" → direct.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html
Optimized Integrations for AWS Services
AWS Step Functions ships first-class optimized integrations for high-use services. Key ones for DVA-C02:
- AWS Lambda (
lambda:invoke,lambda:invoke.waitForTaskToken) — invoke a function, synchronously or with callback. - Amazon DynamoDB (
dynamodb:putItem,dynamodb:getItem,dynamodb:updateItem,dynamodb:deleteItem) — direct DynamoDB CRUD from a Task state, no Lambda needed. - Amazon SQS (
sqs:sendMessage,sqs:sendMessage.waitForTaskToken) — enqueue messages with or without callback. - Amazon SNS (
sns:publish,sns:publish.waitForTaskToken) — publish to a topic with optional task token for approval flows. - Amazon ECS / AWS Fargate (
ecs:runTask,ecs:runTask.sync,ecs:runTask.waitForTaskToken) — launch a task, optionally block until it finishes. - AWS Glue (
glue:startJobRun,glue:startJobRun.sync) — start an ETL job and wait for it. - Amazon SageMaker (
sagemaker:createTrainingJob.sync,sagemaker:createEndpoint.sync, etc.) — orchestrate ML training/deployment without a Lambda polling loop. - AWS Batch (
batch:submitJob.sync) — run containerized compute jobs. - Amazon EventBridge (
events:putEvents) — publish events to a bus. - AWS Step Functions (
states:startExecution,states:startExecution.sync) — nested workflows.
Each optimized integration saves you a wrapper Lambda and eliminates the associated cold start, IAM role, and billing line item.
Asynchronous Callback with Task Tokens
Task tokens deserve a dedicated section because they are the foundation of human-in-the-loop workflows and the most elegant DVA-C02 answer for "pause a workflow until an external event happens."
The Callback Lifecycle
- AWS Step Functions enters a Task state with
Resource: ...:.waitForTaskToken. - AWS Step Functions generates a unique task token and injects it into the downstream payload (via
$$.Task.Token). - The downstream target (SNS topic, SQS queue, Lambda function, ECS task, HTTP webhook) receives the token along with business context.
- AWS Step Functions suspends the Task state — no billing for Standard during the wait.
- An out-of-band worker completes its real work and calls one of three APIs:
SendTaskSuccess(taskToken, output)— Task resumes withoutputas the state result.SendTaskFailure(taskToken, error, cause)— Task fails with the given error, triggering Retry/Catch.SendTaskHeartbeat(taskToken)— resets theHeartbeatSecondstimer without completing.
- AWS Step Functions resumes the state machine from the Task.
Heartbeat for Long-Running Workers
If the worker needs an hour but the Task should fail fast when the worker crashes, set HeartbeatSeconds: 300 and have the worker ping SendTaskHeartbeat every 60 seconds. If three heartbeats are missed, AWS Step Functions raises States.Timeout and Retry/Catch kicks in. This is how you detect a dead on-premises worker without waiting a full hour.
Activity Workers (Legacy Variant)
AWS Step Functions also supports Activities — long-lived worker pools that poll GetActivityTask and report back via SendTaskSuccess/SendTaskFailure. Activities pre-date .waitForTaskToken and are still useful for on-premises or non-AWS workers. Modern code usually prefers .waitForTaskToken with SNS/SQS fan-out.
Distributed Map for Big-Data Iteration
The distributed Map state, introduced in December 2022, is the AWS Step Functions answer for "iterate over millions of S3 objects without writing a Lambda coordinator."
Key Capabilities
- ItemReader: source of items. Can be a JSON array from state input, a CSV file in S3, a JSON array file in S3, or an S3 object listing under a prefix.
- ItemSelector and ItemBatcher: shape each item before it is passed to the ItemProcessor; optionally batch items into groups of N to reduce overhead.
- MaxConcurrency: up to 10,000 concurrent child executions (each child is a sub-state-machine execution).
- ToleratedFailurePercentage / ToleratedFailureCount: define failure tolerance. Exceeding the threshold raises
States.ExceedToleratedFailureThresholdon the parent. - ResultWriter: optionally write each child's output to S3 for later aggregation; avoids the 256 KB state output limit.
Big-S3 Iteration Pattern
"ProcessAllFiles": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": { "Bucket": "my-data-bucket", "Prefix": "incoming/" }
},
"MaxConcurrency": 1000,
"ToleratedFailurePercentage": 2,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
"StartAt": "ProcessOne",
"States": { "ProcessOne": { "Type": "Task", "Resource": "arn:aws:lambda:...:ProcessFile", "End": true } }
},
"ResultWriter": { "Resource": "arn:aws:states:::s3:putObject", "Parameters": { "Bucket": "results-bucket", "Prefix": "run-output/" } },
"End": true
}
Each child execution can itself be Express type for maximum throughput while the parent is Standard for auditability. This is the canonical "Standard parent + Express child" DVA-C02 pattern.
Inline Map is limited to 40 concurrent iterations and shares one parent execution's state — fine for small batches. Distributed Map scales to 10,000 concurrent child executions, accepts S3 as an input source, and has its own failure tolerance knobs. On DVA-C02, if the scenario mentions "process a million S3 objects," "iterate a large CSV," or "fan out across 10,000 items," the answer is distributed Map, not inline Map and not a Lambda coordinator. Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
Saga Pattern: Compensating Transactions
The saga pattern is how distributed systems achieve "eventual atomicity" across services that lack a shared transaction manager (e.g., DynamoDB + Stripe + SNS). Each step has a compensating action that undoes it. AWS Step Functions is the canonical AWS implementation of sagas.
Saga State-Machine Skeleton
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task", "Resource": "arn:aws:lambda:...:ReserveInventory",
"Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "FailAndReportNoCompensation" }],
"Next": "ChargeCustomer"
},
"ChargeCustomer": {
"Type": "Task", "Resource": "arn:aws:lambda:...:ChargeCustomer",
"Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 3, "BackoffRate": 2, "IntervalSeconds": 2 }],
"Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "ReleaseInventory" }],
"Next": "ShipOrder"
},
"ShipOrder": {
"Type": "Task", "Resource": "arn:aws:lambda:...:ShipOrder",
"Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 3, "BackoffRate": 2, "IntervalSeconds": 2 }],
"Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "RefundCustomer" }],
"End": true
},
"RefundCustomer": { "Type": "Task", "Resource": "arn:aws:lambda:...:RefundCustomer", "Next": "ReleaseInventory" },
"ReleaseInventory": { "Type": "Task", "Resource": "arn:aws:lambda:...:ReleaseInventory", "Next": "FailAndReport" },
"FailAndReport": { "Type": "Fail", "Error": "SagaFailed", "Cause": "Compensation executed" },
"FailAndReportNoCompensation": { "Type": "Fail", "Error": "SagaAborted", "Cause": "Failed before any commit" }
}
Every forward Task has a Catch that routes to its compensating Task, which in turn chains to the next upstream compensation. If shipment fails, refund the customer and release inventory. If charge fails, just release inventory (nothing to refund). The state machine definition IS the saga — no external coordinator, no distributed lock, no Kafka log required.
AWS publishes an official saga sample project with DynamoDB bookings (CreateOrder, ReserveFlight, ReserveCar, ReserveHotel, each with Cancel compensations). Expect a DVA-C02 scenario along these exact lines.
A saga is "local transactions + explicit compensations" — AWS Step Functions expresses this naturally through per-Task Catch blocks that route to compensating Tasks. The key insight: compensation order is reverse of forward order, and each compensating Task must be idempotent (it may run twice if the Step Functions execution itself is retried). Use DynamoDB conditional writes (attribute_not_exists) or States.UUID()-based idempotency keys to make compensations safe under retry.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/sample-project-saga.html
Circuit Breaker with Choice + CloudWatch
A circuit breaker prevents a workflow from hammering a failing downstream service. The AWS Step Functions pattern uses a Choice state that reads a CloudWatch metric alarm state via a small Lambda probe.
Circuit Breaker Skeleton
"CheckBreaker": {
"Type": "Task", "Resource": "arn:aws:lambda:...:ReadCircuitBreakerAlarm",
"ResultPath": "$.breakerState", "Next": "RouteOnBreaker"
},
"RouteOnBreaker": {
"Type": "Choice",
"Choices": [
{ "Variable": "$.breakerState.state", "StringEquals": "OPEN", "Next": "FailFast" },
{ "Variable": "$.breakerState.state", "StringEquals": "HALF_OPEN", "Next": "ProbeCall" }
],
"Default": "NormalCall"
}
The breaker Lambda calls DescribeAlarms on a CloudWatch alarm that watches the downstream service's error rate. If the alarm is ALARM (state = OPEN), the workflow fails fast. If it's INSUFFICIENT_DATA (state = HALF_OPEN), send one probe call. If it's OK (state = CLOSED/Default), call normally. This pattern keeps one misbehaving service from cascading failure across your entire system — a classic Domain 4 optimization technique.
Execution History and Observability
Standard Workflow Execution History
Standard AWS Step Functions workflows retain 90 days of execution history in the service itself and expose it via:
- The AWS Step Functions console's visual execution view (click any state to see input/output/errors).
GetExecutionHistoryAPI (max 1,000 events per page).- CloudWatch Logs (if configured).
- EventBridge execution state change events.
Each state transition emits a CloudWatch metric (ExecutionsStarted, ExecutionsSucceeded, ExecutionsFailed, ExecutionsTimedOut, ExecutionsAborted), and each Task state emits per-resource metrics (LambdaFunctionsTimedOut, etc.).
Express Workflow Observability
Express workflows retain no history in the service. You must enable CloudWatch Logs logging when you create the state machine. Three log levels:
ALL— every history event, largest cost.ERROR— only failure events.FATAL— only terminal failures.OFF— nothing; makes debugging impossible.
INCLUDE_EXECUTION_DATA toggles whether input/output payloads are logged (secrets appear in logs if set to true). Choose ERROR + execution data for most dev workloads.
A favorite DVA-C02 trap: "the team deployed an Express Step Functions workflow and cannot figure out why some executions fail." The fix is always to enable CloudWatch Logs logging at ERROR or ALL level with INCLUDE_EXECUTION_DATA=true, because Express workflows keep no history in the Step Functions service. Standard workflows would have shown the failure in the console immediately. If the stem mentions "Express" and "cannot debug," the answer is CloudWatch Logs.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/cw-logs.html
X-Ray Tracing
Enable X-Ray tracing on the state machine and each Task's downstream call is traced in the X-Ray service map. Use this for deep latency forensics across nested workflows and service integrations.
Step Functions vs Lambda Chaining vs EventBridge vs SQS
On DVA-C02, one of the most common scenario question families is "pick the right orchestrator." Here is the decision matrix:
Choose AWS Step Functions When
- You need explicit, visualizable, retryable, auditable flow control.
- The workflow has branching (Choice), fan-out (Parallel/Map), or compensation (saga).
- A single flow exceeds AWS Lambda's 15-minute timeout.
- Human approval or external callback is part of the flow.
- You need exactly-once auditing across multi-service transactions.
Choose Direct AWS Lambda Chaining (Lambda calls Lambda or SDK) When
- The flow has exactly one step plus maybe one follow-up.
- There is no need for retry orchestration beyond AWS Lambda's built-in async retry.
- You never need to visualize state or audit per-step outcome.
Choose Amazon EventBridge When
- The architecture is choreographed, not orchestrated — producers publish events and consumers react independently.
- You want pattern-based fan-out to many targets with cross-account routing.
- No single actor owns the "flow"; each service reacts to events autonomously.
Choose Amazon SQS When
- The pattern is producer-consumer decoupling with at-least-once delivery and visibility timeouts.
- You need buffering, not workflow semantics.
- Consumers are pull-based (Lambda event source mapping, ECS polling).
AWS Step Functions overlaps EventBridge, but they solve different problems: AWS Step Functions = orchestration, EventBridge = choreography. If a DVA-C02 question says "one central definition controls the order of steps and retries," it's AWS Step Functions. If it says "multiple services react independently to events," it's EventBridge.
Common AWS Step Functions Exam Traps
Trap 1: Express Retention
Express workflows have no execution history retained by AWS Step Functions. CloudWatch Logs must be enabled. Standard retains 90 days in-service (and marketing docs sometimes say "1 year" referring to max execution duration, not history retention — don't confuse the two).
Trap 2: At-Most-Once vs At-Least-Once
Express Sync = at-most-once (caller gets the result; if AWS Step Functions crashes before completion, no retry). Express Async = at-least-once (AWS Step Functions may deliver duplicates if a crash happens mid-execution). Standard = exactly-once. If a scenario says "exactly-once," only Standard qualifies.
Trap 3: Inline vs Distributed Map Concurrency
Inline Map caps at 40 concurrent iterations; distributed Map goes to 10,000. "Iterate a million objects" = distributed Map, always.
Trap 4: Missing ResultPath in Catch
Without ResultPath, the error object replaces state input, destroying business context for compensation. Always set ResultPath.
Trap 5: Choice Has No Retry
Choice states cannot retry — they just route. Wrap any probe logic in a preceding Task that can retry, then Choice on its result.
Trap 6: Default is Mandatory in Choice
Omit Default on a Choice and AWS Step Functions raises States.NoChoiceMatched if no rule matches. Always include a Default.
Trap 7: Task Token vs Activity ARN
.waitForTaskToken is the modern callback pattern. Activities are the legacy polling pattern. Both work, but the DVA-C02 wording usually implies .waitForTaskToken unless the scenario explicitly says "on-premises workers poll for work."
Memorize these seven traps: (1) Express has no in-service history — require CloudWatch Logs. (2) Express Sync is at-most-once; Express Async at-least-once; only Standard is exactly-once. (3) Inline Map = 40 max; distributed Map = 10,000 max. (4) Missing ResultPath in Catch destroys business context. (5) Choice states have no Retry. (6) Choice states need Default. (7) .waitForTaskToken is the modern callback pattern; Activities are legacy polling. These single-line facts resolve most AWS Step Functions scenario questions on DVA-C02.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
AWS Step Functions Pricing Summary
- Standard: roughly 0.000025 USD per state transition (us-east-1). First 4,000 transitions per month are free. No per-execution or per-duration fee.
- Express: billed per execution count plus per-GB-second of memory used plus a small per-request fee. Memory is allocated automatically based on workload (64 MB increments). Cheaper than Standard above ~1,000 executions/s for short flows.
- CloudWatch Logs (Express) charges separately.
- X-Ray traces charge separately.
For DVA-C02 you rarely need exact prices, but you do need to know: Standard is transition-priced, Express is execution+memory priced. That distinction drives "which workflow type is cheaper for high volume?" scenarios.
AWS Step Functions Limits to Memorize
- Max execution duration: 1 year (Standard), 5 minutes (Express).
- Max state machine definition size: 1 MB.
- Max execution history events: 25,000 (Standard; after that the execution is redrive-only).
- Max state input/output size: 256 KB.
- Max StartExecution rate: 2,000/s (Standard), 100,000+/s (Express), soft limits.
- Max concurrent executions: no hard per-state-machine cap, but account-level quotas apply.
- Max distributed Map child executions: 10,000 concurrent.
- Max inline Map concurrency: 40.
- Task token callback max wait: 1 year (Standard).
FAQ — AWS Step Functions
Q1. What is AWS Step Functions in one sentence for DVA-C02?
AWS Step Functions is a serverless orchestrator that runs a JSON state machine (written in Amazon States Language) to coordinate AWS Lambda functions and other AWS services with built-in Retry, Catch, Parallel, Map, Choice, timeouts, asynchronous callbacks via task tokens, visual execution history, and per-transition or per-execution billing. On DVA-C02, AWS Step Functions is the default answer whenever the question mentions "orchestrate," "long-running workflow," "visualize execution," "distributed error handling," "saga," or "human approval."
Q2. What is the difference between AWS Step Functions Standard and Express workflows?
AWS Step Functions Standard workflows run up to 1 year, retain 90 days of history in the service, cap at 2,000 executions per second, deliver exactly-once execution semantics, and bill per state transition — use them for auditable, long-running, human-approval, and ETL flows. Express workflows run up to 5 minutes, retain no history in the service (CloudWatch Logs required), scale to 100,000+ executions per second, are either Synchronous (at-most-once) or Asynchronous (at-least-once), and bill per execution plus per-GB-second — use them for high-volume, short-lived event processing, IoT ingestion, and streaming. A common nested pattern is Standard parent + Express child (via .sync optimized integration) to combine audit with throughput.
Q3. How do Retry and Catch interact with ResultPath in AWS Step Functions?
Retry runs first — AWS Step Functions attempts the Task up to MaxAttempts times with IntervalSeconds scaled by BackoffRate (and optionally capped by MaxDelaySeconds with JitterStrategy: FULL). If all retries are exhausted, Catch runs: it matches the error name in ErrorEquals, transitions to the Next state, and — critically — writes the error object to ResultPath. Omitting ResultPath replaces the entire state input with the error, destroying the business payload your compensation needs. Always set ResultPath to a scoped key like $.error so the fallback state has both the original input and the error.
Q4. How do task tokens and .waitForTaskToken work for human approval flows?
.waitForTaskToken is the asynchronous callback service-integration pattern. AWS Step Functions generates a task token, injects it into the downstream payload (via $$.Task.Token), pauses the Task state, and waits for a worker to call SendTaskSuccess(token, output), SendTaskFailure(token, error, cause), or SendTaskHeartbeat(token). For human approval, combine .waitForTaskToken with SNS — the approver clicks an email link, which invokes API Gateway → Lambda → SendTaskSuccess. Set TimeoutSeconds to cap wait time (up to 1 year for Standard) and HeartbeatSeconds to detect dead workers.
Q5. What is the difference between .sync and .waitForTaskToken service integrations?
.sync (Run-a-Job) blocks the Task until a service-completion event arrives — AWS Step Functions knows how to poll or subscribe for the downstream service's "job done" signal for AWS Batch, Amazon ECS, AWS Glue, Amazon SageMaker, AWS CodeBuild, and nested Step Functions. You write no callback code. .waitForTaskToken blocks the Task until your own worker calls SendTaskSuccess/SendTaskFailure — you inject a token, any external system can complete it. Rule of thumb: if AWS already knows when the service finishes, use .sync; if a human or third-party system controls completion, use .waitForTaskToken.
Q6. When should I use distributed Map vs inline Map vs a Lambda coordinator?
Inline Map is a simple fan-out inside one parent execution, capped at 40 concurrent iterations — fine for small collections. Distributed Map scales to 10,000 concurrent child executions, accepts S3 prefixes or CSV/JSON files as input sources via ItemReader, supports batching and failure tolerance (ToleratedFailurePercentage), and can spawn Express child executions for maximum throughput — this is the correct answer for "process a million S3 objects" on DVA-C02. A hand-rolled Lambda coordinator has no built-in retry, no visibility, no failure tolerance knobs, and it risks Lambda timeouts — almost never the exam's correct answer.
Q7. How does AWS Step Functions implement the saga pattern?
The saga pattern expresses "local transactions + explicit compensations" — each forward Task has a Catch block that routes to a compensating Task, which in turn chains upstream-compensation. Examples: if ShipOrder fails, route to RefundCustomer → ReleaseInventory → Fail. If ChargeCustomer fails, route to ReleaseInventory → Fail (no refund needed, nothing was charged). Compensations must be idempotent because the execution may be retried — use DynamoDB conditional writes (attribute_not_exists) or States.UUID() idempotency tokens. AWS publishes an official saga sample project using DynamoDB bookings as the canonical reference.
Q8. When should I use AWS Step Functions vs direct AWS Lambda chaining vs EventBridge?
Use AWS Step Functions for orchestration — one central state machine defines order, retries, compensations, and visibility. Use direct AWS Lambda chaining when the flow is one or two steps and needs no explicit retry or audit (note this is the pattern AWS Lambda + destinations addresses). Use Amazon EventBridge for choreography — multiple producers and consumers react to events independently, no central flow owner. The DVA-C02 tell: stems mentioning "control the sequence," "compensate on failure," "visual execution history," or "human approval" point to AWS Step Functions; stems mentioning "multiple services react independently," "pattern matching," or "cross-account event routing" point to EventBridge.
Q9. What CloudWatch metrics and logs does AWS Step Functions emit?
Standard AWS Step Functions emits per-state-machine metrics (ExecutionsStarted, ExecutionsSucceeded, ExecutionsFailed, ExecutionsTimedOut, ExecutionsAborted, ExecutionThrottled) and per-activity metrics (ActivityRunTime, ActivityScheduleTime, ActivitiesSucceeded, etc.). Enable CloudWatch Logs at the state machine level to capture full execution history. Express workflows retain no in-service history, so CloudWatch Logs is mandatory — set log level to ALL or ERROR and INCLUDE_EXECUTION_DATA=true to capture input/output payloads. Enable X-Ray tracing on the state machine for cross-service latency forensics.
Q10. What AWS Step Functions limits must I memorize for DVA-C02?
Memorize: 1 year max execution (Standard), 5 min max execution (Express), 1 MB state machine definition, 256 KB state input/output, 25,000 execution history events (Standard), 2,000 exec/s (Standard), 100,000+ exec/s (Express), 10,000 distributed Map concurrent children, 40 inline Map concurrency, 3 default Retry MaxAttempts, 2.0 default BackoffRate, 1 s default IntervalSeconds, 90-day in-service history retention (Standard only), and the three service-integration patterns (direct / .sync / .waitForTaskToken). Those numbers and names answer the majority of recall questions on AWS Step Functions.
Summary — AWS Step Functions and Distributed Error Handling at a Glance
- AWS Step Functions is the serverless orchestrator of DVA-C02 Domain 4; it replaces Lambda-to-Lambda chaining with a declarative JSON state machine written in Amazon States Language (ASL).
- Standard vs Express: Standard = 1 year, 2,000 exec/s, exactly-once, 90-day history in service, per-transition pricing. Express = 5 min, 100,000+ exec/s, CloudWatch Logs required, at-most-once (Sync) or at-least-once (Async), per-execution+memory pricing.
- Eight ASL state types: Task, Choice, Parallel, Map (inline or distributed), Pass, Wait, Succeed, Fail. Memorize which does what.
- Intrinsic functions (
States.Format,States.StringSplit,States.JsonMerge,States.UUID, plus the array/math/hash/base64 helpers) kill plumbing Lambdas. - Retry:
ErrorEquals,MaxAttempts,IntervalSeconds,BackoffRate,MaxDelaySeconds,JitterStrategy. Built-in error names includeStates.ALL,States.Timeout,States.TaskFailed,States.Permissions,States.NoChoiceMatched. - Catch:
ErrorEquals,Next,ResultPath. Always setResultPathor you lose the business payload on error. - Three service-integration patterns: direct (
arn:aws:states:::svc:action), Run-a-Job (.sync), Wait-for-Callback (.waitForTaskTokenwithSendTaskSuccess/SendTaskFailure/SendTaskHeartbeat). - Optimized integrations for Lambda, DynamoDB, SQS, SNS, ECS, Glue, SageMaker, Batch, EventBridge, and nested Step Functions eliminate wrapper Lambdas.
- Distributed Map iterates up to 10,000 concurrent children over S3 prefixes or CSV/JSON files; pair Standard parent with Express children for audit + throughput.
- Saga pattern: forward Task + compensating Task via Catch; compensations must be idempotent.
- Circuit breaker: Choice state reading a CloudWatch alarm state via a probe Task.
- Standard workflows expose visual execution history in the console; Express workflows require CloudWatch Logs for any post-mortem debugging.
- Orchestration (Step Functions) vs choreography (EventBridge) vs decoupling (SQS) — each solves a different class of problem.
- Master these and DVA-C02 Domain 4 orchestration scenarios become predictable, and Task 4.3 optimization answers become obvious.