examhub .cc 用最有效率的方法,考取最有價值的認證
Vol. I
本篇導覽 約 37 分鐘

Systems Manager:Automation、Patch Manager 與 Session Manager

7,400 字 · 約 37 分鐘閱讀

AWS Systems Manager is the SOA-C02 SysAdmin's swiss-army knife — a single console that replaces SSH, replaces cron, replaces ad-hoc patch scripts, replaces hand-rolled secrets in EC2 user-data, and replaces the bastion host. On the SOA-C02 exam, Systems Manager threads through Domain 3 Task Statement 3.2 ("automate manual or repeatable processes") and Domain 1 Task Statement 1.2 ("remediate issues based on monitoring and availability metrics"), and it also shows up as the back-end of AWS Config auto-remediation, EventBridge target choice for Lambda alternatives, and the only sanctioned way to run Run Command against a fleet without exposing port 22. If CloudWatch is the SOA-C02 nervous system, Systems Manager is the SOA-C02 hands.

This guide walks through Systems Manager from the SysOps angle: how the SSM Agent identifies an instance and what makes a managed instance "managed", how Run Command differs from State Manager (one-shot vs continuous desired-state), how Patch Manager baselines define which patches are approved and how the auto-approval delay works, how Maintenance Windows orchestrate patching across a fleet on a schedule, how Automation runbooks chain steps with assumeRole and conditional branches, how Session Manager removes SSH bastions and port 22, how Parameter Store SecureString relates to (and differs from) Secrets Manager, and how Inventory and OpsCenter provide compliance and incident workflows. You will also see the recurring SOA-C02 scenario shapes: agent-not-managed diagnostics, "patches are approved but not installed", maintenance-window concurrency tuning, Automation runbook IAM AssumeRole, and Session Manager VPC endpoint requirements for private subnets.

Why Systems Manager Sits at the Heart of SOA-C02 Domain 3

The official SOA-C02 Exam Guide v2.3 lists three skills under Task Statement 3.2: "use AWS services (for example, Systems Manager, CloudFormation) to automate deployment processes", "implement automated patch management", and "schedule automated tasks by using AWS services (for example, EventBridge, AWS Config)". Systems Manager appears in two of the three skills, and Patch Manager is the only AWS-native answer to the "implement automated patch management" skill — there is no other in-scope service that performs OS patching against an EC2 fleet on a schedule. Domain 3 is worth 18 percent of the exam, and Task Statement 3.2 produces the bulk of those questions.

Domain 1 Task Statement 1.2 then re-uses Systems Manager from a different angle: "use AWS Systems Manager Automation runbooks to take action based on AWS Config rules". When AWS Config detects a non-compliant resource, the EventBridge rule on aws.config events routes to an SSM Automation document that performs the remediation — tag a missing resource, encrypt an unencrypted EBS volume, restrict an open security group. The exam expects you to know the full Config -> EventBridge -> SSM Automation chain end to end, including the IAM remediation role that SSM assumes.

At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "which AWS service should you use to manage patches across a fleet?" SOA-C02 asks "the nightly patching cycle is failing on 12 percent of instances — list the diagnostic steps in order". The answer always touches the same three foundations: SSM Agent must be running, the IAM instance profile must include AmazonSSMManagedInstanceCore, and the network path to the SSM endpoint must exist (NAT gateway, internet gateway, or VPC interface endpoint). Get one of the three wrong and Systems Manager appears to do nothing — silently. That silence is the canonical SOA-C02 trap.

  • SSM Agent: the lightweight daemon (amazon-ssm-agent on Linux, AmazonSSMAgent Windows service) pre-installed on most modern AWS-published AMIs. The agent polls Systems Manager and executes Run Command, State Manager, Automation, and Session Manager work on the host.
  • Managed instance: any EC2 or hybrid (on-prem) host that is registered with Systems Manager. Membership requires three conditions: agent running, IAM instance profile with AmazonSSMManagedInstanceCore (or hybrid activation for on-prem), and outbound network reach to the SSM service endpoint.
  • Document (SSM Document): a JSON or YAML template that defines the steps to run on a managed instance. Document types include Command (Run Command and State Manager), Automation (multi-step orchestration that may target AWS APIs in addition to instances), Session (Session Manager preferences), and Package.
  • Run Command: ad-hoc, one-shot remote execution of a Command document against a target set defined by instance IDs, tags, or resource groups.
  • State Manager: continuous-desired-state — a State Manager association re-runs a Command document on a schedule against a target set to enforce a desired configuration (agent installed, AV definitions current, sshd hardened).
  • Patch Manager: the SSM capability that scans and installs OS patches against managed instances using patch baselines, patch groups, and maintenance windows.
  • Patch baseline: rules that define which patches are auto-approved (by classification, severity, product) and which are explicitly approved or rejected. Each OS family has a default AWS-managed baseline; you can author custom baselines.
  • Patch group: a tag (Patch Group = prod-rhel9) that pins instances to a custom baseline. An instance can belong to exactly one patch group.
  • Maintenance Window: a recurring or one-time schedule that targets instances and runs tasks (Run Command, Automation, Lambda, Step Functions). Patching at scale uses a Maintenance Window with a AWS-RunPatchBaseline task.
  • Automation runbook: a multi-step document with branching, assumeRole, aws:executeAwsApi, aws:executeScript, aws:waitForAwsResourceProperty, and aws:approve steps — used for self-service operations and Config auto-remediation.
  • Session Manager: browser-based or CLI shell access to managed instances without SSH keys, bastion hosts, or open inbound ports — auditable to S3 and CloudWatch Logs.
  • Parameter Store: hierarchical key/value store for configuration data and secrets. Standard tier is free; Advanced tier supports larger values and policies. SecureString uses KMS for encryption at rest.
  • Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html

白話文解釋 Systems Manager Automation, Patch Manager, and Run Command

Systems Manager has many sub-services and the names blur together. Three analogies make the structure stick.

Analogy 1: The Building Maintenance Department

Systems Manager is the building maintenance department of a large office complex (your AWS account). Each EC2 instance is a room with a smart lock (the SSM Agent), and the maintenance department holds the master key (the IAM instance profile AmazonSSMManagedInstanceCore). Without that lock fitted and the master key on file, the maintenance team cannot enter — that room is "unmanaged". Run Command is the maintenance worker walking room-by-room performing a one-time job ("change the bulb in every room tagged Floor=3"). State Manager is the standing recurring instruction ("once a week, check that every room's smoke detector is armed and replace the battery if it isn't"). Patch Manager is the building-wide window-replacement program — it has a patch baseline (the spec sheet for which window models are approved), patch groups (north-facing rooms vs south-facing rooms get different glass), and a maintenance window (Sundays 2am to 5am, only ten rooms at once so the building is never fully closed). Automation runbooks are the emergency response binders with numbered steps the maintenance team follows when a fire alarm goes off — step 1 evacuate, step 2 notify, step 3 isolate, step 4 inspect — and each step has an assumeRole clause for which department's keys are needed. Session Manager is the maintenance worker's universal swipe card that opens any room without ever needing the room's individual key (no SSH keypair) and without needing a side door (no bastion host) — and every entry is logged to a tamper-proof register. Parameter Store is the master folder cabinet in the basement holding standard supply specs (paint color, light bulb wattage) — the SecureString drawer is locked with a KMS key.

Analogy 2: The IT Department's Run Book Binder

A SysAdmin's physical run book binder has tabs for "create new VPN user", "rotate database password", "patch RHEL fleet", and "investigate failed login alarm". Each tab has a numbered procedure: step 1, step 2, step 3, with conditional branches ("if step 2 returns RC=4, skip to step 7"). SSM Automation runbooks are exactly that binder, but executable — the document defines steps as aws:executeAwsApi (call an AWS API), aws:executeScript (run a Python or PowerShell script locally on the runner), aws:runCommand (run a Command document on a managed instance), aws:waitForAwsResourceProperty (poll until a condition is met), and aws:approve (pause for a human to approve before continuing). The AWS-managed runbooks like AWS-RestartEC2Instance, AWS-StopEC2Instance, AWS-CreateSnapshot, and AWS-PatchInstanceWithRollback are the pre-printed pages every IT department starts with; you write custom runbooks for the procedures that are unique to your shop. The runbook author specifies an AssumeRole so the runbook can act with elevated privileges that the human user doesn't have directly — the IT department's "after-hours master key" is checked out by the runbook engine, not handed to every junior tech.

Analogy 3: The Hotel Master Patch Schedule

Patching a fleet is like renovating a chain of hotels. You can't close every hotel at once — guests would have nowhere to go. Patch Manager + Maintenance Windows is the chain-wide renovation schedule: every hotel tagged Region=APAC gets repainted Sundays 2am to 5am, with MaxConcurrency = 10% (only one in ten hotels is offline at a time) and MaxErrors = 5% (if more than 5 percent of hotels fail to repaint, halt the chain-wide rollout). The patch baseline is the specification sheet — color, primer brand, finish — and the auto-approval delay is the manufacturer warranty wait: a new paint formula is automatically approved 7 days after release so any factory defects surface first. Patch groups tag hotels with their baseline ("luxury chain uses spec A, budget chain uses spec B"). Patch compliance reporting is the post-renovation inspection report — every hotel's status against its baseline. The whole pipeline is what the SOA-C02 exam means by "implement automated patch management".

For SOA-C02, the building maintenance department analogy is the most useful when a question chains capabilities together — Run Command + State Manager + Patch Manager are all the same maintenance team using different tools, with the same IAM and network prerequisites. Whenever the question hints "the agent isn't responding" or "the patches aren't installing", remind yourself: agent + IAM + network — those three have to all be present. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html

SSM Agent: Lifecycle, Prerequisites, and Troubleshooting

The SSM Agent is the foundation of every Systems Manager capability. If the agent is not registered as a managed instance, no Run Command, no State Manager, no Patch Manager, no Session Manager, no Automation aws:runCommand step will work on that host.

Agent installation

The agent is pre-installed on most AWS-published AMIs since 2017: Amazon Linux 2 / 2023, Ubuntu 16.04+, Windows Server 2008 R2+, RHEL/CentOS 7+, SUSE 12+, macOS, and the Bottlerocket OS used for EKS-managed nodes. For older AMIs, custom AMIs, or on-premises hosts, install via:

  • yum install -y amazon-ssm-agent (RHEL/CentOS/Amazon Linux) or snap install amazon-ssm-agent --classic (Ubuntu).
  • MSI installer for Windows Server.
  • Run Command AWS-ConfigureAWSPackage with parameter name=AmazonSSMAgent if the host is already managed (chicken-and-egg: only useful for upgrades).
  • EC2 Image Builder baking the latest agent version into a golden AMI — the SOA-C02-preferred path for fleet uniformity.

The agent communicates outbound over HTTPS port 443 to the regional Systems Manager endpoints (ssm.<region>.amazonaws.com, ec2messages.<region>.amazonaws.com, and ssmmessages.<region>.amazonaws.com). It never accepts inbound traffic — this is what makes Session Manager safe.

Three prerequisites for a managed instance

For an instance to appear in Fleet Manager → Managed instances, all three must be true:

  1. SSM Agent installed and running (systemctl status amazon-ssm-agent).
  2. IAM instance profile attached with AmazonSSMManagedInstanceCore policy (or a custom equivalent that grants ssm:UpdateInstanceInformation, ssmmessages:*, ec2messages:*, plus the writes needed for inventory/Session Manager logging).
  3. Outbound network path to the three SSM endpoints — via internet gateway (public subnet), NAT gateway (private subnet), or VPC interface endpoints for com.amazonaws.<region>.ssm, com.amazonaws.<region>.ec2messages, and com.amazonaws.<region>.ssmmessages (private subnet without internet egress).

If any of the three is missing, the instance silently does not register. Fleet Manager simply does not list it. Run Command targeting by instance ID returns "InvalidInstanceId" — not a permission error, not a network timeout, just "not a valid SSM instance".

On-premises and hybrid

For on-prem servers, register the host via a Systems Manager hybrid activation: create an activation with an IAM service role (not instance profile, since there is no EC2), receive an activation code and ID, and run amazon-ssm-agent -register -code <code> -id <id> -region <region> on the host. The host now shows up as a managed instance with an mi- prefix instead of i-. SOA-C02 can ask "patch a fleet that mixes EC2 and on-prem servers in one schedule" — the answer is hybrid activation + a Maintenance Window targeting both sets.

Agent troubleshooting checklist

When an instance is missing from Fleet Manager:

  1. Is the agent running? sudo systemctl status amazon-ssm-agent — if dead, sudo systemctl restart amazon-ssm-agent and check journalctl -u amazon-ssm-agent.
  2. Is the IAM role attached and correct? EC2 console → Instance → Security tab → IAM Role. Confirm the role has AmazonSSMManagedInstanceCore. If the role was attached after launch, restart the agent so it re-reads the credentials.
  3. Can the agent reach the endpoints? From the instance: curl -v https://ssm.us-east-1.amazonaws.com/ — a TLS handshake is good. If timeout, check the route table (NAT gateway present?), security group outbound rules (443 allowed?), NACL outbound and inbound (ephemeral ports allowed back?), and whether the VPC has the three required interface endpoints if it is air-gapped.
  4. Agent version mismatch? Older agents have known bugs against newer SSM features. Patch Manager AWS-RunPatchBaseline requires agent ≥ 3.0 for some operating systems. Update via Run Command AWS-UpdateSSMAgent — but the instance must already be managed to receive it (which is why baking the agent into the AMI is preferred).
  5. Resource limits? Each region has a default quota of managed instances. For very large fleets, request an increase via Service Quotas.

The single most-tested SOA-C02 SSM scenario is "Run Command does not list the instance" or "patches are not installing on 10 percent of the fleet". The answer is virtually always one of: agent not running (or wrong version), IAM instance profile missing AmazonSSMManagedInstanceCore, or network path absent (no NAT gateway, no internet gateway, no VPC interface endpoint). The exam tests the diagnostic order — agent first, IAM second, network third. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/setup-instance-profile.html

  • Agent listening ports: zero inbound; outbound HTTPS 443 only.
  • Three SSM endpoint domains required: ssm.<region>.amazonaws.com, ec2messages.<region>.amazonaws.com, ssmmessages.<region>.amazonaws.com.
  • VPC interface endpoints required for private-subnet-only fleets: com.amazonaws.<region>.ssm, com.amazonaws.<region>.ec2messages, com.amazonaws.<region>.ssmmessages.
  • Default agent heartbeat: every 5 minutes the agent calls UpdateInstanceInformation. An instance that misses two consecutive heartbeats is marked ConnectionLost.
  • Managed-instance prefix: i- for EC2, mi- for hybrid (on-prem) registrations.
  • Required managed policy: AmazonSSMManagedInstanceCore (the modern minimum). The legacy AmazonEC2RoleforSSM is deprecated.
  • Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html

Run Command: Ad-Hoc Fleet Execution Without SSH

Run Command is the simplest Systems Manager capability and the most-used in day-to-day SysOps. It executes a Command document against a target set, gathers stdout/stderr, and reports per-instance success or failure.

How a Run Command invocation flows

  1. The SysOps engineer selects a Command document — for example, AWS-RunShellScript (Linux), AWS-RunPowerShellScript (Windows), AWS-ConfigureAWSPackage (install/update SSM-distributed packages), AWS-UpdateSSMAgent, AmazonCloudWatch-ManageAgent, or a custom document.
  2. Defines targets:
    • By instance IDs — explicit list.
    • By tag — for example tag:Environment=prod. The most common pattern at fleet scale.
    • By Resource Group — a saved query that resolves to a set of instances.
  3. Sets concurrency controls:
    • MaxConcurrency — how many instances run the command simultaneously (absolute number or percentage).
    • MaxErrors — when to halt the rollout. After MaxErrors failures, no further instances start.
  4. Optionally specifies an output S3 bucket and CloudWatch log group so stdout is preserved beyond the 2,500-character console truncation.
  5. Optionally requires CloudWatch alarm integration — if a configured alarm fires during execution, halt the command.

The agent on each target receives the work via the SSM service, runs it locally, and reports back. There is no SSH session, no inbound port, no shared credential.

Common Command documents

  • AWS-RunShellScript / AWS-RunPowerShellScript — execute arbitrary commands. Quick and dangerous; prefer purpose-built documents when one exists.
  • AWS-ConfigureAWSPackage — install or upgrade packages from SSM Distributor (CloudWatch agent, Inspector agent, third-party agents like CrowdStrike).
  • AWS-UpdateSSMAgent — upgrade the agent itself.
  • AmazonCloudWatch-ManageAgent — install, configure, and start the CloudWatch agent with a config from Parameter Store.
  • AWS-RunPatchBaseline — scan or install patches per the patch baseline. This is the document Patch Manager invokes inside a Maintenance Window.
  • AWS-StartInteractiveCommand / AWS-StartSSHSession — Session Manager bootstrap documents.

Run Command vs State Manager

Run Command is one-shot: you click it, it runs, it's done. State Manager is continuous-desired-state: you create an association that re-runs a Command document on a schedule against a target set. Two SOA-C02 examples make the distinction clear:

  • "Apply the latest CloudWatch agent config to all Environment=prod instances right now" — Run Command.
  • "Ensure every new Environment=prod instance has the CloudWatch agent installed and configured, forever" — State Manager association, runs every 30 minutes (default) or any cron schedule, automatically picks up new instances that join the tag.

State Manager associations are also the SOA-C02-correct answer for "the auditor wants every instance to be in compliance with our hardening Command document continuously" — Run Command is one-shot and would not catch drift.

The SOA-C02 cue is the verb. "Run", "execute", "apply" once → Run Command. "Ensure", "maintain", "continuously" → State Manager association. Both use the same Command documents, but State Manager re-runs them on a schedule and captures compliance status per-instance. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-state.html

Patch Manager: Baselines, Patch Groups, and Scan vs Install

Patch Manager is the SOA-C02 answer for automated OS patching. It scans managed instances against a patch baseline, optionally installs the approved patches, and reports compliance.

Patch baselines

A patch baseline is a set of rules that determine which patches are approved for an OS. Each OS family has a default AWS-managed baseline:

  • AWS-AmazonLinux2DefaultPatchBaseline
  • AWS-WindowsPredefinedPatchBaseline-OS
  • AWS-UbuntuDefaultPatchBaseline
  • AWS-RedHatDefaultPatchBaseline
  • AWS-SUSEDefaultPatchBaseline
  • AWS-CentOSDefaultPatchBaseline
  • AWS-MacOSDefaultPatchBaseline

Each default baseline approves Critical and Important security patches with a 7-day auto-approval delay — meaning a patch released by the OS vendor is automatically approved 7 days after release. The delay protects against bad patches being rolled out before the community discovers regressions.

You can author a custom baseline when:

  • You need a different auto-approval delay (0 days for emergency patching of zero-days, 14 or 30 days for ultra-conservative production environments).
  • You need to approve only a subset of classifications (Security but not Bugfix).
  • You need to explicitly approve a single critical CVE patch ahead of the auto-approval delay.
  • You need to explicitly reject a known-bad patch.
  • You need OS-level patch sources beyond the defaults (e.g., a private package repo).

A custom baseline is associated with a patch group by tagging instances with Patch Group=<group-name>. An instance can belong to exactly one patch group, so the tag value choice matters.

  • Default auto-approval delay: 7 days for AWS-managed baselines (Critical and Important security patches).
  • Patch Group tag key: literally Patch Group (with a space). Tag value is the patch-group name. An instance can belong to one patch group only.
  • Patch baseline OS coverage: Amazon Linux 2/2023, Windows, Ubuntu, RHEL, SUSE, CentOS, macOS, Oracle Linux, Debian.
  • Patch compliance states: COMPLIANT, NON_COMPLIANT, UNSPECIFIED_DATA (instance has not been scanned).
  • Operations: Scan (report compliance only, no install), Install (install approved patches and reboot if required).
  • Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-patch-baselines.html

Scan vs Install

The AWS-RunPatchBaseline document accepts an Operation parameter:

  • Scan — evaluate the instance against the baseline and report compliance. No patches are installed, no reboots happen. Used for audit and pre-window readiness checks.
  • Install — install all approved patches that are missing, and reboot if any patch requires it (configurable: RebootIfNeeded or NoReboot).

A common operational pattern is:

  1. Daily Scan via State Manager association — every instance reports compliance once a day. The CloudWatch dashboard shows the org-wide compliance percentage.
  2. Weekly Install via Maintenance Window — Sundays 2am to 5am, a Maintenance Window task with Operation=Install patches the fleet in waves controlled by MaxConcurrency and MaxErrors.

Separating Scan from Install lets you observe drift between scheduled patch windows without forcing reboots, and lets compliance reports drive non-patch remediation (e.g., when a single instance fails to patch repeatedly, an EventBridge rule on aws.ssm compliance events pages the on-call).

A common SOA-C02 distractor: the team needs to roll out a critical zero-day patch within 24 hours, but the default baseline has a 7-day auto-approval delay. The candidate who picks "wait 7 days" or "create a new baseline from scratch" gets it wrong. The right answer is one of: (a) explicitly approve the specific patch by KB or CVE in the existing baseline (the explicit list overrides the delay), (b) author a new baseline with auto-approval delay = 0 for the emergency, or (c) use patch lists to scope an emergency rollout. The 7-day delay is a default, not a hard rule. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-patch-baselines.html

Patch compliance reporting

Patch Manager publishes compliance per instance, viewable in:

  • Patch Manager console → Compliance.
  • Systems Manager Inventory → Patch compliance.
  • EventBridgeaws.ssm events of detail-type Configuration Compliance State Change route to SNS or Lambda.
  • AWS Config — the ec2-managedinstance-patch-compliance-status-check managed rule emits Config compliance findings, which can drive auto-remediation chains.

For audit purposes, the compliance data flows naturally into AWS Security Hub via Config or directly via SSM integrations.

Maintenance Windows: Scheduling Fleet-Wide Operations

A Maintenance Window is a recurring or one-time schedule with three components: a schedule, a set of registered targets, and a set of registered tasks. Maintenance Windows are how Patch Manager runs at fleet scale, and they are also useful for any other recurring operation that should not run during business hours.

Schedule

Schedules accept either:

  • A cron expressioncron(0 2 ? * SUN *) for "every Sunday 2am UTC".
  • A rate expressionrate(7 days).
  • A one-time schedule with a specific date.

Optional ScheduleTimezone puts the cron in local time (e.g., Asia/Taipei), avoiding the UTC mental conversion. Optional Cutoff specifies how many hours before the window ends to stop launching new tasks (so already-running tasks have time to finish). Optional Duration is how long the window stays open in total.

Registered targets

Targets are resolved at task execution time, not at registration:

  • By instance IDs.
  • By tag (the most common — tag:Patch Group=prod-rhel9).
  • By resource group.

Because targets are resolved at execution time, new instances that match the tag automatically join the next window's run.

Registered tasks

A task is one of four types:

  • RUN_COMMAND — invoke a Command document (most common: AWS-RunPatchBaseline).
  • AUTOMATION — invoke an Automation runbook.
  • LAMBDA — invoke a Lambda function.
  • STEP_FUNCTIONS — start a state machine execution.

Each task carries:

  • MaxConcurrency — how many targets are processed simultaneously. 10 (absolute) or 10% (percentage of total). For a 200-instance fleet at MaxConcurrency=10%, the window patches 20 at a time, then the next 20, and so on.
  • MaxErrors — when to halt the rollout. 5 (absolute) or 5%. After this many task failures, no more instances start.
  • TaskInvocationParameters — document-specific parameters (Operation=Install, RebootOption=RebootIfNeeded).
  • ServiceRoleArn — the IAM role the Maintenance Window assumes to execute the task (the maintenance window service role, not the instance profile — these are different IAM identities).
  • Priority — when multiple tasks are registered to the same window, lower-priority tasks run first.
  • MaxConcurrency: absolute number or percentage (e.g., 10%). No default — always specify, or task runs against all targets simultaneously.
  • MaxErrors: absolute number or percentage. After this many failures, the rollout halts. Often set to 5% for production patching.
  • Cutoff time: stop launching new tasks N hours before window end. Default is 0 (no cutoff).
  • Window duration: 1 to 24 hours.
  • Tasks per window: up to 10 tasks per window.
  • Targets per window: up to 50 registered target sets per window.
  • Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html

Scenario Pattern: hundreds of EC2s patched every Sunday 2am

Canonical SOA-C02 design:

  1. Tag every instance to be patched with Patch Group=prod-linux (and Environment=prod).
  2. Custom patch baseline for prod-linux — Amazon Linux 2, Critical/Important security patches, auto-approval delay 7 days, plus an explicit-approval list for any urgent CVEs.
  3. Maintenance Window with cron(0 2 ? * SUN *) and ScheduleTimezone=Asia/Taipei, duration 3 hours, cutoff 30 minutes.
  4. Registered target: tag:Patch Group=prod-linux.
  5. Registered task: AWS-RunPatchBaseline with Operation=Install, RebootOption=RebootIfNeeded, MaxConcurrency=10%, MaxErrors=5%, output to S3 bucket for audit.
  6. State Manager association running AWS-RunPatchBaseline with Operation=Scan daily so compliance is visible mid-week.
  7. EventBridge rule on aws.ssm compliance state-change events, posting to SNS for any instance that fails to patch twice in a row.

This pipeline is the answer to virtually every "patch a fleet" SOA-C02 scenario.

SSM Automation Runbooks: Multi-Step Orchestration

While Run Command targets only managed instances, Automation runbooks orchestrate AWS API calls plus instance commands plus human approvals in a single document. Automation runbooks are how complex remediations and self-service operations are codified.

Automation document structure

An Automation document is YAML (or JSON) with parameters, mainSteps, and an assumeRole. Each step has an action from a small library:

  • aws:executeAwsApi — make an arbitrary AWS API call (DescribeInstances, StopInstances, RestoreDBClusterFromSnapshot, etc.). The most powerful action.
  • aws:executeScript — run a Python or PowerShell script on the runner (the Systems Manager service, not your instance). Useful for transforms and complex logic.
  • aws:runCommand — run a Command document on a managed instance.
  • aws:waitForAwsResourceProperty — poll an API until a property reaches a target value. Used to wait for an instance to enter running or a stack to reach CREATE_COMPLETE.
  • aws:assertAwsResourceProperty — fail the runbook if a property does not match.
  • aws:approve — pause for one or more named principals to approve via console. Used for production-impacting steps that need human judgement.
  • aws:branch / aws:choice — conditional branching based on previous step output.
  • aws:sleep — wait for a fixed duration.

AWS-managed runbooks

AWS publishes a large catalog of pre-built runbooks. The high-frequency SOA-C02 ones:

  • AWS-RestartEC2Instance — stop, wait, start an instance. Simple and surprisingly common in scenarios.
  • AWS-StopEC2Instance / AWS-StartEC2Instance.
  • AWS-CreateSnapshot / AWS-DeleteSnapshot.
  • AWS-PatchInstanceWithRollback — install patches and roll back if reboot fails.
  • AWS-UpdateCloudFormationStack.
  • AWS-RunPatchBaseline (also available as a Command document for Run Command).
  • AWSConfigRemediation-* — a family of remediation runbooks intended to be wired to AWS Config rules.

IAM AssumeRole for Automation

A runbook can be invoked by a user, by EventBridge, by State Manager, or by AWS Config remediation. The runbook itself acts under the AssumeRole specified in the document — typically a role with the union of all permissions needed across all steps.

The role's trust policy must trust ssm.amazonaws.com (the Automation service principal). Common errors:

  • Trust policy missing ssm.amazonaws.com — the runbook fails immediately with AssumeRole denied.
  • Role permissions missing one of the API calls — runbook fails at the specific step.
  • Cross-account runbooks — the target account needs a role with trust to the source account, and the source-account role needs sts:AssumeRole on the target-account role.

A SOA-C02 scenario describes a Config rule wired to an Automation remediation runbook that "does not run". The candidate is shown the rule, the EventBridge target, and the runbook YAML. The bug is virtually always one of: (a) the runbook's assumeRole parameter is empty (the runbook tries to use the invoker's permissions, which fail because the invoker is the Config service); (b) the IAM role's trust policy does not include ssm.amazonaws.com; or (c) the role lacks one of the API permissions a step needs. Without a populated AssumeRole, the runbook fails silently. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-permissions.html

Scenario Pattern: Config rule + EventBridge + SSM Automation

The canonical Domain 1.2 + Domain 3.2 chain:

  1. AWS Config rule — for example s3-bucket-public-read-prohibited. When a bucket becomes public, the rule moves to NON_COMPLIANT.
  2. AWS Config remediation configuration — attached to the rule, specifies the SSM Automation runbook (e.g., AWSConfigRemediation-RemoveS3BucketPolicyStatementForS3PublicAccess) and an IAM remediation role.
  3. SSM Automation runbook — runs against the offending bucket, removes the public-grant statement, returns success.
  4. EventBridge rule (optional) — also routes the aws.config non-compliance event to SNS so humans see what happened.

The SOA-C02 trap: candidates wire the EventBridge rule directly to the Lambda function instead of using Config's native remediation feature. Both work; Config remediation is the AWS-native answer when the input is a Config rule, with built-in retry and logging.

Session Manager: Replacing SSH and Bastion Hosts

Session Manager provides browser-based or AWS CLI shell access to managed instances without SSH keys, without bastion hosts, and without inbound port 22 or 3389. It is the single most-tested SSM capability on SOA-C02 because it solves a security problem every SysOps team faces.

How a session works

  1. The user opens Systems Manager → Fleet Manager → select instance → Start session (or aws ssm start-session --target i-0abc123 from the CLI).
  2. The Session Manager service authenticates the user against IAM (the user needs ssm:StartSession permission).
  3. The SSM Agent on the target instance receives a request via its existing outbound connection.
  4. The agent spawns a shell (default: ssm-user, with sudo). Stdin and stdout flow through the SSM service back to the user's browser/CLI — no inbound traffic to the instance.
  5. The session is logged: every keystroke and command output can be streamed to CloudWatch Logs and/or S3, optionally KMS-encrypted.

Why Session Manager wins on the exam

Compared with bastion-host SSH, Session Manager:

  • Removes inbound port 22 entirely — security groups can deny all inbound traffic, dramatically shrinking the attack surface.
  • Removes SSH keys — IAM identities authenticate sessions; no shared keypairs to rotate or distribute.
  • Removes bastion EC2 cost — no always-on jump host.
  • Logs every session to S3/CloudWatch — auditable for compliance.
  • Works in private subnets with no NAT/IGW if the VPC has the three SSM interface endpoints.
  • Integrates with IAM session policies to restrict which commands a user can run.

Session preferences

A SessionManagerRunShell document type holds account-wide preferences:

  • shellProfile — what shell to launch (default bash Linux, cmd Windows).
  • runAsEnabled — start session as a custom OS user instead of ssm-user.
  • s3BucketName + cloudWatchLogGroupName — where to log sessions.
  • kmsKeyId — KMS key for encrypting session log data.
  • idleSessionTimeout — auto-terminate idle sessions after N minutes.
  • maxSessionDuration — hard cap on session length.

Session Manager VPC endpoint requirement for private subnets

For an instance in a private subnet with no internet egress, the same three SSM VPC interface endpoints required for managed-instance registration also support Session Manager:

  • com.amazonaws.<region>.ssm
  • com.amazonaws.<region>.ec2messages
  • com.amazonaws.<region>.ssmmessages

If you also want session logs to go to S3 from the instance directly (rather than via the SSM service), you may need an S3 gateway endpoint.

Whenever the scenario mentions "secure access to EC2 in a private subnet without exposing port 22" or "audit every administrative session for compliance", the SOA-C02 default answer is Session Manager — never a bastion host, never an SSH keypair, never port 22 open. The candidate who picks bastion + key rotation has missed the SOA-specific operational pattern. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html

By default, Session Manager logs to CloudWatch Logs and S3 are unencrypted. For compliance environments, enable KMS encryption in the session preferences with a customer-managed key. The exam sometimes asks "ensure session logs are protected at rest" — the answer is the KMS key in session preferences, not S3 bucket-level encryption alone (although both together is best practice). Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-enable-encryption.html

Parameter Store: Configuration and Secrets

Parameter Store is Systems Manager's hierarchical key/value store. Parameter names are paths (/myapp/prod/db/host), types are String, StringList, or SecureString, and access is controlled with IAM policies on parameter ARNs.

Tiers

  • Standard tier — free, up to 10,000 parameters per region per account, parameter value up to 4 KB. No parameter policies.
  • Advanced tier — paid (USD 0.05/parameter/month at time of writing), up to 100,000 parameters, value up to 8 KB, supports parameter policies (expiration, expiration notification, no-change notification).

SecureString

A SecureString parameter is encrypted at rest with KMS. The parameter's encryption key is either:

  • The AWS-managed key alias/aws/ssm (default).
  • A customer-managed KMS key, specified at creation time.

Reading a SecureString requires both ssm:GetParameter(s) permission and kms:Decrypt permission on the encryption key. Writing requires ssm:PutParameter and kms:Encrypt.

Parameter Store vs Secrets Manager

This comparison shows up on virtually every SOA-C02 attempt.

Feature Parameter Store SecureString Secrets Manager
Encryption at rest KMS KMS
Cost Free (Standard) / 0.05 USD/param/month (Advanced) 0.40 USD/secret/month + per-API-call fees
Native automatic rotation No (you must wire your own Lambda) Yes — built-in Lambda rotation for RDS, Redshift, DocumentDB; custom rotation Lambda for others
Cross-account sharing Resource policies on Advanced tier Resource policies natively
Versioning Yes Yes
Maximum value size 4 KB Standard / 8 KB Advanced 64 KB
Use case Application config, simple secrets, non-rotated secrets Database credentials, API keys with required rotation

The SOA-C02 cue: "automatic rotation" → Secrets Manager. "Application configuration values, some of which are sensitive" → Parameter Store SecureString. "Hardcoded database password in EC2 user-data, must be removed" → Secrets Manager (because operationally you also want rotation, even if the question doesn't say so explicitly).

Public parameters

AWS publishes useful values as public Parameter Store parameters under /aws/service/:

  • /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 — the latest Amazon Linux 2023 AMI ID. CloudFormation Parameters can reference this so stacks always launch on the latest AMI.
  • /aws/service/global-infrastructure/regions — list of all regions.
  • /aws/service/eks/optimized-ami/... — EKS-optimized AMI IDs.

Public parameters are the SOA-C02-correct answer for "always launch the latest AMI" without baking AMI IDs into templates.

Inventory: Software and Configuration Catalog

Systems Manager Inventory collects metadata about managed instances on a schedule:

  • Installed applications (with version).
  • AWS components installed.
  • Network configuration (interfaces, routing).
  • Windows updates installed.
  • Files at specified paths.
  • Registry keys (Windows).
  • Custom inventory types defined by the operator.

Inventory is a State Manager association under the hood — it runs the AWS-GatherSoftwareInventory document on a schedule (default every 30 minutes). The collected data is queryable in the SSM console and via the Inventory API, and it can be synced to a central S3 bucket via Resource Data Sync for cross-account, cross-region querying with Athena or QuickSight.

Common SOA-C02 use cases:

  • "Find every instance running Apache version < 2.4.55" — Inventory query on installed applications.
  • "Audit all Windows hotfixes installed across the org" — Resource Data Sync into S3 + Athena.
  • "Identify instances with a specific configuration file present" — custom inventory type.

OpsCenter: Operational Item Aggregation

OpsCenter is the Systems Manager workflow tool for managing operational issues (called OpsItems). An OpsItem is a tracked operational concern — it has a status (Open, In Progress, Resolved), an assignee, related resources, and a category.

OpsItems are created automatically from:

  • CloudWatch alarm state changes (when configured).
  • AWS Config non-compliance events.
  • AWS Health events.
  • GuardDuty findings.
  • Manual creation by SysOps engineers.

OpsCenter is the SOA-C02 answer for "consolidate operational alerts from multiple AWS services into a single triage view". For full incident response, OpsCenter integrates with Incident Manager (separate sub-feature), which adds runbooks, on-call rotations, and chat channel integration.

Common Trap: SSM Endpoints in Air-Gapped VPCs

A canonical SOA-C02 trap: an instance in a fully private subnet (no NAT, no IGW, no Direct Connect, no public IP) cannot register as managed even though the agent is running and the IAM role is correct. The agent silently retries DNS resolution and TLS handshake against ssm.<region>.amazonaws.com and times out. The fix is the three SSM VPC interface endpoints (ssm, ec2messages, ssmmessages). Forgetting any one of them produces partial functionality:

  • Without ssm endpoint — instance never registers.
  • Without ec2messages endpoint — Run Command and Patch Manager fail.
  • Without ssmmessages endpoint — Session Manager specifically fails (session establishment uses this endpoint), but Run Command may still work.

The exam tests the three-endpoint requirement directly — candidates who only configure the ssm endpoint and assume the rest follows are wrong.

Common Trap: Patch Manager Reboots Production During Business Hours

If you run AWS-RunPatchBaseline with Operation=Install and RebootOption=RebootIfNeeded outside a Maintenance Window — for example as an ad-hoc Run Command — the targeted instances will reboot the moment a patch requires it, regardless of whether you are at lunch or in a SEV1 incident. The Maintenance Window is what enforces the schedule and the concurrency. The SOA-C02 lesson: Run Command for patching is for emergencies or scan-only; production patching belongs in a Maintenance Window with MaxConcurrency and MaxErrors controls.

Common Trap: Hybrid Activation Code Reuse

Hybrid activation codes are short-lived (default 24 hours, max 30 days) and tied to an instance count limit. Operators sometimes try to reuse a single activation code across hundreds of on-prem hosts and find that registration fails after the limit. The right pattern is to mint activation codes via Infrastructure-as-Code (CloudFormation, Terraform via the Systems Manager APIs) at provisioning time, with appropriate per-batch instance count limits, and to rotate the codes regularly.

Common Trap: Run Command Has No Output

If you do not configure an S3 output bucket or CloudWatch Log Group on a Run Command invocation, command stdout is truncated to the first 2,500 characters in the Run Command console. For any command that produces meaningful output (a multi-line script, a configuration dump, a compliance report), always specify an S3 output location at invocation time — this is a frequent SOA-C02 distractor where candidates think Run Command is broken when actually output is just truncated.

SOA-C02 vs SAA-C03: The Operational Lens

Question style SAA-C03 lens SOA-C02 lens
Patch management "Which AWS service automates OS patching?" "Patch Manager skipped 12% of instances last Sunday — list the diagnostic order."
Bastion replacement "Which service securely replaces SSH bastions?" "Configure Session Manager with KMS-encrypted CloudWatch logs and a 30-minute idle timeout."
Secrets management "Choose between Parameter Store and Secrets Manager." "RDS credentials hardcoded in EC2 user-data — migrate to Secrets Manager and rotate every 30 days."
Automation "Which service runs scheduled remediation?" "Build a Config rule + EventBridge + SSM Automation chain with the correct AssumeRole."
Run Command Rarely tested. Heavily tested — concurrency, output to S3, MaxErrors, Run Command vs State Manager.
Maintenance Windows "How do you schedule fleet patching?" "Configure cron cron(0 2 ? * SUN *) with MaxConcurrency=10% and MaxErrors=5%."
SSM Agent "Where does SSM Agent come from?" "Instance not visible in Fleet Manager — list every prerequisite to verify."

The SAA candidate selects the service; the SOA candidate configures it correctly, troubleshoots when it does not register, and operates the patch cycle, automation pipeline, and session log audit during incidents.

Exam Signal: How to Recognize a Domain 3.2 Question

  • "Patches did not install" — diagnose agent + IAM + network, then check baseline auto-approval delay, then check Maintenance Window MaxErrors halt.
  • "Cannot SSH into the private-subnet instance" — Session Manager + SSM VPC endpoints. Never bastion + SSH key.
  • "Hardcoded credential must be removed" — Secrets Manager (with rotation) or Parameter Store SecureString (without rotation).
  • "Config rule must auto-remediate" — Config remediation configuration → SSM Automation runbook with correct AssumeRole and trust to ssm.amazonaws.com.
  • "Continuous compliance, not one-shot" — State Manager association, not Run Command.
  • "Patch on schedule" — Maintenance Window with cron, registered tags, AWS-RunPatchBaseline task, MaxConcurrency and MaxErrors.
  • "Always launch latest AMI" — Public Parameter Store parameter under /aws/service/.
  • "Operational alerts from multiple sources need triage" — OpsCenter OpsItems.
  • "Audit every administrative session" — Session Manager + S3/CloudWatch Logs + KMS encryption.
  • "Patch on-prem hosts and EC2 in one schedule" — SSM hybrid activation + Maintenance Window targeting both.

With Domain 3 worth 18 percent and Task Statement 3.2 covering Systems Manager-driven automation and patching, expect 8 to 10 questions in this exact territory, plus 2-3 more from Domain 1.2 (Config + EventBridge + SSM Automation chain). Mastering the patterns in this section is one of the highest-leverage study activities for SOA-C02. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html

Decision Matrix — Systems Manager Capability for Each SysOps Goal

Use this lookup during the exam.

Operational goal Primary capability Notes
Run a one-shot command across a fleet Run Command Target by tag for fleet scale; specify S3 output for full stdout.
Maintain a configuration continuously State Manager association Re-runs a Command document on a schedule; new instances picked up via tag.
Patch a fleet on a schedule Patch Manager + Maintenance Window Patch baseline + Patch Group tag + cron + MaxConcurrency + MaxErrors.
Run an emergency patch outside the window Run Command with AWS-RunPatchBaseline Or update baseline auto-approval delay to 0 for the urgent CVE.
Replace SSH bastion Session Manager + S3/CloudWatch logs, KMS encryption, IAM session policy.
Access EC2 in a private subnet Session Manager + SSM VPC endpoints All three: ssm, ec2messages, ssmmessages.
Store DB credentials with rotation Secrets Manager Built-in Lambda rotation for RDS/Redshift/DocumentDB.
Store config values, some sensitive Parameter Store SecureString Cheaper than Secrets Manager when no rotation needed.
Always launch the latest Amazon Linux AMI Public Parameter Store parameter /aws/service/ami-amazon-linux-latest/....
Auto-remediate a Config rule Config remediation → SSM Automation runbook Or EventBridge → SSM Automation. AssumeRole must trust ssm.amazonaws.com.
Multi-step orchestration with branches Automation runbook aws:executeAwsApi, aws:branch, aws:approve for human gates.
Catalog installed software across fleet Inventory + Resource Data Sync Sync to S3, query with Athena.
Triage operational alerts OpsCenter OpsItems Aggregates CloudWatch, Config, Health, GuardDuty findings.
Patch on-prem servers SSM Hybrid Activation + Patch Manager Same Maintenance Window targets EC2 and on-prem together.
Halt a patch rollout if too many fail MaintenanceWindow MaxErrors Often 5% for production.
Restrict admin shell session length Session Preferences maxSessionDuration Plus idleSessionTimeout for inactivity.
Update SSM Agent across the fleet Run Command AWS-UpdateSSMAgent Or bake newer agent into golden AMI via Image Builder.

Common Traps Recap — Systems Manager Automation, Patch Manager, and Run Command

Trap 1: Patches "approved" but not installed

A patch can be approved by the baseline yet remain uninstalled because the instance is not in scope of any Maintenance Window task with Operation=Install, or because the Maintenance Window halted on MaxErrors, or because the instance is in a different patch group than expected. Check the Maintenance Window history and per-instance compliance state.

Trap 2: Default 7-day auto-approval delay slows emergency patching

Override with explicit approval lists or a custom baseline with auto-approval delay = 0 for the emergency. Don't wait the 7 days.

Trap 3: SSM Agent silently fails without IAM or network

AmazonSSMManagedInstanceCore plus reachability to the three SSM endpoints. Missing either is a silent failure — Fleet Manager simply does not list the instance.

Trap 4: Session Manager logs unencrypted

Default session logs are unencrypted. For compliance environments, set kmsKeyId in session preferences with a customer-managed KMS key.

Trap 5: Automation runbook with empty AssumeRole

Without an AssumeRole, the runbook tries to use the invoker's identity. Wired-up Config remediation cannot supply this — the runbook must specify AssumeRole, and the role's trust policy must include ssm.amazonaws.com.

Trap 6: Run Command output truncated to 2,500 characters

Always specify an S3 output bucket and/or CloudWatch Log Group at invocation time for any non-trivial command output.

Trap 7: Parameter Store SecureString without kms:Decrypt

Reading a SecureString needs both ssm:GetParameter and kms:Decrypt on the encryption key. Missing the latter produces an AccessDeniedException on decrypt that candidates often misdiagnose as a parameter access problem.

Trap 8: Maintenance Window task uses instance profile when it should use the maintenance-window service role

The Maintenance Window itself executes under a service role (AWS-SystemsManager-MaintenanceWindowRole or similar), separate from the instance profile on the target instance. Permissions for orchestrating the Run Command come from the service role; permissions for what the document does on the host come from the instance profile. Confusing the two causes obscure permission errors.

Trap 9: Patch Group tag key has a space

The literal tag key is Patch Group (with a space) — not PatchGroup (camelCase) and not patch_group. Misspelling the key causes the instance to fall back to the default baseline silently.

Trap 10: One instance, one patch group

An instance can belong to exactly one patch group at a time. Tagging an instance with Patch Group=foo and Patch Group=bar (the second tag overwrites the first) does not put it in two baselines — it picks the last-applied tag value.

FAQ — Systems Manager Automation, Patch Manager, and Run Command

Q1: Why does Run Command not list my EC2 instance?

Three prerequisites must all be satisfied: (a) the SSM Agent is running on the instance — check systemctl status amazon-ssm-agent; (b) the IAM instance profile has AmazonSSMManagedInstanceCore (or equivalent) attached; and (c) the instance has outbound network reach to ssm.<region>.amazonaws.com, ec2messages.<region>.amazonaws.com, and ssmmessages.<region>.amazonaws.com over HTTPS port 443 — via internet gateway, NAT gateway, or VPC interface endpoints. If the instance is in a fully private subnet, you must create all three VPC interface endpoints. The diagnostic order on the SOA-C02 exam is exactly: agent first, IAM second, network third.

Q2: What is the difference between Run Command and State Manager?

Run Command is one-shot: you click it, it runs once, it's done. State Manager is continuous-desired-state: a State Manager association re-runs a Command document on a schedule against a target set, automatically picking up new instances that match the target tag. Both use the same Command documents under the hood. The SOA-C02 cue is the verb: "execute now" or "apply once" → Run Command; "ensure", "maintain", or "continuously enforce" → State Manager. State Manager also captures per-instance compliance state for each association, which Run Command does not.

Q3: How do I roll out an emergency security patch within 24 hours when the default baseline has a 7-day auto-approval delay?

Three options. (a) Add the specific patch to the baseline's explicit approval list by KB number or CVE — this overrides the auto-approval delay for that patch only. (b) Author a new patch baseline with auto-approval delay = 0 for the emergency, associate it with the affected patch group, and revert later. (c) Use Run Command with AWS-RunPatchBaseline and BaselineOverride parameter pointing to a one-time emergency baseline. The default 7-day delay is not a hard constraint — it is a default to protect against bad patches, and it can be overridden when a known critical vulnerability outweighs the regression risk.

Q4: When should I choose Parameter Store SecureString vs Secrets Manager?

Use Secrets Manager when (a) the secret needs automatic rotation — Secrets Manager has built-in Lambda rotation for RDS, Redshift, DocumentDB credentials, plus custom rotation for any secret; (b) the secret value is large (Secrets Manager allows 64 KB vs SecureString's 8 KB Advanced or 4 KB Standard); or (c) you need fine-grained cross-account sharing via resource policies. Use Parameter Store SecureString when (a) the value is configuration data, possibly sensitive, that does not need rotation; (b) cost matters at high volume — Parameter Store Standard is free; (c) you want hierarchical organization (/myapp/prod/db/host). For the SOA-C02 exam, "rotation" or "RDS credentials" → Secrets Manager; "application config values, some sensitive" → Parameter Store SecureString.

Q5: Why does my SSM Automation runbook fail with "AssumeRole denied"?

The runbook's assumeRole parameter points to an IAM role, and that role's trust policy must include ssm.amazonaws.com as a trusted principal. Without it, the Systems Manager Automation service cannot assume the role, and every step that needs the role's permissions fails. Additionally, the role must have the permissions for every API call any step makes — aws:executeAwsApi calls need explicit IAM permissions for the called API. Common pattern: a remediation runbook that calls s3:PutBucketPolicy will fail at that step if the role does not include that permission, even if AssumeRole itself succeeded. Always check both the trust policy and the permissions when troubleshooting Automation runbook failures.

Q6: How do I patch on-premises servers alongside EC2 instances on the same schedule?

Register the on-prem servers via SSM Hybrid Activation: in the Systems Manager console, create an activation with an IAM service role, receive an activation code and ID, and run the agent on each on-prem host with amazon-ssm-agent -register -code <code> -id <id> -region <region>. The hosts now appear as managed instances with mi- prefix (vs i- for EC2). Tag the on-prem instances with the same Patch Group tag as your EC2 fleet. Configure a Maintenance Window with a target set matching the tag, and the same AWS-RunPatchBaseline task patches both EC2 and on-prem instances in the same window. This is the SOA-C02-canonical answer for unified patching across cloud and on-prem.

Q7: What's the difference between MaxConcurrency and MaxErrors in a Maintenance Window?

MaxConcurrency controls parallelism — how many target instances run the task at the same time. 10 means 10 instances at once; 10% means 10 percent of the target set. After the first batch finishes, the next batch starts. MaxErrors controls failure tolerance — after how many task failures the entire rollout halts. 5 means halt after 5 failures; 5% means halt after 5 percent of the target set fails. The two work together: at MaxConcurrency=10% and MaxErrors=5%, on a 200-instance fleet, the window patches in batches of 20 and halts permanently if 10 of those 200 fail. Production patching typically uses small concurrency (5–20%) and small error tolerance (1–5%) so a bad patch does not take down the whole fleet.

Q8: Can I run an SSM Automation runbook from another account?

Yes, via Automation cross-account, cross-region execution. The runbook is invoked in the source account but executes API calls in target accounts via assumed roles. The setup: (a) in each target account, create an IAM role trusted by the source account with the permissions the runbook needs; (b) in the source account, create an AWS-SystemsManager-AutomationExecutionRole that can sts:AssumeRole on the target-account roles; (c) invoke the runbook with TargetLocations parameter listing the target accounts and regions. The runbook engine assumes the target-account role for each location and executes there. SOA-C02 may ask "patch a multi-account fleet from a central security account" — Automation cross-account execution + Systems Manager StackSets-style fan-out is the answer.

Q9: How do Session Manager session logs work, and where do they go?

Sessions can be logged to CloudWatch Logs, S3, or both, configured in the session preferences (the SessionManagerRunShell SSM document). Every keystroke and command output is streamed to the configured destinations. KMS encryption is opt-in via kmsKeyId in the preferences. The IAM principal starting the session needs s3:PutObject and logs:CreateLogStream / logs:PutLogEvents for the configured destinations (the SSM service streams the data, but it acts on behalf of the user identity). Session logs are the primary audit trail for SSH-replacement use cases, and SOA-C02 expects you to enable them with KMS encryption for compliance scenarios. Without configuring session preferences, sessions are not logged at all — only metadata (who, when, which instance) shows in CloudTrail.

Q10: What happens if a Maintenance Window task is still running when the window's duration ends?

Maintenance Windows have a Cutoff value (in hours) that specifies how long before the window's end to stop launching new tasks. Already-running tasks continue past the window end if the host completes within reasonable time, but no new instances are picked up after the cutoff. For example, with Duration=4 hours and Cutoff=1 hour, the window starts at 2am, stops launching new tasks at 5am, and ends at 6am. This protects against tasks getting started near the very end and not finishing during the maintenance time. The SOA-C02 exam can ask "patches were partially applied to the fleet, with new instances not getting picked up after a certain time" — the answer is the cutoff, not a failure. Tune Cutoff to be at least the longest expected per-instance task duration so the rollout completes cleanly within the window.

Once Systems Manager is in place, the next operational layers are: CloudFormation Stacks and StackSets for infrastructure-as-code provisioning that deploys the SSM resources themselves and complements Systems Manager for reproducible environments, Scheduled Tasks and Config Auto-Remediation for the EventBridge cron rules and Config remediation configurations that drive Systems Manager Automation, EventBridge Rules and SNS Notifications for routing alarms and Health events into the same SSM Automation pipeline, and IAM Policies, MFA, SCPs, and Access Troubleshooting for the IAM identities that every Systems Manager capability depends on (instance profiles, automation roles, maintenance window service roles).

官方資料來源