Cost visibility and rightsizing is the operational discipline of seeing where every cloud dollar goes and then acting on the data — and on SOA-C02, Domain 6 Task Statement 6.1 turns it into 12 percent of the exam. Where SAA-C03 hardly tests cost optimization at all (architectural model selection — Reserved vs Savings Plans — at most), SOA-C02 explicitly tests running the cost tools day to day: filtering Cost Explorer to find the team that broke the monthly budget, configuring AWS Budgets with automated actions to stop a runaway dev account before the next pay cycle, interpreting Compute Optimizer's ML risk levels to decide whether to apply a rightsizing recommendation safely, and reading Trusted Advisor's cost optimization checks to clean up idle EC2 instances, underutilized EBS volumes, unassociated Elastic IPs, and idle RDS databases.
This guide walks through cost visibility and rightsizing from the SysOps angle: how cost allocation tags activate (and why they lag for a day), how Cost Explorer filtering finds cost spikes, how Budgets and Budget Actions automate remediation, how Compute Optimizer's 14-day baseline window produces EC2/ASG/Lambda/EBS recommendations with risk ratings, which Trusted Advisor checks need a Business or Enterprise support plan, how to assess workloads for Spot Instance suitability and handle the 2-minute interruption warning, when managed services beat self-managed for cost, and how the AWS Cost and Usage Report (CUR) flows into S3 and Athena for arbitrary granular analysis. You will also see the recurring SOA-C02 cost visibility scenario shapes: monthly bill spiked 30 percent — find the culprit; Compute Optimizer says EC2 is over-provisioned — apply with confidence; team about to blow its budget — Budget Action triggers an SCP.
Why Cost Visibility Sits at the Heart of SOA-C02 Domain 6
The official SOA-C02 Exam Guide v2.3 lists five skills under Task Statement 6.1: implement cost allocation tags; identify and remediate underutilized or unused resources using Trusted Advisor, AWS Compute Optimizer, and AWS Cost Explorer; configure AWS Budgets and billing alarms; assess resource usage patterns to qualify workloads for EC2 Spot Instances; and identify opportunities to use managed services such as Amazon RDS, AWS Fargate, and Amazon EFS. Each of those five skills maps directly to an operational tool in cost visibility and rightsizing — none of them are architectural design questions in the SAA sense. The exam wants to know whether you can run the cost tools and act on what they say.
At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "which purchasing model would minimize cost for this steady-state workload?" SOA-C02 asks "the SysOps team noticed last month's bill jumped 30 percent — walk through the steps in Cost Explorer to find the cause and remediate." The answer is a sequence of operational actions: filter by service to find the spike, then by usage type to identify NAT gateway data processing, then add an S3 VPC gateway endpoint to stop paying for S3 traffic that did not need to traverse the NAT. Cost visibility and rightsizing is the topic where every other Domain 6 dollar saved gets measured — Compute Optimizer recommendations are only useful if you read them, Budget Actions only fire if you configure them, and Spot suitability assessments only matter if you act on the workload tags they produce.
- Cost allocation tag: a user-defined or AWS-generated tag that, once activated in the Billing console, becomes a column in Cost Explorer, Budgets, and the Cost and Usage Report. Without activation the tag exists in the resource but does not appear in cost data.
- AWS Cost Explorer: the interactive cost visibility console — group by service, account, tag, region, usage type; filter and forecast up to 12 months of historical data and 12 months forward.
- AWS Budgets: the alerting and action engine for cost, usage, Reservation utilization, Reservation coverage, Savings Plans utilization, and Savings Plans coverage. Sends notifications via SNS or email and optionally invokes Budget Actions.
- Budget Action: an automated remediation triggered when a budget threshold is breached — apply a deny SCP, attach a deny IAM policy to a user/group/role, or stop EC2/RDS instances.
- AWS Compute Optimizer: an ML-driven recommendation service that analyzes 14 days of CloudWatch metrics for EC2, EC2 Auto Scaling groups, Lambda, and EBS, and emits "current vs recommended" findings with risk levels.
- Trusted Advisor: a multi-category checks engine — Cost Optimization, Performance, Security, Fault Tolerance, Service Limits, Operational Excellence. Full check set requires a Business or Enterprise (or Enterprise On-Ramp) Support plan.
- Reserved Instance (RI): a 1- or 3-year commitment on a specific instance family/region/OS in exchange for up to ~72 percent off On-Demand pricing.
- Savings Plans (SP): a 1- or 3-year commitment on a dollars-per-hour spend in exchange for discount; Compute SP applies broadly, EC2 Instance SP is family-locked but cheaper.
- Spot Instance: spare EC2 capacity sold at up to ~90 percent off On-Demand, reclaimable by AWS with a 2-minute interruption warning.
- AWS Cost and Usage Report (CUR): the most granular billing data feed — per-resource, per-hour line items delivered to S3 in CSV or Parquet, queryable via Athena, Redshift Spectrum, or QuickSight.
- Reference: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html
白話文解釋 Cost Visibility, Budgets, and Rightsizing
Cost visibility jargon is dense and the line between Cost Explorer, Budgets, Compute Optimizer, and Trusted Advisor blurs fast. Three analogies pin the constructs in place.
Analogy 1: The Household Electricity Bill Audit
Cost visibility on AWS is a household electricity bill audit. Cost Explorer is the utility company's online portal where you log in once a month and see "$140 for July, broken down by appliance category" — kitchen 30 percent, HVAC 45 percent, water heater 15 percent, electronics 10 percent. Cost allocation tags are the labelled smart plugs you stuck on the dishwasher, dryer, and gaming PC so the utility can split out their usage instead of lumping everything into "kitchen". AWS Budgets is the monthly cap you set in the app — "if July spend looks like it will exceed $160, text me at the 80 percent and 100 percent marks". Budget Actions are the smart panel automatically shutting off the gaming PC's circuit when the cap is exceeded — actual physical remediation, not just a notification. Compute Optimizer is the utility's energy advisor who looks at your last two weeks of meter data and says "your fridge is running hot — replacing it with a Energy Star model would save $12 a month at very low risk". Trusted Advisor is the monthly inspection checklist the auditor sends — "you have a porch light still on at 3am every night (idle EC2), an old freezer in the garage drawing 80W (underutilized EBS), a phantom load on a power strip plugged into nothing (unassociated Elastic IP)". Spot Instances are the off-peak cheaper kWh rate — you can run the dishwasher overnight for 90 percent off, but the utility might cut power for 2 minutes during demand spikes (the interruption warning), so do not run a critical surgery during dishwasher hours.
Analogy 2: The Fitness Tracker Usage Stats
Compute Optimizer is a Fitbit for your EC2 fleet. After 14 days of wearing it, the tracker has enough data to say "your average heart rate during work hours is 72 bpm — that m5.2xlarge is sized for someone running marathons, you should downgrade to m5.large". The risk level is the tracker's confidence rating — "low risk: we are very sure you can drop two sizes" vs "medium risk: we see occasional spikes that the smaller size might not handle, validate first" vs "high risk: do not change without testing". The 14-day baseline window is the two-week opt-in period before the tracker shows recommendations — newly launched instances and instances older than 14 days but with sparse metrics will simply not appear in the recommendation list. The recommendation comes with estimated monthly savings and estimated performance risk, exactly like the tracker shows estimated calorie burn alongside heart-rate data. A SOA-C02 candidate who treats the risk level as an ignorable knob will eventually apply a recommendation, page the on-call at 2am, and learn the hard way.
Analogy 3: The Restaurant Inventory Waste Audit
Trusted Advisor's cost optimization checks are the restaurant manager's weekly waste audit. Each Monday morning the manager walks the kitchen looking for: produce expiring unused (idle EC2 — paying for the instance hours, gaining no value), oversized stock pots that get filled to 10 percent every shift (underutilized EBS volumes — provisioned at 1TB but using 50GB), orphan menu items printed on every flyer but never ordered (unassociated Elastic IPs — billing 0.005 USD per hour for nothing), and the freezer that is running but holds nothing for the upcoming week's menu (idle RDS database). The audit produces a checklist with estimated monthly savings and prioritization based on dollar impact. The catch is that the full audit only runs for restaurants on the Business or Enterprise franchise plan — the basic plan only sees a stripped-down weekly summary (the support-plan gating of Trusted Advisor: free tier gets six core checks, Business and Enterprise get the full 100+).
For SOA-C02, the fitness tracker analogy is the most useful when a question mixes Compute Optimizer's risk levels with the 14-day baseline window. The exam loves to test "Compute Optimizer shows no recommendation for an instance — why?" The four standard answers are: (a) Compute Optimizer is not opted in for the account; (b) the instance has been running fewer than 14 days; (c) the instance lacks the required CloudWatch metrics — for memory you need the CloudWatch agent, otherwise Compute Optimizer relies on CPU/network only; (d) the workload sits in the "optimized" category so no action is recommended. Reference: https://docs.aws.amazon.com/compute-optimizer/latest/ug/what-is-compute-optimizer.html
Cost Allocation Tags: Activation, Lag, and Chargeback
Cost allocation tags are the foundation of every per-team or per-project cost view in AWS. Without activated tags, Cost Explorer cannot group spend by anything finer than service and account. Two flavors:
AWS-generated tags
The Billing console exposes a small set of AWS-generated tags that AWS populates on its own — most notably aws:createdBy (who created the resource) and a few service-specific tags. Activation is one click in the Billing console under Cost Allocation Tags → AWS-generated cost allocation tags.
User-defined tags
These are the tags you applied to resources (Project=billing-svc, CostCenter=marketing, Environment=prod). After applying the tag to resources, you must navigate to the Billing console → Cost Allocation Tags → User-defined cost allocation tags, find the tag key, and click Activate. Until activated, the tag is invisible to Cost Explorer, AWS Budgets, and the CUR.
The 24-hour activation lag (the most-tested gotcha)
After activation, the tag does not appear in Cost Explorer immediately. AWS Billing needs up to 24 hours to backfill the tag dimension across the cost data. SOA-C02 routinely tests this scenario: "a SysOps engineer activated a CostCenter tag at 9am — at 10am the team lead complains the tag is not in Cost Explorer, what is the fix?" The exam-correct answer is "wait up to 24 hours — activation is asynchronous", not "re-tag the resources" or "create a new tag".
A second nuance: tag activation is forward-looking. The tag becomes a Cost Explorer dimension from the activation moment onward; historical spend before activation is not retroactively re-tagged. For chargeback to work cleanly, activate cost allocation tags as part of the account onboarding runbook, before resources are created.
Multi-account cost allocation tags
In an AWS Organizations setup, cost allocation tag activation happens in the management account's Billing console and applies to consolidated billing for every member account. Member accounts cannot independently activate cost allocation tags for the consolidated bill — the management account owns that decision. This matters for SOA-C02 questions about chargeback in a multi-account environment.
A persistent SOA-C02 trap: candidates expect tag activation to be instant, like applying a Kubernetes label. AWS Billing actually takes up to 24 hours to backfill the tag dimension into Cost Explorer, Budgets, and the CUR. Worse, activation is forward-looking — historical spend before activation is not retroactively re-tagged. Combine this with the operational reality that resources may already exist without the tag at all, and the proper SysOps onboarding runbook is: (1) define the tag taxonomy first, (2) activate the keys in Billing day 0, (3) enforce tag presence with an SCP or AWS Config rule, (4) wait 24 hours before the first chargeback report. Reference: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
AWS Cost Explorer: Filtering by Service, Tag, Account, and Usage Type
AWS Cost Explorer is the operational team's primary console for understanding where the money went. It supports both historical analysis (up to 12 months of past data) and forecasting (up to 12 months forward).
Group-by dimensions
Cost Explorer can group costs (or usage) by:
- Service —
EC2-Instance,RDS,S3,CloudFront,NAT Gateway. The first cut for any cost-spike investigation. - Linked Account — for Organizations, splits cost by member account.
- Tag — once a cost allocation tag is activated, group by
Project,CostCenter,Environment. - Usage Type — within a service, distinguishes line items (
USE2-EBS:VolumeUsage.gp3vsUSE2-EBS:VolumeUsage.io2vsUSE2-DataTransfer-Out-Bytes). Critical for finding NAT gateway data processing or cross-AZ data transfer spikes. - Region, Availability Zone, Operation, API Operation — finer cuts.
Filter dimensions
Same dimensions as group-by, plus purchase type (On-Demand, Reserved, Savings Plan, Spot, Credit) and resource tags. The filter and group-by combination is how a SysOps engineer drills from "the bill jumped 30 percent" to "the Project=video-encoding tag in us-west-2 saw NAT Gateway data processing spike 8x".
Daily, monthly, and hourly granularity
By default Cost Explorer shows daily granularity for the last 14 days and monthly granularity for older data. Hourly granularity and resource-level granularity are paid features that must be explicitly enabled in Billing preferences and incur a per-1000-line-item cost. SOA-C02 may ask "the SysOps team needs hourly cost detail for a 24-hour incident review" — the answer is enabling hourly granularity in Cost Explorer (and accepting the small charge), or pulling the CUR which always has hourly resource-level data.
Forecasting
Cost Explorer's forecast model uses historical patterns to project the end-of-month spend with an 80 percent confidence interval. This forecast is what AWS Budgets uses internally for "forecasted to exceed" alerts.
The canonical investigation flow
Every "monthly bill jumped" question follows this flow in Cost Explorer:
- Total cost by month — confirm there is a spike.
- Group by Service — identify which service grew.
- Group by Usage Type within that service — identify the line item (data processing, IOPS, instance hours).
- Filter to the spiking usage type, group by Linked Account or Tag — identify the team/project responsible.
- Group by Region and AZ — identify whether the spike is broad or localized.
- Remediation — VPC gateway endpoint for S3, instance type rightsizing, idle resource cleanup, etc.
On SOA-C02, when a question describes "the monthly bill spiked 30 percent — what is the first step", the answer is almost always "open Cost Explorer, group by Service to identify the growing line, then group by Usage Type within that service". Wrong answers offer "open the Cost and Usage Report and run an Athena query" — which is correct in principle but slower than the console drill-down for this initial triage. CUR is for repeated, programmatic, fine-grained reporting; Cost Explorer is for ad-hoc investigation. Reference: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html
Rightsizing Recommendations in Cost Explorer
Cost Explorer itself produces rightsizing recommendations for EC2 instances based on CloudWatch metrics. The page lives at Cost Explorer → Rightsizing Recommendations.
What it analyzes
Cost Explorer's rightsizing engine looks at the last 14 days of CPU utilization, network I/O, and disk I/O for each EC2 instance in the account (or organization). Instances that show consistently low utilization are flagged with two recommendation types:
- Terminate — the instance shows near-zero utilization across the 14-day window. Likely an idle dev box, a forgotten test instance, or a project that ended.
- Modify (downsize) — the instance has low but non-zero utilization. Cost Explorer suggests a smaller instance type within the same family (e.g.,
m5.4xlarge→m5.2xlarge) and shows the projected monthly savings.
Cross-instance-family vs same-family recommendations
By default Cost Explorer recommends within the same family. You can enable cross-instance-family rightsizing in the preferences to see recommendations like m5.2xlarge → m6i.xlarge (newer generation, possibly different family). Cross-family recommendations carry slightly higher risk because the underlying CPU architecture and benchmark may differ.
Cost Explorer rightsizing vs Compute Optimizer — the overlap
Cost Explorer's rightsizing is a simpler, baseline view. Compute Optimizer is the deeper, ML-driven engine that covers EC2, ASGs, Lambda, and EBS — not just standalone EC2 — and produces a more granular risk classification. SOA-C02 sometimes tests the boundary: a question asking for "ML-driven recommendations across EC2, Auto Scaling groups, Lambda, and EBS" maps to Compute Optimizer; a question asking "a quick rightsizing list for EC2 inside the Cost Explorer console" maps to Cost Explorer rightsizing. They are complementary; Compute Optimizer is the broader and more nuanced tool.
AWS Compute Optimizer: ML Recommendations for EC2, ASG, Lambda, and EBS
AWS Compute Optimizer is the SOA-C02 exam's headline rightsizing tool. It is opt-in per account or per organization, free for the recommendations themselves (you pay for the underlying CloudWatch metrics and any agents), and it covers four resource types.
EC2 instance recommendations
Compute Optimizer ingests 14 days of CloudWatch metrics (CPU, network I/O, disk I/O, and — if the CloudWatch agent is installed — memory) per EC2 instance and emits one of these findings:
- Under-provisioned — the instance is too small; running near saturation. Recommends upsizing.
- Over-provisioned — the instance is too large; significant headroom. Recommends downsizing with a target instance type and projected savings.
- Optimized — the instance is correctly sized within Compute Optimizer's tolerances; no action recommended.
- None — Compute Optimizer cannot produce a recommendation (insufficient metrics, instance newer than 14 days, or unsupported instance type).
The 14-day baseline window
Compute Optimizer requires a 14-day rolling baseline before it produces a recommendation. Newly launched instances do not appear in the recommendation list until 14 days of metrics accumulate. The exam consistently tests this number — when a question says "Compute Optimizer shows no recommendation for a freshly launched instance, what is the SysOps engineer's next step?", the answer is "wait 14 days for the baseline to build".
Risk levels
Each recommendation carries a performance risk rating from Very Low to Very High. The rating reflects how confident Compute Optimizer is that the recommended instance type can handle the workload's observed peaks. Very Low and Low risk recommendations can usually be applied with confidence; Medium risk needs a non-production validation; High and Very High risk recommendations should be applied only after explicit load testing.
Memory metrics — the CloudWatch agent dependency
Without memory metrics, Compute Optimizer's recommendations are based on CPU, network, and disk I/O only. Memory utilization comes only from the CloudWatch agent publishing into the CWAgent namespace. SysOps teams that want high-confidence recommendations should ensure the agent is installed across the fleet — otherwise Compute Optimizer may recommend a smaller instance that turns out to be memory-bound.
EC2 Auto Scaling group recommendations
For ASGs, Compute Optimizer evaluates the launch template / launch configuration instance type and emits a recommended instance type given the aggregate load profile across the group. The recommendation respects scaling behavior: an ASG that scales out aggressively under load may not need a larger per-instance type if the scaling policy is doing its job.
AWS Lambda function recommendations
Compute Optimizer analyzes the function's memory configuration (which also dictates CPU allocation in Lambda) and execution duration history. It emits recommendations like "current 1024 MB → recommended 768 MB" with projected monthly savings — often counter-intuitive because giving a Lambda function more memory can reduce execution time enough that the per-millisecond cost actually drops.
Amazon EBS volume recommendations
For EBS, Compute Optimizer looks at IOPS, throughput, and burst-balance trends and recommends:
- Volume type changes —
gp2→gp3for cost reduction with same or better performance. - Size and IOPS adjustments — provisioned IOPS that are never used can be reduced.
Enabling Compute Optimizer
Single-account: navigate to the Compute Optimizer console and opt in. Organization-wide: enable in the management account or designate a delegated administrator account, then trusted access propagates the recommendations across all member accounts. Recommendations begin appearing 12–48 hours after opt-in for resources with 14+ days of metrics.
- Baseline window: 14 days of CloudWatch metrics required before recommendations appear.
- Resource types covered: EC2 instances, EC2 Auto Scaling groups, AWS Lambda functions, Amazon EBS volumes.
- Risk levels: Very Low, Low, Medium, High, Very High — applied directly to each recommendation.
- Findings: Under-provisioned, Over-provisioned, Optimized, None.
- Memory metric dependency: requires the CloudWatch agent publishing memory into
CWAgentnamespace. - Cost: free for recommendations themselves; you pay for underlying CloudWatch metrics and any agents.
- Activation latency: 12–48 hours after opt-in for first recommendations.
- Reference: https://docs.aws.amazon.com/compute-optimizer/latest/ug/what-is-compute-optimizer.html
AWS Budgets: Cost, Usage, and Reservation Utilization Budgets
AWS Budgets is the alerting and action engine for monthly, quarterly, or annual spending against a defined target. Budget creation is free up to 2 budgets per account; additional budgets cost a small per-budget-per-day fee.
Budget types
- Cost budget — a dollar threshold per month/quarter/year. The most common SysOps budget.
- Usage budget — a usage threshold per service (e.g., 1 TB of S3 storage, 100,000 Lambda invocations).
- Savings Plans utilization budget — alerts when SP utilization drops below a target (you committed to spend $X/hour but you are only using 80 percent — you are wasting commitment).
- Savings Plans coverage budget — alerts when SP coverage drops below a target percentage of total compute spend.
- RI utilization budget — alerts when Reserved Instance utilization drops below a target (you have RIs but the matching instances are not running).
- RI coverage budget — alerts when RI coverage of eligible spend drops below a target.
Threshold types
- Actual — the alert fires when actual spend reaches the threshold (e.g., 80 percent or 100 percent of budgeted amount).
- Forecasted — the alert fires when Cost Explorer's forecasting model projects end-of-period spend will exceed the threshold. Lets you act before the breach.
Notification targets
- Email addresses (up to 10).
- SNS topic ARN — for chat integrations, ticketing systems, automated remediation pipelines.
- AWS Chatbot for Slack/Microsoft Teams.
Common SysOps budget patterns
- Per-team monthly cost budget filtered by
CostCentertag — alert at 80 percent forecasted, 100 percent actual. - Account-level monthly cost budget in the management account — guardrail for the whole organization.
- Service-specific usage budget — e.g., "alert if NAT Gateway data processing exceeds 1 TB this month" (catches the canonical NAT gateway leak).
- SP utilization budget at 95 percent — paged when commitment waste exceeds 5 percent.
By the time a budget is at 100 percent actual, the month is already overspent. SOA-C02 favors the 80 percent forecasted threshold so the SysOps team is paged with enough lead time to act before the cycle ends. Many real-world budgets layer multiple alerts: 50 percent actual (informational), 80 percent forecasted (warning, page the team lead), 100 percent actual (critical, invoke a Budget Action). Reference: https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-managing-costs.html
Budget Actions: Automated Remediation on Threshold Breach
AWS Budget Actions turn a passive alert into an automated remediation. When a budget threshold is breached, the Budget Action takes one of three concrete steps.
Three Budget Action types
- Apply a deny SCP — at the AWS Organizations level, attach a service control policy that denies a specified set of actions (e.g.,
ec2:RunInstances,rds:CreateDBInstance) to the affected account or OU. This is the organizational guardrail for cost overruns. - Apply an IAM deny policy — attach a managed deny policy to a specified IAM user, group, or role. Stops that principal from making expensive API calls until the policy is removed.
- Stop EC2 or RDS instances — the most direct remediation. The action targets specific instances (by ID) or all matching a tag filter and stops them on threshold breach.
Approval modes
- Automatic — the action runs without human approval as soon as the threshold is breached. Used for non-production budgets where a self-imposed cap is desired.
- Manual approval required — Budgets sends a notification with a link; a designated approver must click to execute. Used for production where automatic shutdown is too risky.
IAM role requirement
Budget Actions need an IAM role that Budgets assumes to perform the action. The role's trust policy must allow budgets.amazonaws.com to assume it, and its permissions policy must grant the relevant actions (organizations:AttachPolicy for SCP attach, iam:AttachUserPolicy for IAM, ec2:StopInstances and rds:StopDBInstance for instance stops). Missing or mis-scoped roles cause Budget Actions to fail silently — the action shows as "execution failed" in the Budgets console.
Reset behavior
Budget Actions are one-shot per budget threshold breach. Once the budget period ends and resets, the action is dismissed. The applied SCP, IAM policy, or stopped instance state remains in place until manually reverted — this is intentional, so the remediation persists across cycle boundaries unless someone explicitly removes it.
The single most common Budget Action failure on SOA-C02 questions: the action is configured but never actually runs because the IAM role is missing the required permissions. The exam tests this exact pattern — "the budget breached at 100 percent, the team configured a Budget Action to apply a deny SCP, but the SCP was never attached, what is the fix?" The answer is to verify the IAM role's trust policy allows budgets.amazonaws.com and its permissions policy includes organizations:AttachPolicy (or whatever action class is needed). Reference: https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-controls.html
Trusted Advisor Cost Optimization Checks
AWS Trusted Advisor is the multi-category checks engine. The Cost Optimization category is the highest-leverage SysOps tool for cleanup work.
The headline cost checks
- Low Utilization Amazon EC2 Instances — instances with low daily CPU and network I/O over the last 14 days. Candidates for termination or downsize.
- Idle Load Balancers — load balancers with no associated targets or with very low request counts. Candidates for deletion.
- Underutilized Amazon EBS Volumes — EBS volumes with low IOPS over the last 7 days. Candidates for snapshot-and-delete or downsize.
- Unassociated Elastic IP Addresses — Elastic IPs not attached to any running instance. Each unassociated EIP costs ~0.005 USD per hour, which compounds across hundreds of EIPs.
- Idle Amazon RDS DB Instances — RDS instances with no connections and very low CPU over the last 7 days. Candidates for snapshot-and-delete.
- Amazon RDS Idle DB Instances — similar; specifically catches non-production RDS that was provisioned and forgotten.
- Underutilized Redshift Clusters — clusters with low query volume.
- Savings Plans / Reserved Instance recommendations — cross-references with billing data to suggest commitment purchases.
The support-plan gating
Trusted Advisor's full check set requires a Business, Enterprise, or Enterprise On-Ramp Support plan. The free Basic and Developer Support tiers see only six core checks (mostly Security and Service Limits), not the full Cost Optimization category. SOA-C02 tests this directly: "the SysOps team wants to enable the full Trusted Advisor cost checks — what is required?" The answer is "upgrade to at least Business Support".
Trusted Advisor and EventBridge integration
Trusted Advisor publishes check status changes to EventBridge as Trusted Advisor Check Item Refresh Notification events. SysOps teams can build EventBridge rules that match on category=cost_optimizing and route to SNS or Lambda for ticketing automation. This is the operational pattern for "the team wants to be paged the moment a new idle EC2 instance shows up in Trusted Advisor".
Refresh cadence
Most Trusted Advisor checks refresh every 24 hours automatically. Manual refresh is available in the console. For programmatic access, the AWS Support API exposes Trusted Advisor checks (also Business+ Support tier).
A persistent SOA-C02 trap: the question describes a SysOps engineer trying to view "Idle Load Balancers" or "Low Utilization Amazon EC2 Instances" in Trusted Advisor and finding only six checks visible. The trap distractor is "the checks are still running, wait for refresh"; the real answer is the support-plan gating — Basic/Developer tiers only see the six core checks. Upgrade to Business or Enterprise to unlock the Cost Optimization category. Reference: https://docs.aws.amazon.com/awssupport/latest/user/trusted-advisor.html
Spot Instance Suitability Assessment and Interruption Handling
EC2 Spot Instances offer up to ~90 percent off On-Demand pricing in exchange for interruption risk — AWS can reclaim Spot capacity with a 2-minute warning when On-Demand demand rises. SOA-C02 Task Statement 6.1 explicitly tests assessing workloads for Spot suitability.
Workload characteristics that suit Spot
- Stateless — losing the instance does not lose any client session, request in flight, or unsaved data.
- Fault-tolerant — the workload tolerates instance loss, typically because work is checkpointed or queued (SQS, Kinesis, S3-backed batch).
- Flexible across instance types and AZs — the workload runs equally well on
m5,m5a,m6i,c5, etc., so the Spot allocator has many capacity pools to draw from. - Time-elastic — the workload tolerates delay; if no Spot capacity is available right now, the work can wait.
Classic Spot-suitable workloads: batch processing, big data ETL, CI/CD build workers, render farms, web crawlers, fault-tolerant containerized microservices behind a queue, ASG fleets with mixed-instances policies.
Workload characteristics that do NOT suit Spot
- Stateful — long-running databases, single-instance application servers with local state.
- Hard-deadline — workloads with strict SLAs that cannot accept the 2-minute interruption.
- Single-AZ pinned — workloads that cannot run cross-AZ have a smaller pool to draw from.
The 2-minute interruption warning
When AWS decides to reclaim a Spot Instance, it issues a 2-minute warning delivered through two channels:
- EC2 Instance Metadata Service (IMDS) — the metadata path
http://169.254.169.254/latest/meta-data/spot/instance-actionreturns a JSON document with the planned termination time. - EventBridge
EC2 Spot Instance Interruption Warningevent — published to the default event bus, can be routed to Lambda, SNS, or Step Functions for graceful shutdown handling.
Workloads should poll IMDS or subscribe to the EventBridge event and use the 2 minutes to: drain in-flight requests from the load balancer, checkpoint state to S3 or DynamoDB, deregister from any service registry, and exit cleanly.
Spot Fleet vs EC2 Auto Scaling group with mixed-instances policy
Two SOA-C02-relevant ways to consume Spot at scale:
- EC2 Auto Scaling group with mixed-instances policy — the modern, recommended pattern. Specify a list of compatible instance types and a target Spot percentage; ASG handles allocation, capacity rebalancing, and Spot interruption replacement.
- EC2 Spot Fleet — older, more configurable fleet manager. Still supported but generally superseded by ASG mixed-instances policy for SOA-C02 scenarios.
Capacity Rebalancing
A modern ASG with Capacity Rebalancing enabled monitors the Spot Instance interruption risk (signaled before the 2-minute warning) and proactively launches replacement Spot capacity in lower-risk pools. This reduces the chance that the 2-minute warning hits multiple instances simultaneously.
- Discount: up to ~90 percent off On-Demand pricing.
- Interruption warning: 2 minutes before reclamation.
- Interruption signal channels: IMDS (
/latest/meta-data/spot/instance-action) and EventBridge (EC2 Spot Instance Interruption Warning). - Suitable workloads: stateless, fault-tolerant, flexible across instance types and AZs, time-elastic.
- Recommended consumption pattern: ASG with mixed-instances policy + Capacity Rebalancing.
- Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html
Savings Plans and Reserved Instances — What SOA-C02 Tests
SOA-C02 does not require the deep math of Savings Plans vs Reserved Instances that SAA-C03 sometimes tests. The exam wants you to understand the commitment models well enough to act on Compute Optimizer and Cost Explorer recommendations.
Reserved Instances (RI)
A 1- or 3-year commitment to a specific instance family, region, OS, and tenancy in exchange for up to ~72 percent off On-Demand. Two flavors: Standard RI (highest discount, locked to the chosen attributes) and Convertible RI (lower discount, can be exchanged for different families during the term). Payment options: All Upfront, Partial Upfront, No Upfront.
Savings Plans (SP)
A 1- or 3-year commitment to a dollar amount of compute spend per hour in exchange for discount. Two flavors:
- Compute Savings Plans — applies broadly to EC2 (any family, any region, any OS), Fargate, and Lambda. Maximum flexibility, ~66 percent maximum discount.
- EC2 Instance Savings Plans — locked to a specific instance family in a specific region; ~72 percent maximum discount, comparable to Standard RI but commitment is dollar-based instead of instance-based.
When SOA-C02 cares about SPs and RIs
The exam tests these in three operational shapes:
- SP utilization budget — page the team when SP commitment is being wasted (utilization < 95 percent).
- SP coverage budget — page when SP coverage of eligible compute spend drops below target.
- Recommendations from Cost Explorer — Cost Explorer surfaces SP/RI purchase recommendations based on usage history; the SysOps engineer's job is to read and act, not derive the math from scratch.
The exam's stance
SOA-C02 explicitly does not test "calculate which purchase model is cheapest given these usage hours" — that is SAA territory. SOA-C02 tests "given Cost Explorer's recommendation, which Budgets type would alert when SP utilization drops?" The answer is SP utilization budget.
- Reserved Instance (RI): 1/3-year commit on specific instance family + region + OS. Up to ~72 percent off. Standard or Convertible.
- Savings Plan (SP): 1/3-year commit on dollar/hour compute spend. Compute SP (broad, ~66 percent) or EC2 Instance SP (family-locked, ~72 percent).
- Spot Instance: spare capacity, up to ~90 percent off, 2-minute reclamation warning, reclaimable any time.
- Decision shortcut for SysOps: steady-state predictable load → SP or RI; spiky bursty load on top of baseline → Spot via ASG mixed-instances.
- Reference: https://docs.aws.amazon.com/savingsplans/latest/userguide/what-is-savings-plans.html
Managed Services for Cost Reduction: Fargate vs EC2, RDS vs Self-Managed
SOA-C02 Task Statement 6.1 explicitly mentions "identify opportunities to use managed services (e.g., Amazon RDS, AWS Fargate, Amazon EFS)". The operational TCO comparison.
When managed services win
- AWS Fargate vs self-managed EC2 for containers — Fargate eliminates capacity planning, AMI patching, and host-level scaling. Fargate per-task pricing is higher per CPU-hour than EC2, but the operational savings (no patch cycle, no scaling tuning, no AMI lifecycle) often dominate at small to mid scale.
- Amazon RDS vs self-managed databases on EC2 — RDS handles backups, patching, Multi-AZ failover, read replica provisioning, parameter tuning. The hourly RDS premium over EC2 is small relative to the labor cost of running a self-managed database with the same reliability.
- Amazon EFS vs self-managed NFS — EFS scales storage and throughput automatically and replicates across AZs. Self-managed NFS on EC2 needs file-server provisioning, replication, and capacity management.
- AWS Backup vs self-rolled snapshot scripts — centralized backup plans, vaults, lifecycle, cross-region copy, compliance reports.
- AWS Lambda vs self-managed cron on EC2 — Lambda is per-millisecond billed; cron on EC2 needs the host always running.
The TCO framing on SOA-C02
The exam wants you to recognize that managed-service cost is not just the AWS sticker price — it is sticker minus the cost of the engineering hours saved. A SysOps team that values its time will favor managed services for any non-differentiating layer: the database, the load balancer, the certificate, the DNS, the backup. Self-managed only when the workload genuinely needs control that managed services do not provide (e.g., a database engine RDS does not support, or a Linux distro Fargate does not run).
AWS Cost and Usage Report (CUR) + Athena Analysis
For SysOps cost analysis beyond Cost Explorer's pre-built dimensions, the AWS Cost and Usage Report (CUR) provides per-resource, per-hour line items as raw files in S3.
CUR delivery
- Hourly granularity by default; daily and monthly aggregations also available.
- CSV or Apache Parquet format; Parquet is preferred for Athena efficiency.
- Delivery to an S3 bucket in a folder path you specify; AWS writes a manifest JSON per delivery cycle.
- Updates throughout the month — AWS rewrites the CSV/Parquet files as bills are finalized; the final report for a month is available 24 hours after month-end.
Schema highlights
The CUR schema has 100+ columns spanning identity (account, region), product (service, instance type, AZ), pricing (rate code, On-Demand cost, blended cost, unblended cost, amortized cost), reservation (RI ARN, RI utilization), and Savings Plans (SP ARN, SP-effective cost). Custom columns appear for each cost allocation tag once activated.
Querying CUR with Amazon Athena
The standard pattern:
- Configure CUR with Parquet format and Athena integration enabled in the CUR setup.
- AWS automatically creates an AWS Glue crawler and an Athena database/table pointing at the S3 prefix.
- Run SQL in Athena:
SELECT product_servicecode, SUM(line_item_unblended_cost) FROM cur_table WHERE line_item_usage_start_date BETWEEN ... GROUP BY product_servicecode ORDER BY 2 DESC.
When to use CUR over Cost Explorer
- Cost Explorer for ad-hoc investigation, dashboards, and built-in dimensions. Console-driven.
- CUR + Athena for: programmatic reporting, custom chargeback scripts, multi-month trend analysis, finance-team integrations, and any cut Cost Explorer's UI does not support natively.
- CUR + QuickSight for executive dashboards built on top of CUR with visualization.
- CUR granularity: hourly, daily, or monthly — hourly is the most-tested default.
- Format: CSV or Parquet (Parquet preferred for Athena).
- Delivery destination: S3 bucket with manifest JSON per cycle.
- Refresh cadence: multiple times per day during the month; final report ~24 hours after month-end.
- Native query path: Athena via Glue crawler (auto-configured if Athena integration is enabled).
- Reference: https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html
Scenario Pattern: Monthly Cost Spiked 30 Percent — Find the Culprit
This is the canonical SOA-C02 cost visibility scenario. The runbook:
- Open Cost Explorer. Set the date range to "Last 6 Months" and granularity to "Monthly". Confirm the spike — is it month-over-month, or did it gradually creep up?
- Group by Service. Identify the service that grew. Common culprits: EC2-Other (NAT Gateway, EBS, data transfer), CloudWatch (high-cardinality custom metrics or Logs ingestion), S3 (storage growth or cross-region replication), RDS (instance count or storage scaling).
- Filter to that service, group by Usage Type. This is where NAT Gateway data processing vs NAT Gateway hours becomes visible, where S3 PUT/GET vs storage becomes visible, where RDS storage vs instance hours becomes visible.
- Filter to the spiking usage type, group by Linked Account or Tag. Identify which account or team is responsible. If cost allocation tags are not activated, this step fails — that is why Step 0 of any cost discipline is activating tags.
- Group by Region and AZ. A spike concentrated in one region usually points to a workload migration; cross-AZ traffic spikes point to architectural issues.
- Identify and apply remediation. Common patterns: NAT Gateway data processing spike → add S3 / DynamoDB VPC gateway endpoint to bypass NAT for those services; cross-AZ data transfer spike → co-locate the chatty workload tiers in one AZ; oversized RDS → snapshot, downsize, restore; idle EC2 → terminate per Trusted Advisor or Compute Optimizer recommendation.
- Set a Budget on the affected dimension. Forecasted alert at 80 percent of the new normal so future regressions are caught early.
The SOA-C02 exam-correct phrasing tends to emphasize "use Cost Explorer to drill from Service to Usage Type to Tag" — the multi-step drill-down rather than jumping straight to CUR.
Scenario Pattern: Compute Optimizer Says EC2 Is Over-Provisioned — Apply With Confidence
The second canonical SOA-C02 scenario. The runbook:
- Read the recommendation. Compute Optimizer shows current
m5.4xlarge, recommendedm5.2xlarge, monthly savings $147, performance risk Low. - Verify the baseline window. Confirm 14+ days of metrics. If the workload had an unusual idle period in the window (holiday, freeze), wait for a representative window before applying.
- Check whether memory metrics are present. If the CloudWatch agent is not installed, Compute Optimizer's recommendation is CPU/network-only; a memory-bound workload could be misjudged. Install the agent and let metrics accumulate before trusting Medium-risk recommendations.
- For Low / Very Low risk: apply directly in non-production first, observe one full business cycle, then apply in production via launch template version bump + ASG instance refresh.
- For Medium risk: load-test in non-production with realistic peak traffic before applying anywhere.
- For High / Very High risk: do not apply without explicit load testing and stakeholder sign-off; the recommendation is information only.
- Track savings. After applying, validate in Cost Explorer that the line item dropped by approximately the projected amount. Discrepancies usually point to data transfer or other components Compute Optimizer does not model.
The exam-correct pattern matches recommendation risk to action: Low risk + production rightsizing is acceptable; Medium risk requires non-production validation; High risk requires load testing.
Common Trap: Cost Allocation Tag Activation Lag
Already covered above, repeated here because of frequency: cost allocation tags take up to 24 hours to appear in Cost Explorer after activation, and activation is forward-looking — historical spend is not retroactively re-tagged. SysOps onboarding runbooks should activate tag keys on day 0, before any team-level chargeback report is expected.
Common Trap: Compute Optimizer 14-Day Baseline Window
Already covered, repeated for visibility: Compute Optimizer needs 14 days of CloudWatch metrics before producing a recommendation. Newly launched instances, instances with sparse metrics, or instances missing the CloudWatch agent (for memory) will not yield trustworthy recommendations until the baseline accumulates.
Common Trap: Budget Action Requires a Properly Scoped IAM Role
Already covered, repeated for visibility: Budget Actions need an IAM role with a trust policy allowing budgets.amazonaws.com and a permissions policy granting the relevant action class (organizations:AttachPolicy, iam:AttachUserPolicy, ec2:StopInstances, rds:StopDBInstance). Missing or mis-scoped roles cause Budget Actions to fail silently — the budget breaches, the action shows "execution failed", and remediation never happens.
Common Trap: Spot Interruption 2-Minute Warning Is the Only Notice
When AWS decides to reclaim a Spot Instance, the 2-minute warning is the only signal — there is no longer warning, no negotiation, no retry. Workloads that need more than 2 minutes for graceful shutdown (long-running database transactions, multi-gigabyte cache warmup) are not Spot-suitable. SysOps engineers who put stateful workloads on Spot will eventually see data loss; the right answer is either On-Demand for stateful tiers or architecting the workload to checkpoint within sub-2-minute intervals.
Common Trap: Trusted Advisor Free-Tier Limited Checks
Already covered, repeated: Basic and Developer Support tiers see only six core Trusted Advisor checks. The full Cost Optimization category (Idle Load Balancers, Underutilized EBS, Unassociated EIPs, etc.) requires Business or Enterprise (or Enterprise On-Ramp) Support.
Common Trap: Cost Explorer Hourly Granularity Is Paid
Cost Explorer's default daily granularity is free; hourly granularity and resource-level granularity are opt-in paid features that bill per 1000 line items. SysOps teams that need 24-hour incident-window cost analysis must either enable hourly granularity in Billing preferences (small charge) or pull the CUR (free file delivery, but Athena queries cost).
Common Trap: Stopped EC2 Instances Still Cost Money for EBS
A frequent SysOps cost shock: stopping an EC2 instance saves the instance-hour charge but does not stop EBS volume billing. The EBS volumes attached to the stopped instance continue to bill at the same rate. Long-stopped instances should be terminated (after taking AMI/snapshot if data is needed) — or the EBS volumes detached and deleted independently. Idle Elastic IPs attached to stopped instances also continue to bill at ~0.005 USD/hour. Trusted Advisor's "Unassociated Elastic IP" and "Underutilized EBS" checks catch both.
SOA-C02 vs SAA-C03: The Operational Lens
SAA-C03 and SOA-C02 both touch cost, but the lenses differ.
| Question style | SAA-C03 lens | SOA-C02 lens |
|---|---|---|
| Selecting a purchase model | "Choose the most cost-effective purchase model for steady-state EC2." (RI vs SP math) | "Configure a Savings Plans utilization budget at 95 percent — which budget type?" |
| Compute Optimizer | "Which AWS service provides ML-based rightsizing?" | "Compute Optimizer shows no recommendation for the new instance — why?" (14-day baseline) |
| Trusted Advisor | "Which AWS service identifies idle resources?" | "Which support plan is required to see Idle Load Balancers in Trusted Advisor?" |
| Cost Explorer | Rarely tested at depth. | "Walk through the Cost Explorer drill-down: Service → Usage Type → Tag for a 30 percent spike." |
| Budget Actions | Rarely tested. | "Configure a Budget Action to apply a deny SCP — which IAM role permission is required?" |
| Cost allocation tags | "Use cost allocation tags for chargeback." | "Tags were activated this morning but do not appear in Cost Explorer — why?" (24h lag) |
| Spot Instances | "Which workloads are suitable for Spot?" | "Configure interruption handling — which IMDS path and which EventBridge event?" |
| CUR + Athena | "How to query granular billing data?" | "CUR Parquet → Glue → Athena: which integration setting auto-creates the table?" |
The SAA candidate selects the model; the SOA candidate runs the cost tools, configures the alerts and actions, and remediates when the budget breaches.
Exam Signal: How to Recognize a Domain 6.1 Question
Domain 6.1 questions on SOA-C02 follow predictable shapes.
- "The bill spiked" — the answer is the Cost Explorer drill-down: Service → Usage Type → Tag → Region. Sometimes the remediation is a VPC gateway endpoint (NAT data processing), sometimes rightsizing, sometimes cleanup of idle resources.
- "Idle resources should be cleaned up" — the answer is Trusted Advisor cost checks (Idle EC2, Underutilized EBS, Unassociated EIP, Idle RDS), which require Business or Enterprise Support.
- "Compute Optimizer shows no recommendation" — the answer is the 14-day baseline window or missing memory metrics (CloudWatch agent).
- "Compute Optimizer says over-provisioned, can we apply?" — the answer is read the risk level: Low/Very Low → apply with normal change management; Medium → non-prod validate; High/Very High → load test first.
- "The team needs to be paged before going over budget" — the answer is AWS Budgets at 80 percent forecasted plus a notification SNS topic.
- "The team needs to automatically stop overspending" — the answer is Budget Actions (deny SCP, IAM deny policy, or stop EC2/RDS) with a properly scoped IAM role.
- "Tags activated but not in Cost Explorer" — the answer is the 24-hour activation lag.
- "Workload candidate for Spot Instances" — the answer is stateless + fault-tolerant + flexible across types/AZs + time-elastic; consume via ASG mixed-instances policy with Capacity Rebalancing.
- "SP / RI utilization is too low" — the answer is an SP utilization budget or RI utilization budget alerting below 95 percent.
- "Custom granular cost analysis" — the answer is CUR (Parquet) + Athena via Glue crawler.
With Domain 6 worth 12 percent and Task Statement 6.1 covering most of it, expect 6 to 8 questions in this exact territory. Cost Explorer drill-downs, Compute Optimizer interpretation, Budgets+Actions, and the Trusted Advisor support-plan gating are the four highest-frequency shapes. Reference: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html
Decision Matrix — Cost Tool for Each SysOps Goal
Use this lookup during the exam.
| Operational goal | Primary tool | Notes |
|---|---|---|
| Investigate a monthly bill spike | Cost Explorer drill-down | Service → Usage Type → Tag → Region. |
| Identify underutilized EC2 across a fleet | Compute Optimizer | 14-day baseline; respect risk levels. |
| Find idle resources broadly | Trusted Advisor cost checks | Requires Business or Enterprise Support. |
| Alert before going over budget | AWS Budgets at 80% forecasted | Email or SNS notification. |
| Auto-stop runaway spending | Budget Action | Deny SCP, IAM deny, or stop EC2/RDS — needs IAM role. |
| Track per-team or per-project spend | Cost allocation tags + Cost Explorer | Activate tags; wait 24h for backfill. |
| Build a custom finance dashboard | CUR + Athena + QuickSight | Parquet format for Athena efficiency. |
| Right-size a Lambda function | Compute Optimizer Lambda recommendations | Memory + duration history. |
| Right-size an EBS volume | Compute Optimizer EBS recommendations | gp2 → gp3 is the most common. |
| Right-size an Auto Scaling group launch template | Compute Optimizer ASG recommendations | Aggregates fleet load profile. |
| Determine if a workload fits Spot | Workload assessment (stateless, fault-tolerant, flexible, time-elastic) | Then ASG mixed-instances + Capacity Rebalancing. |
| Handle Spot interruption gracefully | IMDS poll + EventBridge EC2 Spot Instance Interruption Warning |
2-minute window. |
| Detect SP commitment waste | SP utilization budget | Alert below 95% utilization. |
| Detect RI commitment waste | RI utilization budget | Alert below 95% utilization. |
| Audit finalized monthly billing | CUR (final report) | ~24h after month-end. |
| Cleanup unassociated EIPs | Trusted Advisor → Unassociated Elastic IP | Each EIP costs ~0.005 USD/h idle. |
| Stop runaway dev account | Budget Action with deny SCP | Per-OU SCP attach. |
| Compare managed vs self-managed TCO | Cost Explorer per-service + operational labor | Sticker isn't the whole story. |
Common Traps Recap — Cost Visibility, Budgets, and Rightsizing
Every SOA-C02 attempt will see two or three of these distractors.
Trap 1: Cost allocation tags work the moment you apply them
They do not. Applying the tag to resources is step one; activating the tag in the Billing console is step two; waiting up to 24 hours for backfill is step three. Activation is forward-looking, so historical spend before activation is not retroactively re-tagged.
Trap 2: Compute Optimizer always has a recommendation
It does not. Newly launched instances (less than 14 days), instances with sparse metrics, instances missing the CloudWatch agent for memory metrics, and instances classified as "Optimized" all yield no actionable recommendation.
Trap 3: Trusted Advisor full checks are free
The full Cost Optimization category requires Business, Enterprise, or Enterprise On-Ramp Support. Basic and Developer tiers see only six core checks (mostly Security and Service Limits).
Trap 4: Stopped EC2 instances cost nothing
The instance-hour charge stops, but EBS volumes still bill and any Elastic IP attached to the stopped instance continues to bill at ~0.005 USD/hour. Truly idle instances should be terminated or have volumes/EIPs cleaned up independently.
Trap 5: Detailed monitoring or hourly Cost Explorer is free
EC2 detailed monitoring bills per metric per month. Cost Explorer hourly granularity bills per 1000 line items queried. Both are small but real charges that compound at fleet scale.
Trap 6: Budget alerts at 100 percent are sufficient
By 100 percent the cycle is already overspent. Operational maturity sets alerts at 80 percent forecasted so the team has lead time to act.
Trap 7: Budget Actions fire automatically without an IAM role
They do not — Budgets needs an IAM role it can assume with the right permissions class. Missing or mis-scoped roles cause silent failure.
Trap 8: Spot Instances can be reclaimed without warning
Reclamation always comes with a 2-minute warning via IMDS and EventBridge. But 2 minutes is the only window — there is no longer notice, so workloads that need more than 2 minutes for graceful shutdown are not Spot-suitable.
Trap 9: Cost Explorer rightsizing replaces Compute Optimizer
It does not. Cost Explorer's rightsizing covers EC2 only and is a simpler engine; Compute Optimizer covers EC2, ASG, Lambda, and EBS with a deeper ML model and explicit risk levels. They are complementary; SOA-C02 expects you to use Compute Optimizer for the broader and more nuanced view.
Trap 10: CUR is the same as Cost Explorer
CUR is the raw line-item file (CSV or Parquet) delivered to S3, queryable via Athena/Redshift Spectrum/QuickSight. Cost Explorer is the interactive console with built-in dimensions. Use Cost Explorer for ad-hoc investigation, CUR for programmatic and custom reporting.
FAQ — Cost Visibility, Budgets, and Rightsizing
Q1: Why does my newly activated cost allocation tag not appear in Cost Explorer?
Tag activation in the Billing console is asynchronous — AWS Billing needs up to 24 hours to backfill the tag dimension into Cost Explorer, AWS Budgets, and the Cost and Usage Report. The activation is also forward-looking: historical spend before the activation moment is not retroactively re-tagged, so even after the 24 hours pass, you will only see new spend grouped by the tag, not old. The SysOps best practice is to define the tag taxonomy and activate the keys at account onboarding, before resources are created — and to enforce tag presence with an SCP or AWS Config rule so future resources cannot escape chargeback.
Q2: Why is Compute Optimizer not showing a recommendation for my EC2 instance?
Four common causes. (1) Compute Optimizer is not opted in for the account or organization — fix in the Compute Optimizer console. (2) The instance has been running fewer than 14 days — Compute Optimizer requires a 14-day rolling baseline of CloudWatch metrics before producing a recommendation. (3) Memory metrics are missing — for memory-bound workloads, install the CloudWatch agent so the agent publishes mem_used_percent into the CWAgent namespace, otherwise recommendations are based on CPU/network/disk-IO only. (4) The instance is classified as Optimized — already correctly sized, no action recommended. Wait the 14 days, install the agent if memory matters, and the recommendation should appear.
Q3: How do I configure a Budget Action that automatically stops EC2 instances when a budget breaches?
The configuration has four parts: (1) create the AWS Budget with the threshold (cost or usage) you want to act on; (2) create an IAM role whose trust policy allows budgets.amazonaws.com to assume it and whose permissions policy grants ec2:StopInstances on the relevant resource ARNs; (3) configure the Budget Action of type "Apply IAM policy" or "Stop EC2/RDS instances" referencing the IAM role and the target instance IDs or tag filter; (4) choose Automatic or Manual approval. The most common failure mode is the IAM role missing permissions — the action will show "execution failed" in the Budgets console rather than executing. Verify the IAM role with the IAM Policy Simulator before relying on the action.
Q4: My monthly bill jumped 30 percent — what is the first step in Cost Explorer?
Open Cost Explorer, set the date range to "Last 6 Months" and granularity to "Monthly", and group by Service to identify which AWS service grew. Once you have the service, filter to that service and group by Usage Type to identify the specific line item — NAT Gateway data processing vs NAT Gateway hours, S3 PUT/GET vs storage, RDS storage vs instance hours. Then filter to the spiking usage type and group by Linked Account or Tag to identify the responsible team. Finally, group by Region or AZ to spot localized issues. Common remediations: VPC gateway endpoint for S3/DynamoDB to bypass NAT, instance rightsizing, idle resource cleanup, or co-locating chatty tiers in one AZ to reduce cross-AZ data transfer.
Q5: What workloads are suitable for EC2 Spot Instances?
Workloads that are stateless (losing the instance does not lose state), fault-tolerant (work is checkpointed or queued so loss is recoverable), flexible across instance types and AZs (the workload runs equally well on m5, m5a, m6i, c5 etc., giving the Spot allocator many capacity pools), and time-elastic (delays in starting are acceptable). Classic Spot-suitable workloads: batch processing, big data ETL, CI/CD build workers, render farms, web crawlers, fault-tolerant containerized services behind SQS, ASG fleets with mixed-instances policy. Not suitable: stateful databases, single-instance application servers, hard-deadline workloads where the 2-minute interruption warning is too short for graceful shutdown.
Q6: How does the Spot Instance 2-minute interruption warning reach my application?
Two channels. (1) EC2 Instance Metadata Service (IMDS) — the path http://169.254.169.254/latest/meta-data/spot/instance-action returns a JSON document with the planned termination time when an interruption is imminent; the application can poll IMDS every 5–10 seconds and react. (2) Amazon EventBridge — AWS publishes an EC2 Spot Instance Interruption Warning event to the default event bus, which can be routed to Lambda, SNS, Step Functions, or a Systems Manager Automation runbook for centralized handling. The robust pattern is to use EventBridge for fleet-wide handling (drain ALB targets, deregister from service discovery) and IMDS poll inside the instance for local cleanup (flush in-memory state to S3, exit container processes cleanly). Two minutes is the only window — workloads needing longer are not Spot-suitable.
Q7: When do I use AWS Cost and Usage Report (CUR) versus AWS Cost Explorer?
Use Cost Explorer for ad-hoc investigation, executive dashboards using built-in dimensions, console-driven drill-downs (Service → Usage Type → Tag), and forecasting up to 12 months forward. Use CUR + Athena (or Redshift Spectrum or QuickSight) for programmatic reporting, custom chargeback scripts that need per-resource per-hour line items, multi-month trend analysis at fine granularity, finance-team integrations with non-AWS tooling, and any custom cut Cost Explorer's UI does not natively support. The standard CUR setup is Parquet format with Athena integration enabled, which auto-creates a Glue crawler and Athena table — then run SQL like SELECT product_servicecode, SUM(line_item_unblended_cost) FROM cur_table GROUP BY 1 ORDER BY 2 DESC. CUR final monthly data is available ~24 hours after month-end.
Q8: Why are most Trusted Advisor cost checks missing in my account?
Trusted Advisor's full check set — including the entire Cost Optimization category (Idle EC2, Underutilized EBS, Unassociated Elastic IPs, Idle RDS, Idle Load Balancers, etc.) — requires a Business, Enterprise, or Enterprise On-Ramp Support plan. The free Basic and Developer Support tiers see only six core checks (mostly Service Limits and a small number of Security checks). To unlock the full Cost Optimization category, upgrade the Support plan. SOA-C02 tests this directly: when a question asks "the team wants to enable Idle Load Balancers and Underutilized EBS Volume checks", the right answer pairs the Trusted Advisor service with the Business/Enterprise Support requirement.
Q9: What is the difference between Reserved Instances and Savings Plans for SOA-C02?
Reserved Instances (RIs) are a 1- or 3-year commitment to a specific instance family, region, OS, and tenancy in exchange for up to ~72 percent off On-Demand. Standard RI is locked to those attributes; Convertible RI can be exchanged for different families during the term at a slightly lower discount. Savings Plans (SPs) are a 1- or 3-year commitment to a dollars-per-hour compute spend in exchange for discount. Compute SP applies broadly to EC2 (any family/region/OS), Fargate, and Lambda for ~66 percent maximum discount. EC2 Instance SP is family-locked to a specific region for ~72 percent maximum, comparable to Standard RI but commitment is dollar-based instead of instance-based. SOA-C02 does not require deep math — it tests utilization budgets (alert when SP/RI utilization drops below 95 percent), coverage budgets, and reading Cost Explorer's purchase recommendations.
Q10: Should I use Compute Optimizer or Cost Explorer rightsizing?
Compute Optimizer is the broader and deeper tool — it covers EC2 instances, EC2 Auto Scaling groups, AWS Lambda functions, and Amazon EBS volumes, and emits explicit risk levels (Very Low through Very High) along with each recommendation. Cost Explorer's rightsizing recommendations cover only standalone EC2 instances and are a simpler engine without the same risk granularity. They are complementary. SOA-C02 favors Compute Optimizer for any "ML-driven recommendations across compute and storage" question and uses Cost Explorer rightsizing only for "quick console view of EC2 rightsizing within Cost Explorer". For real operations, install the CloudWatch agent for memory metrics, opt in to Compute Optimizer at the organization level, wait the 14-day baseline, then act on Low and Very Low risk recommendations first.
Further Reading and Related Operational Patterns
- What is AWS Cost Explorer
- Cost Explorer Rightsizing Recommendations
- Activating User-Defined Cost Allocation Tags
- Managing Costs with AWS Budgets
- Configuring AWS Budgets Actions
- What is AWS Compute Optimizer
- Compute Optimizer EC2 Recommendations
- Compute Optimizer Auto Scaling Group Recommendations
- Compute Optimizer Lambda Recommendations
- Compute Optimizer EBS Volume Recommendations
- AWS Trusted Advisor User Guide
- Trusted Advisor Cost Optimization Checks
- EC2 Spot Instances
- Spot Instance Interruptions
- AWS Savings Plans User Guide
- AWS Cost and Usage Report (CUR)
- Querying CUR with Amazon Athena
- AWS SOA-C02 Exam Guide v2.3 (PDF)
Once cost visibility is in place, the next operational layers are: Performance Optimization (EBS, RDS, EC2, S3) for the right-sizing-for-performance side of Domain 6 (the cost lens here meets the performance lens there at Compute Optimizer and CloudWatch metrics); Multi-Account Strategy with Organizations and Control Tower for the SCP guardrails that Budget Actions attach to and for organization-wide cost allocation tag and Compute Optimizer activation; EC2 Auto Scaling, ELB, and Multi-AZ HA for the workload tier whose rightsizing and Spot consumption these tools govern; and CloudWatch Metrics, Alarms, and Dashboards for the underlying metrics that Compute Optimizer and Cost Explorer rightsizing both consume.