examhub .cc The most efficient path to the most valuable certifications.
In this note ≈ 39 min

Cost Optimization for Existing Production Workloads

7,650 words · ≈ 39 min read

Cost optimization for existing AWS workloads is the discipline of retrofitting FinOps onto an AWS estate that was built without it — without breaking production, without asking teams to rewrite applications, and without causing the kind of political backlash that makes the next optimization initiative dead on arrival. On SAP-C02, cost optimization for existing workloads is Task 3.5 inside Domain 3 (Continuous Improvement for Existing Solutions, 25% of the exam) and it is the only cost topic that tests remediation framing rather than greenfield design. Every question stem starts with a running system — "a company has 20 accounts spending $2M per year", "a CUR report shows $50K per month on NAT Gateway data processing", "200 EC2 instances run at 8% average CPU" — and asks which cost optimization levers to pull, in which order, with which guardrails. Cost optimization for existing workloads is therefore not about knowing the services; it is about sequencing them so that the first week produces visible savings and the fiftieth week still has momentum.

This guide assumes you already know Associate-level cost basics (on-demand vs Reserved Instances vs Spot, S3 storage classes, Trusted Advisor existence) and focuses on the Pro-level audit-to-remediation workflow: AWS Cost and Usage Report (CUR) landed in Amazon S3, queried with Amazon Athena, visualised with Amazon QuickSight; AWS Compute Optimizer and AWS Trusted Advisor feeding idle-resource findings; Savings Plans recommendations balanced with Reserved Instance Exchange; a Spot migration plan that respects interruption tolerance; EC2 generation migrations to Graviton; Amazon EBS gp2 → gp3 bulk conversion; Amazon S3 Storage Lens → S3 Intelligent-Tiering → lifecycle rules; unused resource cleanup; and a data transfer audit that targets AWS PrivateLink VPC endpoints, Amazon CloudFront coverage, and NAT Gateway topology. The goal is Pro-depth decision-making plus a rollout plan that keeps twenty account owners on your side.

Why Cost Optimization for Existing Workloads Matters on SAP-C02

At Professional tier, AWS expects you to treat cost as a first-class architectural constraint, on equal footing with reliability and security. Task 3.5 of the SAP-C02 exam guide ("Identify opportunities for cost optimizations") appears as roughly 15 percent of Domain 3 questions, and community data from recent takers shows the most frequent traps are: recommending Savings Plans commitments before running a coverage-and-utilization report; proposing Spot for stateful workloads; picking S3 Intelligent-Tiering for objects under 128 KB; converting RDS to Graviton without checking engine-version support; and forgetting that Savings Plans cannot be cancelled mid-term.

The exam also loves to pit Compute Optimizer against Trusted Advisor against Cost Explorer rightsizing recommendations, and Savings Plans against Reserved Instances against Spot for the same workload. The fastest way to get these right is to know which tool owns which signal, and which optimization lever applies to which class of waste.

Core term for cost optimization on existing workloadsCost optimization for existing AWS workloads is the architectural pattern of combining AWS Cost and Usage Report, Amazon Athena, Amazon QuickSight, AWS Compute Optimizer, AWS Trusted Advisor, AWS Cost Explorer, AWS Savings Plans, Amazon EC2 Spot, AWS Graviton migration, Amazon EBS gp3 migration, Amazon S3 Storage Lens with S3 Intelligent-Tiering and lifecycle rules, AWS PrivateLink VPC endpoints, and Amazon CloudFront to retrofit FinOps onto a running AWS estate — driving every dollar of waste to an owner, a remediation action, and a measured unit-cost outcome, without sacrificing reliability or team autonomy.

Plain-Language Explanation: Cost Optimization on Existing Workloads

Cost optimization on an existing AWS estate can feel intimidating because it stitches together ten services and twenty stakeholders. Three analogies make the whole programme click.

Analogy 1 — The Old House Energy Retrofit

Imagine you have just bought a forty-year-old house and the electricity and heating bills are three times your neighbour's. You do not demolish the house; you retrofit it. Cost optimization on an existing AWS workload works the same way. First, you hire an energy auditor who walks every room with a thermal camera — that is the CUR + Athena + QuickSight audit, producing a heat map of which rooms (accounts, services, tags) leak the most cost. The auditor then lists findings in priority order: "the attic insulation is worst, the boiler is 20 years old, half the bulbs are incandescent, the fridge is from 1998" — that is Trusted Advisor + Compute Optimizer + Storage Lens findings, ranked by dollar impact. Next comes the quick-win week: swap all the bulbs to LED (EBS gp2 → gp3, delete unattached EIPs, delete old AMIs) — zero-risk, same-day savings. After that you schedule the seasonal upgrades — insulating the attic in autumn, replacing the boiler in spring — because those need planning and cannot all happen at once (Graviton migration, RDS engine upgrade, Savings Plans commitment). Finally, you lock in the utility's fixed-rate tariff for the stable baseline you now know you need (Savings Plans), while keeping some spot-market flexibility for the months you travel (Spot for batch). The key insight is that sequence matters: you never buy a new boiler before you have read the meter for a month, and you never commit to a fixed tariff before you know your baseline.

Analogy 2 — The Supermarket Inventory Clean-Up

Cost optimization on an existing AWS estate is like taking over a supermarket where the previous manager never used a stock system. The CUR is the receipts from every till for the last twelve months. Athena is the analyst who runs SQL over those receipts. QuickSight is the dashboard on the back-office wall that every department head checks on Monday morning. Trusted Advisor is the store auditor who walks the aisles once a week and notes "this refrigerator is empty but running (unused Elastic IPs), these boxes have expired (old AMIs), this chiller is oversized for the amount of stock it holds (oversized EC2)". Compute Optimizer is the refrigeration engineer who inspects each chiller's internal temperature history and says "you can replace this model with a 40 percent smaller one that will keep the goods at the same temperature" (rightsizing from m5.2xlarge to m6g.large). S3 Storage Lens is the stockroom inventory camera that shows which shelves have not been touched for 90 days — perfect candidates for the deep freezer (S3 Glacier Deep Archive) via lifecycle rules. Savings Plans is the supermarket signing a three-year contract with the dairy supplier for a guaranteed baseline volume at a 30 percent discount, while continuing to buy seasonal extras on the spot market (EC2 Spot). And like a real retail manager, you never tell the stockroom staff on Friday afternoon "we are replacing every chiller on Monday" — you pilot in aisle 7, measure, then roll out aisle by aisle.

Analogy 3 — The Fleet Manager at a Logistics Company

A logistics company with a 500-truck fleet that has grown organically for a decade looks exactly like a 20-account AWS estate. The CUR is the fuel-card data from every truck. Athena plus QuickSight is the fleet dashboard that ranks cost per kilometre per driver per route. Compute Optimizer is the telemetry platform that notes "truck 217 is always at 20 percent load — downsize to a van". Trusted Advisor is the weekly depot inspection that flags unregistered trailers (unattached EBS volumes), trucks parked for 90 days (stopped instances), and trucks with expired road tax (idle Classic Load Balancers). Graviton migration is replacing a diesel fleet with hybrid engines of the same chassis class — same routes, same drivers, 20 percent less fuel, but every vehicle needs a week in the workshop. gp2 → gp3 is swapping an old tyre model for a cheaper, better-performing one with the same bolt pattern — an in-place refit, no new vehicle. Spot is renting drivers from the temp agency for seasonal overflow, accepting they can be recalled at two minutes notice — fine for warehouse-to-warehouse shuttling, suicidal for perishable-goods long-haul. Savings Plans is the three-year volume fuel contract. RI Exchange is swapping an old contract on diesel vans for the same cost commitment on the newer hybrid model. And critically, the fleet manager never announces "all trucks will be replaced next quarter" — they publish a Cost Optimization Portfolio with per-depot savings targets, let each depot pilot two vehicles, and scale what works.

Pick the analogy that clicks — If you are an infrastructure engineer, the old-house retrofit analogy is usually the fastest path to retaining the sequencing rules (audit before buying, quick wins before long projects, rate-lock after baseline is known). If you are a FinOps practitioner, the supermarket inventory analogy maps cleanly to the unit-cost and stakeholder-engagement side. If you are presenting to executives, the fleet manager analogy sells the portfolio view. Use whichever one helps you lock the pattern into long-term memory — on SAP-C02 the exam rewards pattern recognition, not service trivia.

Starting Point: Where Cost Optimization on Existing Workloads Usually Goes Wrong

Before any tooling, recognise the three failure modes that kill cost optimization programmes. First, the "shop first, measure later" anti-pattern: a team buys a $500K Savings Plan before anyone has run a coverage report, then discovers half the commitment is unused because a large workload is being decommissioned next quarter. Second, the "big bang migration" anti-pattern: a central team announces a fleet-wide Graviton migration without a per-service compatibility audit, hits a single library incompatibility, and the project stalls for six months. Third, the "surprise chargeback" anti-pattern: FinOps rolls out tag enforcement via SCP without warning, breaks three deploy pipelines on the first day, and loses political support for the rest of the year.

On SAP-C02, any answer that begins with a commitment purchase, a fleet-wide migration, or a hard enforcement control before any audit has been performed is almost certainly wrong. The correct Pro-level pattern is always observe → segment → pilot → scale → enforce.

SAP-C02 answer ordering rule — When a SAP-C02 scenario asks "which of the following should the solutions architect do first?" in a cost optimization context on an existing workload, the answer is almost never a purchase commitment, a fleet-wide migration, or a lifecycle-rule rollout. It is almost always an observability step: enable the AWS Cost and Usage Report to an Amazon S3 bucket, turn on AWS Compute Optimizer at the organization level, activate Amazon S3 Storage Lens advanced metrics, or enable AWS Trusted Advisor at Business or Enterprise Support. Only once data exists can the follow-up actions be prioritised.

Cost Audit Workflow: CUR + Athena + QuickSight

The AWS Cost and Usage Report is the only dataset that captures every line-item of AWS spend at hourly granularity, with every user-defined cost allocation tag, every RI or Savings Plans amortisation split, every unblended and blended rate. It is the foundational source for Pro-level cost optimization and it is free — you only pay for the Amazon S3 storage and any downstream query cost.

The canonical audit architecture for cost optimization on an existing workload is a three-stage pipeline. First, the management account delivers the CUR in Parquet format to a centralised Amazon S3 bucket in the shared-services or billing account, with hourly granularity, resource IDs enabled, and time-based partitioning (s3://bucket/cur/year=YYYY/month=MM/). Second, AWS Glue crawls the CUR schema and registers it as an external table so Amazon Athena can run SQL queries directly against the Parquet files. Third, Amazon QuickSight connects to Athena as a data source and publishes dashboards that are shared with account owners at the appropriate permission level — account owners see only their slice, FinOps sees the whole estate.

The Three Canonical QuickSight Dashboards

The three cost optimization dashboards every SAP-C02 scenario implicitly expects are:

Dashboard 1 — Tag Coverage Dashboard. Shows the percentage of spend by account that carries each required tag (CostCenter, Environment, Owner, Project). Tag coverage below 80 percent is the single biggest blocker to chargeback and cost optimization because untagged resources cannot be attributed, so they cannot be owned, so they cannot be remediated. A working cost optimization programme lifts tag coverage from 40 percent to 95 percent in the first quarter.

Dashboard 2 — Service Cost by Team Dashboard. Shows monthly spend per team (derived from the CostCenter tag or AWS Cost Categories), broken down by service, with a seven-day and thirty-day trend. The objective is to give each team a unit-cost view: "cost per active user", "cost per transaction", "cost per 1000 API calls". Without a unit-cost view, teams optimise nothing because reducing absolute spend while growing usage looks like failure.

Dashboard 3 — Waste Hunt Dashboard. Joins CUR line-items with Trusted Advisor idle-resource findings and Compute Optimizer recommendations to produce a ranked list: "Top 20 resources by potential monthly savings". Each row shows the resource ID, the account, the tag owner, the current cost, the predicted cost after remediation, and the recommended action. This is the dashboard that drives the weekly waste-hunt standup.

Cost audit data pipeline — The SAP-C02 canonical cost audit pipeline is: AWS Cost and Usage Report → Parquet in Amazon S3AWS Glue Data CatalogAmazon Athena queries → Amazon QuickSight dashboards. Memorise this five-step chain — SAP-C02 questions frequently ask which component is missing, or which alternative (e.g., AWS Cost Explorer) is not sufficient for row-level custom analytics across a 20-account organization.

The First Three Athena Queries

For a twenty-account $2M-per-year estate, the first three Athena queries against the CUR are worth memorising because they produce 80 percent of the initial remediation backlog.

Query 1 ranks line_item_usage_account_id by line_item_unblended_cost for the last 30 days — this identifies the three or four accounts that drive the majority of spend. Query 2 groups by product_product_name and line_item_usage_account_id — this produces the service-by-account spend heat map, from which the top ten line items usually stand out (typically Amazon EC2, Amazon RDS, Amazon S3, AWS Data Transfer, and NAT Gateway). Query 3 filters line_item_operation for NatGateway and DataTransfer line items — NAT Gateway data-processing charges ($0.045/GB in us-east-1) often dominate the data transfer bill and are the lowest-risk, fastest-payback optimization target.

Trusted Advisor + Compute Optimizer Across the Organization

AWS Trusted Advisor and AWS Compute Optimizer are complementary, not competing, tools for cost optimization on existing workloads. Trusted Advisor is the breadth tool — one scan, dozens of check categories, shallow depth per check. Compute Optimizer is the depth tool — narrower scope (EC2, Auto Scaling groups, EBS, Lambda, ECS on Fargate, RDS, commercial DB on EC2) but deep machine-learning-driven recommendations backed by fourteen days of CloudWatch metrics.

Trusted Advisor Cost Optimization Checks at Enterprise Scale

Trusted Advisor at Business or Enterprise Support unlocks the full cost optimization category — roughly 20 checks including Low Utilization Amazon EC2 Instances, Idle Load Balancers, Unassociated Elastic IP Addresses, Underutilized Amazon EBS Volumes, Amazon RDS Idle DB Instances, Amazon S3 Bucket Lifecycle Policy, Amazon EC2 Reserved Instance Lease Expiration, AWS Lambda Functions with High Error Rates, and Savings Plan coverage checks. At organization scale, Trusted Advisor findings are aggregated via AWS Organizations — the management account or a delegated administrator can view organization-wide findings across every linked account.

For cost optimization on an existing 20-account workload, the weekly waste-hunt standup reviews the five highest-dollar-impact Trusted Advisor checks per account: unattached EBS volumes over 30 days old, unassociated Elastic IPs, idle Classic or Application Load Balancers with zero requests over seven days, underutilized EC2 instances at under 10 percent CPU and 5 MB network I/O for 14 consecutive days, and low-utilization RDS instances.

Compute Optimizer at Organization Level

AWS Compute Optimizer delivered via the organization-level opt-in from the management account produces recommendations for every linked account in one place. The four recommendation types for EC2 are NotOptimized with Underprovisioned (the instance is CPU-bound or memory-bound and needs a larger family or size), Overprovisioned (the instance can be downsized, typically the highest-value recommendations), Optimized (no change), and None (insufficient metric data — fewer than 14 days observed).

Compute Optimizer recommendations are Graviton-aware — for every recommendation it shows the equivalent Graviton-based instance family (e.g., current m5.2xlarge → recommended m6g.large or m7g.large) with the expected monthly savings. For memory-based rightsizing Compute Optimizer requires the CloudWatch agent to be installed on the instance so memory utilization metrics are ingested; without the agent Compute Optimizer can only rightsize on CPU and network.

Compute Optimizer memory blind spot — AWS Compute Optimizer does not see memory utilization by default — the CloudWatch agent must be installed on each EC2 instance to publish mem_used_percent to the CWAgent namespace. If a SAP-C02 question states that Compute Optimizer recommends downsizing memory-intensive Java workloads from r5.4xlarge to m6g.large but the team disagrees, the most likely root cause is that the CloudWatch agent is not installed and the recommendation is based on CPU only. The remediation is to deploy the CloudWatch agent fleet-wide via AWS Systems Manager State Manager, wait 14 days, and re-evaluate.

The Audit Workflow Pro-Level Sequence

The Pro-level audit workflow that SAP-C02 rewards is: (1) enable CUR + Trusted Advisor Business/Enterprise + Compute Optimizer Organization opt-in in week 1; (2) wait 14 days for Compute Optimizer to accumulate sufficient metric data; (3) in week 3, build the three QuickSight dashboards; (4) in week 4, run the first waste-hunt standup with findings ranked by dollar impact; (5) from week 5 onwards, run the standup weekly and drive tickets from the top of the list downward.

Audit Workflow: The Weekly Waste-Hunt Standup

The weekly waste-hunt standup is the operational cadence that converts cost optimization findings into remediations. It is not a ceremony invented by AWS — it is the FinOps Foundation's "inform, optimise, operate" loop operationalised on top of AWS tooling.

A typical agenda for cost optimization on existing workloads runs 45 minutes. The FinOps lead presents the top ten new findings from the Waste Hunt Dashboard. Each finding has a named owner (derived from the Owner tag or account assignment), a dollar amount, and a recommended action. The owner confirms or challenges the recommendation in five minutes — if confirmed, a ticket is created in the team's issue tracker with a two-week deadline; if challenged, the owner provides the business reason and the finding is muted in QuickSight (a CUR-backed mute-table is updated).

The standup has three explicit anti-patterns it avoids. First, it never name-shames — teams are listed by CostCenter with savings targets, never by individual engineer. Second, it never overrides an owner's rejection — if a team says "this instance must stay at r5.4xlarge because it runs our year-end batch which we cannot measure in a 14-day window", FinOps records the exception and moves on. Third, it never approves commitments at the standup — Savings Plans and Reserved Instance purchases go through a separate monthly forecasting meeting.

Savings Plans Recommendation Engine and Coverage Gaps

AWS Savings Plans are the single highest-dollar-impact cost optimization lever on an existing workload — a 1-year All Upfront Compute Savings Plan delivers a 27 percent to 36 percent discount versus On-Demand, and a 3-year All Upfront delivers up to 66 percent. For a $2M/year estate with 60 percent stable baseline, a correctly sized Savings Plan portfolio can save $400K to $600K annually.

Savings Plans come in three flavours: Compute Savings Plans (most flexible — applies to EC2 across all regions, instance families, sizes, tenancies, and operating systems, plus Fargate and Lambda), EC2 Instance Savings Plans (locked to a specific instance family in a specific region, higher discount than Compute, less flexible), and SageMaker Savings Plans (applies to SageMaker ML workloads only). For cost optimization on a heterogeneous 20-account workload, Compute Savings Plans are almost always the correct choice because their flexibility survives instance family migrations (e.g., m5 → m6g) without breaking the commitment.

Reading the Coverage and Utilization Report

AWS Cost Explorer publishes two Savings Plans reports that together drive the commitment decision. Coverage answers "what percentage of my eligible On-Demand spend is covered by a Savings Plan?" — a coverage of 40 percent means 60 percent of your compute spend is still paying full On-Demand rates and is a purchase opportunity. Utilization answers "what percentage of the Savings Plan commitment I already own is being used?" — a utilization of 95 percent means the commitment is healthy; a utilization of 60 percent means you over-committed and are paying for unused capacity.

The Pro-level heuristic for cost optimization on an existing workload is: target 80 percent coverage and 95 percent utilization across the portfolio. Below 80 percent coverage, buy more Savings Plans. Below 95 percent utilization, stop buying and investigate why existing commitments are underused — usually a workload was decommissioned or migrated to a region the EC2 Instance Savings Plan does not cover.

AWS provides a Savings Plans Recommendation Engine in Cost Explorer that analyses the last 7, 30, or 60 days of On-Demand usage and proposes a commitment level with estimated monthly savings and breakeven hours. The engine supports three payment options (All Upfront, Partial Upfront, No Upfront) and two terms (1-year, 3-year). For cost optimization on a volatile existing estate, a 1-year No Upfront Compute Savings Plan sized at 70 to 80 percent of current baseline is the safest starting commitment — it captures most of the discount without locking in capital or term beyond the next budget cycle.

Savings Plans are non-cancellable — An AWS Savings Plan commitment is irrevocable for the full 1-year or 3-year term. It cannot be cancelled, refunded, downsized, or transferred to another payer account. On SAP-C02, any answer that suggests "cancel the Savings Plan and repurchase" is automatically wrong. The correct remediation for over-commitment is to drive additional eligible workload to the payer account so the commitment is consumed, or to accept the amortised loss as the cost of a safety margin. This is why the Pro-level commitment heuristic is 70 to 80 percent of baseline, never 100 percent.

Spot Migration Plan: Identify, Pilot, Roll Out

Amazon EC2 Spot capacity offers discounts of up to 90 percent off On-Demand, with the trade-off that AWS can reclaim the capacity with a 2-minute interruption notice. On an existing workload, Spot migration is a surgical exercise: identify interruption-tolerant workloads first, pilot on one service, then roll out.

Identification: Which Existing Workloads Tolerate Spot

The classification rule for cost optimization on existing workloads is: a workload tolerates Spot if it is stateless, horizontally scalable, re-runnable, and not time-critical at the individual-instance level. The canonical candidates are batch ETL jobs on EMR, CI/CD build runners, rendering farms, stateless web tiers behind an ALB with health-check-driven replacement, containerised microservices with graceful shutdown handlers, and Amazon EKS or Amazon ECS workloads with multiple replicas. The canonical non-candidates are single-primary databases (RDS, self-managed PostgreSQL), session-stateful servers without external session stores, any workload with a "no interruption" SLA, and migrations or batch jobs that cannot checkpoint.

The Two-Week Pilot

The Spot migration pilot for cost optimization on an existing workload uses AWS Auto Scaling Groups with a mixed-instances policy: 100 percent of the On-Demand base capacity, plus Spot for all scale-out capacity, across four to six instance types in the same vCPU class. The ASG is configured with capacity-optimized-prioritized allocation strategy so Spot pools with the deepest capacity are chosen first. A Lambda function subscribed to the EC2 Spot Instance Interruption Warning (EC2 Spot Instance Interruption) drains the instance from the ALB target group within the 2-minute notice window.

A two-week pilot measures interruption rate, mean time to replacement, application-level error rate, and unit cost. If the interruption rate stays below 5 percent per day and application-level errors stay within SLO, the workload is promoted to full Spot for scale-out. If interruptions spike or application errors exceed SLO, the workload stays at 50 percent Spot or reverts to full On-Demand.

The Organization-Wide Rollout

The rollout pattern is one service per sprint, piloted in the lowest-traffic account first (typically the development OU), then promoted to staging, then production. At the end of a quarter, a $2M/year estate with 40 percent Spot-eligible workload typically reaches 25 to 30 percent Spot coverage and captures an additional $150K to $200K annual savings on top of Savings Plans.

Spot is not for databases — SAP-C02 questions frequently include a distractor answer that suggests "move the primary RDS instance to Spot" or "run the Elasticsearch master node on Spot". These are always wrong. Amazon EC2 Spot is fundamentally incompatible with stateful single-primary databases because a 2-minute interruption is shorter than most database failover windows and guarantees data loss or split-brain. The correct answer on SAP-C02 is always to run stateful database primaries on On-Demand or Reserved Instances, and to use Spot only for stateless compute and worker tiers.

EC2 Generation Migration: Pre-Graviton → Graviton and t2 → t3 → t3a

EC2 generation migration is the second-highest-dollar cost optimization lever on existing workloads after Savings Plans. Every generation transition delivers 10 to 40 percent price-performance improvement for the same workload, and the migration is typically an in-place instance-type change with a single reboot.

The Three Canonical Generation Migrations

t2 → t3 → t3a. The t2 family uses CPU credits that can throttle aggressively under sustained load; t3 uses unlimited credits by default (with a per-vCPU/hour surcharge cap) and delivers roughly 10 to 20 percent better price-performance. t3a uses AMD EPYC processors and costs 10 percent less than t3 for the same vCPU/memory ratio, with comparable performance for most general-purpose workloads. For cost optimization on an existing 20-account estate, migrating the fleet from t2 to t3a is typically a one-sprint project with savings of 15 to 25 percent on every migrated instance.

m4 → m5 → m6i → m7i-flex / m7g. The m4 family is Intel Haswell/Broadwell; m5 is Intel Skylake/Cascade Lake; m6i is Intel Ice Lake; m7i-flex is Intel Sapphire Rapids in a flexible burstable variant. Each generation delivers roughly 10 to 15 percent price-performance improvement for the same workload. m7g is the Graviton 3 equivalent — 20 to 40 percent better price-performance than m6i for compatible workloads.

Pre-Graviton → Graviton (m5 → m6g → m7g, c5 → c6g → c7g, r5 → r6g → r7g). Graviton is AWS's ARM64-based processor line (Graviton 2, 3, 4). For compatible workloads Graviton delivers 20 to 40 percent price-performance improvement over the same-generation Intel equivalent. Compatibility requires ARM64-compiled binaries for the OS and every native dependency — pure Java, Python, Node.js, Go, and .NET Core workloads usually migrate with no code changes; C/C++ native binaries, specific commercial software, and workloads with x86-only assembly require recompilation or are incompatible.

The Graviton Migration Workflow

The Pro-level Graviton migration workflow for cost optimization on an existing workload is: (1) run a compatibility inventory across the estate using lscpu, dpkg --print-architecture, or container image manifests; (2) classify workloads into three buckets — green (pure interpreted/JVM, migrate in-place), yellow (requires container rebuild or package recompile), red (commercial software without ARM64 support); (3) pilot one green workload per team; (4) rebuild container images as multi-arch manifests with docker buildx; (5) roll out across the green bucket over a quarter; (6) revisit the yellow bucket after the green bucket is complete.

RDS Graviton migration is a separate workflow — it requires a minor engine version upgrade and a maintenance-window restart. RDS MySQL, PostgreSQL, MariaDB, and Aurora all support Graviton; SQL Server and Oracle on RDS do not.

S3 Lifecycle Retrofit: Storage Lens → Intelligent-Tiering → Lifecycle Rules

Amazon S3 cost optimization on existing workloads is frequently the highest-ROI initiative after Savings Plans and Graviton, because most estates have a long tail of objects that are stored in S3 Standard but have not been accessed in months. The three-stage workflow is: Storage Lens to observe access patterns, S3 Intelligent-Tiering to handle unpredictable access without lifecycle rules, and lifecycle rules for predictable access patterns.

Stage 1 — Amazon S3 Storage Lens Advanced Metrics

Amazon S3 Storage Lens is the only organization-wide S3 observability tool. At the free tier it provides 28 metrics; advanced metrics (paid) unlocks activity metrics (GET/PUT/DELETE counts per bucket per day), prefix-level aggregation, and 15 additional cost-optimization metrics like "percentage of objects not accessed in 30 days", "bytes in Standard that could be moved to Intelligent-Tiering", and "incomplete multipart uploads". For cost optimization on an existing estate with more than 100 TB in S3, Storage Lens advanced metrics pays for itself in the first week.

The canonical Storage Lens finding for a $2M/year estate is: 60 to 80 percent of objects by count have not been accessed in 90 days, but are stored in S3 Standard. The remediation is to adopt S3 Intelligent-Tiering for net-new objects and use lifecycle rules to transition existing objects.

Stage 2 — S3 Intelligent-Tiering Adoption

S3 Intelligent-Tiering is the default storage class for any object whose access pattern is unknown or variable. It automatically moves objects between four tiers: Frequent Access, Infrequent Access (after 30 consecutive days of no access), Archive Instant Access (after 90 days, if the bucket-level opt-in is enabled), and optionally Archive Access and Deep Archive Access (asynchronous retrieval, opt-in via configuration). There is no retrieval charge for Frequent, Infrequent, and Archive Instant tiers — only for the two asynchronous archive tiers.

Intelligent-Tiering has a monitoring and automation fee of $0.0025 per 1000 objects per month. For objects smaller than 128 KB this fee exceeds the storage saving, so Intelligent-Tiering is only cost-effective for objects at or above 128 KB. For small-object workloads (e.g., thumbnail image pipelines, per-request log files), Intelligent-Tiering is the wrong choice and a lifecycle rule that transitions to S3 Standard-IA after 30 days is better.

Stage 3 — Lifecycle Rules for Predictable Access

Lifecycle rules apply when access patterns are known and deterministic. The three canonical lifecycle patterns for cost optimization on existing workloads are: (1) application logs — Standard for 30 days, transition to S3 Standard-IA for 60 days, transition to S3 Glacier Instant Retrieval for 275 days, expire after 365 days (typical compliance retention); (2) backup snapshots — Standard for 7 days, transition to S3 Glacier Deep Archive for 2555 days (7 years for financial compliance), expire after 2555 days; (3) build artefacts and CI/CD outputs — Standard for 14 days, expire after 14 days (no archival value).

One lifecycle rule every existing workload should have is incomplete multipart upload cleanup — transition to AbortIncompleteMultipartUpload after 7 days. A surprisingly large fraction of $2M/year estates have 5 to 20 TB of orphaned multipart uploads that are invisible to the S3 console and silently incur Standard storage cost.

S3 Intelligent-Tiering object-size threshold — S3 Intelligent-Tiering charges a $0.0025 per 1000 objects per month monitoring fee. For objects smaller than 128 KB this fee exceeds the tiering savings, so such objects are always billed at Frequent Access rates and cannot be demoted. On SAP-C02, a question that asks "which storage class minimises cost for a workload of 10 billion 4 KB IoT sensor readings" will not have Intelligent-Tiering as the correct answer — the right answer is usually S3 Standard with a lifecycle rule, or aggregation of the small objects into larger Parquet files before upload.

EBS gp2 → gp3 Bulk Migration

The Amazon EBS gp2 → gp3 migration is the single highest-ROI quick-win on an existing workload. gp3 is structurally cheaper than gp2 (20 percent lower storage cost per GB in most regions), delivers a guaranteed 3000 IOPS and 125 MB/s throughput baseline independent of volume size (gp2 scales IOPS linearly with size), and supports in-place type modification via ModifyVolume with no downtime — the instance never reboots and I/O continues throughout.

For cost optimization on an existing 20-account workload with, say, 200 TB of gp2 volumes, the gp2 → gp3 migration delivers $40K to $60K annual savings depending on region. The migration script iterates every volume via aws ec2 describe-volumes --filters Name=volume-type,Values=gp2, runs aws ec2 modify-volume --volume-type gp3, and moves on — at the AWS API rate limit the fleet-wide migration of 200 TB typically completes in 24 to 48 hours.

The only gp3 gotcha is for volumes larger than 1 TB where the gp2 IOPS exceeded 3000 — in that case, the gp3 conversion must also specify a higher --iops value to preserve the workload's performance. For the vast majority of volumes (under 1 TB, running at default gp2 IOPS), the default gp3 3000 IOPS / 125 MB/s envelope is equal or better than the gp2 baseline.

Unused Resource Cleanup: The First Week's Quick Wins

Unused resource cleanup is the week-one dopamine hit for any cost optimization programme on an existing workload. The six canonical unused-resource categories deliver immediate, zero-risk savings and build the political momentum needed for the harder initiatives later.

Unattached Elastic IP addresses. AWS charges $0.005/hour (~$3.60/month) for every Elastic IP that is allocated but not associated with a running resource. A 20-account estate typically has 20 to 80 unattached EIPs. Remediation: aws ec2 describe-addresses filtered for AssociationId=None, release via aws ec2 release-address.

Old Amazon Machine Images (AMIs) and snapshots. Every AMI references one or more EBS snapshots that are billed until the AMI is deregistered and the snapshots deleted. A 20-account estate accumulates 100 to 500 AMIs over its lifetime, many from decommissioned projects. Remediation: deregister AMIs older than 90 days that are not referenced by any launch template or Auto Scaling group, then delete the underlying snapshots.

Stopped EC2 instances over 30 days. A stopped instance incurs no compute charge, but its attached EBS volumes continue to bill at the full rate. A stopped m5.xlarge with a 500 GB gp2 root volume costs $50/month in EBS with zero business value. Remediation: identify stopped instances over 30 days old, confirm with owner, terminate (which releases the EBS) or snapshot-and-terminate.

Detached EBS volumes. Volumes that were detached from a terminated instance but never deleted. Trusted Advisor flags these directly. Remediation: snapshot any volume with business value, delete the volume.

Idle Classic, Application, and Network Load Balancers. Load balancers with zero RequestCount over 7 days typically indicate a decommissioned service. ALBs cost ~$16/month plus LCU charges; NLBs similar. Remediation: confirm with owner and delete.

Unused RDS instances. RDS instances with zero DatabaseConnections over 7 days and CPU under 5 percent are candidates for either deletion or stopping. RDS instances can be stopped for up to 7 days at a time (no compute charge, storage only); longer requires automation that re-stops on schedule.

A first-week cleanup on a $2M/year 20-account estate typically reclaims $30K to $80K annualised with a total engineering effort of two to three person-days.

Data Transfer Reduction Audit: NAT, VPC Endpoints, CloudFront

Data transfer is the invisible cost line. A CUR query that filters for product_product_name = 'AWS Data Transfer' or line_item_operation LIKE '%NatGateway%' often reveals that data transfer is the second- or third-largest cost line after EC2 and RDS.

NAT Gateway Cost Structure

AWS NAT Gateway has two cost dimensions: hourly ($0.045/hour in us-east-1, ~$33/month per NAT) and data processing ($0.045/GB). For a workload that routes 20 TB/month of traffic through a single NAT (e.g., pulls from Amazon ECR, package mirrors, Amazon S3 in a different region), the data-processing charge is $900/month per NAT — often 30 times the hourly charge. Multiply by 3 AZs in each of 15 VPCs and NAT Gateway can reach $40K to $50K/month on a $2M/year estate.

The Three-Step NAT Reduction Plan

Step 1 — Add VPC Gateway Endpoints for Amazon S3 and Amazon DynamoDB. Gateway endpoints are free (no hourly charge, no data processing charge) and route S3/DynamoDB traffic directly through the VPC route table without touching the NAT. On a workload that pulls 5 TB/month of build artefacts or ML training data from S3, a gateway endpoint saves $225/month per VPC.

Step 2 — Add VPC Interface Endpoints (AWS PrivateLink) for heavy AWS services. Interface endpoints for Amazon ECR, Amazon CloudWatch Logs, AWS Systems Manager, Amazon SNS, Amazon SQS, AWS KMS, and AWS Secrets Manager eliminate NAT traffic to those service APIs. Interface endpoints cost $0.01/hour per AZ per endpoint plus $0.01/GB data processing — cheaper than NAT for any service that moves over 2 GB/month.

Step 3 — Adopt Amazon CloudFront for egress. For any workload that serves content to the public internet, CloudFront egress from an edge location is cheaper than direct egress from a region (and performance is better). Also, data transfer from S3 or an ALB to CloudFront is $0.00/GB (free). Moving a static-asset workload behind CloudFront eliminates the regional egress charge entirely.

NAT Topology: Shared vs Per-AZ

A common cost optimization debate is whether to run one NAT per AZ (fault-tolerant, expensive) or one shared NAT in one AZ (cheaper, introduces cross-AZ data-transfer charges). The Pro-level answer is always per-AZ NAT in production because the cross-AZ data transfer charge ($0.01/GB each way, so $0.02/GB round-trip) exceeds the NAT hourly charge difference at any non-trivial traffic volume, and the single-AZ NAT introduces a reliability SPOF that eliminates any cost saving during an AZ event.

NAT topology on SAP-C02 — Any SAP-C02 answer that proposes "consolidate 3 NAT Gateways into 1 to save cost" in a production VPC is almost always wrong. The correct production topology is one NAT Gateway per AZ, with subnet route tables pointing each AZ's private subnets to the local NAT. The preferred cost optimization lever is to reduce traffic through NAT via VPC Gateway Endpoints (S3, DynamoDB) and VPC Interface Endpoints (ECR, Logs, SSM, SNS, SQS, KMS, Secrets Manager), not to reduce NAT count.

Reserved Capacity Right-Sizing: RI Exchange and Savings Plans Portfolio Balancing

Once an existing workload has Savings Plans coverage and has migrated to newer instance generations, the commitment portfolio itself needs rebalancing. Two mechanisms matter: Convertible Reserved Instance Exchange, and Savings Plans Portfolio composition.

Convertible Reserved Instance Exchange

Convertible RIs (purchased before Savings Plans existed, or for workloads where an RI is still preferable) can be exchanged mid-term for a different Convertible RI of equal or greater value, as long as the new RI has the same or longer remaining term and the same or higher total upfront value. This is the correct remediation when a workload migrates from m5 to m6g — the original m5 Convertible RI is exchanged for an equivalent m6g Convertible RI, preserving the discount commitment across the migration.

Standard RIs (non-convertible) cannot be exchanged — only sold on the RI Marketplace, which is region-restricted and requires a listing process that can take weeks to clear.

Savings Plans Portfolio Composition

A mature cost optimization portfolio on an existing $2M/year workload typically has a three-layer commitment stack: a 3-year All Upfront Compute Savings Plan for the base 50 percent of compute spend (maximum discount, highest lock-in); a 1-year No Upfront Compute Savings Plan for the mid 20 percent (medium discount, medium lock-in, re-evaluated each year); On-Demand and Spot for the top 30 percent (zero lock-in, captures growth and volatility). Over time, as the base stabilises, a portion of the mid-layer graduates to the base; as new workloads come online, a portion of the top migrates to the mid.

The rebalancing cadence is quarterly — the FinOps team reviews coverage, utilization, forecasted spend, and any planned architecture changes, then purchases incremental commitments to refill the base and mid layers. No single quarterly purchase exceeds 20 percent of total annual spend, so any misforecast is bounded.

Remediation Sequence: The 12-Week Rollout Plan for a $2M/Year 20-Account Estate

The Pro-level remediation sequence for cost optimization on an existing workload balances technical dependencies (you cannot buy a Savings Plan before you have baseline data) with political realities (you cannot survive three consecutive weeks without a visible win).

Weeks 1–2 — Observability. Enable AWS Cost and Usage Report delivery to a central S3 bucket. Enable AWS Compute Optimizer at the Organizations level. Enable AWS Trusted Advisor at Business or Enterprise Support. Enable Amazon S3 Storage Lens advanced metrics. Deploy the CloudWatch agent via AWS Systems Manager State Manager to every EC2 instance. Deliverable: raw data is flowing, but no optimisation has happened yet. This week has no visible savings but is a hard dependency for everything else.

Weeks 3–4 — Dashboards and Waste Hunt. Build the three QuickSight dashboards (Tag Coverage, Service Cost by Team, Waste Hunt). Run the first waste-hunt standup. Create tickets for the top 20 unused-resource findings (unattached EIPs, old AMIs, stopped instances, detached volumes, idle LBs). Deliverable: $10K to $20K in week-one savings, first data-backed cost conversation with each team.

Weeks 5–6 — Quick-Win Migrations. Execute the fleet-wide EBS gp2 → gp3 migration. Execute the t2 → t3a migration. Both are in-place, low-risk, no rollback required. Deliverable: $5K to $15K/month recurring savings, zero production impact, high team buy-in.

Weeks 7–8 — Savings Plans Round 1. Run the Savings Plans Recommendation Engine with 30-day lookback. Purchase a 1-year No Upfront Compute Savings Plan sized at 60 to 70 percent of current baseline (intentionally conservative). Deliverable: largest single-line savings of the programme, $15K to $25K/month.

Weeks 9–10 — S3 Lifecycle and NAT Reduction. Enable S3 Intelligent-Tiering as the default storage class for all new objects via bucket default. Apply lifecycle rules for logs, backups, and incomplete multipart uploads. Add VPC Gateway Endpoints for S3 and DynamoDB to every production VPC. Add VPC Interface Endpoints for ECR, Logs, SSM in the two highest-traffic VPCs. Deliverable: $5K to $15K/month recurring savings.

Weeks 11–12 — Graviton Pilot and Spot Pilot. Identify 3 to 5 Graviton-compatible workloads per team and pilot migration. Identify 2 to 4 stateless worker tiers and pilot Spot via mixed-instances ASG. Deliverable: validated migration playbooks, $2K to $5K/month pilot savings, ready for quarter-two rollout.

At the end of 12 weeks a typical $2M/year 20-account estate has reduced run-rate by 18 to 25 percent ($360K to $500K annualised), has built a weekly cadence that will continue to find 2 to 4 percent additional savings per quarter, and has kept every team owner on side because every remediation came with data, a named owner, and a two-week deadline rather than a surprise.

The 12-week cost optimization sequence — The SAP-C02 Pro-level sequence for cost optimization on existing workloads is: Observability (weeks 1–2) → Waste Hunt (weeks 3–4) → Quick-Win Migrations (weeks 5–6) → Savings Plans (weeks 7–8) → Storage + Network (weeks 9–10) → Graviton + Spot Pilots (weeks 11–12). This ordering is not arbitrary — each phase produces the data or political capital required for the next phase. Memorise the sequence; exam questions routinely ask "which of the following should the architect do first/next?" and the answer always follows this dependency order.

Scenario: Consolidate $2M/Year AWS Spend Across 20 Accounts Without Team Backlash

Put the full picture together with the canonical SAP-C02 scenario. A company runs 20 AWS accounts organised under a single AWS Organization with a shared-services account and a billing account. Annual AWS spend is $2M. The CFO has asked the solutions architect to reduce spend by 20 percent within six months without reducing service quality, and the VP of Engineering has explicitly warned that previous central cost-cutting initiatives have damaged team morale and produced shallow, one-time savings rather than a lasting culture shift.

Month 1 — Observability and quick wins. The architect enables AWS Cost and Usage Report, AWS Compute Optimizer, AWS Trusted Advisor Enterprise, and Amazon S3 Storage Lens advanced metrics. The architect builds the three QuickSight dashboards and runs the first waste-hunt standup. The standup produces 42 tickets across 20 accounts. By month-end, 35 of the 42 are closed, unused-resource cleanup has reclaimed $28K/month, and every team has seen its own spend broken down by service for the first time.

Month 2 — Fleet migrations. The architect executes gp2 → gp3 across 180 TB of EBS (saving $4K/month) and t2 → t3a across 400 instances (saving $3K/month). Both migrations are zero-downtime and communicated to each team one week in advance with a rollback-if-issues policy. No rollbacks are requested.

Month 3 — Savings Plans and S3 lifecycle. Based on 60 days of observed baseline, the architect purchases a 1-year No Upfront Compute Savings Plan at 65 percent of baseline, saving $22K/month. S3 Intelligent-Tiering is enabled as the default for all new objects. Lifecycle rules are rolled out for log buckets and backup buckets, saving $6K/month.

Month 4 — NAT reduction. The architect adds VPC Gateway Endpoints for S3 and DynamoDB to all 15 production VPCs (saving $5K/month on data processing). VPC Interface Endpoints for ECR, Logs, and SSM are added to the three VPCs with highest NAT data volume (saving an additional $4K/month).

Month 5 — Graviton pilot. Three Java microservices and two Python ML serving tiers pilot a migration from m5 to m6g. Performance is equal or better; cost falls 20 percent. The playbook is published and 12 additional services commit to migrating in month 6.

Month 6 — Spot pilot and portfolio review. Two batch ETL pipelines and one CI/CD runner fleet migrate to Spot via mixed-instances ASGs. Interruption rate is 3 percent; SLO is maintained. A second tranche of Savings Plans is purchased covering the net-new stable workload identified over the five months.

End-of-half-year result. Run-rate has fallen from $2M/year to $1.56M/year — a 22 percent reduction. Every team owner has been engaged weekly. No team has been publicly name-shamed. Every optimization has a documented owner, ticket, and outcome. The FinOps dashboards have become the default Monday-morning starting page for every engineering director. The programme has produced both the absolute savings and the organisational capacity to continue saving — which is the real answer the SAP-C02 exam is testing for.

Common Traps on SAP-C02 for Cost Optimization on Existing Workloads

The five highest-frequency SAP-C02 traps on this topic:

Trap 1 — Buying Savings Plans before running a coverage report. The correct first action is always an audit. Any answer that purchases a commitment in week 1 is wrong.

Trap 2 — Proposing Spot for a stateful workload. Spot is only for stateless, horizontally scalable, re-runnable workloads. Spot for RDS, Elasticsearch master, or stateful singletons is always wrong.

Trap 3 — Picking S3 Intelligent-Tiering for sub-128 KB objects. The monitoring fee exceeds the saving. The correct answer is either aggregation or a lifecycle rule to S3 Standard-IA.

Trap 4 — Proposing a single NAT Gateway to save cost. Cross-AZ data transfer exceeds the NAT savings at any non-trivial volume. The correct answer is per-AZ NATs with VPC endpoints to reduce traffic.

Trap 5 — Suggesting "cancel and rebuy" a Savings Plan. Savings Plans are irrevocable. The correct answer is to drive additional eligible workload to consume the commitment, or to accept the over-commit as the cost of a safety margin.

vs AWS Cost Anomaly Detection and AWS Budgets

AWS Cost Anomaly Detection uses machine learning to identify unusual spend patterns at account, service, or tag level, and alerts via SNS/email. It is a detection tool, not an optimization tool — it surfaces anomalies after they occur. AWS Budgets is a threshold tool — alerts fire when actual or forecasted spend crosses a configured threshold. Neither replaces the audit-and-remediation workflow described above; both are complementary guardrails that belong in the observability layer alongside CUR + QuickSight.

Exam Signal: What to Memorise for SAP-C02

On SAP-C02, cost optimization on existing workloads appears in 8 to 12 questions per exam attempt. The specific signals the exam tests for: the five-step audit pipeline (CUR → S3 → Glue → Athena → QuickSight); the Compute Optimizer memory blind spot (CloudWatch agent required); the Savings Plans irrevocability rule; the Spot eligibility rule (stateless, horizontal, re-runnable only); the S3 Intelligent-Tiering 128 KB threshold; the NAT per-AZ rule with VPC endpoints; the Convertible vs Standard RI exchange rule; and the sequencing rule (observe → segment → pilot → scale → enforce). Memorise these eight signals and you will recognise the correct answer on most Task 3.5 questions within two read-throughs.

FAQ: Cost Optimization for Existing AWS Workloads

Q1. For a 20-account AWS Organization where cost allocation tags have only 40 percent coverage, what is the fastest path to chargeback-ready data?

The correct sequence is: (1) enable the AWS Cost and Usage Report with resource IDs and all user-defined tags; (2) build the QuickSight Tag Coverage Dashboard by joining CUR with a tag-requirement list; (3) apply AWS Organizations tag policies to define the allowed tag keys and value formats and surface non-compliance in AWS Config; (4) enforce tagging at resource-creation time via Service Control Policies using the aws:RequestTag and aws:TagKeys condition keys; (5) retroactively tag existing resources via AWS Resource Groups Tagging API and AWS Config auto-remediation. Tag policies alone do not enforce tags — they only report compliance; the SCP plus Config Auto Remediation combination is what actually lifts coverage from 40 percent to 95 percent.

Q2. A CUR query shows NAT Gateway data processing is $50K/month across 15 VPCs. Which remediation has the highest ROI?

The highest-ROI remediation is adding VPC Gateway Endpoints for Amazon S3 and Amazon DynamoDB to every VPC — gateway endpoints are free (no hourly charge, no per-GB charge), and for a typical workload 40 to 70 percent of NAT data-processing volume is S3 or DynamoDB traffic. Step two is adding VPC Interface Endpoints for Amazon ECR, Amazon CloudWatch Logs, AWS Systems Manager, and AWS KMS — these cost $0.01/hour/AZ but save the NAT $0.045/GB charge and pay back at 2+ GB/month per endpoint. Step three is moving public-facing egress to Amazon CloudFront. Do not reduce NAT Gateway count in production — cross-AZ data transfer costs exceed NAT savings.

Q3. 200 EC2 instances run at 8 percent average CPU for 24/7. What is the correct cost optimization sequence?

The correct sequence is: (1) confirm Compute Optimizer has 14 days of metrics and the CloudWatch agent is installed for memory visibility; (2) rightsize each instance to the Compute Optimizer recommendation (typically 40 to 60 percent smaller); (3) evaluate whether the workload is schedule-dependent — if the 24/7 runtime is not a business requirement, add Instance Scheduler or AWS Systems Manager automation to stop instances nights and weekends (saves 65 to 75 percent of runtime); (4) only after rightsizing and scheduling, purchase a Savings Plan sized at the post-optimization baseline. Buying a Savings Plan first locks in the over-provisioned baseline and wastes the commitment.

Q4. A team refuses a Compute Optimizer recommendation to downsize from r5.4xlarge to m6g.large because their application is memory-intensive. How should the architect respond?

The most likely root cause is that the CloudWatch agent is not installed on the instance, so Compute Optimizer sees only CPU and network metrics and is recommending on that basis. The correct response is: (1) deploy the CloudWatch agent via AWS Systems Manager State Manager to publish mem_used_percent; (2) wait 14 days for Compute Optimizer to ingest memory data; (3) re-evaluate the recommendation. If Compute Optimizer still recommends the downgrade after memory data is ingested, the team's concern is unfounded. If the recommendation changes (e.g., to r6g.xlarge), the original recommendation was based on incomplete data. Never force a rightsizing change over a team's objection — collect the data and let the data decide.

Q5. Over-commitment on a 3-year Savings Plan is discovered six months into the term. The coverage is 100 percent but utilization is only 70 percent. What are the available remediations?

AWS Savings Plans are irrevocable and cannot be cancelled, refunded, or downsized. The available remediations are: (1) drive additional eligible workload to the payer account — migrate existing On-Demand EC2, Fargate, or Lambda workloads from the coverage-exempt mode into the covered accounts so the commitment is consumed; (2) pause new Savings Plans purchases until the existing commitment is fully utilised; (3) post-mortem the forecast — the over-commit is almost always a workload decommissioning, a migration to a region not covered by an EC2 Instance Savings Plan, or a Graviton migration that broke a family-locked EC2 Instance Savings Plan. Going forward, cap new commitments at 70 to 80 percent of observed baseline rather than 100 percent, and prefer Compute Savings Plans (region- and family-flexible) over EC2 Instance Savings Plans (family-locked) on heterogeneous estates.

Q6. For a workload with 2 billion objects averaging 8 KB each in Amazon S3 Standard, is S3 Intelligent-Tiering the correct cost optimization?

No. S3 Intelligent-Tiering has a monitoring and automation fee of $0.0025 per 1000 objects per month. For 2 billion objects that is $5000/month in monitoring fees alone, which exceeds the storage savings for objects smaller than 128 KB (these objects are billed at Frequent Access rates regardless of access pattern, because they cannot be demoted). The correct optimizations are: (1) aggregate small objects into larger objects — 1 MB Parquet files instead of individual 8 KB JSON records — which also improves downstream Amazon Athena query performance; (2) use S3 Standard-IA with a lifecycle rule for objects that stay small and are infrequently accessed; (3) consider Amazon DynamoDB or Amazon S3 Express One Zone for small-object hot workloads that do not fit the S3 Standard cost model.

Q7. Which AWS service is the correct tool to view organization-wide idle-resource findings across 20 linked accounts?

AWS Trusted Advisor at Business or Enterprise Support is the breadth tool — aggregated organization-wide findings across all linked accounts are surfaced to the management account or the delegated administrator. AWS Compute Optimizer at the organization opt-in level provides EC2/ASG/EBS/Lambda/Fargate rightsizing recommendations across linked accounts. AWS Cost Explorer rightsizing recommendations are per-account and are a subset of Compute Optimizer data. For idle-resource findings specifically (unattached EIPs, idle LBs, underutilized EC2, detached volumes), the primary tool is Trusted Advisor; for rightsizing and generation-migration recommendations the primary tool is Compute Optimizer. Both feed into the Waste Hunt QuickSight dashboard for weekly cadence.

Official sources