CloudWatch Logs and Logs Insights on AWS

Amazon CloudWatch Logs is the operational evidence locker of every AWS workload, and on SOA-C02 it is the second pillar of Domain 1 alongside metrics and alarms. Where metrics tell you that something is wrong, logs tell you what went wrong — the stack trace, the failed query, the rejected packet, the denied IAM call. Domain 1 (Monitoring, Logging, and Remediation) is worth 20 percent of the exam, and Task Statement 1.1 explicitly requires you to "identify, collect, analyze, and export logs (for example, Amazon CloudWatch Logs, CloudWatch Logs Insights, AWS CloudTrail logs)" and to "create metric filters". Where SAA-C03 might ask "which AWS service collects application logs?", SOA-C02 asks "the application log group exists, but the new EC2 instances are not pushing logs — list every diagnostic step", or "write a Logs Insights query that returns the top ten error messages by count over the last six hours".

This guide walks through CloudWatch Logs from the SysOps angle: how the log group / log stream model works, why retention defaults to "Never Expire" (and what that costs), how metric filters convert text patterns into alarm-able numbers, how subscription filters stream logs in real time to Lambda or Firehose for SIEM ingestion, the full Logs Insights query language with the operational queries you will actually run during incidents, the CloudWatch agent's log collection configuration including multiline handling, S3 export for long-term archival, cross-account log aggregation via resource policies and destinations, and the recurring SOA-C02 troubleshooting scenarios — agent-not-publishing, IAM permission gaps, and the trap that metric filters do not retroactively scan history.

Why CloudWatch Logs Sits at the Heart of SOA-C02 Domain 1.1

The official SOA-C02 Exam Guide v2.3 explicitly names CloudWatch Logs and CloudWatch Logs Insights in the first skill bullet under Task Statement 1.1, then names metric filters in the fourth bullet, and CloudWatch agent log collection in the second bullet. Three of six skills under TS 1.1 are CloudWatch Logs work. Add Task Statement 1.2 (remediation based on logs and metrics) and the dependency becomes clear: every alarm-driven remediation in Domain 1.2 either reads from a metric filter on a log group or watches the output of a Logs Insights scheduled query. Logs is also the data source CloudTrail (Domain 1.1) and VPC Flow Logs (Domain 5.3) rely on for retention and analysis.

At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "which AWS service should the architect use to centralize application logs?" SOA-C02 asks "logs from 50 EC2 instances are not appearing in the expected log group — what diagnostic steps does the SysOps engineer run, and in what order?" The answer is rarely "use a different service"; it is the CloudWatch agent IAM role, the agent config file path, the logs:CreateLogStream permission, the VPC endpoint for logs.region.amazonaws.com, or the multiline log group regex. CloudWatch Logs is the topic where every later operational story plugs in: CloudTrail forwarding to CloudWatch Logs (Domain 1.1), Lambda function logs landing in /aws/lambda/<name> (Domain 3), VPC Flow Logs writing to a log group (Domain 5.3), AWS WAF web ACL logs going to Firehose then optionally CloudWatch Logs, and AWS Config configuration item delivery to a log group as one delivery channel option.

Log event: a single record ingested into CloudWatch Logs — a timestamp plus a UTF-8 message of up to 256 KB. One log event is what an application call to logger.info(...) produces.
Log stream: an ordered, append-only sequence of log events from a single source — typically one log stream per EC2 instance, per Lambda invocation container, per ECS task, or per VPC ENI for Flow Logs.
Log group: a container of log streams that share a retention policy, KMS encryption setting, metric filters, subscription filters, and access controls. Log groups are the unit of operational management.
Retention policy: how long log events are kept. Values from 1 day to 10 years, plus Never Expire (the default — the most-tested gotcha on SOA-C02).
Metric filter: a pattern definition attached to a log group that extracts numeric values or counts from matching log events and publishes them as a CloudWatch metric for alarming.
Subscription filter: a pattern definition that streams matching log events in real time to Kinesis Data Streams, Kinesis Data Firehose, Lambda, or another log group (cross-account allowed via destination).
Logs Insights: an interactive, purpose-built query language for CloudWatch log groups, with fields, filter, stats, parse, sort, limit, dedup, and display commands. Up to 50 log groups per query, default time range 24 hours.
Embedded Metric Format (EMF): a JSON schema for log events that lets the agent / Lambda emit logs and metrics in a single write.
Log group resource policy: an IAM-style policy attached directly to a log group enabling cross-account writes (used by VPC Flow Logs, Route 53 query logging, CloudFront, etc.).
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html

白話文解釋 CloudWatch Logs and Logs Insights

CloudWatch Logs jargon stacks fast. Three analogies make the constructs stick before any AWS console screen does.

Analogy 1: The Library of Operational Evidence

CloudWatch Logs is a public library of operational evidence. Log groups are the bookshelves organized by topic — one for the web tier (/var/log/nginx/access.log), one for the application (/aws/lambda/checkout), one for VPC Flow Logs, one for CloudTrail. Log streams are the individual books on each shelf, one book per "author" — one per EC2 instance, one per Lambda execution environment, one per ECS task. Log events are the individual sentences inside each book — append-only, never edited. Retention policy is the library's discard rule: this shelf keeps books for 30 days, that shelf keeps them for 7 years, the rare-books section (Never Expire) keeps them forever and the storage cost compounds. Metric filters are the librarian's tally chart taped to a shelf — every time a book contains the word ERROR, the librarian adds a tick mark to the chart; the chart itself is a number you can graph and alarm on. Subscription filters are the photocopier rigged to copy any matching page as it is written, faxing the copy to a SIEM or a Lambda for real-time analysis. Logs Insights is the research-desk query terminal where a SysOps detective types fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m) and gets a chart back in seconds.

Analogy 2: The Restaurant Kitchen Order Tickets

Imagine a busy restaurant where every order ticket is a log event. The kitchen printer (Lambda runtime) prints a new ticket whenever the cashier rings up an order — that is the CreateLogStream call. All tonight's tickets pile up in one dinner-service spike (log stream). The bin of all tickets for "Fri 18:00–22:00 service" is a log group. The chef wants to count how many tickets contained "no onions" — that is a metric filter with the pattern "no onions" and the metric name NoOnionsRequests. The owner wants every "VIP" ticket to be photocopied immediately to the manager's desk — that is a subscription filter routing matching events to a Lambda function (the manager). At end of night, the tickets are archived to a binder (S3 export) for tax records, and the kitchen prints fresh paper for tomorrow. The library and restaurant analogies both insist on the same point: the log group is the unit of policy (retention, encryption, who-can-read, what-pattern-counts), and the log stream is just an arrival channel within it.

Analogy 3: The Postal Service With Search Index

Treat CloudWatch Logs as a postal service plus search index. Every letter (log event) is stamped with a timestamp at the post office (ingestion endpoint) and dropped into a PO box (log stream) addressed to your account. PO boxes are grouped into buildings (log groups) that share a lease term (retention policy) and a security guard (KMS key). The post office's automatic forwarding service (subscription filter) photocopies any letter mentioning a keyword and reroutes the copy to a partner address (Firehose → S3, Lambda, OpenSearch). Logs Insights is the postmaster's full-text search, capable of returning "all letters containing the phrase 'connection refused' across these 12 buildings in the last 24 hours, grouped by sender, ordered by frequency". The takeaway for SOA-C02 candidates: when a question describes "real-time forwarding of matching log events", the answer is a subscription filter; when it describes "find a pattern in historical logs across many groups", the answer is Logs Insights; when it describes "alarm when a pattern appears N times per minute", the answer is a metric filter plus a CloudWatch alarm.

For SOA-C02 alarm-from-logs questions, the restaurant tally chart (metric filter) and the rigged photocopier (subscription filter) are the two distinctions you must keep separate. A metric filter produces a number you can alarm on after the fact. A subscription filter produces a stream of matching events you can react to in real time. The exam regularly offers both as plausible options; pick metric filter when the goal is "count and alarm", pick subscription filter when the goal is "process or forward each match". Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html

CloudWatch Logs Architecture: Log Groups, Log Streams, Retention, and KMS Encryption

Before any query or alarm, you need a precise mental model of how a log event arrives in CloudWatch Logs and where it lives.

The three-tier data model

Log event — a single timestamp + message record. Maximum message size 256 KB, maximum batch size 1 MB, maximum batch event count 10,000. Events older than 14 days or more than 2 hours in the future are rejected.
Log stream — a sequence of log events from one source. Streams are append-only and immutable. Naming is conventional: EC2 instance ID, Lambda execution environment, ECS task ARN, ENI ID for Flow Logs.
Log group — the policy unit. One log group can hold an effectively unlimited number of streams (soft quota typically 1 million) and is where retention, KMS, metric filters, subscription filters, and resource policies attach.

Retention policies — the most-tested default

When you create a log group through any path that does not explicitly set retention — most CLI calls, most SDK calls, automatic creation by Lambda, automatic creation by VPC Flow Logs — the retention defaults to Never Expire. CloudWatch Logs storage is billed per GB per month indefinitely, so a forgotten verbose log group can balloon the bill silently for years.

Supported retention values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653 days, or Never Expire. SysOps engineers should set retention explicitly during creation or remediate via aws logs put-retention-policy --log-group-name <name> --retention-in-days <N>. AWS Config has a managed rule cw-loggroup-retention-period-check that flags log groups exceeding a configured maximum retention; pair with EventBridge → SSM Automation for auto-remediation.

KMS encryption at rest

Log groups encrypt at rest by default with an AWS-managed key. For compliance scenarios you can associate a customer-managed KMS key to the log group at creation (KmsKeyId parameter) or via associate-kms-key on an existing group. The KMS key policy must grant the CloudWatch Logs service principal (logs.<region>.amazonaws.com) the standard kms:Encrypt, kms:Decrypt, kms:ReEncrypt*, kms:GenerateDataKey*, kms:DescribeKey actions, scoped to the log group ARN via the kms:EncryptionContext:aws:logs:arn condition. A common SOA-C02 trap: candidates create the key but forget the kms:EncryptionContext condition, and the agent silently fails to publish.

Disassociating a KMS key

Once a CMK is associated with a log group, all subsequent log events are encrypted with it. Existing log events remain encrypted with whatever key was active at write time. Disassociating the key (disassociate-kms-key) prevents future encryption with the CMK and reverts to the AWS-managed key — but historical events remain bound to the CMK and become unreadable if the CMK is disabled or deleted. Plan key rotation and lifecycle accordingly.

Default log group retention: Never Expire (the SOA-C02 trap default).
Maximum log event size: 256 KB per message.
Maximum log batch size: 1 MB or 10,000 events per PutLogEvents call.
Log event ingestion age window: events from up to 14 days in the past to 2 hours in the future are accepted; outside that window they are rejected.
Logs Insights time range: default 24 hours; queries can scan up to the retention horizon, but cost/duration scales with bytes scanned.
Logs Insights log groups per query: up to 50 log groups in a single query.
Logs Insights query timeout: 15 minutes maximum runtime per query.
Logs Insights returned events: up to 10,000 rows per query result.
Subscription filter targets: Kinesis Data Streams, Kinesis Data Firehose, Lambda, another log group via destination — at most two subscription filters per log group.
Metric filter limit: 100 metric filters per log group.
Retention values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653 days, or Never Expire.
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-quotas.html

Metric Filters: Extracting Numeric Metrics from Log Text Patterns

A metric filter is the bridge from unstructured log text to a structured CloudWatch metric you can alarm on. It is one of the highest-yield SOA-C02 features.

How a metric filter works

You attach a metric filter to a log group with three things:

A filter pattern — what to match. CloudWatch Logs supports two pattern dialects: classic terms-and-quoted-phrases ("ERROR" "timeout"), and JSON pattern matching ({$.statusCode = 500}) for structured JSON log events.
A metric transformation — what to publish. A metric name (e.g., ApplicationErrorCount), a metric namespace (e.g., MyApp/Prod), a value (typically 1 for count-style filters, or $.latency to extract a numeric field), an optional default value when no event matches in a period, and optional dimensions extracted from the log JSON.
An optional unit — the metric unit (Count, Bytes, Milliseconds).

Once the filter is created, every new log event that arrives in the log group is evaluated. Matching events publish data points to the named metric in the named namespace. You then build a CloudWatch alarm on that metric exactly as if it were any other metric.

Common metric filter patterns

Count error log lines: pattern ?ERROR ?Error ?error, value 1, metric ErrorLogCount. Alarm when sum > 100 per 5 minutes.
Count failed SSH attempts: pattern [..., status="Failed", user, ...], value 1, metric FailedSSHAttempts. Pair with composite alarm for brute-force detection.
Extract HTTP latency from JSON access log: pattern {$.method = "GET" && $.latency > 0}, value $.latency, metric GetLatencyMs (Milliseconds unit). Alarm on p99.
Count 5xx in CloudFront logs: pattern {$.responseStatus >= 500 && $.responseStatus < 600}, value 1, metric EdgeServerErrors.
Extract dimensional metric: with JSON pattern, you can extract $.region and $.tenantId as dimensions so one filter publishes a per-region per-tenant time series — useful for multi-tenant SaaS observability.

Pattern syntax — classic vs JSON

Classic pattern matches against the raw text of the log event. Tokens are space-separated terms; quotes mark literal phrases; ? is logical OR; - is exclusion; [...] matches space-delimited columns by position with optional name and predicate.

JSON pattern matches against the log event interpreted as JSON. The pattern is a Boolean expression over $.field paths with operators =, !=, >, <, >=, <=, and Boolean &&, ||. JSON patterns are far more reliable for structured logs than classic patterns because they are unaffected by adjacent text or whitespace.

The single most-tested metric filter gotcha on SOA-C02: a metric filter is forward-looking only. It evaluates log events as they arrive in the log group going forward. Existing log events written before the filter was created are not retroactively scanned, and the metric never gets historical data points. To analyze historical patterns, run a Logs Insights query against the existing log group instead. To produce alarms on a historical baseline, create the filter, wait for enough new data to arrive, then build the alarm — or use Logs Insights for a one-off historical analysis. Candidates who answer "create a metric filter and alarm on yesterday's data" lose the question. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html

Metric filter vs subscription filter vs Logs Insights

This is the canonical SOA-C02 decision and we will return to it in a dedicated decision matrix later. The short answer:

Metric filter when the goal is "count and alarm".
Subscription filter when the goal is "react to each match in real time".
Logs Insights when the goal is "ad-hoc analysis or historical investigation".

CloudWatch Logs Insights: Query Syntax for Operational Troubleshooting

CloudWatch Logs Insights is the purpose-built query language for ad-hoc log analysis. Where metric filters and subscription filters are forward-looking and pattern-driven, Logs Insights is interactive, retroactive, and analytical — the SysOps detective's tool of choice during an incident.

Query structure

A Logs Insights query is a pipeline of commands separated by |. The full grammar:

fields — choose fields to return; supports built-in fields (@timestamp, @message, @logStream, @log) and any JSON path or extracted field.
filter — filter rows by Boolean expression; supports =, !=, <, >, like /regex/, in [...], and, or, not.
parse — extract substrings into named variables using glob-style patterns (* request_id=* status=*) or regex (@message /(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+)/).
stats — aggregate (count(), sum(), avg(), min(), max(), pct(latency, 99), count_distinct(userId)) optionally grouped by field or by bin(5m).
sort — order by field, with asc or desc.
limit — cap result rows (default and max 10,000).
dedup — collapse duplicate rows by field combination.
display — final column projection.

Built-in fields

Every log event in Logs Insights exposes the following implicit fields without configuration:

@timestamp — the event timestamp.
@message — the raw event message.
@logStream — the log stream name.
@log — the log group identifier (account ID + log group name).
@ingestionTime — when CloudWatch Logs received the event.
@type — the event type ('Lambda' for Lambda logs, 'CloudTrail' for CloudTrail). For JSON-encoded log events, top-level JSON fields are auto-extracted as named fields without needing parse.

Multi-log-group queries

A single Logs Insights query can scan up to 50 log groups. This is the answer for "find every API error across the entire fleet of microservices in the last 24 hours" — select all 50 microservice log groups in the console, type one query, and inspect results.

Saved queries and dashboards

Insights queries can be saved under a folder hierarchy and re-run by any team member. The query result can also be embedded in a CloudWatch dashboard as a Logs widget, refreshing automatically. Saved queries are part of the SOA operational runbook and should be version-controlled (export the JSON via API, store in Git).

Pricing model

Logs Insights bills per GB of data scanned per query, independent of how many rows you return. Tightening the time range, narrowing log groups, and using indexed filters (@logStream membership, exact = matches) all reduce scanned bytes and cost. SOA-C02 sometimes asks "how to reduce Logs Insights cost" — the answer is "narrow the time range and the log group set", not "switch to a different service".

Memorize the operational shape of these five queries — variations of them appear in scenario questions and in real production work:

Top N error messages by count — fields @timestamp, @message | filter @message like /ERROR/ | stats count() as cnt by @message | sort cnt desc | limit 10
Count by minute (timeline) — filter @message like /5xx/ | stats count() by bin(1m) to chart error rate over time.
Percentile from JSON access log — filter ispresent(latency) | stats pct(latency, 99) by bin(5m) for p99 latency tracking.
Group by source IP for security — parse @message /(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+)/ | filter @message like /Failed password/ | stats count() by ip | sort count() desc
Lambda timeout investigation — filter @type = "REPORT" | stats max(@duration), avg(@duration), pct(@duration, 99) by bin(5m) against /aws/lambda/<name> log group. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html

Common Logs Insights Queries: Operational Recipes

Beyond syntax, SysOps engineers carry a runbook of canonical queries. Each maps to a real Domain 1.1 scenario.

Top 10 error messages

fields @timestamp, @message
| filter @message like /ERROR/ or @message like /Exception/
| stats count() as occurrences by @message
| sort occurrences desc
| limit 10

This is the first query during any incident. Run across all microservice log groups in the affected stack to surface the dominant error.

p99 latency from ALB access logs (after enabling ALB access logging to a log group)

filter @logStream like /ALB/
| parse @message /target_processing_time=(?<tpt>\\S+)/
| filter ispresent(tpt) and tpt != "-1"
| stats pct(tpt, 99) as p99_seconds, pct(tpt, 50) as p50_seconds by bin(5m)
| sort @timestamp asc

Failed SSH attempts by source IP (Linux `/var/log/secure` collected by the agent)

filter @message like /Failed password/
| parse @message /from (?<src_ip>\\S+) port/
| stats count() as attempts by src_ip
| sort attempts desc
| limit 20

Lambda cold-start frequency

filter @type = "REPORT"
| filter @message like /Init Duration/
| stats count() as cold_starts by bin(5m)

CloudTrail unauthorized API calls (with CloudTrail forwarded to a log group)

filter errorCode = "AccessDenied" or errorCode = "UnauthorizedOperation"
| stats count() as denied_calls by userIdentity.arn, eventName
| sort denied_calls desc
| limit 20

VPC Flow Logs rejected traffic

filter action = "REJECT"
| stats count() as rejects by srcAddr, dstAddr, dstPort
| sort rejects desc
| limit 50

These six queries cover roughly 80 percent of operational investigations. Save them in a "SysOps runbook" folder in the Logs Insights console and rehearse running them — the exam expects familiarity with the syntax shape.

CloudWatch Agent for Log Collection: Config File and Multiline Handling

The unified CloudWatch agent collects log files from Linux and Windows hosts and ships them to log groups. The mechanics matter on SOA-C02 because the most common "logs are missing" troubleshooting scenarios are agent configuration problems.

Agent config file structure

The agent reads a JSON config from /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json (Linux) or C:\\ProgramData\\Amazon\\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json (Windows), or fetches it from SSM Parameter Store. The logs section looks like:

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/myapp/nginx-access",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%d/%b/%Y:%H:%M:%S %z",
            "multi_line_start_pattern": "{timestamp_format}",
            "retention_in_days": 30
          }
        ]
      }
    }
  }
}

Tokens for log_stream_name

Within log_stream_name the agent supports tokens:

{instance_id} — EC2 instance ID. Most common.
{hostname} — OS hostname.
{ip_address} — primary IP.
{local_hostname} — short hostname.
{date} — YYYY-MM-DD.

Using {instance_id} is the SOA-C02 convention because it makes log streams trivially traceable back to the source EC2 resource.

Multiline log grouping

Application logs (Java stack traces, Python tracebacks) are inherently multiline — a single logical event spans many lines. The agent groups multiline events using multi_line_start_pattern, a regex that matches the first line of each event. Lines that do not match are appended to the previous event. The shorthand "multi_line_start_pattern": "{timestamp_format}" reuses the timestamp regex automatically. Without this, every line ends up as its own event and stack traces fragment into a hundred unsearchable rows in CloudWatch Logs.

Auto-creating log groups and retention

The agent can auto-create the log group (logs:CreateLogGroup permission required) and set retention via retention_in_days. If retention is omitted, the log group inherits the Never Expire default — feeding directly into the most-tested cost trap on SOA-C02.

Agent permissions

The agent needs logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents, logs:DescribeLogStreams, logs:DescribeLogGroups, plus cloudwatch:PutMetricData if it also pushes metrics. The managed policy CloudWatchAgentServerPolicy packages all of these. The instance profile must include this policy or an equivalent custom policy.

::warning

When the CloudWatch agent's retention_in_days is omitted from the config, log groups it auto-creates default to Never Expire and accumulate storage charges indefinitely. The same applies to log groups auto-created by Lambda (/aws/lambda/<name>), VPC Flow Logs, and any service that calls CreateLogGroup without specifying retention. Audit your account regularly with the AWS Config managed rule cw-loggroup-retention-period-check, or with a one-line CLI: aws logs describe-log-groups --query "logGroups[?retentionInDays==null].logGroupName". Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SettingLogRetention.html ::

Log Subscription Filters: Real-Time Streaming to Lambda, Firehose, and OpenSearch

Where a metric filter counts matches, a subscription filter routes matches to a downstream consumer in real time. SOA-C02 tests both forms — knowing when each applies is non-negotiable.

Subscription filter mechanics

A subscription filter is attached to a log group with three things:

A filter pattern — same syntax as metric filter patterns (classic or JSON).
A destination ARN — the target for matched events.
An IAM role — the role CloudWatch Logs assumes to write to the destination (must trust logs.<region>.amazonaws.com).

Each log event that matches the pattern is delivered to the destination in near real time (typically under one second). The event is base64-encoded and gzip-compressed by CloudWatch Logs before delivery; consumers must decompress.

Supported destinations

AWS Lambda — invoked synchronously per batch. Use for inline transformation, alerting, or simple SIEM integration. Lambda concurrency limits apply; throttling pushes back into the subscription.
Kinesis Data Streams — high-throughput, ordered, multi-consumer. Use when many downstream applications must independently consume.
Kinesis Data Firehose — buffered delivery to S3, Redshift (out of SOA scope), OpenSearch Service, or third-party HTTP endpoints. Use for archival, data lake ingestion, or OpenSearch dashboarding.
Another CloudWatch Logs log group (via destination) — typically used for cross-account aggregation.

Two subscription filter quota

Each log group supports at most two subscription filters simultaneously. If your architecture requires more than two consumers, fan out via Kinesis Data Streams (multi-consumer) or via Lambda that forwards to multiple targets.

Common subscription filter patterns

Real-time SIEM ingestion — subscription filter on /aws/cloudtrail/... log group with pattern {$.errorCode = "*"} → Firehose → OpenSearch Service → security analyst dashboards.
Streaming application errors to Slack — subscription filter on application log group with pattern ?ERROR ?FATAL → Lambda → Slack webhook.
Centralizing VPC Flow Logs — subscription filter on Flow Log log group → Firehose → S3 → Athena queries; or → OpenSearch for real-time network analysis.

Cost considerations

Subscription filters charge per event delivered. For verbose log groups (Flow Logs, debug-level application logs), this can quickly exceed the cost of the log group itself. Always pre-filter at the pattern level — never use a wildcard subscription on a 100 GB/day log group unless the downstream is genuinely required.

SOA-C02 frequently asks "the SysOps team set up a subscription filter to forward all log events to OpenSearch and the bill spiked — what to do?". The answer is to tighten the filter pattern so only relevant events forward, and/or replace the always-on subscription with a Firehose-based archival path that buffers and batches. Wildcard subscription on a high-volume log group is an antipattern. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html

Exporting Logs to S3: One-Time Export Task vs Continuous via Firehose

Long-term retention (years) and analytics outside CloudWatch Logs require export to S3. Two operational paths exist; SOA-C02 tests when each applies.

Path 1 — One-time export task

The CreateExportTask API initiates a synchronous, one-time export of log events from a log group to an S3 prefix, filtered by time range. The task is bulk, not real-time, taking minutes to hours depending on volume. Only one export task per account per region runs at a time. The task creates an object hierarchy s3://bucket/prefix/<exportTaskId>/... and writes gzip-compressed JSON files.

When to use:

Compliance archival before reducing log group retention — export to S3 with a longer lifecycle, then drop log group retention to 30 days.
Ad-hoc forensic export for a security incident — export the relevant time window to S3, then run Athena queries.
Account migration — export source account logs to S3, copy across accounts, ingest into the destination.

The S3 bucket must have a bucket policy allowing CloudWatch Logs (logs.<region>.amazonaws.com) to write objects with the appropriate aws:SourceAccount and aws:SourceArn conditions.

Path 2 — Continuous via Subscription Filter → Firehose → S3

For ongoing archival of new log events, the right architecture is a subscription filter on the log group routing to Kinesis Data Firehose with an S3 destination. Firehose buffers events (size and time triggers), gzip-compresses, optionally converts to Parquet/ORC, and writes to S3 in real time. This path is scalable to high-throughput log groups and integrates naturally with downstream analytics (Athena, Glue, EMR is out-of-scope but Athena is fine).

When to use:

Continuous SIEM or data lake ingestion.
Cost optimization — keep CloudWatch Logs retention short (e.g., 7 days for hot debugging) and Firehose-archive everything to S3 with cheaper storage tiers.
Format conversion — Firehose can transform JSON to Parquet for Athena performance.

When NOT to export

If the only need is "occasionally search historical logs", Logs Insights against the existing log group is cheaper and faster than export-to-S3-then-Athena. Export only when you need durations beyond CloudWatch Logs retention, columnar format for analytical queries, or distribution to non-AWS systems.

Cross-Account Log Aggregation: Resource Policies and Cross-Account Subscription Filters

Multi-account organizations need a single security or operations account that aggregates logs from every member account. CloudWatch Logs supports two cross-account patterns.

Pattern 1 — Cross-account writes via log group resource policy

A log group resource policy is an IAM policy attached directly to a log group. It grants other AWS accounts or services permission to write log events into the group. This is how AWS native services like VPC Flow Logs, Route 53 query logging, and CloudFront real-time logs publish to a centralized log group; the same mechanism is available to cross-account writes from your own resources.

Example resource policy for a centralized Flow Log target:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AWSLogDeliveryWrite",
    "Effect": "Allow",
    "Principal": { "Service": "delivery.logs.amazonaws.com" },
    "Action": ["logs:CreateLogStream", "logs:PutLogEvents"],
    "Resource": "arn:aws:logs:us-east-1:111122223333:log-group:central-flow-logs:*",
    "Condition": {
      "StringEquals": { "aws:SourceAccount": "111122223333" },
      "ArnLike": { "aws:SourceArn": "arn:aws:logs:us-east-1:111122223333:*" }
    }
  }]
}

Resource policies are limited to 10 KB per region per account — you may need to compress or consolidate statements when many sources share one centralized group.

Pattern 2 — Cross-account subscription filter via destination

For real-time aggregation, the source account creates a subscription filter pointing at a destination in the receiver account. A destination is a CloudWatch Logs object that wraps a Kinesis Data Stream and an IAM role; its access policy permits source accounts to put records.

The flow:

Receiver account creates a Kinesis Data Stream, an IAM role with permissions to write to the stream, and a CloudWatch Logs destination referencing the stream and role. The destination has an access policy listing source account IDs.
Source account creates a subscription filter on its log group with the destination ARN as target.
Events stream from the source log group through CloudWatch Logs into the receiver-account Kinesis stream, where consumers (Lambda, Firehose, downstream analytics) process them.

This is the SOA-C02 answer for "centralize logs from 50 accounts into a single security account in real time". For batch / non-real-time aggregation, S3 with cross-account bucket policy plus per-source-account export tasks (or Firehose) is simpler.

For large organizations, AWS Organizations enables a delegated administrator pattern where one member account is granted the right to read CloudWatch Logs across the org. Combined with cross-account subscription filters or a centralized S3 logging bucket, this is the cleanest operational model and matches AWS's own logging-account pattern in Control Tower. SOA-C02 favors AWS-native multi-account patterns over hand-built cross-account IAM. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CrossAccountSubscriptions.html

Log Retention and Archival: Setting Retention from 1 Day to Never

Retention is the operational decision that most directly drives the CloudWatch Logs bill. Three layers to think through.

Layer 1 — Hot retention (CloudWatch Logs)

This is the period during which Logs Insights queries return data and metric filters keep evaluating new events. Choose based on debugging cadence:

1–7 days — short-lived debug logs, DEBUG-level application output, transient diagnostics.
14–30 days — typical web tier and application logs where two-week incident investigation windows are normal.
90 days — security-sensitive logs (CloudTrail, VPC Flow Logs) where investigations frequently look back a quarter.
365 days or more — compliance-mandated retention; combine with export to S3 Glacier for further cost reduction.

Layer 2 — Warm archival (S3 Standard / S3 IA)

For cost-optimized storage of logs beyond the hot window, export to S3 with lifecycle rules transitioning to S3-IA after 30 days. Athena queries against S3 cost cents per GB scanned and serve compliance lookups well.

Layer 3 — Cold archival (S3 Glacier / Glacier Deep Archive)

For long-term retention required by regulations (HIPAA, SOX, PCI), lifecycle objects to Glacier Deep Archive after 180 days. Retrieval takes hours but costs are an order of magnitude lower than CloudWatch Logs. The standard SOA pattern: CloudWatch Logs 30 days hot, S3 IA 1–2 years warm, Glacier Deep Archive beyond.

Cost implications of "Never Expire"

A log group accumulating 10 GB/day at Never Expire holds roughly 3.6 TB after one year and 18 TB after five years, all billed at the CloudWatch Logs per-GB-month storage rate. Switching the same group to 30-day retention caps storage at 300 GB. The mathematical case for explicit retention is overwhelming, and the operational case is the same — old logs are rarely useful past their hot window.

Scenario Pattern: Application Logs Missing from CloudWatch

A canonical SOA-C02 troubleshooting flow. The runbook in order of frequency:

Confirm the agent is installed and running. SSH or Session Manager into the EC2 instance: sudo systemctl status amazon-cloudwatch-agent (Linux) or Get-Service AmazonCloudWatchAgent (Windows). If absent, install via Systems Manager Run Command using AWS-ConfigureAWSPackage to install AmazonCloudWatchAgent.
Verify the agent has a log-collection config. A bare agent install does nothing. Check /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json for the logs.logs_collected.files.collect_list array. Use amazon-cloudwatch-agent-ctl -a status to see what the agent thinks it is doing.
Verify the file path matches reality. The most common bug: the config says /var/log/myapp/app.log but the application writes to /var/log/myapp/application.log. The agent silently watches a non-existent file.
Verify IAM permissions. The instance profile must include CloudWatchAgentServerPolicy. Missing logs:CreateLogStream is the textbook SOA-C02 failure mode — log group exists, agent runs, but cannot create a new stream when an instance launches.
Verify network path. If the instance is in a private subnet without a NAT gateway, the agent needs a VPC endpoint for logs.<region>.amazonaws.com (interface endpoint) — otherwise calls to the CloudWatch Logs API fail.
Verify the agent log itself. /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log shows publish errors. Look for AccessDenied (IAM gap), connection refused (network gap), no such file (path mismatch), or InvalidSequenceTokenException (rare; auto-recovers).
Check the log group exists and is in the right region. Region mismatch between agent config and console search is a common time-waster.

The most common root causes in practice: missing logs:CreateLogStream permission, file path mismatch, agent installed but the start command never run, missing VPC endpoint for private subnets.

Scenario Pattern: SysOps Must Find All API Errors Across 50 Instances in 24 Hours

The textbook Logs Insights scenario. The ops engineer is paged at 2am with "the API is returning errors". They need to find which errors, on which instances, with what frequency.

The query:

fields @timestamp, @logStream, @message
| filter @message like /5\\d{2}/ or @message like /ERROR/ or @message like /Exception/
| stats count() as occurrences by @logStream, @message
| sort occurrences desc
| limit 100

Run across all 50 instance log groups (multi-select in the console, up to the 50 log group limit). Within seconds the result reveals which messages dominate and which instances are producing them. Drill in by adding | filter @logStream = "i-0abc123" to focus on one host, or by tightening the time range to the last hour.

If the error volume is large enough that scan cost is a concern, add an early filter @timestamp > <epoch_ms> or use the time-range picker to narrow the window. The query is iterative — refine until the dominant pattern is obvious, then propose a metric filter on that pattern so future occurrences fire an alarm without manual re-querying.

Common Trap: Metric Filters Are Not Retroactive

Repeated for emphasis because it appears as a distractor in nearly every SOA-C02 metric-filter question. A metric filter created today does nothing for log events ingested yesterday. The filter evaluates new events as they arrive in the log group; historical events are not re-scanned.

Practical implications:

Cannot retro-alarm on an incident that just happened. Use Logs Insights to investigate the past; create a metric filter for forward-looking alarming.
Cannot back-fill a metric. If a SysOps team needs a metric for capacity-planning over the last 90 days, they must either find an existing CloudWatch metric or run repeated Logs Insights queries and aggregate manually.
Filter pattern changes are also forward-only. Editing the pattern produces a discontinuity in the metric; old events evaluated under the old pattern stay attributed to the old metric data points.

The clean mental rule: metric filter = forward, Logs Insights = backward.

Common Trap: Subscription Filter Forwarding Cost

A subscription filter on a verbose log group can multiply the bill many times over. Before adding a subscription:

Estimate event volume. aws logs describe-log-groups returns storedBytes; divide by retention days to estimate daily ingest.
Estimate matched volume. Run a Logs Insights query for the proposed filter pattern over a 24-hour window — this is the rate the subscription will deliver downstream.
Choose the destination wisely. Lambda billed per invocation can be more expensive than Firehose batched delivery for the same data.
Pre-filter at the pattern level. Never wildcard-subscribe a 100 GB/day log group "to forward everything" — that pattern explodes both subscription costs and downstream compute.

CloudWatch Logs vs CloudTrail Logs: Different Ingestion Paths

A SOA-C02 distinction question. CloudWatch Logs ingests application output, OS logs, agent-collected logs, and any service that writes to a log group (Lambda runtime, VPC Flow Logs, AWS WAF web ACL via direct integration). CloudTrail Logs record AWS API calls — who did what on which resource when — and are delivered to S3 by default, optionally also to a CloudWatch log group.

Operational distinctions for the exam:

Source of data: app/OS for CloudWatch Logs; AWS API plane for CloudTrail.
Default destination: log group for CloudWatch Logs; S3 for CloudTrail (and optionally a log group).
Querying: Logs Insights for both; Athena common for CloudTrail-in-S3.
Retention default: Never Expire for CloudWatch Logs; CloudTrail S3 follows S3 lifecycle rules.
Use case: debug an application bug → CloudWatch Logs; identify who changed an IAM policy → CloudTrail.

When CloudTrail is forwarded to a log group, the same Logs Insights syntax applies — and the operational power is significant, because security teams get an interactive query language over API audit history.

SOA-C02 vs SAA-C03: The Operational Lens

Question style	SAA-C03 lens	SOA-C02 lens
Selecting a logging service	"Which AWS service centralizes application logs?"	"Logs are not appearing in the expected log group — list every diagnostic step in order."
Retention	"Choose the most cost-effective archival."	"All log groups are Never Expire and the bill spiked — implement retention policies and one-time export to S3."
Metric filter	"Which feature converts log patterns to a metric?"	"The metric filter was just created and the alarm is in INSUFFICIENT_DATA — why? (no historical scan)"
Subscription filter	"Which feature forwards logs in real time?"	"Configure cross-account subscription filter via destination + Kinesis Data Stream."
Logs Insights	"Which service queries CloudWatch Logs interactively?"	"Write a Logs Insights query returning top 10 errors by count over the last 24 hours grouped by log stream."
Cross-account	"Aggregate logs across accounts."	"Receiver account creates a destination wrapping a Kinesis stream; source account creates subscription filter pointing at destination ARN."
Encryption	"Encrypt log data at rest."	"Associate CMK; configure key policy with the `kms:EncryptionContext:aws:logs:arn` condition or the agent fails silently."
Multiline logs	Rarely tested.	"Java stack traces show one line per row in CloudWatch Logs — fix the agent `multi_line_start_pattern`."

The SAA candidate selects a service; the SOA candidate diagnoses the missing log, writes the Insights query, and configures the cross-account destination pattern correctly.

Exam Signal: How to Recognize a Domain 1.1 Logs Question

Domain 1.1 logs questions on SOA-C02 follow predictable shapes. Recognize them and time per question drops dramatically.

"Logs are missing" — the answer involves the CloudWatch agent, an IAM permission gap (logs:CreateLogStream is the most common), a file-path mismatch, or a missing VPC endpoint in a private subnet.
"Find a pattern in historical logs" — Logs Insights query. The exam shows the query and expects you to identify the right shape: filter + stats count() by ....
"Alarm when a pattern appears N times" — metric filter + CloudWatch alarm.
"Forward matching events in real time" — subscription filter + Lambda / Firehose / Kinesis.
"Aggregate logs from multiple accounts" — cross-account destination + Kinesis Data Stream, OR centralized log group with resource policy.
"The bill exploded for log storage" — retention defaulted to Never Expire; set explicit retention, optionally export old data to S3.
"Long-term archival of compliance logs" — Subscription filter → Firehose → S3 + lifecycle to Glacier.
"Java stack traces fragmenting in CloudWatch Logs" — agent multi_line_start_pattern not configured.
"Metric filter alarm just created and never fires on historical data" — metric filters are forward-only.
"Encrypt log data with a customer-managed key" — KMS CMK association + key policy with kms:EncryptionContext:aws:logs:arn condition.

Combined with the metrics/alarms topic, Domain 1.1 totals roughly 10–13 questions. CloudWatch Logs and Logs Insights typically claims 5 to 8 of those. Mastering metric filter vs subscription filter vs Logs Insights, plus the Never-Expire retention default, plus the agent IAM gotchas, captures nearly all of them. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html

Decision Matrix — Metric Filter vs Subscription Filter vs Logs Insights

The single most-tested SOA-C02 decision in this topic. Use this lookup during the exam.

Operational goal	Right construct	Why
Alarm when a string appears N times per minute	Metric filter + CloudWatch alarm	The pattern produces a counted metric for alarming.
Stream every matching event to Lambda for processing	Subscription filter → Lambda	Real-time per-event delivery.
Stream all logs to OpenSearch for SIEM dashboards	Subscription filter → Firehose → OpenSearch	Buffered, scalable, format-flexible.
Investigate "what happened in the last 6 hours"	Logs Insights	Interactive, retroactive, queryable.
Find top 10 error messages by count	Logs Insights with `stats count() by @message`	Aggregation language built in.
Archive logs for 7 years for compliance	Subscription filter → Firehose → S3 + Glacier lifecycle	Cheaper than CloudWatch Logs at long retention.
Export the last 30 days for a forensic case	`CreateExportTask` to S3	One-shot, range-bounded.
Forward errors to a Slack channel	Subscription filter → Lambda	Real-time alerting on each event.
Compute p99 latency from JSON logs	Logs Insights `stats pct(latency, 99) by bin(5m)`	Or metric filter extracting `$.latency` then alarm on p99.
Aggregate from 50 accounts in real time	Cross-account subscription filter to destination + Kinesis	Native multi-account log streaming.
Cross-account log writes (centralized log group)	Log group resource policy	For services like VPC Flow Logs, CloudFront.
Search logs older than 14 days quickly	Logs Insights if within retention; otherwise Athena over exported S3	Logs Insights cheap up to retention; Athena beyond.
Detect anomalies in log volume	Metric filter producing event-count metric + CloudWatch anomaly detection alarm	Combine both topics.
Reduce CloudWatch Logs bill on verbose log group	Set retention to days, export old data to S3, drop log level to WARN+	Operational cost discipline.

Common Traps Recap — CloudWatch Logs and Logs Insights

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: Metric filters scan historical logs

They do not. Metric filters are forward-only; create the filter, then alarm on new data. Use Logs Insights for historical investigation.

Trap 2: Default log group retention is reasonable

It is not. Default is Never Expire. Always set retention explicitly at creation; audit existing log groups with the AWS Config managed rule cw-loggroup-retention-period-check.

Trap 3: Subscription filter without rate consideration

A wildcard subscription on a 100 GB/day log group multiplies costs across CloudWatch Logs egress, the destination service, and downstream compute. Always pre-filter at the pattern.

Trap 4: Agent installed equals logs published

The agent does nothing without a config file referencing the right log files and the right log group/stream names. Confirm with amazon-cloudwatch-agent-ctl -a status and check the agent's own log file at /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log.

Trap 5: `logs:CreateLogStream` is implied by `logs:PutLogEvents`

It is not. The agent must explicitly hold logs:CreateLogStream to create a stream on first publish. Use CloudWatchAgentServerPolicy rather than hand-rolling — it is the SOA-sanctioned answer.

Trap 6: Logs Insights queries any number of log groups for free

Cost scales linearly with bytes scanned. A query over 50 verbose log groups for 7 days can cost meaningful dollars. Narrow time range and log group set.

Trap 7: Multiline application logs publish as single events

Without multi_line_start_pattern, every line of a stack trace becomes a separate log event, and the trace becomes unsearchable. The fix is in the agent config, not Logs Insights.

Trap 8: KMS encryption "just works"

The key policy must explicitly trust logs.<region>.amazonaws.com and constrain via the kms:EncryptionContext:aws:logs:arn condition. Without that policy, the agent cannot publish and the log group appears empty.

Trap 9: Subscription filter quota is unlimited

Each log group supports at most two simultaneous subscription filters. For more consumers, use Kinesis Data Streams as a multi-consumer fan-out point.

Trap 10: One-time S3 export covers ongoing archival

CreateExportTask is one-shot. For continuous archival, use a subscription filter → Firehose → S3 pipeline. Mixing the two is also valid: Firehose for new events, plus a one-time export for a backfill window.

FAQ — CloudWatch Logs and Logs Insights

Q1: Why do my CloudWatch Logs alarms based on a brand-new metric filter sit in INSUFFICIENT_DATA?

Two compounding causes. First, metric filters are forward-only — the filter does not retroactively scan log events written before the filter existed, so the metric has no historical data points. Second, the metric is only emitted when matching events arrive — for low-frequency patterns (a particular error type), there may genuinely be no data points yet within the alarm's evaluation window. The fix is twofold: wait for new events to arrive, and configure the alarm's treatMissingData to notBreaching for diagnostic alarms or breaching for absence-is-failure alarms. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html

Q2: What is the cleanest pattern for centralizing logs from many AWS accounts?

For real-time aggregation, use the cross-account subscription filter via destination pattern: the receiver account creates a Kinesis Data Stream and a CloudWatch Logs destination with an access policy listing source account IDs; each source account creates a subscription filter on its log groups pointing at the destination ARN. Events stream in real time to the receiver-account Kinesis stream, where Lambda or Firehose consumers process them. For batch / non-real-time, an alternative is exporting from each account to a centralized S3 bucket with cross-account bucket policy and Athena for queries. SOA-C02 favors the subscription-filter-with-destination pattern when the question emphasizes real-time or organizational governance, and the centralized S3 pattern when the question emphasizes long-term archival or cost. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CrossAccountSubscriptions.html

Q3: What is the difference between a subscription filter and a metric filter?

A metric filter evaluates each incoming log event against a pattern; on match, it publishes a numeric data point to a named CloudWatch metric. The downstream consumer is CloudWatch metrics and alarms. A subscription filter evaluates each incoming log event against a pattern; on match, it streams the entire event to a real-time consumer (Lambda, Kinesis Data Streams, Kinesis Data Firehose). The downstream consumer is a streaming pipeline. Metric filters answer "alarm when X happens N times per period"; subscription filters answer "react to each X event in real time". Both have the same pattern syntax; both are forward-only; both have per-log-group quotas (100 metric filters, 2 subscription filters). On SOA-C02, the giveaway phrase for metric filter is "alarm when…" and for subscription filter is "forward each…to Lambda/SIEM/OpenSearch".

Q4: How do I write a Logs Insights query that gives me the top N errors with their stack traces?

Build the query in two stages. Stage one finds the top N error message patterns by count; stage two pulls a representative event for each. The first query: filter @message like /ERROR/ or @message like /Exception/ | stats count() as cnt by @message | sort cnt desc | limit 10. Once you have the top messages, copy one of them and run a follow-up: fields @timestamp, @logStream, @message | filter @message = "<exact message>" | sort @timestamp desc | limit 5. The second query returns the most recent occurrences in full, including the multiline stack trace if the agent's multi_line_start_pattern was configured correctly so the stack trace was ingested as a single event. If stack traces appear fragmented across multiple @message rows, the agent config is the root cause — fix the multiline pattern and reingest.

Q5: How do I avoid the Never Expire retention trap across an entire account?

Three tactics combined. First, explicitly set retention on every log group at creation — in CloudFormation set RetentionInDays; in the CLI pair create-log-group with put-retention-policy; in the CloudWatch agent config use retention_in_days. Second, enable the AWS Config managed rule cw-loggroup-retention-period-check with a maximum allowed retention; the rule flags any log group exceeding the threshold. Third, wire AWS Config non-compliance to an SSM Automation runbook that calls put-retention-policy to remediate automatically. The combination produces a self-healing account where no log group accumulates indefinitely. For one-time cleanup of an existing account, list non-compliant groups with aws logs describe-log-groups --query "logGroups[?retentionInDays==null].logGroupName" and bulk-set retention on each. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SettingLogRetention.html

Q6: When should I use Logs Insights versus exporting to S3 and using Athena?

Use Logs Insights when the data is within CloudWatch Logs retention (and that retention is appropriate for your investigation horizon), the query is exploratory and iterative, and time-to-result matters. Logs Insights is interactive, returns in seconds, and supports the full query language with auto-completion. Use Athena over S3 when the data is older than CloudWatch Logs retention, the query is repetitive (a daily compliance report), the data benefits from columnar format (Parquet via Firehose conversion) for cost-efficient analytics, or the consumer is a non-AWS BI tool. The cost crossover happens at high data volumes and long retention periods — at hundreds of GB per day for years, Athena over Glacier-tiered S3 dramatically beats CloudWatch Logs hot storage with Logs Insights. SOA-C02 typically frames the question around investigation horizon: hot debugging window → Logs Insights; long-term compliance → Athena.

Q7: Why does my CloudWatch agent fail to push logs even though the agent is running?

In order of likelihood: (1) the IAM instance profile lacks logs:CreateLogStream or logs:PutLogEvents — fix by attaching CloudWatchAgentServerPolicy; (2) the agent config references a file path that does not match where the application writes — fix by aligning the path and reloading config with amazon-cloudwatch-agent-ctl -a fetch-config; (3) the instance is in a private subnet with no internet path and no VPC interface endpoint for logs.<region>.amazonaws.com — fix by adding the interface endpoint with the appropriate security group; (4) a customer-managed KMS key is associated with the log group but the key policy lacks the CloudWatch Logs service principal grant or the kms:EncryptionContext:aws:logs:arn condition — fix in the key policy; (5) the agent's own log at /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log shows the exact error — always check this file when troubleshooting before guessing.

Q8: Can a single Logs Insights query span multiple AWS accounts?

Yes, with a setup. CloudWatch cross-account observability allows a monitoring account to query metrics, logs, and traces across linked source accounts. Once linked (one-click setup in the CloudWatch console under Settings → Monitoring account configuration), Logs Insights in the monitoring account shows source-account log groups in the picker; you can select up to 50 across any combination of linked source accounts. This replaces the older pattern of replicating logs into a central account first. SOA-C02 sometimes calls out this feature for "consolidate operational queries across an organization" scenarios. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html

Q9: How do I configure CloudTrail to flow into a CloudWatch log group for Logs Insights queries?

In the CloudTrail console, edit the trail's settings and enable CloudWatch Logs delivery; CloudTrail asks for a log group name (creates one if missing) and an IAM role it can assume to write events. Once the trail is forwarding, every management event (and data events if enabled) lands as a JSON log event in that log group. From there, all Logs Insights idioms apply: filter errorCode = "AccessDenied" | stats count() by userIdentity.arn, eventName | sort count() desc finds the top denied API calls by principal. This pairs CloudTrail's audit role with Logs Insights' interactive query, an SOA-C02 favored security investigation pattern.

Q10: What do I do when a log group has too many subscription filters required?

CloudWatch Logs caps each log group at two subscription filters. When the architecture genuinely needs more downstream consumers — say, OpenSearch for real-time dashboards, S3 for archival, Lambda for Slack alerting, and Splunk for the security team — the SOA pattern is to fan out via a single subscription filter to a Kinesis Data Stream, then attach multiple consumer applications (Lambda, Firehose, Kinesis Client Library apps) to the stream. Kinesis Data Streams supports multiple independent consumers without contention. The subscription-filter-to-Kinesis-Data-Stream pattern is also the right answer when ordering matters across consumers, and when one consumer outage should not block the others. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html

Once log collection, querying, and routing are in place, the next operational layers are: CloudWatch Metrics, Alarms, and Dashboards for the metric side of the same observability fabric (alarms built on metric filters live here), CloudTrail and AWS Config for audit and configuration compliance signals — many of which forward into the same log groups for Logs Insights queries — EventBridge rules and SNS notifications for routing alarms and events into automated remediation pipelines, and EC2 Auto Scaling and ELB high availability for the workload tier whose application logs feed back into this entire pipeline.

Why CloudWatch Logs Sits at the Heart of SOA-C02 Domain 1.1

白話文解釋 CloudWatch Logs and Logs Insights

Analogy 1: The Library of Operational Evidence

Analogy 2: The Restaurant Kitchen Order Tickets

Analogy 3: The Postal Service With Search Index

CloudWatch Logs Architecture: Log Groups, Log Streams, Retention, and KMS Encryption

The three-tier data model

Retention policies — the most-tested default

KMS encryption at rest

Disassociating a KMS key

Metric Filters: Extracting Numeric Metrics from Log Text Patterns

How a metric filter works

Common metric filter patterns

Pattern syntax — classic vs JSON

Metric filter vs subscription filter vs Logs Insights

CloudWatch Logs Insights: Query Syntax for Operational Troubleshooting

Query structure

Built-in fields

Multi-log-group queries

Saved queries and dashboards

Pricing model

Common Logs Insights Queries: Operational Recipes

Top 10 error messages

p99 latency from ALB access logs (after enabling ALB access logging to a log group)

Failed SSH attempts by source IP (Linux /var/log/secure collected by the agent)

Lambda cold-start frequency

CloudTrail unauthorized API calls (with CloudTrail forwarded to a log group)

VPC Flow Logs rejected traffic

CloudWatch Agent for Log Collection: Config File and Multiline Handling

Agent config file structure

Tokens for log_stream_name

Multiline log grouping

Auto-creating log groups and retention

Agent permissions

Log Subscription Filters: Real-Time Streaming to Lambda, Firehose, and OpenSearch

Subscription filter mechanics

Supported destinations

Two subscription filter quota

Common subscription filter patterns

Cost considerations

Exporting Logs to S3: One-Time Export Task vs Continuous via Firehose

Path 1 — One-time export task

Path 2 — Continuous via Subscription Filter → Firehose → S3

When NOT to export

Cross-Account Log Aggregation: Resource Policies and Cross-Account Subscription Filters

Pattern 1 — Cross-account writes via log group resource policy

Pattern 2 — Cross-account subscription filter via destination

Log Retention and Archival: Setting Retention from 1 Day to Never

Layer 1 — Hot retention (CloudWatch Logs)

Layer 2 — Warm archival (S3 Standard / S3 IA)

Layer 3 — Cold archival (S3 Glacier / Glacier Deep Archive)

Cost implications of "Never Expire"

Scenario Pattern: Application Logs Missing from CloudWatch

Scenario Pattern: SysOps Must Find All API Errors Across 50 Instances in 24 Hours

Common Trap: Metric Filters Are Not Retroactive

Common Trap: Subscription Filter Forwarding Cost

CloudWatch Logs vs CloudTrail Logs: Different Ingestion Paths

SOA-C02 vs SAA-C03: The Operational Lens

Exam Signal: How to Recognize a Domain 1.1 Logs Question

Decision Matrix — Metric Filter vs Subscription Filter vs Logs Insights

Common Traps Recap — CloudWatch Logs and Logs Insights

Trap 1: Metric filters scan historical logs

Trap 2: Default log group retention is reasonable

Trap 3: Subscription filter without rate consideration

Trap 4: Agent installed equals logs published

Trap 5: logs:CreateLogStream is implied by logs:PutLogEvents

Trap 6: Logs Insights queries any number of log groups for free

Trap 7: Multiline application logs publish as single events

Trap 8: KMS encryption "just works"

Trap 9: Subscription filter quota is unlimited

Trap 10: One-time S3 export covers ongoing archival

FAQ — CloudWatch Logs and Logs Insights

Q1: Why do my CloudWatch Logs alarms based on a brand-new metric filter sit in INSUFFICIENT_DATA?

Q2: What is the cleanest pattern for centralizing logs from many AWS accounts?

Q3: What is the difference between a subscription filter and a metric filter?

Q4: How do I write a Logs Insights query that gives me the top N errors with their stack traces?

Q5: How do I avoid the Never Expire retention trap across an entire account?

Q6: When should I use Logs Insights versus exporting to S3 and using Athena?

Q7: Why does my CloudWatch agent fail to push logs even though the agent is running?

Q8: Can a single Logs Insights query span multiple AWS accounts?

Failed SSH attempts by source IP (Linux `/var/log/secure` collected by the agent)

Trap 5: `logs:CreateLogStream` is implied by `logs:PutLogEvents`