VPC Network Troubleshooting Playbook

VPC network troubleshooting is the signature discipline of SOA-C02. Task Statement 5.3 — "Troubleshoot network connectivity issues" — appears verbatim in the official Exam Guide v2.3 and exists nowhere else in the AWS associate exam family. SAA-C03 designs networks; DVA-C02 consumes networks; only SOA-C02 demands that the candidate diagnose a broken network with VPC Flow Logs in one window, ELB access logs in another, and Reachability Analyzer running in the background. Every scenario in TS 5.3 reads like a runbook entry: an EC2 cannot reach S3, an ALB returns intermittent 502s, a VPN tunnel shows UP but no traffic flows, a Reachability Analyzer report says "blocked" without saying why. VPC network troubleshooting is the one topic where SOA-C02 candidates either know the playbook or guess.

This guide walks the SysOps engineer through the full diagnostic stack: Flow Logs at the packet-metadata layer with their ACCEPT and REJECT actions, ELB access logs at the HTTP layer with request_processing_time and target_status_code, CloudFront access logs at the edge with x-edge-result-type, WAF web ACL logs for blocked-request investigation, Reachability Analyzer for proactive path verification, Network Access Analyzer for access-scope auditing, hybrid VPN tunnel and BGP debugging, Direct Connect virtual interface validation, and ALB target group health-check tuning. You will also see the recurring SOA-C02 scenario shapes: NACL ephemeral-port denies, security groups that reference SGs in a peered VPC, Flow Logs that do not capture link-local traffic, and ALB health checks where the timeout is greater than the interval. VPC network troubleshooting is operational detective work; this topic gives you the toolkit and the runbooks.

Why TS 5.3 Is the Signature Domain of SOA-C02

The official SOA-C02 Exam Guide v2.3 lists exactly four skills under TS 5.3: interpret VPC configurations including subnets, route tables, NACLs, and security groups; collect and interpret logs including VPC Flow Logs, ELB access logs, AWS WAF web ACL logs, and CloudFront logs; identify and remediate CloudFront caching issues; and troubleshoot hybrid and private connectivity issues. None of those four skills appears in any other associate exam. SAA-C03 stops at "configure"; SOA-C02 alone goes to "diagnose when configuration meets reality and reality wins."

At the SysOps tier the framing is forensic. SAA-C03 asks "which type of VPC endpoint should the architect choose for S3?" SOA-C02 asks "the application worked yesterday and stopped at 03:14 today; you have access to Flow Logs and an ALB access log bucket — what's your first move?" The answer is rarely "redesign the VPC". The answer is read the REJECT records in Flow Logs, correlate srcaddr and dstaddr with route tables and security groups, then either run Reachability Analyzer to confirm the hypothesis or jump to ELB target health to see whether the failure is at the LB layer. VPC network troubleshooting is the topic where every previous SOA-C02 topic plugs in: CloudWatch Logs Insights queries Flow Logs (Domain 1), an EventBridge rule on Flow Log REJECT spikes triggers SSM Automation (Domain 1.2 and 3.2), Config flags overly permissive security groups (Domain 4.1), and the ALB target health metric drives Auto Scaling decisions (Domain 2). Network troubleshooting threads through the entire exam.

VPC Flow Log: a record of IP traffic metadata (5-tuple plus ACCEPT or REJECT decision plus byte and packet counts) captured at a VPC, subnet, or ENI scope and delivered to CloudWatch Logs, S3, or Kinesis Data Firehose.
ACCEPT vs REJECT: the action recorded for each flow — ACCEPT means security groups and NACLs allowed the traffic, REJECT means at least one of them blocked it.
5-tuple: source IP, destination IP, source port, destination port, protocol — the unique identifier of a network flow.
ELB access log: a per-request log line written by an ALB, NLB, or CLB to S3 every 5 minutes, containing client IP, target IP, processing times, status codes, and TLS metadata.
Reachability Analyzer: a configuration analysis service that computes whether a packet from a source ENI can reach a destination ENI given the current VPC configuration, without sending any actual traffic.
Network Access Analyzer: a configuration analysis service that audits which network paths exist in a VPC against an Access Scope you define (e.g., "no path from public internet to RDS").
Target group health check: an ALB or NLB probe that determines whether a registered target receives traffic, configured with protocol, port, path, interval, timeout, healthy threshold, and unhealthy threshold.
BGP session: the Border Gateway Protocol exchange between AWS and a customer gateway over Site-to-Site VPN or Direct Connect that propagates routes dynamically.
VIF (Virtual Interface): a Direct Connect logical interface — public, private, or transit — that carries traffic over a Direct Connect physical connection.
Idle timeout: the duration an ELB keeps an idle connection open before closing it; mismatch with the application's keep-alive timeout is the most common cause of intermittent 502s.
Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

白話文解釋 VPC Network Troubleshooting

Network troubleshooting jargon piles up fast — five-tuples, ephemeral ports, BGP, idle timeouts. Three analogies make the constructs stick.

Analogy 1: The Detective Investigation

VPC network troubleshooting is detective work. VPC Flow Logs are the security camera footage — they record everyone who passed through the lobby, which direction, and whether the door buzzed them in (ACCEPT) or rejected the keycard (REJECT). The footage does not tell you why the keycard was rejected — it just shows the rejection happened — so you need to combine it with the building access list (security groups) and the floor-by-floor entry rules (NACLs) to deduce the cause. ELB access logs are the front-desk ledger: each visitor who reached the receptionist is logged with arrival time, who they wanted to see (target), how long the receptionist took to phone the office (processing time), and what response they got (status code). CloudFront logs are the branch-office ledger at every regional reception desk. WAF logs are the security guard's incident report — they list everyone who tried to enter and got turned away at the curb because the screening rule rejected them. Reachability Analyzer is the forensic walkthrough — the detective walks the entire path from front door to vault, opens every internal door, and reports the exact step where the path is blocked. The detective never assumes; the detective reads the logs, walks the path, and only then names the suspect.

Analogy 2: The Doctor's Differential Diagnosis

Network troubleshooting is medical differential diagnosis. The patient (application) presents with a symptom (intermittent 502 from the ALB). The doctor (SysOps engineer) does not jump to a treatment. Instead they run labs: ELB access log shows target_status_code = 502 for 8 percent of requests, target group health check is reporting healthy 9/10 instances, and CloudWatch shows TargetResponseTime p99 has spiked. The doctor builds a differential: target keep-alive shorter than ELB idle timeout (most common), security group blocking ephemeral return ports (less common), application crash-loop on one of ten targets (already shown by health check), and database connection pool exhaustion (would show as 504 not 502). Each hypothesis is testable with a different log layer. The doctor narrows down by the data, not by the gut. The exam loves this pattern: it gives you symptoms and three log sources and asks for the first diagnostic step.

Analogy 3: The Electrician Tracing the Wire

Hybrid connectivity troubleshooting is wire-tracing. A VPN tunnel is a wire between the on-prem switch (customer gateway) and the AWS rack (virtual private gateway or transit gateway). The wire has multiple junction boxes: Phase 1 (IKE) authentication is the first junction, Phase 2 (IPSec SA) is the second, routing (static or BGP) is the third, and on-prem firewall ACL is the fourth. The tunnel "going UP" means juncture 1 and 2 succeeded — the wire is electrified — but "no traffic" means juncture 3 or 4 failed. The electrician does not replace the wire; they walk the junctions in order and find the open contact. Direct Connect adds another physical layer (the fiber cross-connect at the colo facility) and another routing layer (the BGP session per VIF). When packets do not flow, the systematic answer is "walk the junctions in order" — never "rebuild the tunnel".

For Flow Log and ELB log questions, the detective analogy frames the answer best: read the logs, then deduce. For ALB 502 vs 504 and target group health questions, the doctor's differential analogy is the cleanest mental model — symptom plus labs plus differential equals diagnosis. For VPN, BGP, and Direct Connect questions, the electrician analogy works because the failure modes are sequential. Use the matching analogy and the answer often becomes obvious without re-reading the question. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

VPC Flow Logs Fundamentals: Scopes, Fields, and Destinations

Before you can troubleshoot with Flow Logs you need a precise mental model of how a Flow Log arrives, what it records, and what it deliberately omits.

Three scopes — VPC, subnet, ENI

A Flow Log can be enabled at three scopes:

VPC level — captures traffic for every ENI in the VPC. Easiest to set up but produces the most data.
Subnet level — captures every ENI in the subnet. Useful when only one tier (public, app, db) is under investigation.
ENI level — the most surgical scope. Use when the investigation is narrowed to a single instance, NAT gateway ENI, RDS ENI, or VPC endpoint interface.

Multiple Flow Logs can target the same scope with different filters and destinations — for example, one Flow Log to S3 for archive and another to CloudWatch Logs for live Logs Insights queries.

Default vs custom format fields

The default Flow Log format records 14 fields: version, account-id, interface-id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log-status. The action field is ACCEPT, REJECT, or NODATA (no traffic in the aggregation window). The log-status field is OK, NODATA, or SKIPDATA (the producer dropped records due to capacity).

The custom format (introduced after default) lets you add up to 30 fields including:

vpc-id, subnet-id, instance-id — adds the parent resource IDs so you can join to other inventories.
tcp-flags — bitmask (SYN=2, ACK=18, FIN=1, RST=4) for TCP handshake forensics.
type — IPv4, IPv6, or EFA.
pkt-srcaddr, pkt-dstaddr — the packet-level addresses inside the IP packet, distinct from srcaddr/dstaddr (which are the ENI-level addresses). This matters for NAT gateway and Transit Gateway flows where the ENI sees one address and the packet payload carries another.
region, az-id, sublocation-type, sublocation-id — placement metadata.
flow-direction — ingress or egress from the ENI's perspective.
traffic-path — integer code indicating the route taken (through-IGW=1, through-VGW=2, through-VPC-peering=3, through-IGW-internet-gateway=4, etc.).

Custom format is the SOA-C02 production standard. Always include vpc-id, instance-id, tcp-flags, pkt-srcaddr, pkt-dstaddr, and flow-direction at minimum.

Aggregation interval — 10 minutes default, 1 minute optional

Flow Logs aggregate flows over an interval before writing the record. The default is 10 minutes; the alternate is 1 minute, configured at Flow Log creation time. The 1-minute interval triples the data volume but is mandatory for live troubleshooting — you cannot wait ten minutes during an incident.

Three destinations — CloudWatch Logs, S3, Kinesis Data Firehose

CloudWatch Logs — best for live Logs Insights queries during incidents. Higher per-GB cost than S3.
S3 — best for long-term archive and bulk Athena queries. Supports Parquet format for cheaper Athena scans.
Kinesis Data Firehose — best for streaming Flow Logs into a SIEM (Splunk, Datadog, OpenSearch) or a data lake with custom transformation Lambdas.

A common SOA-C02 pattern is dual delivery: one Flow Log to CloudWatch Logs at 1-minute aggregation for live troubleshooting, plus another to S3 in Parquet at 10-minute aggregation for archive and forensics.

What Flow Logs do NOT capture

This list is heavily tested:

Traffic to and from 169.254.169.254 (instance metadata service).
Traffic to Amazon DNS server (the .2 address) when using the AWS-provided DNS — though queries to a custom DNS resolver are captured.
DHCP traffic.
Traffic to the reserved IP for the default VPC router (the .1 address).
Mirrored traffic (that is what VPC Traffic Mirroring is for).
Windows license activation traffic.

A SysOps engineer who builds an alarm on "no DNS traffic in Flow Logs" is hunting a ghost — the AWS DNS resolver flows are intentionally omitted.

Default aggregation interval: 10 minutes.
Alternate aggregation interval: 1 minute (custom-set at creation).
Default format fields: 14.
Maximum custom format fields: 30.
Action values: ACCEPT, REJECT, NODATA.
Log-status values: OK, NODATA, SKIPDATA.
Three scopes: VPC, subnet, ENI.
Three destinations: CloudWatch Logs, S3, Kinesis Data Firehose.
Not captured: 169.254.169.254 metadata, AWS DNS, DHCP, default router .1 address, license activation, mirrored traffic.
Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

Flow Log Analysis: Logs Insights and Athena Patterns

Capturing Flow Logs is half the job. The other half is reading them efficiently during an incident.

CloudWatch Logs Insights queries

For Flow Logs delivered to CloudWatch Logs, Logs Insights is the right tool. The fields are pre-parsed when you select the Flow Log format. Common Insights queries every SOA candidate should recognize:

fields @timestamp, srcaddr, dstaddr, dstport, action
| filter action = "REJECT"
| stats count(*) as denies by srcaddr, dstport
| sort denies desc
| limit 20

This produces "top 20 denied source-IP and destination-port pairs in the last hour" — the first query during a connectivity incident. If a single ENI dominates the REJECTs to port 443 to S3, the gateway endpoint route is missing.

fields @timestamp, srcaddr, dstaddr, bytes
| filter dstport = 443 and action = "ACCEPT"
| stats sum(bytes) as totalBytes by dstaddr
| sort totalBytes desc
| limit 10

This finds "top 443 destinations by byte volume" — the right query when the question is "why is NAT gateway data processing cost spiking" (the answer is usually a single application chatting with an external SaaS instead of using a VPC endpoint).

fields @timestamp, srcaddr, srcport, dstaddr, dstport, action
| filter dstport in [22, 3389] and action = "REJECT"
| stats count(*) as attempts by srcaddr
| sort attempts desc

This finds "SSH/RDP brute-force probes" — REJECTs to port 22 or 3389 from random sources, useful for security hardening.

Athena over S3

For long-term retention and forensic queries spanning days or weeks, deliver Flow Logs to S3 in Parquet format and query with Athena. The setup steps:

Enable a Flow Log with destination S3, format custom, file format Parquet, hive-compatible partitioning enabled.
Run the MSCK REPAIR TABLE flow_logs; (or use partition projection) to register partitions.
Query via Athena:

SELECT srcaddr, dstaddr, dstport, sum(bytes) AS total_bytes
FROM flow_logs
WHERE action = 'REJECT'
  AND day >= '2026/04/24' AND day <= '2026/04/25'
  AND vpcid = 'vpc-0abc123'
GROUP BY srcaddr, dstaddr, dstport
ORDER BY total_bytes DESC
LIMIT 50;

Parquet plus partitioning by account-id/region/year/month/day reduces Athena scan cost by 10x to 100x compared to plain JSON or CSV.

On SOA-C02, when the scenario says "the on-call needs to see denied traffic in real time", the answer is Flow Logs to CloudWatch Logs at 1-minute aggregation, queried with Logs Insights. When the scenario says "the security team needs to investigate connections from 30 days ago", the answer is Flow Logs to S3 in Parquet, queried with Athena. These are the two canonical destinations and you should not mix them up. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-athena.html

Reading Flow Log Records: ACCEPT, REJECT, and What They Actually Mean

Interpreting Flow Log records is the SOA-C02 core skill. The trap is that ACCEPT does not mean "the application succeeded" and REJECT does not always mean "fix the security group".

ACCEPT does not mean success

ACCEPT means security groups and NACLs both allowed the packet. It does not mean the destination application accepted the connection. A TCP RST from the application after handshake will still appear as ACCEPT in Flow Logs because the security layer let the packet through. Application-layer failures (HTTP 502, database connection refused, TLS handshake error) are invisible to Flow Logs. For those you need ELB access logs, application logs, or VPC Traffic Mirroring.

REJECT means at least one of {SG, NACL} denied

REJECT does not say which of security group or NACL denied. The triage process:

Check security group rules first: SGs are stateful, so an outbound SG rule that allows an outbound flow automatically allows the return traffic. If the SG looks correct, move on.
Check NACL rules next: NACLs are stateless, so both directions need explicit rules including ephemeral return ports (1024–65535 for Linux, 49152–65535 for Windows / RFC 6056). The most common production NACL bug is allowing inbound port 443 but not allowing outbound ephemeral ports for the return path.
Use Reachability Analyzer to confirm: it explicitly tells you which rule (SG, NACL, route table, or other) blocks the path.

Direction matters

A REJECT with flow-direction = ingress means the inbound rule denied the packet. A REJECT with flow-direction = egress means the outbound rule denied. Without the custom-format flow-direction field you must infer from the relationship between srcaddr, dstaddr, and the ENI's primary IP — adding flow-direction to the Flow Log saves an analysis step.

`pkt-srcaddr` vs `srcaddr` — the NAT gateway / TGW gotcha

The default srcaddr and dstaddr are the addresses as seen by the ENI. For a NAT gateway flow, the ENI sees the NAT gateway's address, not the originating instance's private IP. The custom-format pkt-srcaddr and pkt-dstaddr fields show the packet-level addresses, which preserve the original sender. For Transit Gateway and VPC peering flows the same principle applies. SOA-C02 has tested this directly: "the security team can see the NAT gateway hits S3 a million times but cannot see which instance is responsible — what's the fix?" Answer: add pkt-srcaddr to the Flow Log custom format.

Flow Logs tell you the packet was denied; they do not tell you whether the security group, the NACL, the route table, or another control was responsible. To get the specific blocker you must run Reachability Analyzer on the same source-destination pair, which explicitly identifies the offending rule. SOA-C02 expects you to know that Flow Logs are reactive (what happened) and Reachability Analyzer is proactive (what would happen and why). Reference: https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html

ELB Access Logs: Format, Status Codes, and Timing Fields

When the symptom is HTTP-layer (502, 504, slow response, TLS handshake failure), Flow Logs cannot help. You need ELB access logs.

Enabling ELB access logs

Access logs are off by default. Enable per load balancer with an S3 bucket destination. Logs are written every 5 minutes for ALB/NLB and every 5 or 60 minutes for CLB. The S3 bucket must have a bucket policy granting the regional ELB log delivery account s3:PutObject permission — a misconfigured bucket policy is the most common reason logs do not appear.

ALB access log fields (key ones for troubleshooting)

The ALB access log line is space-separated with these critical fields:

type — http, https, h2, grpcs, ws, wss.
time — request timestamp at ISO 8601.
elb — load balancer ARN segment.
client:port — client IP and source port.
target:port — backend target IP and port.
request_processing_time — seconds from ELB receiving the request to sending it to the target. Should be tiny (< 1ms). High values mean the ELB is queueing.
target_processing_time — seconds from ELB sending the request to receiving the response. The application's processing time. High values point to slow backend.
response_processing_time — seconds from ELB receiving the response to sending it to the client. Tiny; high values mean network egress contention.
elb_status_code — what the ELB returned to the client.
target_status_code — what the target returned to the ELB.
received_bytes, sent_bytes — request and response sizes.
request — method, URI, protocol.
user_agent, ssl_cipher, ssl_protocol — client metadata.
target_group_arn, trace_id, domain_name, chosen_cert_arn, matched_rule_priority — routing metadata.
error_reason — populated when the ELB returns 4xx/5xx; values like LambdaThrottling, Target_FailedHealthChecks, Target_ConnectionError name the cause.

NLB access logs

NLBs only support access logs for TLS listeners (TLS termination at the NLB). Plain TCP/UDP listeners do not produce access logs because the NLB does not parse the application protocol. For pure TCP NLBs, use Flow Logs at the target ENI scope.

Common SOA-C02 questions on ELB access logs

"ALB returns 502 and the team needs to find which target is failing." → Filter access logs where elb_status_code = 502, stats count by target:port to find the offending target IP.
"ALB returns 504 — what does it mean?" → target_processing_time exceeded the target group's health check timeout or listener's idle timeout. Increase the idle timeout (default 60 seconds) or fix the slow backend.
"Latency p99 is high but the team does not know whether it is the LB or the app." → Sum request_processing_time + target_processing_time + response_processing_time; a high target_processing_time isolates the application.

elb_status_code — the HTTP code the ELB returned to the client. This is what the user sees.
target_status_code — the HTTP code the backend target returned to the ELB. This is what the application emitted.
When elb_status_code = 502, target_status_code = "-" (empty) means the ELB never got a response from the target (connection failure, TLS error, or target not even tried).
When elb_status_code = 502, target_status_code = 502 means the target itself returned 502 (the backend's upstream is failing, or the application itself returned that code).
When elb_status_code = 504, target_status_code = "-" means the ELB gave up waiting for the target — target_processing_time will be at the idle timeout.
Reference: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html

ALB Target Group Unhealthy Targets — The Health Check Playbook

When ALB target health is the symptom, the playbook is mechanical. Walk it in order.

Health check parameters

ALB target group health checks are configured per target group with these parameters (and their defaults):

Protocol — HTTP, HTTPS, gRPC. Must match what the target listens on for the health-check endpoint.
Port — traffic-port (use the registered port) or an override port. Override is right when the health endpoint runs on a separate management port.
Path — for HTTP/HTTPS, the URI to GET. Default /.
HealthCheckIntervalSeconds — interval between checks. Default 30 seconds, range 5–300.
HealthCheckTimeoutSeconds — how long the LB waits for a response. Default 5 seconds, range 2–120.
HealthyThresholdCount — consecutive successful checks to declare healthy. Default 5, range 2–10.
UnhealthyThresholdCount — consecutive failed checks to declare unhealthy. Default 2, range 2–10.
Matcher — HTTP codes considered healthy. Default 200. Can be 200-399, 200,202, etc.

The eight diagnostic steps

When targets are unhealthy, walk the list:

Confirm the target is running and the application is listening. SSH/Session Manager into the instance, curl http://localhost:8080/health to verify the app responds.
Confirm the security group on the target allows inbound from the ALB security group. This is the most common cause. The target's SG must allow the health check port from the ALB's SG (not from 0.0.0.0/0 — that works but is wrong).
Confirm the health check port matches the registered port unless an override is configured deliberately.
Confirm the path returns 200 (or whatever Matcher specifies). If the app's /health returns 401 because it expects auth, the LB sees it as unhealthy.
Confirm timeout is less than interval. If timeout (5s) is greater than interval (3s), checks queue and the target oscillates. AWS validates this — but candidates are tested on the relationship.
Confirm the unhealthy threshold accounts for legitimate slow starts. A web app that takes 60 seconds to warm up needs HealthCheckGracePeriodSeconds on the ASG side (longer grace) plus an unhealthy threshold high enough to not kill the target during boot.
Confirm subnet / route table allows ALB-to-target reachability. Cross-AZ is fine; cross-VPC requires VPC peering and matching route tables on both sides.
Confirm the target's response body is acceptable. Some apps return 200 with a body indicating a soft failure ("status:degraded"). The LB sees 200 and marks healthy. Use a Matcher that requires the right code, and have the app return 503 when degraded.

Reason codes

The ALB console (and the Reason field of the DescribeTargetHealth API) returns codes like:

Target.FailedHealthChecks — the target failed health checks (the generic case).
Target.Timeout — the target did not respond within HealthCheckTimeoutSeconds.
Target.ResponseCodeMismatch — the target returned a code outside the Matcher.
Target.NotInUse — the target group is not associated with a load balancer.
Target.NotRegistered — the target was deregistered.
Target.HealthCheckDisabled — health checks are not enabled for the target group.
Elb.InitialHealthChecking — checks are still running for the first time after registration.

These codes appear on SOA-C02 directly: a question may give you Target.Timeout and ask for the most likely fix (increase HealthCheckTimeoutSeconds or fix the slow application).

A common configuration mistake: someone sets HealthCheckIntervalSeconds = 5 and HealthCheckTimeoutSeconds = 10. The console will reject this (timeout must be less than interval), but candidates encountering legacy CLBs or NLBs may see a similar antipattern and not recognize the bug. The rule: timeout is always strictly less than interval. The corollary on SOA-C02: when an ALB target oscillates between healthy and unhealthy, suspect timeout-to-interval ratio and the application's startup time relative to the unhealthy threshold. Reference: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html

Reachability Analyzer — Proactive Path Verification

VPC Reachability Analyzer computes, on demand, whether a packet from a source ENI (or instance, internet gateway, transit gateway attachment, VPN gateway) can reach a destination ENI given the current VPC configuration. It does not send any actual traffic; it analyzes the configuration graph.

What it analyzes

For a given source, destination, protocol, and port, Reachability Analyzer evaluates:

Source and destination security group rules.
Network ACL rules on every subnet in the path.
Route tables on every subnet in the path.
Internet gateway, NAT gateway, VPC endpoint, VPC peering, Transit Gateway, virtual private gateway routing.
Prefix lists and elastic load balancer routing logic.

The output is either Reachable with a step-by-step path (subnet by subnet, hop by hop, naming each ENI and rule that allowed traffic) or Not reachable with the Explanation Codes field naming the specific configuration object that blocks the path — e.g., BLOCKED_BY_SECURITY_GROUP_RULE with the SG ID, BLOCKED_BY_NETWORK_ACL_RULE with the NACL ID, NO_ROUTE_TO_DESTINATION with the missing route table.

When to use Reachability Analyzer

Pre-deployment validation — before launching a new microservice, verify the ALB can reach the new target group and the target can reach the database.
Incident triage — when Flow Logs show REJECT but you need the specific rule to fix.
Compliance — confirm "no path exists from public internet to RDS" at audit time.
Change management — re-run after every NACL or route table change to detect regressions.

Cost

Reachability Analyzer charges per analysis (currently $0.10 per analysis). For a CI/CD pipeline that runs hundreds of analyses per day this adds up; for incident triage it is negligible.

Limits

Reachability Analyzer does not reason about:

Application-layer security (WAF rules, application code returning 403). It is purely a network-layer service.
DNS resolution failures. It analyzes IP routing only.
Cross-Region paths. Source and destination must be in the same Region.
Asymmetric routing intentionally introduced by transit firewalls. It assumes return-path is symmetric.

For application-layer issues use ELB access logs and WAF logs. For DNS issues use Route 53 Resolver query logging.

On SOA-C02, when a scenario describes "the team has spent two hours staring at security groups and route tables to understand why packets are dropped" — the answer is run Reachability Analyzer on the source-destination pair. It returns the exact blocker in seconds. Candidates who pick "review every NACL by hand" lose easy points. Reference: https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html

Network Access Analyzer — Access Scope Auditing

Network Access Analyzer is the audit-side complement to Reachability Analyzer. Where Reachability Analyzer answers "can A reach B?", Network Access Analyzer answers "is there any path from set X to set Y, given the entire VPC?".

Access scopes

You define an Access Scope in JSON specifying:

Source — what set of resources or addresses the analysis treats as the origin (e.g., "internet gateway", "any address in 0.0.0.0/0").
Destination — what set the analysis treats as the target (e.g., "any RDS instance", "subnets tagged tier=db").
MatchPaths / ExcludePaths — additional path constraints.

Run the access scope and Network Access Analyzer enumerates every path that matches. For "no path from internet to db tier" you expect an empty result; any non-empty result names the offending route, SG, or NACL.

Bundled access scopes

AWS provides predefined scopes for common audit questions:

Internet Inbound — what internet sources can reach which internal resources.
Internet Outbound — which internal resources can reach the internet.
Internal Network Inbound — paths from on-prem (over VPN/DX) to AWS resources.
Internal Network Outbound — paths from AWS to on-prem.
Trusted Resources — define a trusted set and find all paths from non-trusted to trusted.

Use cases

Compliance audit — prove "no internet ingress to PCI scope" by running the bundled Internet Inbound scope filtered to PCI tags and verifying the result is empty.
Pre-merger network review — when adding a new VPC peering or Transit Gateway attachment, audit what new paths it creates.
Continuous compliance — schedule Network Access Analyzer via Lambda and EventBridge to detect regressions when a developer adds a wide SG rule.

Reachability Analyzer vs Network Access Analyzer — the cleanest mental separation

Question	Use
Can A reach B right now?	Reachability Analyzer
Is there any path from any-internet to my-RDS?	Network Access Analyzer
Why is A unable to reach B?	Reachability Analyzer (it names the blocking rule)
Audit: did we open any new internet path this quarter?	Network Access Analyzer (run Access Scope, diff results)

The two services sound similar and are commonly confused on SOA-C02. Reachability Analyzer takes a specific source ENI and specific destination ENI and computes that one path. Network Access Analyzer takes an Access Scope (a set of sources and a set of destinations) and enumerates all paths between them. For a single broken connection, use Reachability Analyzer. For an audit question across the whole VPC, use Network Access Analyzer. Reference: https://docs.aws.amazon.com/vpc/latest/network-access-analyzer/what-is-network-access-analyzer.html

CloudFront Logs and Cache Issue Diagnosis

CloudFront has two logging tiers: standard access logs (delivered to S3 every few minutes, all fields, low cost) and real-time logs (streamed via Kinesis Data Streams, configurable subset of fields, higher cost, sub-second latency).

Standard access log fields (key ones)

date, time, x-edge-location — when and where (e.g., IAD89 = Ashburn POP).
c-ip — viewer client IP.
cs-method, cs-uri-stem, cs-uri-query — request line.
cs(Host), cs(User-Agent), cs(Referer) — request headers.
sc-status — status code returned to the viewer.
sc-bytes, cs-bytes — response and request size.
time-taken — total time at the edge to fully serve the response.
x-edge-result-type — the outcome at the edge. Values:
- Hit — served from edge cache.
- RefreshHit — cache had a stale copy, revalidated with origin, served.
- Miss — cache did not have it, fetched from origin.
- LimitExceeded, CapacityExceeded — backpressure.
- Error — error returned (look at sc-status).
- Redirect — CloudFront returned a redirect.
x-edge-detailed-result-type — finer detail (e.g., OriginShieldHit, MissGeneratedResponse).
x-edge-response-result-type — the result from the response side.
cs-protocol, ssl-protocol, ssl-cipher — TLS metadata.

Common cache issue diagnoses

"Stale content served to users" — origin updated but viewers see old version. Cause: Cache-Control: max-age is high (or default TTL on the cache behavior is high) and you have not invalidated. Fix: create an invalidation (/path/* pattern) and shorten origin's Cache-Control for that path. Or use a versioned URL (/v2/asset.js) so the new URL is a fresh cache key.
"Cache hit ratio low" — x-edge-result-type heavily Miss. Causes: query string variations not normalized in the cache policy, cookies forwarded to origin (each unique cookie value creates a new cache key), Cache-Control: no-store from origin. Fix: use a cache policy that normalizes query strings to only the ones that matter, configure origin not to send no-store for cacheable assets.
"Some users see content from a different region" — by design; CloudFront serves from the nearest POP. Origin response should not depend on viewer location unless using CloudFront-Viewer-Country request header.
"504 from CloudFront" — origin response time exceeded the origin response timeout (default 30 seconds, configurable up to 60 for ALB origins, longer for custom origins). Fix: speed up the origin or increase the timeout.

Real-time logs

For incidents where 5-minute log delivery is too slow, configure a real-time log configuration on the cache behavior. Pick the fields, the sampling rate (1–100 percent), and the Kinesis Data Stream destination. Real-time logs land in Kinesis within seconds and can be consumed by Lambda, Kinesis Data Analytics, or OpenSearch for sub-minute incident triage.

For SOA-C02, when troubleshooting CloudFront, the first column to look at is always x-edge-result-type. It immediately tells you whether the request was a hit, a miss, a refresh hit, or an error — which determines the next layer to investigate. A high miss ratio means cache policy work; high Error with sc-status = 502/504 means origin work; high RefreshHit is normal and expected. Reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

AWS WAF Web ACL Logs

WAF logs every request a Web ACL evaluates, with the rule that matched and the action taken. The destination can be CloudWatch Logs, S3, or Kinesis Data Firehose.

Key fields

timestamp, webaclId, terminatingRuleId, terminatingRuleType, action (ALLOW, BLOCK, COUNT, CAPTCHA, CHALLENGE).
httpRequest — block containing client IP, country, headers, URI, args.
ruleGroupList — every rule group evaluated, with action taken.
nonTerminatingMatchingRules — rules that matched but only set Count, not Block.

Diagnostic patterns

"Legitimate users are blocked by WAF" — query logs for action = BLOCK and the user's IP, find the terminatingRuleId. If it is an AWS managed rule (e.g., AWSManagedRulesCommonRuleSet/SizeRestrictions_BODY), evaluate whether to scope-down the rule (exclude specific paths) or set it to Count first to measure impact.
"Rate-based rule blocking too aggressively" — adjust the limit, scope to specific URI patterns, or add an IP set exception for known partners.
"Bots bypassing WAF" — switch the suspect rule from Block to Challenge or CAPTCHA, monitor the result in logs.
"Need to find SQL injection attempts" — query for terminatingRuleId matching SQLiRuleSet, group by httpRequest.clientIp.

WAF logs vs Flow Logs

Flow Logs see only the 5-tuple. WAF logs see the full HTTP request including headers, query string, and body samples. For Layer 7 attacks (SQLi, XSS, application abuse), only WAF logs help.

Hybrid Connectivity Troubleshooting: Site-to-Site VPN

A Site-to-Site VPN connects an on-prem network to a VPC via two redundant IPsec tunnels (per VPN connection) terminated at a virtual private gateway (VGW) or transit gateway (TGW).

Tunnel state machine

A tunnel goes through phases:

DOWN — no IPsec association.
Phase 1 (IKE) negotiation — authentication and key exchange. Common failures: pre-shared key mismatch, IKE version mismatch (v1 vs v2), encryption/integrity algorithm mismatch, lifetime mismatch. AWS publishes the supported algorithm sets in the tunnel options.
Phase 2 (IPsec SA) establishment — the actual data SA. Common failures: PFS group mismatch, encryption algorithm mismatch, traffic selector (interesting traffic) mismatch.
UP — the tunnel is established. Traffic can now flow if routing is correct.

The aws ec2 describe-vpn-connections API returns each tunnel's Status and StatusMessage. The most informative diagnostic strings are IKE_PHASE1_DOWN, IPSEC_PHASE2_DOWN, and the negotiation parameter mismatch messages.

Tunnel UP but no traffic — the routing problem

A tunnel UP means Phase 1 and Phase 2 succeeded. If traffic still does not flow, the cause is one of:

VPC route table does not have a route for the on-prem CIDR pointing to the VGW or TGW attachment. Add the route (or enable route propagation for dynamic VPN).
On-prem route table does not have a route for the VPC CIDR pointing to the customer gateway IP. The on-prem networking team's responsibility.
Customer gateway firewall drops the traffic before it reaches the IPsec interface. Verify ACL allows the AWS-side CIDR.
NACL on the VPC subnet blocks the inbound packet. The VGW route delivers the packet to the subnet; NACLs still apply.
Security group on the destination instance blocks the inbound packet from the on-prem CIDR.
BGP session DOWN (if dynamic routing is configured) — Phase 2 may be UP without a BGP session, in which case AWS does not learn on-prem routes. Check aws ec2 describe-vpn-connections for BGP_STATUS = ESTABLISHED.
Asymmetric routing through two tunnels — packets go out tunnel 1 and return on tunnel 2; some on-prem firewalls drop asymmetric flows.

BGP session diagnosis

For dynamic routing, BGP is the protocol that AWS and the customer gateway use to advertise routes. The BGP session is separate from the IPsec tunnel — Phase 2 can be UP and BGP DOWN. Common BGP issues:

ASN mismatch — customer gateway configured with the wrong AWS ASN (default 64512 for VGW, configurable) or wrong on-prem ASN.
Hold timer mismatch — usually fine with defaults, but custom values must agree.
Authentication mismatch — MD5 password set on one side only.
Route advertisement filtering — on-prem advertises only specific prefixes; AWS does not learn the full on-prem network.

The CloudWatch metric TunnelState (1 = UP, 0 = DOWN) and TunnelDataIn/TunnelDataOut byte counters per tunnel are essential for monitoring.

A persistent SOA-C02 distractor: "the VPN tunnel state is UP but the application cannot reach on-prem servers — what's the cause?" The trap is to pick "Phase 1 mismatch" or "PSK wrong" — these would prevent the tunnel from being UP at all. The right answer is routing: the VPC route table is missing the on-prem CIDR route, or the on-prem firewall is dropping the inbound traffic, or BGP is DOWN so routes are not learned. Walk the routing layer, not the IPsec layer. Reference: https://docs.aws.amazon.com/vpn/latest/s2svpn/VPNTroubleshooting.html

Hybrid Connectivity Troubleshooting: Direct Connect

Direct Connect (DX) replaces VPN with a private fiber connection between the customer's network and an AWS Direct Connect location. Troubleshooting DX layers physical, link, and routing concerns.

The three layers

Physical layer — the fiber cross-connect at the colocation facility. State is reported by the AWS console as connectionState (pending, available, down, etc.) and by the partner DC. Common problems: cross-connect not yet patched, optic Tx/Rx light levels out of spec, MTU mismatch.
Link layer — the BGP-over-Ethernet handshake on each VIF. State reported as bgpStatus per VIF. Common problems: VLAN ID mismatch, BGP ASN mismatch, BGP MD5 password mismatch.
Routing layer — even with BGP UP, traffic only flows when AWS-side route tables and on-prem route tables agree.

VIF types

Public VIF — used to reach AWS public services (S3, DynamoDB) over DX. Advertises customer-owned public prefixes.
Private VIF — connects DX to a single VPC's virtual private gateway. One VPC per private VIF.
Transit VIF — connects DX to a Direct Connect Gateway, which fans out to multiple Transit Gateways and many VPCs across regions.

CloudWatch metrics for DX

ConnectionState, ConnectionBpsEgress, ConnectionBpsIngress, ConnectionLightLevelTx, ConnectionLightLevelRx, ConnectionPpsEgress, ConnectionPpsIngress, ConnectionErrorCount. Light levels are particularly informative — degraded fiber or dirty connectors show up as drifting Tx/Rx levels before connection drop.

Common DX problems

VIF stuck in pending — customer router has not yet established BGP. Verify VLAN, ASN, IP addressing on customer side.
VIF available but no traffic — same routing checklist as VPN: VPC route table, on-prem route table, NACL, SG. Plus, for transit VIF, the Direct Connect Gateway and Transit Gateway must have a propagated route.
High latency / packet loss on DX — check ConnectionLightLevel, ConnectionErrorCount, MTU (DX supports jumbo frames 9001 / 8500 depending on connection type — mismatch causes fragmentation).
Failover to VPN backup not working — both DX and VPN must advertise the same prefix, and the on-prem router must prefer DX (typically via BGP local-preference) so VPN is the standby. Test by withdrawing the DX prefix and verifying VPN takes over.

DX is a private connection but is not encrypted at the AWS service edge. For compliance regimes that require encryption in transit, layer a Site-to-Site VPN over the DX private VIF, or use MACsec on DX dedicated connections that support it. SOA-C02 has tested this: "the compliance team requires encryption in transit between on-prem and AWS — which architecture meets the requirement?" The answer is VPN-over-DX or MACsec, not DX alone. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/Troubleshooting.html

Scenario Pattern: EC2 Cannot Reach S3 Through a Gateway VPC Endpoint

The runbook for the most common SOA-C02 networking trouble ticket.

Confirm the endpoint exists and is type Gateway. S3 supports both Gateway (free, route-table based) and Interface (PrivateLink, paid, ENI based). For private subnets needing to reach S3 cheaply, Gateway is the SOA default.
Confirm the route table associated with the EC2's subnet has an entry for the S3 prefix list pointing to the gateway endpoint. This is the single most common miss. The route is added automatically only if you select the route table during endpoint creation. If you create the endpoint and forget to associate the route table, the route is never added.
Confirm the endpoint policy allows the desired S3 actions on the desired buckets. Default policy is *:*; tightened policies often forget specific bucket ARNs.
Confirm the security group on the EC2 allows outbound HTTPS (port 443) to the S3 prefix list (or to 0.0.0.0/0 if not yet hardened). Note: outbound SG rules can reference an AWS-managed prefix list (com.amazonaws.region.s3), which is the SOA-clean answer.
Confirm the bucket policy (and any SCP) does not deny the access. A common production trip-up: the bucket policy has Condition: aws:SourceVpce = vpce-... requiring requests through a specific endpoint, and the EC2 is hitting the endpoint with a different ID than expected.
Confirm DNS resolution. Gateway endpoints rely on the standard S3 hostname bucket.s3.region.amazonaws.com resolving to S3 IPs that are in the prefix list. The VPC must have enableDnsHostnames and enableDnsSupport enabled.
Confirm Flow Logs do not show REJECT to S3 IPs from the EC2's ENI. If they do, run Reachability Analyzer between the EC2 ENI and an internet-gateway (placeholder destination — for S3, run between the ENI and the prefix list addresses).

The solution is almost always step 2 (missing route) or step 4 (SG outbound block). Reachability Analyzer pinpoints the exact cause without manual hunting.

Scenario Pattern: ALB Returns 502 Intermittently

The doctor's differential diagnosis playbook.

Pull ALB access logs and filter elb_status_code = 502. Group by target:port to find the offending backend. If one target is dominant, that target is degraded; deregister and replace.
Check error_reason. Values:
- Target_ConnectionError — TCP connection failed. Likely target SG or NACL issue, or target has crashed.
- Target_FailedHealthChecks — target is unhealthy but ALB tried it; should not happen if target was deregistered. Check health check tuning.
- Target_TLSHandshakeError — TLS issue between ALB and target (HTTPS target groups). Cert expired or chain broken.
Check target_status_code.
- target_status_code = 502 means the target itself returned 502 (its own upstream is broken).
- target_status_code = "-" means ALB never got a response — connection or TLS failure.
Check keep-alive settings. The most common 502 cause is the target's keep-alive timeout being shorter than the ALB's idle timeout. ALB reuses the connection; the backend has already closed it; ALB sends a request on a stale connection and gets a TCP RST; ALB returns 502 to the client. Fix: set the target's keep-alive timeout greater than the ALB idle timeout (default ALB idle = 60 seconds; many backends default to 5–30 seconds).
Check target's health check grace period. During Auto Scaling, if the new instance is registered before it is ready, the ALB sends real traffic and gets 502s until the app boots. Increase HealthCheckGracePeriodSeconds on the ASG.
Check security group on the target. Must allow inbound from the ALB's SG on the target port.
Check VPC subnets. ALB needs at least 2 subnets in 2 different AZs; target subnets must have route to ALB subnets (typically the same VPC, so trivial).
Check application logs. A target that crashes mid-request returns 502 to ALB. Application crash dumps are the final word.

Most production 502 storms trace to step 4 (keep-alive mismatch) — fix it once and the issue disappears.

Scenario Pattern: VPN Tunnel UP but No Traffic

The electrician's wire-tracing playbook.

Confirm the tunnel is actually UP via CloudWatch TunnelState = 1 for the tunnel in question (each VPN connection has 2 tunnels — one may be UP and one DOWN, and the application may be hashing onto the DOWN one).
Confirm BGP session is ESTABLISHED (if dynamic routing). If it is not, traffic cannot flow because routes are not learned.
Confirm the VPC route table for the EC2's subnet has a route for the on-prem CIDR pointing to the VGW (for VPN-to-VGW) or TGW attachment (for VPN-to-TGW). For BGP, the route should be propagated; for static, the route must be added manually.
Confirm the on-prem route table has a route for the VPC CIDR via the customer gateway IP / DX VIF.
Confirm NACLs on the VPC subnet allow inbound from the on-prem CIDR and outbound to it (including ephemeral ports for return traffic).
Confirm the security group on the destination EC2 allows inbound from the on-prem CIDR.
Confirm the on-prem firewall allows traffic to the AWS-side CIDR; the on-prem firewall is the most opaque link in the chain — always confirm with the on-prem team rather than assuming.
Run Reachability Analyzer between the on-prem-side endpoint (represented as the VGW or TGW attachment) and the VPC destination ENI. If Reachability Analyzer says reachable but traffic still fails, the problem is on-prem; if it names a specific blocker on AWS side, fix it there.

The order matters. Walking from the AWS side outward (steps 1–6) catches 80 percent of issues; the remaining 20 percent are on-prem firewall (step 7). Reachability Analyzer (step 8) confirms the AWS-side path is clean.

Scenario Pattern: Reachability Analyzer Says Blocked — Why?

When Reachability Analyzer returns "Not reachable", the Explanation Codes field names the cause. Common codes and their fixes:

BLOCKED_BY_INGRESS_NACL_RULE / BLOCKED_BY_EGRESS_NACL_RULE — names the NACL ID and the rule number that denies. Fix the NACL or change the deny to allow.
BLOCKED_BY_SECURITY_GROUP_RULE — no SG rule allows the path. Add the rule. Common gotcha: the SG references another SG (sg-abc), and that other SG is in a peered VPC; SG references across VPC peering work only if both SGs are in the same VPC by default. Cross-VPC SG references require the SGs to be in the same Region and the peering to have explicit support enabled.
NO_ROUTE_TO_DESTINATION — the route table on a subnet in the path has no entry for the destination CIDR. Add the route or fix route propagation.
BLACKHOLE_ROUTE_TO_DESTINATION — a route exists but points to a deleted target (e.g., the NAT gateway was deleted but the route was not updated). Remove or update the route.
MAX_TRANSIT_GATEWAY_ATTACHMENT_LIMIT_EXCEEDED / similar limit errors — service quota issues.
ENI_SECURITY_GROUP_RULES_DENY — the ENI itself has SG rules that deny.
THROUGH_RESOURCE_NOT_REACHABLE — a transit resource (TGW, peering, endpoint) is unreachable from the source.

The fix is named by the explanation code; you do not need to guess. SOA-C02 may give you an explanation code in the question stem and ask for the corrective action.

Common Trap: NACL Ephemeral Ports Denied

NACLs are stateless — they evaluate inbound and outbound rules independently. When a client makes an outbound HTTPS request, the request goes out on port 443 (allowed by outbound rule), but the return traffic comes back to a high-numbered ephemeral port chosen by the OS:

Linux kernels use range 32768–60999 (Linux ephemeral range).
Windows Server 2008+ uses range 49152–65535 (per RFC 6056).
AWS NLB uses range 1024–65535 when source-NATting.

The conservative production rule is to allow inbound 1024–65535 on the NACL for return traffic. SysOps engineers who set tight outbound rules (e.g., allow 443 only) but forget the inbound ephemeral allowance get random connection failures that look like packet loss. Flow Logs show REJECT on high source ports — that is the fingerprint.

This trap is heavily tested. The exam-correct answer is "the NACL outbound allows 443 but the NACL inbound does not allow ephemeral ports — add inbound 1024–65535".

Common Trap: Security Group Cross-VPC References

A security group rule can reference another security group as the source: Allow inbound 443 from sg-abc. This is convenient — it is dynamic and follows the source instances regardless of IP. But there is a subtle limit: referencing an SG in a peered VPC is supported, but referencing across VPC-peering only works in the same Region, and not over Transit Gateway. If you have a TGW connecting two VPCs and try to reference an SG from VPC A in VPC B's SG rules, it fails — you must reference IP addresses (CIDR) instead.

Reachability Analyzer flags this as BLOCKED_BY_SECURITY_GROUP_RULE with the explanation that the referenced SG is not visible from the destination VPC. The fix on TGW is to use CIDR ranges in SG rules, not SG references. SOA-C02 tests this directly.

Common Trap: Flow Logs Do Not Capture Link-Local Traffic

The list of traffic types Flow Logs do not capture, repeated because it appears on the exam:

Instance metadata service (169.254.169.254).
Time service (169.254.169.123).
Amazon DNS server (the .2 address of the VPC CIDR).
DHCP (UDP 67/68).
VPC default router (the .1 address of the VPC CIDR).
Mirrored traffic (use VPC Traffic Mirroring instead).
Windows license activation (KMS).

A SysOps engineer who builds a "no DNS traffic" alarm on Flow Logs to detect a VPC DNS outage will miss it entirely — those queries are never logged. The right tools are Route 53 Resolver query logging for DNS and VPC Traffic Mirroring for full packet capture.

Common Trap: ALB Health Check Timeout > Interval

The ALB target group health check has two timing parameters:

HealthCheckIntervalSeconds — how often.
HealthCheckTimeoutSeconds — how long to wait for a response.

The console enforces Timeout < Interval. If it were the other way around, checks would queue and overlap, never deciding cleanly. Candidates who see "interval = 5, timeout = 10" should immediately recognize the impossible configuration. The fix: set timeout strictly less than interval, typically interval = 30, timeout = 5 for default web apps or interval = 10, timeout = 3 for fast-failing health endpoints. The corollary: when targets oscillate between healthy and unhealthy, the cause is usually the health endpoint legitimately taking 4–6 seconds with a timeout of 5 — the variance crosses the threshold randomly.

Common Trap: ELB Idle Timeout vs Backend Keep-Alive

The ALB's idle timeout (default 60 seconds) is the time the ALB keeps an idle connection to a target before closing. The target's keep-alive timeout (varies by application — Apache 5 seconds, NGINX 75, Node.js 5, Tomcat 60) is the time the target keeps the connection idle before closing. If the target's keep-alive is shorter than the ALB's idle timeout, the target closes first; the ALB still thinks the connection is open; the next request gets a TCP RST and the ALB returns 502 to the client. The fix is to set the target's keep-alive timeout greater than the ALB idle timeout (e.g., NGINX keepalive_timeout 75s matches ALB default 60s with margin). This is one of the highest-leverage SOA-C02 troubleshooting facts and explains an enormous fraction of intermittent 502s in production.

SOA-C02 vs SAA-C03: Network Troubleshooting Belongs to SOA

SAA-C03 hardly tests troubleshooting; SOA-C02 owns the discipline.

Question style	SAA-C03 lens	SOA-C02 lens
VPC Flow Logs	"Which service should the architect enable for network visibility?"	"Read the Flow Log REJECT records and identify the blocker."
ELB access logs	"Where should ALB access logs be stored?"	"ALB returns 502; filter access logs to identify the failing target."
Reachability Analyzer	Rarely tested.	"Reachability Analyzer says blocked at SG rule X — what is the fix?"
ALB target health	"Configure target group health checks."	"Targets oscillating between healthy and unhealthy — diagnose timeout vs interval."
Hybrid VPN	"Choose between VPN and Direct Connect."	"VPN tunnel UP but no traffic — walk the routing layer."
WAF logs	"Which service detects SQL injection?"	"Legitimate users blocked by WAF; identify the rule and decide scope-down."
CloudFront logs	"CloudFront integrates with WAF."	"Cache hit ratio dropped from 92% to 60% — diagnose query string normalization in cache policy."
Network Access Analyzer	Rarely tested.	"Audit: prove no internet path to RDS."

The SAA candidate selects services; the SOA candidate reads the logs, walks the layers, and fixes the cause.

Exam Signal: Domain 5.3 Question Patterns

TS 5.3 questions follow predictable shapes. Recognize them and you cut your reading time in half.

"What is the FIRST step to diagnose..." — the right answer is almost always Reachability Analyzer (for connectivity questions) or read the relevant log (Flow Log for L3/L4, ELB access log for HTTP, WAF log for L7). Avoid "redesign the network" answers.
"The application is reachable but slow..." — the answer is in ELB access log timing fields (request_processing_time, target_processing_time, response_processing_time) or RDS Performance Insights. Latency is rarely a security-group issue.
"Tunnel is UP but no traffic..." — the answer is routing (VPC route table, on-prem route table, BGP session), not IPsec.
"Targets are intermittently unhealthy..." — the answer is health check tuning (timeout vs interval, threshold counts) or keep-alive mismatch.
"Cache miss ratio is too high..." — the answer is CloudFront cache policy (query strings, cookies, headers) or origin Cache-Control headers.
"Audit: no path from X to Y..." — the answer is Network Access Analyzer with a defined Access Scope, not manual VPC review.
"The team needs to know which instance behind the NAT gateway hits S3..." — the answer is Flow Logs custom format with pkt-srcaddr at the NAT gateway ENI.

With Domain 5 worth 18 percent and TS 5.3 covering the troubleshooting half, expect 5 to 7 questions in this exact territory. Mastering Flow Logs, ELB access logs, Reachability Analyzer, and the canonical scenario runbooks is the single highest-leverage SOA-specific study activity. SAA candidates do not have this content; SOA candidates must own it. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

Decision Matrix — Which Tool for Each Symptom

Use this lookup during the exam.

Symptom	First tool	Why
EC2 cannot reach S3 / DynamoDB	Reachability Analyzer + Flow Logs	Names the missing route or SG/NACL rule.
ALB returns 502 intermittently	ALB access logs (`error_reason`, `target_status_code`)	Identifies target failure vs keep-alive mismatch.
ALB returns 504	ALB access logs (`target_processing_time`)	Backend slower than idle timeout.
Target group targets unhealthy	ALB target health Reason codes + SG check	Names timeout, mismatch, or unreachability.
Cache hit ratio dropped	CloudFront access logs (`x-edge-result-type`, query string fields)	Diagnoses cache policy regression.
Stale CloudFront content	CloudFront invalidation + cache policy review	TTL too long or no versioned URL.
Legitimate users blocked	WAF logs (`terminatingRuleId`)	Names the blocking rule.
Suspicious activity from one IP	Flow Logs filter REJECT by `srcaddr`	High REJECT count = scanning.
NAT gateway data spike	Flow Logs custom format with `pkt-srcaddr`	Identifies originating instance.
VPN tunnel DOWN	`aws ec2 describe-vpn-connections` StatusMessage	Names IKE Phase 1 or 2 failure.
VPN tunnel UP but no traffic	Routing layer + BGP status check	Tunnel is the wire, routing is the destination.
DX VIF stuck pending	BGP session + customer router config	VLAN, ASN, IP addressing on customer side.
Path audit (no internet → RDS)	Network Access Analyzer Access Scope	Set-to-set analysis vs Reachability one-to-one.
Need to find ALL paths to a resource	Network Access Analyzer	Reachability Analyzer is point-to-point.
HTTP-layer attack signature	WAF logs + CloudFront real-time logs	Sub-minute incident triage.
TCP RST mid-stream	Flow Logs custom format with `tcp-flags`	RST flag bitmask = 4.
Cross-account VPC traffic visibility	Flow Logs to centralized S3 bucket + Athena	Single SQL query across accounts.
Application-layer 403	Flow Logs cannot help (ACCEPT)	Need application or WAF logs.

Common Traps Recap — VPC Network Troubleshooting

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: Flow Logs name the rule that blocked

They do not. Flow Logs report ACCEPT or REJECT, not which SG/NACL rule decided. Reachability Analyzer names the rule.

Trap 2: ACCEPT in Flow Logs means the application succeeded

It only means SGs and NACLs permitted the packet. The application can still return 502, drop the connection, or fail TLS — invisible to Flow Logs.

Trap 3: Detailed monitoring or default monitoring affects Flow Logs

Flow Logs are independent of EC2 metric monitoring. They aggregate at 1 or 10 minutes per the Flow Log configuration, regardless of the EC2 metric setting.

Trap 4: SG references work across Transit Gateway

They do not. SG-to-SG references only work within the same VPC or across same-Region VPC peering with explicit enablement. For TGW-connected VPCs, use CIDR ranges in SG rules.

Trap 5: NACL is stateful

It is stateless. Both inbound and outbound rules must be configured, including ephemeral return ports (1024–65535).

Trap 6: VPN tunnel UP means traffic flows

UP means Phase 1 and Phase 2 succeeded. Traffic still requires correct routing on both sides plus BGP session ESTABLISHED for dynamic routing.

Trap 7: Direct Connect is encrypted by default

It is not. DX is private but not encrypted at the AWS service edge. Use VPN-over-DX or MACsec for encryption.

Trap 8: Reachability Analyzer can analyze cross-Region paths

It cannot. Source and destination must be in the same Region. Cross-Region requires multiple analyses joined manually or Network Access Analyzer.

Trap 9: Network Access Analyzer is the same as Reachability Analyzer

Reachability Analyzer is one-to-one (specific source ENI to specific destination ENI). Network Access Analyzer is set-to-set (Access Scope between sets of resources). Use the right tool for the question type.

Trap 10: ALB health check timeout can be greater than interval

It cannot — the console rejects this configuration. Timeout is always strictly less than interval. Suspect this configuration shape only as a distractor.

Trap 11: Flow Logs capture all traffic in the VPC

They do not capture instance metadata, AWS DNS, DHCP, default router, mirrored traffic, or KMS license activation. For DNS use Route 53 Resolver query logging; for full packet use VPC Traffic Mirroring.

Trap 12: CloudFront x-edge-result-type "Hit" guarantees fresh content

A Hit is fresh per the cache TTL. If the TTL is too long and the origin updated, Hit will serve stale content until the cache entry expires or you invalidate.

FAQ — VPC Network Troubleshooting

Q1: When should I use Reachability Analyzer vs reading Flow Logs?

Use Reachability Analyzer when you need to know why a path is or is not working — it names the specific rule (SG, NACL, route table) that allows or blocks. Use Flow Logs when you need to know what actually happened at the packet level over time — they tell you which connections were attempted and whether each was accepted or rejected. The combined workflow during an incident: Flow Logs say "REJECT from 10.0.1.5 to 10.0.2.10:443"; Reachability Analyzer between those ENIs says "BLOCKED_BY_SECURITY_GROUP_RULE on sg-abc rule 3"; you fix sg-abc rule 3. Flow Logs are reactive, Reachability Analyzer is proactive — both are useful, in that order, during real triage.

Q2: Why does my ALB return 502 even though the target is healthy?

The most common cause is a keep-alive mismatch: the target's keep-alive timeout is shorter than the ALB's idle timeout (default 60s). The ALB keeps the connection open after a request; the target closes it; the ALB tries to reuse a stale connection and gets a TCP RST; the ALB returns 502 to the client. The fix is to set the target's keep-alive timeout greater than the ALB idle timeout — for example NGINX keepalive_timeout 75s against the default ALB idle of 60s. ALB access logs with error_reason = Target_ConnectionError and target_status_code = "-" are the fingerprint. Other 502 causes include the target genuinely returning 502 (its own upstream is broken), TLS handshake failures (HTTPS target groups), and ALB-to-target SG rules being absent.

Q3: A user reports "the website is slow from Asia but fast from the US" — where do I look?

Three possibilities, each with a different log layer. (a) CloudFront edge selection — the user in Asia is being served from a far-away POP because of DNS resolver geography. Check CloudFront access logs x-edge-location for the user's IP. (b) Origin latency — CloudFront has to fetch from origin in us-east-1 because the cache missed; look at time-taken and x-edge-result-type = Miss. The fix is better cache policy (more hits) or Origin Shield (regional cache layer). (c) Network path — the underlying internet path between Asia and CloudFront's POP is congested. CloudFront uses the AWS backbone for the POP-to-origin leg, so the slowness is on the user-to-POP segment, which AWS does not control. The fix is to use AWS Global Accelerator for static-IP edge entry, or accept the geographic reality. SOA-C02 typically wants answers (a) or (b).

Q4: How do I detect "data exfiltration" via Flow Logs?

Build a Logs Insights query that finds high-byte-volume outbound flows from internal subnets to external destinations. A starter query:

fields @timestamp, srcaddr, dstaddr, bytes, action
| filter action = "ACCEPT" and isIpv4InSubnet(srcaddr, "10.0.0.0/8") and not isIpv4InSubnet(dstaddr, "10.0.0.0/8")
| stats sum(bytes) as totalBytes by srcaddr, dstaddr
| sort totalBytes desc
| limit 50

Then cross-reference the top destination IPs with known partner IP ranges; unknown destinations with high byte volume are exfiltration candidates. For higher-fidelity detection use GuardDuty, which has built-in detectors for unusual data egress patterns and DNS-based exfiltration signatures. Flow Logs alone show the data movement; GuardDuty provides the threat intelligence to call it suspicious. SOA-C02 favors the GuardDuty answer for "detect threats", with Flow Logs as the supporting evidence layer.

Q5: Why does the same Flow Log show different srcaddr values for the same connection?

Two patterns explain this. (a) NAT gateway translation — the NAT gateway's ENI sees traffic with srcaddr equal to the originating instance's private IP for outbound, and srcaddr equal to the external server's IP for the return. Without pkt-srcaddr in the custom format, you cannot tell at the NAT gateway ENI which instance is the originator after translation; with pkt-srcaddr, the original instance IP is preserved. (b) Flow direction confusion — the same TCP connection produces two flow records (one for each direction) with srcaddr and dstaddr swapped. The custom-format flow-direction field disambiguates. The general rule is to always include pkt-srcaddr, pkt-dstaddr, and flow-direction in production Flow Logs to avoid this confusion during incident triage.

Q6: VPN tunnel keeps flapping every few hours — what's the cause?

Three common causes. (a) DPD (Dead Peer Detection) timeout mismatch — AWS sends DPD probes; if the customer gateway does not respond within the timeout, AWS tears the tunnel and renegotiates. Increase DPD interval on the customer side or align with AWS defaults. (b) Phase 2 lifetime expiry without rekey — IPsec SAs have a finite lifetime (default 1 hour); rekeying should be transparent but some implementations drop traffic during rekey. Check customer gateway logs around the lifetime boundary. (c) Overlapping NAT or asymmetric routing — the on-prem network may be NATting traffic to a different IP for some flows, breaking the IPsec selector. Use Reachability Analyzer to confirm AWS-side path, then escalate to the on-prem networking team. The CloudWatch metric TunnelState over time graphs the flap pattern; the AWS console's StatusMessage for each tunnel names the negotiation parameter that is failing.

Q7: How do I monitor Direct Connect health proactively?

Set CloudWatch alarms on these metrics per VIF/connection. (a) ConnectionState — alarm on transition away from up. (b) ConnectionLightLevelTx and ConnectionLightLevelRx — alarm on values outside the optical spec range (typically -8 to +5 dBm for 1G, -10 to -1 dBm for 10G). Drift indicates degraded fiber. (c) ConnectionErrorCount — alarm on any non-zero rate, indicating physical-layer errors. (d) ConnectionBpsEgress and ConnectionBpsIngress — alarm on saturation (e.g., > 80% of port speed) so you can plan capacity before users see drops. (e) BgpStatus per VIF (returned by describe-virtual-interfaces) — alarm on transition away from up. Pair this with a VPN backup connection that takes over if DX fails — and test the failover quarterly so you know it works before you need it.

Q8: Can I use Reachability Analyzer in a CI/CD pipeline?

Yes, and SOA-C02 favors this answer. The workflow: a pull request to your network-IaC (CloudFormation, Terraform) triggers a CI job that (a) deploys to a staging account, (b) runs aws ec2 create-network-insights-path and aws ec2 start-network-insights-analysis for each critical source-destination pair (e.g., ALB-to-target, target-to-RDS, target-to-S3), (c) waits for completion, (d) parses the result for NetworkPathFound = true. If any required path is unreachable, the PR is blocked. The cost is $0.10 per analysis; for a few hundred analyses per day it is negligible compared to a production outage. This pattern is the SOA-clean answer to "how do we prevent network regressions before they reach production".

Q9: My Logs Insights query against Flow Logs is slow — how do I speed it up?

Three optimizations. (a) Narrow the time range — Logs Insights scans all matching log groups within the time window; a query over the last 24 hours scans 24x more data than the last hour. Use the smallest window that contains the incident. (b) Filter early — put the most selective filter clauses first (filter action = "REJECT" before filter dstport = 443) so subsequent stages process less data. (c) Use field projection — Logs Insights automatically projects only the fields you reference, so referencing fewer fields means less data shuffled. For very large Flow Log volumes (TB per day), prefer Athena over S3 in Parquet format with partitioning by year/month/day/account/region — Athena scans typically run 5–20x faster than Logs Insights at scale and cost less per scan.

Q10: My ALB is returning 502 only during deployments — what's wrong?

This is the target group deregistration delay problem. When an instance is deregistered (during deploy, scale-in, or manual termination), the target group enters the draining state and stops sending new requests, but in-flight requests continue. The default deregistration delay is 300 seconds (5 minutes). If the deregistration delay is shorter than the longest in-flight request, the connection is forcibly closed and the client sees 502. The fix is to set the deregistration delay to longer than the longest expected request — for typical web apps 30–60 seconds is plenty; for long-running APIs (video upload, ML inference) it might be 5–15 minutes. The corollary: the HealthCheckGracePeriodSeconds on the ASG side must be longer than the application's startup time so new instances do not receive traffic before they are ready. Both deregistration delay and grace period must be tuned to the application's actual behavior, not left at defaults.

Once VPC network troubleshooting is in your toolkit, the related operational layers are: VPC configuration and connectivity for the design-time controls (subnets, route tables, NACLs, security groups, endpoints, peering, TGW, VPN) whose correctness this topic verifies; Route 53 DNS and CloudFront for the DNS and CDN layers that sit in front of the VPC and have their own failure modes; WAF, Shield, and network protection for the Layer 7 controls whose logs you read here for false-positive triage; and CloudWatch Logs and Logs Insights for the query engine that powers Flow Log and ELB log analysis at scale.

Why TS 5.3 Is the Signature Domain of SOA-C02

白話文解釋 VPC Network Troubleshooting

Analogy 1: The Detective Investigation

Analogy 2: The Doctor's Differential Diagnosis

Analogy 3: The Electrician Tracing the Wire

VPC Flow Logs Fundamentals: Scopes, Fields, and Destinations

Three scopes — VPC, subnet, ENI

Default vs custom format fields

Aggregation interval — 10 minutes default, 1 minute optional

Three destinations — CloudWatch Logs, S3, Kinesis Data Firehose

What Flow Logs do NOT capture

Flow Log Analysis: Logs Insights and Athena Patterns

CloudWatch Logs Insights queries

Athena over S3

Reading Flow Log Records: ACCEPT, REJECT, and What They Actually Mean

ACCEPT does not mean success

REJECT means at least one of {SG, NACL} denied

Direction matters

pkt-srcaddr vs srcaddr — the NAT gateway / TGW gotcha

ELB Access Logs: Format, Status Codes, and Timing Fields

Enabling ELB access logs

ALB access log fields (key ones for troubleshooting)

NLB access logs

Common SOA-C02 questions on ELB access logs

ALB Target Group Unhealthy Targets — The Health Check Playbook

Health check parameters

The eight diagnostic steps

Reason codes

Reachability Analyzer — Proactive Path Verification

What it analyzes

When to use Reachability Analyzer

Cost

Limits

Network Access Analyzer — Access Scope Auditing

Access scopes

Bundled access scopes

Use cases

Reachability Analyzer vs Network Access Analyzer — the cleanest mental separation

CloudFront Logs and Cache Issue Diagnosis

Standard access log fields (key ones)

Common cache issue diagnoses

Real-time logs

AWS WAF Web ACL Logs

Key fields

Diagnostic patterns

WAF logs vs Flow Logs

Hybrid Connectivity Troubleshooting: Site-to-Site VPN

Tunnel state machine

Tunnel UP but no traffic — the routing problem

BGP session diagnosis

Hybrid Connectivity Troubleshooting: Direct Connect

The three layers

VIF types

CloudWatch metrics for DX

Common DX problems

Scenario Pattern: EC2 Cannot Reach S3 Through a Gateway VPC Endpoint

Scenario Pattern: ALB Returns 502 Intermittently

Scenario Pattern: VPN Tunnel UP but No Traffic

Scenario Pattern: Reachability Analyzer Says Blocked — Why?

Common Trap: NACL Ephemeral Ports Denied

Common Trap: Security Group Cross-VPC References

Common Trap: Flow Logs Do Not Capture Link-Local Traffic

Common Trap: ALB Health Check Timeout > Interval

Common Trap: ELB Idle Timeout vs Backend Keep-Alive

SOA-C02 vs SAA-C03: Network Troubleshooting Belongs to SOA

Exam Signal: Domain 5.3 Question Patterns

Decision Matrix — Which Tool for Each Symptom

Common Traps Recap — VPC Network Troubleshooting

Trap 1: Flow Logs name the rule that blocked

Trap 2: ACCEPT in Flow Logs means the application succeeded

Trap 3: Detailed monitoring or default monitoring affects Flow Logs

Trap 4: SG references work across Transit Gateway

Trap 5: NACL is stateful

Trap 6: VPN tunnel UP means traffic flows

Trap 7: Direct Connect is encrypted by default

Trap 8: Reachability Analyzer can analyze cross-Region paths

Trap 9: Network Access Analyzer is the same as Reachability Analyzer

Trap 10: ALB health check timeout can be greater than interval

Trap 11: Flow Logs capture all traffic in the VPC

Trap 12: CloudFront x-edge-result-type "Hit" guarantees fresh content

`pkt-srcaddr` vs `srcaddr` — the NAT gateway / TGW gotcha