VPC Configuration, Endpoints, and Hybrid Connectivity

Amazon VPC is the foundational fabric of every AWS workload, and on SOA-C02 it is the spine of Domain 5 (Networking and Content Delivery, 18 percent of the exam). Task Statement 5.1 is verbatim "Implement networking features and connectivity" — and the SysOps lens is sharply different from the architect's view. Where SAA-C03 asks "which networking option fits this design", SOA-C02 asks "the EC2 instance in the private subnet still cannot reach S3, list every step you would check", or "the VPC peering connection is established but traffic does not flow, why", or "the SSM agent runs but the instance does not appear in Fleet Manager — what private connectivity is missing". The exam wants to know whether you can configure a working network and troubleshoot a broken one without falling back to "open everything 0.0.0.0/0".

This guide walks the SOA-C02 candidate through VPC configuration and connectivity from the operational angle: how subnets, route tables, internet gateways, and NAT gateways combine to produce public and private subnets; how security groups (stateful) and network ACLs (stateless) overlap and where the ephemeral-port trap bites; when a Gateway VPC endpoint beats an Interface VPC endpoint and the cost difference; why VPC peering is non-transitive and when Transit Gateway is the right escape hatch; what a Site-to-Site VPN actually configures end to end; and how Session Manager plus SSM Interface endpoints replace the public SSH bastion entirely. Each section closes with the SysOps-tier troubleshooting heuristics and exam signals.

Why VPC Configuration Sits at the Heart of SOA-C02 Domain 5

The official SOA-C02 Exam Guide v2.3 lists three skills under Task Statement 5.1, and every one of them maps to topics in this guide: configure a VPC (subnets, route tables, NACLs, security groups, NAT gateway, internet gateway); configure private connectivity (Session Manager, VPC endpoints, VPC peering, VPN); and configure AWS network protection services (WAF, Shield — covered in the dedicated WAF and Shield topic). Task Statement 5.3 then layers troubleshooting on top, and that is where VPC Flow Logs, ELB access logs, and Reachability Analyzer come into the picture (covered in the network troubleshooting topic).

At the SysOps tier the framing is operational, not architectural. SAA-C03 asks "which VPC endpoint type should the architect choose for an S3 access pattern from a private VPC?" SOA-C02 asks "the application in the private subnet receives 403 Forbidden when calling S3, the security group allows 443 outbound, the route table has the gateway endpoint route — what else could be wrong?" The answer is rarely a different architecture; it is usually a missing endpoint policy, a wrong region, an NACL ephemeral port rule, or a DNS resolution flag on the VPC.

VPC configuration is the topic where every other Domain 5 topic plugs in. Route 53 health checks need network reachability. CloudFront origin access control needs a VPC for the origin. WAF attaches to ALBs that live in subnets. Network troubleshooting reads VPC Flow Logs from the very subnets you configure here. Get VPC right and the rest of Domain 5 is straightforward; get it wrong and every connectivity question becomes a dice roll.

VPC: a logically isolated virtual network in AWS, scoped to a single region, with a primary IPv4 CIDR block (and optional secondary CIDRs and IPv6 blocks).
Subnet: a range of IP addresses within a VPC, scoped to a single Availability Zone. A subnet is "public" if its route table has a 0.0.0.0/0 route to an internet gateway, otherwise it is "private".
Route table: a set of rules (routes) that determine where network traffic from a subnet or gateway is directed. Each subnet must be associated with exactly one route table.
Internet gateway (IGW): a horizontally scaled, highly available VPC component that allows communication between the VPC and the internet. One IGW per VPC.
NAT gateway: a managed Network Address Translation service that allows instances in a private subnet to initiate outbound connections to the internet while remaining unreachable from the internet. AZ-scoped, billed per hour and per GB processed.
Security group (SG): a stateful virtual firewall attached to an ENI. Allow rules only — return traffic is automatically permitted.
Network ACL (NACL): a stateless firewall attached to a subnet. Both allow and deny rules; return traffic must be explicitly allowed (ephemeral ports).
VPC endpoint: a private connection from a VPC to an AWS service without traversing the public internet. Two types: Gateway (S3, DynamoDB) and Interface (PrivateLink-based ENI).
VPC peering: a one-to-one network connection between two VPCs that allows private IP routing. Non-transitive.
Transit Gateway (TGW): a regional network transit hub that connects VPCs and on-premises networks through a hub-and-spoke topology. Supports transitive routing.
Site-to-Site VPN: an IPsec VPN connection between a virtual private gateway (or transit gateway) on the AWS side and a customer gateway on the on-premises side.
Session Manager: an AWS Systems Manager capability that provides interactive shell access to EC2 instances without SSH, bastion hosts, or open inbound ports.
Reference: https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html

白話文解釋 VPC Configuration and Connectivity

VPC jargon stacks fast — subnets, route tables, gateways, endpoints, peering, transit. Three analogies make the constructs stick.

Analogy 1: The Apartment Building Floor Plan

A VPC is a purpose-built apartment building on a private lot. Its CIDR block is the block of street numbers the building owns — say 10.0.0.0/16, room numbers 10.0.0.0 through 10.0.255.255. Subnets are the floors of the building, each tied to a specific Availability Zone wing. Route tables are the directory boards at each elevator landing — "to leave the building (0.0.0.0/0), take the front door (IGW); to reach the storage warehouse (S3 prefix list), take service elevator A (Gateway endpoint); to visit the next building (peer VPC CIDR), use the breezeway (peering connection)". The internet gateway is the front lobby door — open to the street, but only the floors whose directory points there can reach it. A NAT gateway is the mailroom — residents from the back floors (private subnets) can drop outbound mail through the mailroom and receive replies, but no outsider can deliver directly to those rooms. Security groups are the smart door locks per apartment that remember who you let in (stateful — return traffic implicit). Network ACLs are the floor security gates that check every person both ways and have no memory (stateless — return traffic must be on its own allow list, hence the ephemeral-port rule). VPC peering is a direct breezeway between two buildings, but no transit allowed — you cannot walk through Building A to reach Building C. Transit Gateway is the central transit station at the city center — every building connects to it once and it handles all the inter-building routing.

Analogy 2: Office Building Security and Visitor Management

A public subnet is a ground-floor lobby — anyone with the right ID badge (security group rule) can walk in from the street. A private subnet is a secured upper floor — there is no direct street entrance; visitors come through the lobby and ride a managed elevator (NAT gateway) for outbound errands, but the elevator does not let strangers up. Session Manager is the video-call concierge service — instead of giving every employee a master key (SSH key) and a public hallway entrance (port 22 to 0.0.0.0/0), the concierge takes a video call from each authorized employee, verifies their badge in IAM, and unlocks the office door from inside. The SSM Interface VPC endpoint is the internal phone line that lets the concierge work without an outside phone — the office floor never needs to be reachable from the street. A bastion host is the old guard shack at the gate — once useful, now the concierge service has replaced it for SOA-C02 purposes.

Analogy 3: Postal Sorting Facility

A route table is the postal sorting belt — a letter labelled with destination CIDR 0.0.0.0/0 falls into the IGW chute; a letter labelled s3.us-east-1.amazonaws.com prefix list falls into the Gateway endpoint chute; a letter labelled 10.20.0.0/16 (peer VPC) falls into the peering chute. Most-specific route wins — a letter for 10.20.5.0/24 is sorted by the /24 belt before the /16 belt sees it. The internet gateway is the outgoing mail truck to the city. The NAT gateway is the return-address-rewriting station — outgoing letters from private apartments get the building's main return address, so replies come back to the building, where the station rewrites them back to the original apartment. A VPC endpoint is the dedicated courier line to a specific recipient (S3, DynamoDB, SSM) that bypasses the city post office entirely — letters never leave the AWS private network. A VPN tunnel is a diplomatic pouch — every letter is sealed and authenticated before it leaves the building, decrypted only at the partner site.

For SOA-C02, the apartment building floor plan analogy is the most useful when a question mixes route tables with NACLs and security groups — you can mentally walk a packet from elevator landing to floor gate to apartment door. The postal sorting analogy is sharpest for route table troubleshooting (most-specific match), and the office security analogy is the cleanest mental model for Session Manager replacing the SSH bastion. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html

VPC Fundamentals: CIDR Blocks, Subnets, Route Tables, IGW, and NAT

Before any connectivity question makes sense, you need a precise mental model of how the VPC's primitives compose.

CIDR blocks and subnet sizing

A VPC has a primary IPv4 CIDR block chosen at creation, between /16 (65,536 addresses) and /28 (16 addresses). You can add up to four secondary CIDRs later. CIDR blocks must come from the RFC 1918 private ranges — 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 — or be a non-overlapping public range you own. The CIDR is fundamental: changing the primary CIDR after creation is not allowed (you would create a new VPC), and overlapping CIDRs make VPC peering and Transit Gateway routing impossible.

A subnet is a range carved out of the VPC CIDR, scoped to a single AZ. A /24 subnet has 256 addresses, of which AWS reserves five: the network address (.0), VPC router (.1), DNS (.2), future use (.3), and broadcast (.255). So a /24 subnet has 251 usable addresses. A /28 subnet (the smallest allowed) has 11 usable. SOA-C02 will sometimes ask "the team needs at least 100 instances in a subnet, what is the smallest CIDR" — the answer is /25 (128 - 5 = 123 usable).

Public vs private subnets — defined by route tables

A subnet is public if its route table has a default route 0.0.0.0/0 pointing to an internet gateway. A subnet is private if it does not. There is no "public subnet" toggle; the property emerges from the routing.

For an instance to actually reach the internet from a public subnet, three things must be true:

The subnet's route table has 0.0.0.0/0 → IGW.
The instance has a public IP or Elastic IP (auto-assign public IPv4 enabled, or EIP attached).
Security groups and NACLs allow the traffic in both directions.

Miss any one of those and the instance has internet connectivity that looks configured but does not work — a classic SOA-C02 scenario.

Route table mechanics

A route table is a list of routes, each with a destination CIDR and a target. Most-specific route wins: if 10.0.0.0/16 → local and 10.0.5.0/24 → peering connection both exist, traffic to 10.0.5.10 goes through the peering connection. Every VPC has a default main route table that all subnets inherit unless explicitly associated with a custom route table; SysOps best practice is to create explicit custom route tables per subnet tier (one for public, one or more for private) so the main route table stays minimal.

The implicit local route covers the entire VPC CIDR — instances in any subnet within the VPC can reach each other by private IP without any added route. You cannot delete or modify the local route.

Internet gateway

An internet gateway is a horizontally scaled, redundant, regional VPC component. There is no throughput limit you need to size, no AZ choice — one IGW per VPC, attached to the VPC, then referenced from public route tables. The IGW performs source/destination IP translation for instances with public IPs, so a packet leaving an EC2 instance with public IP 54.x.x.x appears on the internet sourced from 54.x.x.x even though the instance internally has private 10.0.x.x.

NAT gateway

A NAT gateway lets instances in private subnets initiate outbound connections to the internet while remaining unreachable from inbound. Operational facts:

AZ-scoped: a NAT gateway lives in one AZ and uses one Elastic IP. For HA across AZs, deploy one NAT gateway per AZ and point each AZ's private subnet route table at its same-AZ NAT gateway.
Pricing: charged per hour the NAT gateway exists and per GB of data processed. The data-processing charge is the surprise — at high volume NAT gateway becomes the largest single line item on the bill.
Bandwidth: scales automatically up to 100 Gbps and 10 million packets per second.
Targets: place in a public subnet with 0.0.0.0/0 → IGW; private subnets route 0.0.0.0/0 → NAT gateway.

The legacy alternative is a NAT instance — a self-managed EC2 instance running NAT software. Cheaper for very low volume, but you own patching, HA, scaling, and source/destination check disablement. SOA-C02 strongly prefers the managed NAT gateway except in cost-sensitive low-volume scenarios.

VPC CIDR: /16 (65,536) to /28 (16) IPv4 addresses; up to 4 secondary CIDRs.
Subnet reserved addresses: 5 per subnet (network, VPC router, DNS, reserved, broadcast).
Smallest subnet: /28 → 11 usable addresses.
One IGW per VPC. One main route table per VPC. One NACL per subnet.
Route table max routes: 50 static routes per route table (raise via support to 1000); BGP-propagated routes do not count.
Security group max rules: 60 inbound + 60 outbound per SG (default), up to 5 SGs per ENI → 300 + 300 effective rules per ENI. Quota raisable.
NACL max rules: 20 inbound + 20 outbound per NACL (default 20, max 40 each).
Ephemeral port range: Linux 32768–60999, Windows Server 2008+ 49152–65535, AWS NAT/ELB 1024–65535. Use 1024–65535 for NACL outbound return traffic to be safe.
NAT gateway: AZ-scoped; per-hour + per-GB billed; 100 Gbps and 10M pps capacity.
VPC peering: 125 active peering connections per VPC (raisable).
Transit Gateway: 5,000 attachments per TGW; 10,000 routes per route table.
Reference: https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html

Security Groups vs Network ACLs: Stateful vs Stateless and the Ephemeral Port Trap

Two firewalls operate inside every VPC, and they behave fundamentally differently. SOA-C02 tests this distinction more than any other Domain 5 topic.

Security groups — stateful, allow-only, ENI-attached

A security group is attached to an Elastic Network Interface (ENI). It contains only allow rules; there is no deny rule (denial is the absence of allow). Crucially, security groups are stateful: if an inbound rule allows traffic in, the response traffic is automatically permitted out, regardless of outbound rules. Likewise, if an outbound rule allows traffic out, the response is automatically permitted in.

Operational facts:

One ENI can have up to 5 security groups attached (raisable to 16).
Default outbound rule on a new SG is allow all; default inbound rule is deny all (empty list).
Rule sources can be CIDR ranges, another security group ID (the most useful pattern — "allow port 443 from sg-web"), or a managed prefix list.
SG rule changes take effect immediately on existing connections.

Security group references are the single best operational pattern. Instead of enumerating IPs, write rules like "allow port 5432 from sg-app to sg-db". The rule keeps working as instances launch and terminate; you never edit the rule again.

Network ACLs — stateless, allow + deny, subnet-attached

A network ACL (NACL) is attached to a subnet. It contains both allow and deny rules, evaluated in numbered order from lowest to highest. Crucially, NACLs are stateless: return traffic is not automatically permitted. Every direction must be explicitly allowed.

This is where the ephemeral port trap lives. When a client makes an outbound TCP connection, the OS picks a random source port from the ephemeral range:

Linux (most modern distributions): 32768–60999.
Windows Server 2008+: 49152–65535.
AWS NAT gateway, AWS Lambda, AWS ELBs: 1024–65535.

If the client is on the AWS side and the NACL only allows outbound to port 443 but does not allow inbound on the ephemeral range, the response packets are dropped at the NACL even though the security group is fine. The reverse is also true: a server in a subnet with restrictive outbound NACL must allow outbound on the ephemeral range so its responses to clients reach them.

The exam-correct pattern: NACL inbound and outbound both include a rule for TCP 1024–65535 ALLOW to cover ephemeral return traffic for all OSes and AWS services.

Operational facts:

One NACL per subnet (a subnet can be associated with only one NACL at a time, but one NACL can be associated with many subnets).
Default NACL allows all inbound and outbound. Custom NACLs deny all by default until you add allow rules.
Rule numbers must be unique within a NACL; lower numbers are evaluated first; first match wins.
An explicit deny rule cannot be overridden by a higher-numbered allow rule — once denied, always denied.

Property	Security group	Network ACL
Scope	ENI (instance)	Subnet
State	Stateful (return traffic implicit)	Stateless (return traffic explicit)
Rule types	Allow only	Allow + Deny
Rule evaluation	All rules considered; if any allows, allow	Numbered order; first match wins
Default new SG/NACL	Inbound deny all, outbound allow all	Custom: deny all both ways. Default: allow all.
Reference targets	CIDR, SG ID, prefix list	CIDR only
Ephemeral port concern	None (stateful)	Critical — outbound and inbound must both allow 1024–65535 for return traffic

If the question describes a subnet-wide block that cannot be exempted by an instance-level rule, the answer is NACL deny. If the question describes "allow only this app tier to reach this DB tier" using identifiers that survive instance churn, the answer is SG-to-SG referencing. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html

A common SOA-C02 distractor: "the security group allows port 443, but the instance still cannot receive web traffic — what is wrong?" The NACL on the subnet is denying inbound port 443 (or denying inbound on the ephemeral return range for outbound traffic). NACLs are evaluated before security groups for inbound and after for outbound; a NACL deny is the floor and no SG allow can lift it. Candidates who only check the security group lose easy points. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html

VPC Endpoints — Gateway Type for S3 and DynamoDB

A VPC endpoint is a private connection from your VPC to an AWS service that does not traverse the public internet. Two flavors exist, and SOA-C02 tests both.

Gateway VPC endpoints — S3 and DynamoDB only

Gateway endpoints exist for exactly two services: Amazon S3 and Amazon DynamoDB. Operational facts:

No charge — gateway endpoints are free. No hourly fee, no data processing fee.
Configured as a route table entry — when you create the endpoint, AWS adds a route to the chosen route tables: destination = the service's prefix list (managed by AWS, e.g., pl-63a5400a for S3 in us-east-1), target = the gateway endpoint.
Region-scoped — the endpoint serves only the same-region S3 buckets or DynamoDB tables. Cross-region S3 access still goes over the internet (or through other paths).
No security group; access is controlled by the endpoint policy (an IAM resource policy on the endpoint itself) plus the IAM identity policies and bucket/table policies.
No DNS change — instances still use the public S3 hostname s3.region.amazonaws.com, and the route table redirects the traffic privately.

The configuration steps:

Create the gateway endpoint in the VPC for service com.amazonaws.region.s3 (or dynamodb).
Associate it with the route tables of the subnets that need the endpoint.
Optionally tighten the endpoint policy (default: allow all). A common SOA pattern is to restrict the endpoint policy to specific bucket ARNs to prevent data exfiltration to external buckets.

Why use it: cost (NAT gateway data processing fees go away for S3 traffic), security (traffic stays on the AWS network), and reliability (no internet dependency).

The S3 NAT gateway cost trap

A canonical SOA-C02 scenario: the bill has a huge NAT gateway data processing line, the team grep the VPC Flow Logs and finds it is mostly traffic to 52.x.x.x (S3 public IPs). The fix is one click — add a Gateway VPC endpoint for S3 to the relevant route tables. The data goes through the endpoint instead of the NAT, the data processing charge disappears, and the application URLs do not change.

For any VPC where private-subnet workloads call S3 or DynamoDB, add the Gateway endpoint. There is no downside: no cost, no rotation overhead, no DNS change, immediate cost reduction on NAT gateway data processing. SOA-C02 routinely tests this as the answer to "reduce NAT gateway charges" or "avoid internet routing for S3 access". Reference: https://docs.aws.amazon.com/vpc/latest/privatelink/gateway-endpoints.html

VPC Endpoints — Interface Type (PrivateLink) for Everything Else

Every other AWS service that supports a VPC endpoint uses an Interface VPC endpoint (a.k.a. AWS PrivateLink). This is the bigger, more expensive, more flexible model.

Interface endpoint mechanics

An interface endpoint creates an Elastic Network Interface (ENI) with a private IP from your subnet's CIDR for each AZ you select. The ENI is the entry point — your application calls the service's normal hostname, and Route 53 (private DNS for the endpoint) resolves it to the ENI's private IP, so traffic stays on the AWS private network.

Operational facts:

One ENI per AZ — pick subnets in each AZ where the endpoint should be reachable. For HA, enable in at least two AZs.
Hourly charge per AZ plus per-GB data processing charge. At very high volume the data processing fee can rival NAT, but for moderate API volume the endpoint is much cheaper than NAT.
Security group attached to the endpoint ENI — control which sources can reach it (typically allow port 443 from the application's SG).
Endpoint policy for IAM-style access control on the endpoint itself.
Private DNS — when enabled (default for most AWS services), instances calling secretsmanager.region.amazonaws.com or ssm.region.amazonaws.com resolve to the ENI private IPs automatically. The VPC must have enableDnsSupport and enableDnsHostnames both set to true.

Common interface endpoints on SOA-C02

The Systems Manager trio is the most heavily tested:

com.amazonaws.region.ssm — Systems Manager API (Run Command, State Manager, Parameter Store).
com.amazonaws.region.ssmmessages — Session Manager data channel for shell sessions.
com.amazonaws.region.ec2messages — required for SSM agent legacy communication.

For a private EC2 instance to be Session-Manager-reachable without a NAT gateway, all three of these endpoints must exist and be reachable from the instance's subnet, and the security groups must allow port 443 from the instance to the endpoint ENIs.

Other frequently tested interface endpoints: kms, secretsmanager, logs (CloudWatch Logs), monitoring (CloudWatch metrics), events (EventBridge), sns, sqs, ec2, ecr.api and ecr.dkr (out of scope for SOA, but relevant context), lambda.

The DNS resolution requirement

Private DNS only works when the VPC has both enableDnsSupport and enableDnsHostnames set to true. If either is false, instances trying to resolve secretsmanager.us-east-1.amazonaws.com will not get the endpoint's private IP back — they will get the public IP and the call will fail (or go through NAT if NAT exists, defeating the purpose). This is one of the highest-frequency SOA-C02 trap configurations.

On SOA-C02, when a question describes "the application in the private subnet is configured to call Secrets Manager, the interface endpoint exists, the security groups are correct, but the call still fails with timeout or DNS resolution error" — check enableDnsSupport AND enableDnsHostnames on the VPC. Both must be true for the endpoint's private DNS feature to override the public hostname resolution. New VPCs created via the console default both to true; VPCs imported via CloudFormation or API can have enableDnsHostnames set false. Reference: https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html

Use the Gateway endpoint wherever it exists (S3, DynamoDB) — it is free. Use the Interface endpoint for everything else, but be aware of the per-AZ hourly fee plus data processing fee. The break-even versus a NAT gateway depends on traffic volume; for SOA-C02, when the question emphasizes "avoid the internet" or "compliance requires private connectivity to AWS APIs", choose the Interface endpoint regardless of cost. When the question emphasizes "reduce NAT gateway data processing for S3 or DynamoDB traffic", the Gateway endpoint is the no-brainer answer. Reference: https://docs.aws.amazon.com/vpc/latest/privatelink/concepts.html

VPC Peering: Limits, Non-Transitive Routing, and the Hub Trap

A VPC peering connection is a one-to-one network link between two VPCs that allows private IP routing between them. It works across accounts and across regions (inter-region peering, since 2017).

Mechanics and configuration

A peering connection is established via a request/accept handshake:

Owner of VPC-A initiates a peering request to VPC-B (same or different account).
Owner of VPC-B accepts the request.
Both sides add route table entries pointing the other VPC's CIDR at the peering connection ID (pcx-...). Without these, the peering exists but no traffic flows.
Security groups and NACLs on both sides must allow the desired traffic.

For cross-account peering, the requester and accepter are different AWS accounts; for inter-region, the two VPCs are in different regions and the peering uses the AWS backbone, not the public internet.

Non-transitive routing — the most-tested trap

VPC peering is non-transitive. If VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot reach VPC-C through B, even if you add routes pointing A → C through the A-B peering. AWS explicitly does not forward across peering connections. The only ways to make A talk to C are:

Peer A directly with C (full mesh — works, but at N VPCs requires N(N-1)/2 connections, unmanageable past about 5 VPCs).
Use AWS Transit Gateway as a hub (the modern answer).
Use a third-party transit appliance (out of scope for SOA).

Other VPC peering limitations

No overlapping CIDRs. Two VPCs with the same CIDR (or any overlap) cannot peer.
No edge-to-edge routing — VPC-B's IGW, NAT gateway, VPN, or Transit Gateway are not reachable from VPC-A through the peering. A peering carries only direct VPC-to-VPC traffic.
No DNS resolution by default — VPC-A resolving a private DNS name in VPC-B's private hosted zone requires explicit DNS sharing (via Route 53 Resolver rules or by enabling DNS resolution for the peering). The peering itself does not propagate DNS.
125 active peering connections per VPC (raisable). Beyond that, switch to Transit Gateway.

The classic SOA-C02 trap: three VPCs (A, B, C). A peers with B, B peers with C, the engineer adds a route in A's table pointing C's CIDR to the A-B peering connection — and it does not work. Traffic from A reaches B and is dropped. The fix is either to peer A directly with C, or replace the mesh with a Transit Gateway. Candidates who do not know "non-transitive" pick the wrong answer every time. Reference: https://docs.aws.amazon.com/vpc/latest/peering/invalid-peering-configurations.html

AWS Transit Gateway: Hub-and-Spoke Replacement for Peering Mesh

AWS Transit Gateway (TGW) is a regional network transit hub that connects VPCs and on-premises networks through a hub-and-spoke topology. It is the answer to peering's non-transitive limitation and the answer to mesh management at scale.

Why Transit Gateway exists

A full mesh of N VPCs needs N(N-1)/2 peering connections. Beyond about 5 VPCs the operational burden — adding routes on both sides for every new pair, tracking who owns which connection, propagating DNS — becomes untenable. Transit Gateway lets you attach each VPC, on-premises VPN, and Direct Connect once, then control routing centrally through TGW route tables.

Operational components

Transit Gateway — the hub itself, regional, charged per attachment-hour and per GB of traffic processed.
Attachments — VPC attachments (one per VPC, multi-AZ ENIs in chosen subnets), VPN attachments, Direct Connect Gateway attachments, peering attachments to other TGWs (cross-region).
TGW Route tables — control which attachments can route to which. Multiple route tables let you implement segmentation: production VPCs route only to the production route table, dev VPCs to the dev table, shared services VPCs are reachable from both.
Route propagation vs static routes — VPN attachments can BGP-propagate routes; VPC routes are static.
Resource Access Manager (RAM) — share the TGW across AWS Organizations accounts so member accounts can attach their VPCs without owning the TGW.

Inter-region Transit Gateway peering

TGWs in different regions can peer with each other, allowing global hub-and-spoke architectures. Inter-region traffic uses the AWS backbone.

When SOA-C02 picks Transit Gateway

"We have 30 VPCs that all need to talk to a shared services VPC" → TGW.
"We have peering connections between every pair of VPCs and route table changes are unmanageable" → migrate to TGW.
"A spoke VPC needs to reach the on-premises network through the same path as another spoke" → TGW with VPN/DX attachment.

On SOA-C02, the moment a scenario mentions more than a handful of VPCs needing any-to-any or shared-services connectivity, the answer is Transit Gateway, not VPC peering. TGW also unlocks transitive routing (which peering forbids) and centralizes routing policy. The opposite is also true: for exactly two VPCs that need direct connectivity at minimal cost, VPC peering is simpler and cheaper than TGW. Reference: https://docs.aws.amazon.com/transit-gateway/latest/tgw/what-is-transit-gateway.html

Site-to-Site VPN: IPsec Tunnels, BGP, and Customer Gateways

A Site-to-Site VPN connects on-premises networks to a VPC over IPsec tunnels riding the public internet. It is the cheap, flexible hybrid-connectivity path; AWS Direct Connect is the higher-cost, dedicated alternative (mostly out of scope for SOA-C02 deep-dive).

Components

Customer gateway (CGW) — the on-premises side: a physical or virtual VPN device with a public IP. AWS represents it as a CGW resource holding the IP and BGP ASN.
Virtual private gateway (VGW) — the AWS-side VPN concentrator attached to a single VPC. Or:
Transit Gateway VPN attachment — attach the VPN directly to a TGW for multi-VPC hybrid connectivity.
VPN connection — pairs the CGW with the VGW (or TGW), creates two IPsec tunnels (for redundancy across AWS data centers), and yields a configuration file you load into the on-prem VPN device.

Two IPsec tunnels per VPN connection

Every Site-to-Site VPN connection provisions two tunnels to two different AWS public IP endpoints in different data centers within the region. This is for HA: one tunnel can fail over without downtime. For full HA, you also configure a second customer gateway on the on-prem side (separate physical device, separate ISP) and a second VPN connection — that gives you four tunnels.

Static routing vs BGP

Two routing models:

Static — you list the CIDR blocks reachable on-premises in the VPN connection configuration; AWS adds routes to the VGW/TGW route table; the on-prem device similarly has static routes.
BGP (dynamic) — the on-prem device runs BGP with the AWS endpoint, advertising the on-prem CIDRs and learning the VPC CIDRs. BGP enables automatic failover between tunnels and dynamic route propagation.

SOA-C02 prefers BGP for any scenario that emphasizes "automatic failover" or "dynamic routing", and static for "simple single-tunnel low-traffic site".

Pre-shared key

Each tunnel uses a pre-shared key (PSK) for IPsec authentication. AWS generates default PSKs but allows you to override during VPN connection creation; for compliance scenarios where the security team owns key generation, this is the configuration you want.

Site-to-Site VPN troubleshooting checklist

When the VPN connection is up in the AWS console but on-prem traffic does not flow:

Both tunnels show UP in the VPC console? If only one is up, traffic still flows but redundancy is lost.
Route table on the AWS side has the on-prem CIDR pointing to the VGW (or TGW)? Without route propagation enabled, BGP learned routes do not appear automatically.
On-prem device's tunnel interfaces are configured with the inside IPs from the AWS-provided config file?
Security groups on the EC2 instances allow traffic from the on-prem CIDR?
NACLs on the relevant subnets allow inbound from on-prem CIDR and outbound on the ephemeral range?
On-prem firewall allows ESP (IP protocol 50), UDP 500 (IKE), and UDP 4500 (NAT-T) outbound to the AWS VPN endpoint IPs?

Choose BGP when: there are multiple tunnels and you want automatic failover; there are multiple VPN connections and you want load distribution; the on-prem network changes frequently (new subnets) and you want them auto-advertised. Choose static when: a single tunnel is fine; the on-prem device does not support BGP; CIDRs are stable. SOA-C02 favors BGP for any "fault-tolerant hybrid connectivity" scenario. Reference: https://docs.aws.amazon.com/vpn/latest/s2svpn/VPC_VPN.html

Session Manager: Replacing the SSH Bastion Entirely

For SOA-C02, AWS Systems Manager Session Manager is the canonical answer to "how do you administer EC2 instances in a private subnet without SSH or a bastion host?" It is also one of the most heavily tested SSM scenarios across both Domain 3 and Domain 5.

What Session Manager replaces

Traditional pattern: a public bastion host (jump box) in a public subnet with port 22 open to the corporate IP range. Engineers SSH to the bastion, then SSH from the bastion to private-subnet instances. Operational pain: SSH key distribution and rotation, bastion host hardening and patching, audit log scattering, public IP exposure on the bastion, port 22 exposure even if narrowly scoped.

Session Manager replaces this with:

Browser-based or AWS CLI shell access to the instance, no SSH client, no SSH keys.
No public IP required on the target instance.
No inbound port open on the security group of the target — Session Manager runs over an outbound HTTPS (port 443) connection initiated by the SSM agent.
Full audit trail of every keystroke logged to CloudWatch Logs or S3.
IAM-controlled access — ssm:StartSession permission on the user, and instance must trust SSM via the instance profile.

What Session Manager requires on the instance

SSM agent installed and running. Pre-installed on most modern AWS-provided AMIs (Amazon Linux 2 / 2023, Ubuntu LTS, Windows Server 2016+).
IAM instance profile attached, with the managed policy AmazonSSMManagedInstanceCore (allows ssm:UpdateInstanceInformation, the messaging APIs, etc.).
Network reachability to Systems Manager endpoints. Three options:
- Public subnet with internet route — the agent reaches ssm.region.amazonaws.com over the internet.
- Private subnet with NAT gateway — same as above through NAT.
- Private subnet with Interface VPC endpoints for com.amazonaws.region.ssm, com.amazonaws.region.ssmmessages, com.amazonaws.region.ec2messages — no NAT or internet required.

The third option is the SOA-C02 sanctioned answer for "administer instances in a private subnet without internet access".

Session Manager + Interface endpoints — the canonical SOA architecture

For a fully private-subnet workload that needs operational access:

Create three Interface VPC endpoints in the VPC for ssm, ssmmessages, ec2messages.
Attach them to a subnet in each AZ where instances run.
Attach a security group to the endpoints allowing inbound 443 from the instance security group.
Attach the AmazonSSMManagedInstanceCore policy to the instance role.
Confirm the instance appears in Systems Manager Fleet Manager → Managed instances.
Engineers run aws ssm start-session --target i-0abc123 from their laptop (with appropriate IAM) and get an interactive shell.

No bastion host. No public IP. No port 22. No SSH keys. Full audit trail.

Whenever a SOA-C02 question asks "how to allow operations to access EC2 instances in a private subnet without exposing SSH" or "how to eliminate the bastion host while preserving administrative access", the answer pairs Session Manager with the three SSM Interface VPC endpoints (ssm, ssmmessages, ec2messages) plus the AmazonSSMManagedInstanceCore instance profile. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html

For compliance scenarios, configure session logging at the Session Manager preferences level: choose a CloudWatch Log group and/or an S3 bucket, optionally with KMS encryption. Every session's full transcript is persisted, including the user identity (IAM principal), the instance ID, and the start/end timestamps. This is the artifact your auditor wants. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/setup-create-vpc.html

AWS WAF and Shield — Brief Mention (See the Dedicated Topic)

Task Statement 5.1 also names AWS WAF and AWS Shield as network protection services. Those are covered in depth in the WAF, Shield, and Network Protection Services study note. Two operational facts to remember when reading the rest of this guide:

WAF operates at Layer 7 (HTTP) and attaches to CloudFront, ALB, API Gateway, or AppSync. It does not attach to a VPC, NACL, or security group. It inspects request content, not just IP/port.
Shield Standard is automatic and free for L3/L4 DDoS protection on every AWS resource. Shield Advanced is a paid service for additional L7 protection, cost protection on DDoS-induced scaling bills, and access to the AWS Shield Response Team.

VPC primitives (subnets, NACLs, security groups) are the Layer 3/4 defenses; WAF is the Layer 7 defense; Shield wraps both. The dedicated topic walks through rate-based rules, managed rule groups, Firewall Manager multi-account deployment, and Global Accelerator's built-in DDoS protection.

Scenario Pattern: EC2 in Private Subnet Cannot Reach S3

This is the canonical SOA-C02 troubleshooting scenario. The runbook:

Confirm the instance has a working IAM role. Without an instance profile granting s3:GetObject (or whatever action), even a perfect network path returns 403. Check via EC2 console → Instance → Security tab.
Confirm the security group allows outbound 443. Most default SGs allow all outbound, but some teams lock it down. The SG must allow port 443 to the S3 prefix list (or 0.0.0.0/0).
Confirm the NACL allows outbound 443 AND inbound on the ephemeral range (1024–65535). NACL is stateless; outbound 443 alone leaves the response packets stranded.
Confirm the route table has a path to S3. Either via NAT gateway (0.0.0.0/0 → NAT, then NAT → IGW) or via a Gateway VPC endpoint for S3 (S3 prefix list → endpoint).
If using a Gateway endpoint, confirm the endpoint is associated with the subnet's route table. Creating the endpoint without associating it with the right route table is the most common misconfiguration.
Check the endpoint policy. Default endpoint policy is allow-all; if the security team narrowed it to specific bucket ARNs and the request hits a different bucket, it will be denied at the endpoint.
Check the bucket policy and ACL. A bucket policy that requires aws:SourceVpce = vpce-... blocks any access from outside the listed endpoint IDs.
Check region. Gateway endpoints are region-scoped — they only serve same-region buckets. A bucket in us-east-1 accessed from a VPC in us-west-2 will not use the us-west-2 S3 endpoint.

Most common root causes in order of frequency: missing endpoint route in the subnet's route table, NACL ephemeral port rule missing, IAM role missing, cross-region bucket access.

Scenario Pattern: VPC Peering Doesn't Route — Transitive Trap

The second canonical SOA-C02 scenario. Three VPCs:

VPC-A 10.0.0.0/16
VPC-B 10.1.0.0/16
VPC-C 10.2.0.0/16

A peers with B (pcx-AB). B peers with C (pcx-BC). The team adds to A's route table: 10.2.0.0/16 → pcx-AB, expecting traffic from A to be routed to B and then on to C. Packets leave A, reach B, and are silently dropped.

The diagnosis runbook:

Verify the peering connections are Active. A pending or failed peering yields silent drops.
Check the route tables on every leg. A→B requires A has 10.1.0.0/16 → pcx-AB and B has 10.0.0.0/16 → pcx-AB. A→C through B needs A has 10.2.0.0/16 → pcx-AB AND B has 10.2.0.0/16 → pcx-BC AND C has 10.0.0.0/16 → pcx-BC. Even with all three routes, AWS does not forward across peerings.
The fix is one of:
- Add direct peering A↔C with the appropriate routes.
- Replace the peering mesh with a Transit Gateway: attach A, B, and C to the TGW, set up TGW route tables.
- For limited cross-VPC service exposure, use PrivateLink (Interface endpoint with NLB origin) to expose a specific service from B to A and from B to C, without full network reachability.

The "AHA" of the trap is that AWS explicitly documents the non-transitive behavior — it is not a bug, it is by design. Memorize it.

Common Trap: NAT Gateway Charges per GB

A high-frequency SOA-C02 cost trap. NAT gateway has two billing dimensions: per hour the gateway exists ($0.045/hour in most regions) and per GB of data processed ($0.045/GB in most regions). The data processing fee applies to every byte going through the NAT, in either direction.

For a private subnet that pulls 1 TB/day from the internet, that is roughly 30 TB/month × $0.045 = $1,350/month in data processing alone, on top of the per-hour charge. Add the Gateway VPC endpoint for S3 (free) or the Interface endpoint for the relevant AWS service (much cheaper than NAT for most volumes) and the bill drops dramatically.

The exam pattern: "the team noticed the NAT gateway data processing line on the bill is $X — what is the likely fix?" The answer is almost always "add a VPC endpoint for the high-volume service".

Common Trap: Gateway Endpoint vs Interface Endpoint Pricing

Candidates sometimes treat "VPC endpoint" as a single concept and miss the cost split:

Gateway endpoint (S3, DynamoDB only) — free. No hourly fee, no data processing fee.
Interface endpoint — paid. Hourly fee per AZ where the endpoint ENI exists, plus per-GB data processing fee.

For S3 and DynamoDB, always use the Gateway endpoint — there is no Interface endpoint version of these services for general use (S3 has an interface endpoint variant for PrivateLink-based bucket policies, but it costs more and is rarely the right answer for SOA-C02).

For everything else, the Interface endpoint is your only option, and the cost trade-off versus NAT depends on traffic volume and whether the workload tolerates internet egress at all.

Common Trap: NACL vs SG Deny Override

This is a recurring SOA-C02 confusion. A security group has only allow rules; the absence of an allow is a deny, but no SG can express an explicit deny that overrides another SG. A NACL has both allow and deny rules; an explicit deny in a NACL cannot be overridden by any SG allow on the instance.

The packet flow is:

Inbound to instance: route table → NACL inbound (subnet) → security group inbound (ENI) → instance.
Outbound from instance: instance → security group outbound (ENI) → NACL outbound (subnet) → route table.

If the NACL denies inbound on port X, no SG on any instance in that subnet can accept port X. This is the architectural use case for NACLs — subnet-wide bans the instance owner cannot override (e.g., "no SSH from any IP at the subnet level, even if a developer accidentally opens SG rule 22 from 0.0.0.0/0").

SOA-C02 vs SAA-C03: The Operational Lens

SAA-C03 and SOA-C02 both test VPC, but the lenses differ.

Question style	SAA-C03 lens	SOA-C02 lens
Choosing endpoint type	"Which endpoint type fits S3 access from a private VPC?"	"The S3 calls fail — list every check from route table to endpoint policy."
Public vs private subnet	"Which tier should the database go in?"	"The instance has a public IP but cannot reach the internet — what is missing?"
NACL vs SG	"Which provides subnet-level filtering?"	"The NACL outbound 443 is allowed but the response is dropped — fix?" (ephemeral ports)
VPC peering	"Architect a multi-VPC connectivity solution."	"VPC-A → VPC-C through VPC-B does not work — diagnose." (transitive trap)
Transit Gateway	"When to choose TGW over peering."	"Migrate the existing 8-VPC peering mesh to TGW — list steps."
Session Manager	"Which service replaces the bastion host?"	"Configure Session Manager for a private subnet with no internet — list endpoints."
Site-to-Site VPN	"Choose VPN vs Direct Connect."	"The VPN tunnel is up but routes do not propagate — what flag?" (route propagation)
NAT vs endpoint	"Architect cost-effective private connectivity."	"Reduce NAT gateway data processing for S3 traffic — implement now."

The SAA candidate selects the architecture; the SOA candidate configures the route table, opens the right port, picks the right endpoint, and diagnoses why the packet did not arrive.

Exam Signal: How to Recognize a Domain 5.1 Question

Domain 5.1 questions on SOA-C02 follow predictable shapes. Recognize them and your time per question drops dramatically.

"The instance cannot reach the internet" — check route table 0.0.0.0/0 destination, public IP/EIP, security group outbound, NACL outbound + ephemeral inbound.
"The instance cannot reach S3 / DynamoDB" — check IAM role, security group, NACL ephemeral, route table for Gateway endpoint or NAT, endpoint policy, bucket policy aws:SourceVpce constraint.
"The instance cannot reach an AWS API like Secrets Manager / SSM / KMS from a private subnet" — check Interface VPC endpoint exists, security group on endpoint allows 443 from instance, VPC has enableDnsSupport and enableDnsHostnames true, instance subnet's AZ is in the endpoint's AZ list.
"VPC peering does not forward traffic to the third VPC" — non-transitive trap; switch to TGW or add direct peering.
"VPN tunnel is up but on-prem routes don't appear" — enable route propagation on the VPC route table for the VGW, or check BGP on-prem advertising.
"Eliminate the bastion host" — Session Manager + SSM Interface endpoints + AmazonSSMManagedInstanceCore instance profile.
"Reduce NAT gateway data processing charges" — Gateway endpoint for S3/DynamoDB (free); or Interface endpoint for the heavy AWS API.
"NACL allows outbound but response is dropped" — ephemeral port range 1024–65535 inbound rule on the NACL.
"Security group rule cannot exempt the deny" — NACL deny is the floor; SG cannot override.
"Connectivity broke after a route table change" — check that the subnet is still associated with the right route table; check the most-specific route is the intended target.

With Domain 5 worth 18 percent of the exam and Task Statement 5.1 carrying VPC, endpoints, peering, TGW, VPN, and WAF/Shield, expect 5 to 7 questions in this exact territory. Mastering the patterns in this section is one of the highest-leverage study activities for SOA-C02. Reference: https://d1.awsstatic.com/onedam/marketing-channels/website/aws/en_US/certification/approved/pdfs/docs-sysops-associate/AWS-Certified-SysOps-Administrator-Associate_Exam-Guide.pdf

Decision Matrix — VPC Construct for Each SysOps Goal

Use this lookup during the exam.

Operational goal	Primary construct	Notes
Allow private subnet to reach internet outbound	NAT gateway	One per AZ for HA; route 0.0.0.0/0 → NAT in private RT.
Allow private subnet to reach S3 / DynamoDB privately	Gateway VPC endpoint	Free; route table entry for service prefix list.
Allow private subnet to reach AWS API (Secrets Manager, SSM, KMS) privately	Interface VPC endpoint	One ENI per AZ; SG allows 443; private DNS on.
Subnet-wide ban that instance owners cannot override	NACL deny rule	Numbered, lowest first match wins.
App-tier-to-DB-tier filtering that survives instance churn	Security group referencing another SG ID	"allow 5432 from sg-app to sg-db".
Connect 2 VPCs same region	VPC peering	Cheapest; remember non-transitive.
Connect many VPCs in any-to-any topology	Transit Gateway	Hub-and-spoke; share via RAM across accounts.
Connect on-premises to a single VPC	Site-to-Site VPN with Virtual Private Gateway	BGP for failover; two tunnels redundant.
Connect on-premises to many VPCs	Site-to-Site VPN attached to Transit Gateway	One VPN, multi-VPC reach.
Eliminate SSH bastion	Session Manager + SSM Interface endpoints + IAM instance profile	No public IP, no port 22.
Audit operator shell sessions	Session Manager logging to CloudWatch Logs / S3 with KMS	Per-session transcript persisted.
Reduce NAT gateway data processing for S3	Add Gateway VPC endpoint for S3	Free; immediate cost reduction.
Restrict S3 access to only this VPC	Bucket policy with `aws:SourceVpce`	Plus endpoint policy for defense-in-depth.
HA outbound NAT across AZs	One NAT gateway per AZ + per-AZ private RT	Cross-AZ NAT routing wastes data transfer cost.
Cost-sensitive low-volume NAT	NAT instance	You own patching, HA, source/dest check off.
Detect failed VPN tunnel	CloudWatch metric `TunnelState` per tunnel	Alarm on `< 1` for either tunnel.
Block specific country traffic at L7	AWS WAF geo-match rule	Not a VPC concept; WAF on ALB/CloudFront.
Block UDP flood at L3/L4	Shield Standard (auto) + AWS Network Firewall	NACLs and SGs are not designed for DDoS.

Common Traps Recap — VPC Configuration and Connectivity

Every SOA-C02 attempt will see two or three of these distractors.

Trap 1: NACL is stateful

It is not. NACLs are stateless; outbound and inbound rules are independent. Every TCP response needs an explicit inbound (or outbound) rule on the ephemeral port range.

Trap 2: Detailed monitoring fixes the network

VPC connectivity issues are not solved by enabling detailed monitoring or VPC Flow Logs alone. Flow Logs help diagnose after the failure; the fix is route table, SG, NACL, or endpoint configuration.

Trap 3: VPC peering carries transitive routes

It does not. A → B and B → C does not yield A → C. Use Transit Gateway or direct peering.

Trap 4: Interface endpoint without DNS attributes

enableDnsSupport and enableDnsHostnames both must be true for private DNS on Interface endpoints to resolve correctly.

Trap 5: Gateway endpoint forgotten in the subnet's route table

Creating the endpoint is not enough; you must associate it with the route table of every subnet that needs it.

Trap 6: NACL ephemeral range too narrow

Different OSes use different ephemeral ranges. The safe rule is TCP 1024–65535 ALLOW in both inbound and outbound on the NACL.

Trap 7: Security group rule with overlapping deny intent

You cannot write "deny" in a security group. If the rule must be a hard deny, it belongs in a NACL.

Trap 8: NAT gateway charges only by the hour

NAT gateway charges per hour AND per GB processed. The data processing fee is the silent budget killer.

Trap 9: Cross-region S3 via Gateway endpoint

Gateway endpoints are region-scoped. A VPC in us-west-2 with an S3 Gateway endpoint cannot reach a us-east-1 bucket through it; that traffic still goes over the internet (or NAT).

Trap 10: Bastion host as the SSH answer

For SOA-C02, the modern answer is Session Manager. A bastion host is a legacy pattern the exam considers second-best.

Trap 11: Default route table propagation off

Propagation is a per-route-table setting for VPN BGP routes. If it is off, BGP-learned routes do not appear and on-prem routes do not work even though the tunnel is up.

Trap 12: Single NAT gateway = cross-AZ data transfer

A single NAT gateway in AZ-a serves traffic from AZ-b and AZ-c, but cross-AZ traffic incurs data transfer charges. Best practice is one NAT gateway per AZ.

FAQ — VPC Configuration and Connectivity

Q1: When does a Gateway VPC endpoint beat an Interface VPC endpoint?

Gateway endpoints exist only for S3 and DynamoDB and are free. Interface endpoints exist for nearly every other AWS service and cost per AZ-hour plus per-GB data processing. So the rule is simple: for S3 and DynamoDB, always use the Gateway endpoint — there is no scenario where the Interface endpoint variant is the SOA-C02 right answer for general access. For every other service, the Interface endpoint is your only option, and the question becomes whether the Interface endpoint's hourly + data fee is cheaper than the alternative path (NAT gateway, internet egress) at your traffic volume. For most SOA-C02 questions the framing is "private connectivity required" and the cost comparison is moot — the Interface endpoint is the correct answer because the workload cannot tolerate internet egress at all.

Q2: Why does my private subnet instance fail to reach Secrets Manager even though the Interface endpoint exists?

Three high-frequency causes. First, the VPC may not have enableDnsSupport and enableDnsHostnames both set to true — without both, the private DNS name secretsmanager.region.amazonaws.com does not resolve to the endpoint's private IP, and the application falls back to the public IP, which from a private subnet just times out. Second, the security group on the endpoint ENI may not allow inbound 443 from the application's security group — endpoint ENIs need their own SG, separate from the application's. Third, the endpoint may not be enabled in the AZ where the calling instance lives — endpoints are per-AZ ENIs, and an instance in an AZ without an endpoint ENI cannot use the endpoint via private DNS. Check all three before assuming the endpoint is broken.

Q3: How do I choose between VPC peering and Transit Gateway for connecting multiple VPCs?

For exactly two VPCs that need full any-to-any connectivity, VPC peering is cheaper and simpler — no per-attachment hourly fee, no per-GB processing fee for traffic within the peering. For three or more VPCs needing any-to-any connectivity, Transit Gateway wins because peering's non-transitive routing forces you into a full mesh that becomes unmanageable past about five VPCs (10 connections at 5 VPCs, 45 at 10, 190 at 20). TGW also handles VPN and Direct Connect attachments natively, lets you segment routing with multiple TGW route tables, and supports inter-region TGW peering for global hub-and-spoke. The general SOA-C02 rule: more than five VPCs or any need for transitive routing → TGW; exactly two VPCs with simple connectivity → peering.

Q4: What is the canonical Session Manager configuration for a fully private VPC?

Five components. (1) The EC2 instance has the SSM agent running — pre-installed on Amazon Linux 2 / 2023, Ubuntu LTS, Windows Server 2016+. (2) The instance has an IAM instance profile with the AmazonSSMManagedInstanceCore managed policy attached. (3) The VPC has three Interface VPC endpoints: com.amazonaws.region.ssm, com.amazonaws.region.ssmmessages, com.amazonaws.region.ec2messages, each enabled in the AZs where instances run. (4) The endpoint security groups allow inbound 443 from the instance security groups. (5) The user starting the session has IAM permission ssm:StartSession on the target instance. With those five in place, the engineer runs aws ssm start-session --target i-0abc123 and gets an interactive shell over the AWS private network with no public IP, no SSH key, no port 22, and a full audit trail logged to CloudWatch Logs or S3.

Q5: Why does my Site-to-Site VPN tunnel show UP but on-prem traffic does not flow to the VPC?

Most common cause: route propagation is not enabled on the VPC route table for the Virtual Private Gateway. BGP-learned routes are advertised to the VGW but only appear in the VPC route table if propagation is on. Second cause: the on-prem device is BGP-advertising the wrong CIDRs — verify the advertised prefixes match what AWS expects. Third cause: NACLs on the relevant subnets do not allow inbound from the on-prem CIDR or do not allow outbound on the ephemeral range for return traffic. Fourth cause: security groups on the destination instances do not allow traffic from the on-prem CIDR range. Fifth cause: the on-prem firewall is not allowing ESP (IP protocol 50), UDP 500 (IKE), and UDP 4500 (NAT-T). Walk through each layer in order — route propagation is the highest-frequency miss on SOA-C02.

Q6: How do I deny SSH from a specific country at the subnet level?

NACLs only support CIDR-based rules, not country-based. To deny an entire country at the subnet level, you would need a list of all CIDR ranges assigned to that country (large and changing), packaged as deny rules in the NACL. At scale this is impractical. The exam-correct answer for country-level blocking is AWS WAF geo-match rules attached to the workload's CloudFront distribution or ALB — WAF maintains the country-to-CIDR mapping and updates it automatically. NACLs are the right tool for "deny this specific bad IP range" or "subnet-wide ban regardless of SG"; WAF is the right tool for L7 country, header, body content, and rate-based filtering. WAF is covered in the dedicated SOA-C02 topic.

Q7: What is the difference between a NAT gateway and a NAT instance?

A NAT gateway is a managed AWS service: AWS provisions, scales, patches, and operates it. AZ-scoped, charged per hour and per GB processed, capacity up to 100 Gbps. A NAT instance is a regular EC2 instance running NAT software (typically a community AMI with iptables masquerading). You manage patching, HA (which means an Auto Scaling group of size 1 with health checks), and you must disable the EC2 source/destination check for it to forward packets. NAT instance is cheaper at very low volume because you pay only the EC2 instance cost, but for any production-grade availability and throughput, the NAT gateway wins on operational simplicity. SOA-C02 prefers NAT gateway for every scenario except "minimize cost for a low-volume dev environment".

Q8: How does the most-specific route rule work in a VPC route table?

When a packet leaves an instance, the VPC router examines the subnet's route table and picks the route with the longest matching prefix to the destination IP. Concretely: if the route table has 10.0.0.0/16 → local and 10.0.5.0/24 → pcx-AB (peering), traffic to 10.0.5.10 goes through the peering connection because /24 is more specific than /16. This is how Gateway endpoints, peering routes, and VPN routes selectively override the default 0.0.0.0/0 → IGW or NAT. For SOA-C02 questions involving "the traffic is going to the wrong place", check the route table for a more-specific route than the one you expected to win.

Q9: When does a security group rule referencing another security group beat a CIDR rule?

Security group references win whenever the source set is "instances belonging to another tier" rather than a fixed IP range. Imagine an Auto Scaling group of web servers behind an ALB — the IPs change every time an instance is launched or replaced, so a CIDR-based rule on the database security group would have to be updated constantly. By writing the database SG rule as "allow 5432 from sg-web", the rule keeps working as web instances come and go because it identifies the source by SG membership, not by IP. This is the SOA-C02 best-practice pattern for any tier-to-tier filtering. The only reason to use CIDR-based rules in a security group is when the source is on-premises (no SG to reference) or in a peer VPC where SG references work only across same-region peerings (cross-region peerings do not propagate SG references).

Q10: Why does the team get charged for cross-AZ data transfer with a single NAT gateway?

Because every NAT gateway is AZ-scoped — it lives in exactly one AZ — and traffic from instances in other AZs must cross the AZ boundary to reach it. AWS charges $0.01/GB for inter-AZ data transfer on top of the NAT gateway's per-hour and per-GB processing fees. A single NAT gateway shared by three AZs incurs cross-AZ charges for two of the three AZs' traffic. The fix is one NAT gateway per AZ, with each AZ's private subnet route table pointing to its same-AZ NAT gateway. This pattern also improves availability — an AZ failure that takes out one NAT gateway does not affect the other two AZs' egress.

Q11: How many VPC peering connections can I have, and when do I switch to Transit Gateway?

The default soft quota is 125 active peering connections per VPC (raisable via support). Practically, the management overhead becomes painful much sooner — at 8 VPCs, a full mesh requires 28 peering connections, each with route table entries on both sides, with no central place to enforce routing policy or segment traffic. Most operations teams switch to Transit Gateway between 5 and 10 VPCs. TGW also unlocks transitive routing, central route table policy, integration with VPN and Direct Connect attachments, and cross-account sharing via AWS RAM. The SOA-C02 trigger phrase is "the team is struggling to manage routing across many VPCs" → TGW.

Once VPC primitives are configured, the next operational layers are: Route 53 DNS and CloudFront for the public-facing edge of the VPC, WAF, Shield, and network protection services for L7 and DDoS defenses on top of the L3/L4 firewalls covered here, VPC network troubleshooting for the Flow Logs, ELB access logs, and Reachability Analyzer toolkit that finds problems in this fabric after the fact, and EC2 Auto Scaling and ELB high availability for the workload tier whose subnets, security groups, and route tables are designed in this guide.

Why VPC Configuration Sits at the Heart of SOA-C02 Domain 5

白話文解釋 VPC Configuration and Connectivity

Analogy 1: The Apartment Building Floor Plan

Analogy 2: Office Building Security and Visitor Management

Analogy 3: Postal Sorting Facility

VPC Fundamentals: CIDR Blocks, Subnets, Route Tables, IGW, and NAT

CIDR blocks and subnet sizing

Public vs private subnets — defined by route tables

Route table mechanics

Internet gateway

NAT gateway

Security Groups vs Network ACLs: Stateful vs Stateless and the Ephemeral Port Trap

Security groups — stateful, allow-only, ENI-attached

Network ACLs — stateless, allow + deny, subnet-attached

VPC Endpoints — Gateway Type for S3 and DynamoDB

Gateway VPC endpoints — S3 and DynamoDB only

The S3 NAT gateway cost trap

VPC Endpoints — Interface Type (PrivateLink) for Everything Else

Interface endpoint mechanics

Common interface endpoints on SOA-C02

The DNS resolution requirement

VPC Peering: Limits, Non-Transitive Routing, and the Hub Trap

Mechanics and configuration

Non-transitive routing — the most-tested trap

Other VPC peering limitations

AWS Transit Gateway: Hub-and-Spoke Replacement for Peering Mesh

Why Transit Gateway exists

Operational components

Inter-region Transit Gateway peering

When SOA-C02 picks Transit Gateway

Site-to-Site VPN: IPsec Tunnels, BGP, and Customer Gateways

Components

Two IPsec tunnels per VPN connection

Static routing vs BGP

Pre-shared key

Site-to-Site VPN troubleshooting checklist

Session Manager: Replacing the SSH Bastion Entirely

What Session Manager replaces

What Session Manager requires on the instance

Session Manager + Interface endpoints — the canonical SOA architecture

AWS WAF and Shield — Brief Mention (See the Dedicated Topic)

Scenario Pattern: EC2 in Private Subnet Cannot Reach S3

Scenario Pattern: VPC Peering Doesn't Route — Transitive Trap

Common Trap: NAT Gateway Charges per GB

Common Trap: Gateway Endpoint vs Interface Endpoint Pricing

Common Trap: NACL vs SG Deny Override

SOA-C02 vs SAA-C03: The Operational Lens

Exam Signal: How to Recognize a Domain 5.1 Question

Decision Matrix — VPC Construct for Each SysOps Goal

Common Traps Recap — VPC Configuration and Connectivity

Trap 1: NACL is stateful

Trap 2: Detailed monitoring fixes the network

Trap 3: VPC peering carries transitive routes

Trap 4: Interface endpoint without DNS attributes

Trap 5: Gateway endpoint forgotten in the subnet's route table

Trap 6: NACL ephemeral range too narrow

Trap 7: Security group rule with overlapping deny intent

Trap 8: NAT gateway charges only by the hour

Trap 9: Cross-region S3 via Gateway endpoint

Trap 10: Bastion host as the SSH answer

Trap 11: Default route table propagation off

Trap 12: Single NAT gateway = cross-AZ data transfer

FAQ — VPC Configuration and Connectivity

Q1: When does a Gateway VPC endpoint beat an Interface VPC endpoint?

Q2: Why does my private subnet instance fail to reach Secrets Manager even though the Interface endpoint exists?

Q3: How do I choose between VPC peering and Transit Gateway for connecting multiple VPCs?

Q4: What is the canonical Session Manager configuration for a fully private VPC?

Q5: Why does my Site-to-Site VPN tunnel show UP but on-prem traffic does not flow to the VPC?

Q6: How do I deny SSH from a specific country at the subnet level?

Q7: What is the difference between a NAT gateway and a NAT instance?

Q8: How does the most-specific route rule work in a VPC route table?

Q9: When does a security group rule referencing another security group beat a CIDR rule?

Q10: Why does the team get charged for cross-AZ data transfer with a single NAT gateway?

Q11: How many VPC peering connections can I have, and when do I switch to Transit Gateway?

Further Reading and Related Operational Patterns

官方資料來源