examhub .cc The most efficient path to the most valuable certifications.
In this note ≈ 41 min

Data Analytics and Streaming Architecture at Scale

8,200 words · ≈ 41 min read

Data analytics at scale is the defining architectural theme of the AWS Certified Solutions Architect Professional (SAP-C02) exam. Where the Associate exam treats analytics as a single service selection (Amazon Athena, or Amazon Redshift, or Amazon EMR), the Professional exam treats data analytics at scale as a multi-account, multi-team, multi-region governance problem: how does a 500-analyst organization share curated datasets across twenty product teams without violating GDPR, without duplicating storage, and without forcing every team through a central ETL bottleneck? Every SAP-C02 scenario about data analytics at scale layers three concerns on top of each other — the physical plane (where do the bytes live and how do they flow), the logical plane (who owns which dataset and who consumes it), and the governance plane (who may see which column, which row, which cell, under which legal basis). Getting data analytics at scale right on SAP-C02 is worth roughly eight to twelve exam questions on a typical sitting, which is why this is one of the longest topics in the study guide.

This study guide walks through every service and pattern the SAP-C02 exam uses to test data analytics at scale: the modern data lake pattern with Amazon S3 bronze, silver, and gold zones plus Apache Iceberg and open table formats; AWS Glue Data Catalog federation and AWS Lake Formation LF-Tags with row-level and column-level security at organization scale; the streaming decision tree across Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon Managed Service for Apache Flink, and Amazon MSK; Amazon Redshift RA3 with Amazon Redshift Serverless, Amazon Redshift Spectrum, federated query, concurrency scaling, materialized views, and zero-ETL integrations; Amazon DataZone as the cross-team governance portal; cross-account data sharing through AWS Data Exchange, Amazon Redshift Data Sharing, and AWS Resource Access Manager; Amazon OpenSearch Service for observability and search; Amazon SageMaker Feature Store and the lake house integration; and AWS Glue Data Quality plus Amazon Deequ for data quality at scale. A running scenario — a retail organization with a 20-team data lake serving 500 analysts, bound by GDPR — threads the whole topic together so you can see how data analytics at scale looks in a real SAP-C02 answer.

Data Analytics at Scale — the SAP-C02 Mental Model

Data analytics at scale on AWS is not a single service; it is a layered architecture. The SAP-C02 exam rewards candidates who can name the layer and pick the right service for that layer, not candidates who memorize service features in isolation. There are six layers in the reference model the SAP-C02 exam uses for data analytics at scale, and every scenario maps to one or more of them.

The six layers of data analytics at scale

Layer one is ingestion — getting data into the platform from operational databases, applications, partners, SaaS tools, IoT devices, and clickstreams. SAP-C02 services in this layer include Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon MSK, AWS Database Migration Service, AWS DataSync, AWS Transfer Family, and zero-ETL integrations from Amazon Aurora and Amazon DynamoDB into Amazon Redshift.

Layer two is storage — the actual bytes. Amazon S3 is the canonical data lake substrate; Amazon Redshift RA3 managed storage is the warehouse substrate; Amazon OpenSearch Service storage is the search substrate. The key SAP-C02 insight for data analytics at scale is that storage and compute must be decoupled, so that multiple engines can read the same bytes without duplicating them.

Layer three is the catalog — the metadata layer that describes tables, schemas, and partitions. AWS Glue Data Catalog is the default; it federates out to Hive Metastore on Amazon EMR, to Amazon Redshift via Amazon Redshift Spectrum external schemas, and to cross-account Glue catalogs. Data analytics at scale means one catalog spanning the whole organization.

Layer four is governance — the permissions, lineage, data quality, and auditing layer. AWS Lake Formation is the permissions plane for Amazon S3-backed tables; Amazon DataZone is the business-facing governance portal; AWS Glue Data Quality and Amazon Deequ are the data quality rule engines; AWS CloudTrail and AWS Lake Formation audit logs close the loop for compliance.

Layer five is compute and query — the engines that read the bytes and return answers. Amazon Athena for ad-hoc serverless SQL, Amazon Redshift Serverless and Amazon Redshift RA3 for curated marts, Amazon EMR Serverless for Apache Spark workloads, Amazon Managed Service for Apache Flink for streaming analytics, and Amazon SageMaker for machine learning workloads. Data analytics at scale requires that every engine honors the same Lake Formation permissions.

Layer six is consumption — the dashboards, notebooks, APIs, and embedded experiences that business users actually interact with. Amazon QuickSight is the native BI tool; third-party BI tools connect through Amazon Athena or Amazon Redshift JDBC/ODBC; Amazon DataZone projects give analysts a self-service portal.

Why SAP-C02 treats data analytics at scale differently from SAA-C03

SAA-C03 will accept "Amazon Athena plus AWS Lake Formation plus Amazon QuickSight" for most analytics scenarios. SAP-C02 will not accept that answer when the scenario involves twenty teams, multi-account AWS Organizations, petabyte-scale storage, sub-second concurrency, cross-region compliance, or active-active failover. The Professional exam is looking for the composite answer: Lake Formation LF-Tags federated across accounts, Amazon Redshift RA3 with Redshift Data Sharing for curated marts, Amazon Managed Service for Apache Flink for streaming aggregates, Amazon DataZone for self-service discovery, and AWS Glue Data Quality rules attached to every silver-zone table. Data analytics at scale is always a multi-service, multi-account answer on SAP-C02.

Plain-Language Explanation: Data Analytics at Scale on AWS

Data analytics at scale can feel abstract, so here are four analogies from different domains that make the architecture concrete for the SAP-C02 exam.

Analogy 1 — The library network (圖書館聯盟)

A single library is easy to govern. A national library network of twenty branches serving half a million readers is a completely different problem, and that is what data analytics at scale looks like.

  • Each branch library is one team's AWS account with its own Amazon S3 data lake.
  • The national card catalog is the AWS Glue Data Catalog federated across accounts — a reader in branch A can look up a book shelved in branch B without leaving their seat.
  • The national librarian council is AWS Lake Formation with LF-Tags — they define rules like "all books tagged GDPR-EU may only be read by readers with an EU-resident bracelet," and every branch enforces them the same way.
  • The self-service discovery kiosk is Amazon DataZone — a reader types "quarterly revenue" and the kiosk lists every branch that has a matching dataset, who owns it, and how to request access.
  • The inter-library loan desk is AWS Data Exchange plus Amazon Redshift Data Sharing plus AWS Resource Access Manager — when branch A wants to share a curated collection with branch B (or with a partner), these services move the permission, not the book.
  • The reading rooms are the query engines — Amazon Athena for quick reference, Amazon Redshift for heavy research, Amazon EMR for rebinding old manuscripts.
  • The rare books vault is Amazon S3 Glacier for cold archival; the current-events display is Amazon OpenSearch Service for hot searchable data.

When a SAP-C02 scenario describes twenty teams, many accounts, shared datasets, and central governance, the library network analogy tells you the answer is data analytics at scale using Lake Formation plus DataZone plus Redshift Data Sharing — not a single monolithic data warehouse.

Analogy 2 — The modern kitchen brigade (廚房分工)

A restaurant kitchen at scale is a brigade of specialized stations, and data analytics at scale on AWS works the same way.

  • The receiving dock is ingestion — Amazon Kinesis Data Streams, Amazon MSK, and AWS DMS receive raw ingredients (events, CDC streams, files).
  • The cold storage walk-in is the bronze zone in Amazon S3 — everything lands here untouched, like vegetables still in their crates.
  • The prep station is the silver zone — AWS Glue jobs or Amazon EMR Serverless clean, dedupe, and re-package ingredients into Apache Parquet or Apache Iceberg tables.
  • The plating station is the gold zone — Amazon Redshift materialized views or AWS Glue jobs compose dimensional marts that look exactly like what the diner will see.
  • The expediter is AWS Lake Formation — they watch every dish leaving the kitchen and make sure no restricted ingredient ends up on a plate that should not have it (row-level, column-level, and cell-level security).
  • The menu is Amazon DataZone — diners (analysts) see only the dishes they are allowed to order, with descriptions written in business language.
  • The head chef is the data governance council — they decide which stations exist, which recipes are gold standard, and who gets to taste what.

The kitchen brigade analogy is the right mental model whenever a SAP-C02 scenario asks about bronze, silver, and gold zones, multi-team ownership, and quality gates. Data analytics at scale is never one person doing everything; it is a brigade with explicit handoffs.

Analogy 3 — The toolbox (工具箱)

The streaming layer of data analytics at scale confuses many candidates because AWS offers so many overlapping services. The toolbox analogy pins each one to a use case.

  • Amazon Kinesis Data Streams is the screwdriver — low-level, general-purpose, gives you full control over producers and consumers, retention up to 365 days. Reach for it when you write your own consumer code and need millisecond-level control.
  • Amazon Data Firehose is the nail gun — high-throughput, one-shot, configure the destination and fire. Reach for it when the job is "land this stream in Amazon S3 (or Amazon OpenSearch Service, or Amazon Redshift, or Splunk, or a third-party HTTP endpoint) without writing a consumer."
  • Amazon Managed Service for Apache Flink is the power drill with variable speed — continuous, stateful, windowed computation in Apache Flink. Reach for it when the job is "compute a one-minute tumbling average per user across a million events per second."
  • Amazon MSK is the industrial lathe — full Apache Kafka semantics, partitions, consumer groups, exactly-once processing, tiered storage. Reach for it when the organization has invested in the Apache Kafka ecosystem and needs compatibility.
  • Amazon Kinesis Data Analytics for Apache Flink Studio (now part of Managed Service for Apache Flink) is the bench grinder — interactive notebooks for streaming SQL and Python.

The toolbox analogy answers the single most common SAP-C02 trap: "Amazon Kinesis Data Streams or Amazon MSK?" The right pick is the right tool for the job, and the exam rewards the pick that matches the language in the scenario.

Analogy 4 — The insurance policy (保險)

Data quality and governance at scale behave like an insurance policy on the whole data analytics platform.

  • AWS Glue Data Quality rules are the monthly premium — you pay a small amount of compute every day to detect broken data before it poisons the gold zone.
  • Amazon Deequ is the DIY home inspection — library-based rule framework on Amazon EMR or Amazon SageMaker when you need richer rules than Glue Data Quality provides out of the box.
  • AWS Lake Formation LF-Tags are the policy rider — a single tag expression covers a thousand tables so that a GDPR classification change takes effect everywhere at once.
  • Amazon DataZone lineage is the claims record — when something goes wrong in a dashboard, lineage shows you every upstream hop so you can file the right claim (and fix the right bug).

Data analytics at scale without data quality and governance is uninsured; the first incident wipes out the value. The SAP-C02 exam rewards answers that wire quality and governance in from day one, not bolted on later.

Modern Data Lake — S3 Bronze, Silver, Gold, and Apache Iceberg

The data lake is the storage backbone of data analytics at scale on AWS. The SAP-C02 exam expects you to know the multi-zone pattern, the open table format options, and the catalog federation design.

Bronze, silver, and gold zones at organization scale

At SAA-C03 you learned bronze (raw), silver (cleansed), gold (consumable) as a single-account pattern. At SAP-C02 scale, each zone is partitioned further. Bronze is a landing zone per source system, owned by the ingestion team, retained by regulatory policy. Silver is the shared curated layer, owned by a central data platform team, written by AWS Glue ETL or Amazon EMR Serverless jobs that also run AWS Glue Data Quality rules. Gold is multi-owner — each product team owns its own gold marts in its own AWS account, registered into a federated AWS Glue Data Catalog so other teams can discover and query them.

Apache Iceberg, Apache Hudi, and Delta Lake — open table formats

Raw Apache Parquet files on Amazon S3 do not support transactions, schema evolution, or time travel. Open table formats solve this by adding a metadata layer on top of Parquet files. Apache Iceberg is the first-class citizen on AWS — supported natively by Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, and AWS Lake Formation. Apache Hudi and Delta Lake are also supported by Amazon EMR and AWS Glue, but Apache Iceberg is the SAP-C02 default answer for "we need ACID transactions, row-level updates and deletes, schema evolution, and time travel on the data lake."

For data analytics at scale, Apache Iceberg matters because GDPR right-to-be-forgotten requests require row-level deletes, which raw Parquet cannot express. Apache Iceberg delete files plus a scheduled compaction job turn "forget this customer" into a one-line SQL statement.

Apache Iceberg is an open table format that adds ACID transactions, row-level updates and deletes, schema evolution, hidden partitioning, and snapshot-based time travel to Apache Parquet files on object storage. AWS Glue Data Catalog natively tracks Iceberg table metadata, and Amazon Athena, Amazon Redshift, AWS Glue, and Amazon EMR read and write Iceberg tables with full Lake Formation governance. Apache Iceberg is the SAP-C02 default for a modern data lake that needs transactional semantics at scale. Source ↗

Glue Data Catalog federation across accounts

Data analytics at scale almost always spans multiple AWS accounts in AWS Organizations. A single central Glue Data Catalog can become a bottleneck, so AWS supports Glue Data Catalog federation: each producing account keeps its own Glue Data Catalog, and consumer accounts register federated catalogs that point to the producers. AWS Lake Formation brokers cross-account grants through AWS Resource Access Manager. For data analytics at scale on SAP-C02, federation is the default, and you should never propose copying catalog entries between accounts manually.

Choosing file layout for data analytics at scale

Partition by the column that most queries filter on (typically event date and tenant ID). Keep files between 128 MB and 1 GB. Use ZSTD compression on Apache Parquet for the silver and gold zones; use Snappy for the bronze zone where write throughput matters more. Run scheduled compaction on Apache Iceberg tables to prevent small-file fragmentation. Apply partition projection on Amazon Athena tables with millions of partitions so that you avoid AWS Glue crawler runtime overhead. These choices are identical to SAA-C03, but at SAP-C02 they must be applied consistently across all twenty teams — which is exactly what AWS Glue Data Quality and Amazon DataZone enforce.

Lake Formation at Organization Scale — LF-Tags, Row/Column Security, Cross-Account

AWS Lake Formation is the governance workhorse of data analytics at scale on AWS. At SAP-C02 scale, manual grants do not work; LF-Tags are not optional.

LF-Tag ontology for 20 teams

Design an LF-Tag ontology up front. A canonical organization uses four dimensions: zone (bronze, silver, gold), domain (finance, marketing, product, ops), sensitivity (public, internal, confidential, restricted, pii), and geography (us, eu, apac). Every database, table, and sensitive column carries the full tag set. Grants are expressed as tag expressions — for example, "the EU analyst group has SELECT on geography=eu AND sensitivity IN (public, internal, confidential) AND zone IN (silver, gold)." A single tag change re-permissions thousands of tables. This is how data analytics at scale stays governable.

Row-level and column-level security at organization scale

AWS Lake Formation data filters combine row filter expressions with included or excluded column lists. At organization scale, you create one filter per sensitivity-geography-domain combination and grant it to the matching role. The EU-retail-analyst role gets a data filter with region = 'EU' as the row filter and the customer_pii_email column excluded. The US-retail-analyst role gets a parallel filter with region = 'US'. When a GDPR scope changes, you edit the filter, not every query. Amazon Athena, Amazon Redshift Spectrum, AWS Glue, and Amazon EMR all honor the filter automatically.

Cross-account Lake Formation sharing

AWS Lake Formation supports two cross-account sharing modes. In named resource sharing, you grant a specific table to a specific account; AWS Lake Formation uses AWS Resource Access Manager to send the share, and the consumer account accepts it. In LF-Tag sharing, you grant a tag expression to an account; any table that later gains the matching tag becomes available to the consumer automatically, which is the scalable pattern for data analytics at scale. Consumer accounts query through Amazon Athena or Amazon Redshift Spectrum with their own IAM roles; Lake Formation vends scoped S3 credentials.

For any SAP-C02 scenario with more than five producer tables or more than three consumer accounts, choose AWS Lake Formation LF-Tag-based cross-account sharing over named resource sharing. Named sharing requires a new grant for every new table; LF-Tag sharing means a table created tomorrow with the right tag set is automatically visible to the right consumers, with no new grant. Data analytics at scale breaks down under named sharing because the grant surface grows quadratically with tables times consumers. Source ↗

Lake Formation plus AWS Organizations service-level grants

In AWS Organizations, AWS Lake Formation supports grants to the entire organization, to an organizational unit (OU), or to a specific account. Granting to an OU means a newly-onboarded account in that OU inherits access automatically. This is the pattern the SAP-C02 exam rewards when a scenario says "new product teams should gain access to the shared gold zone without manual onboarding."

GDPR-specific Lake Formation patterns

For GDPR compliance at data analytics at scale, three Lake Formation patterns matter most. First, use sensitivity=pii LF-Tags to mark every PII column and grant access only to the data protection officer role. Second, use row-level filters on gdpr_consent = true so queries automatically exclude records where consent has been withdrawn. Third, use Apache Iceberg row-level deletes plus an Apache Iceberg DELETE FROM statement triggered by right-to-be-forgotten requests. AWS CloudTrail plus AWS Lake Formation audit logs prove enforcement to auditors.

Streaming is the second pillar of data analytics at scale on SAP-C02. The exam tests the decision tree across Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon Managed Service for Apache Flink, and Amazon MSK.

Amazon Kinesis Data Streams — general-purpose streaming

Amazon Kinesis Data Streams is the foundational AWS streaming service. It partitions data into shards (provisioned mode) or scales shards automatically (on-demand mode). Retention is configurable from 24 hours to 365 days. Kinesis Data Streams is the right pick when you need many independent consumers reading the same stream at different speeds, when you need replay for up to a year, when you need millisecond latency with full control over consumer code, and when you already have AWS Lambda, AWS Glue streaming, or Amazon Kinesis Client Library applications reading from the stream.

Amazon Data Firehose — zero-code delivery

Amazon Data Firehose (formerly Amazon Kinesis Data Firehose) is the no-code streaming delivery service. It buffers by size or time, optionally transforms with AWS Lambda, optionally converts to Apache Parquet or Apache ORC, and writes to Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Snowflake, or a generic HTTP endpoint. Firehose is the right pick when the destination is one of the supported sinks, when you tolerate 60-second-plus latency, and when you do not want to manage consumers. For data analytics at scale, Firehose is the default for bronze-zone ingestion on Amazon S3.

Amazon Managed Service for Apache Flink (formerly Amazon Kinesis Data Analytics) runs Apache Flink applications for stateful, windowed, exactly-once streaming computation. It is the right pick when you need to compute per-user session aggregates, sliding-window metrics, stream-to-stream joins, or complex event processing. Managed Flink reads from Amazon Kinesis Data Streams, Amazon MSK, or Amazon MSK Serverless, and writes to any sink Apache Flink supports. For data analytics at scale, Managed Flink is where streaming aggregates land in near-real-time gold tables.

Amazon MSK — Apache Kafka compatibility

Amazon Managed Streaming for Apache Kafka (Amazon MSK) runs real Apache Kafka with full protocol compatibility. Amazon MSK is the right pick when the organization already uses Apache Kafka, when you need consumer groups with exactly-once semantics, when you need the richer Apache Kafka ecosystem (Apache Kafka Connect, Apache Kafka Streams, Confluent Schema Registry-compatible tooling), or when you need tiered storage for long-retention streams. Amazon MSK Serverless is the auto-scaling option for variable throughput.

The SAP-C02 exam writes parallel scenarios where Kinesis Data Streams and Amazon MSK both technically work. The discriminator is almost always in the scenario text. Keywords like "Apache Kafka producers already exist," "consumer groups," "Apache Kafka Connect," or "third-party Kafka-compatible tooling" point to Amazon MSK. Keywords like "AWS-native," "Amazon Kinesis Client Library," "shard-level control," "pay per shard-hour," or "AWS Lambda event source mapping" point to Amazon Kinesis Data Streams. Do not assume Amazon MSK is always better because it is Apache Kafka; Kinesis Data Streams has simpler operations and tighter AWS integration for many data analytics at scale workloads. Source ↗

The streaming decision tree at data analytics at scale

Start at the top. Is the destination exactly one of S3, Redshift, OpenSearch, Splunk, Snowflake, or HTTP, with no custom consumer logic? Use Amazon Data Firehose. Is the computation stateful with windowed aggregates or stream joins? Use Amazon Managed Service for Apache Flink reading from Kinesis Data Streams or Amazon MSK. Is the organization standardized on Apache Kafka protocol and tooling? Use Amazon MSK. Otherwise, use Amazon Kinesis Data Streams with AWS Lambda or AWS Glue streaming consumers. This four-branch tree answers almost every SAP-C02 streaming question about data analytics at scale.

Redshift at Scale — RA3, Serverless, Spectrum, Federated Query, Zero-ETL

Amazon Redshift is the data warehouse pillar of data analytics at scale. SAP-C02 tests six Redshift capabilities that do not appear on SAA-C03.

Amazon Redshift RA3 managed storage

RA3 node types (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouple compute from storage. Data lives in Redshift Managed Storage (RMS), which is backed by Amazon S3 under the hood, and compute nodes cache the hot working set on local SSD. You can scale compute and storage independently; you can pause and resume; you can run concurrency scaling clusters that burst capacity during peaks. RA3 is the SAP-C02 default for any provisioned Redshift answer — DS2 and DC2 are legacy.

Amazon Redshift Serverless

Amazon Redshift Serverless removes cluster sizing entirely. You specify a base Redshift Processing Unit (RPU) capacity; Serverless auto-scales up during load and pauses when idle. Billing is per RPU-second. For data analytics at scale on SAP-C02, Redshift Serverless is the right pick when workloads are unpredictable (dev/test, ad-hoc analytics, infrequent but intensive jobs) or when you want the simplicity of no cluster management. RA3 provisioned is still the right pick for steady-state 24x7 production workloads where Reserved Instance pricing and predictable capacity matter.

Amazon Redshift Spectrum — read the lake in place

Amazon Redshift Spectrum lets a Redshift cluster query data directly on Amazon S3 through an external schema registered in AWS Glue Data Catalog, with AWS Lake Formation enforcement. This is how a Redshift session joins warehouse fact tables with data lake silver-zone Apache Iceberg tables without loading them. For data analytics at scale, Spectrum is the bridge that lets the gold-zone warehouse stay lean while still answering questions that need raw lake data.

Amazon Redshift Federated Query

Amazon Redshift Federated Query lets Redshift run SQL directly against Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, Amazon RDS for MySQL, and Amazon Aurora MySQL. This is different from Amazon Redshift Spectrum, which reads Amazon S3. Federated query is the SAP-C02 answer for "run a Redshift-native SQL that joins a warehouse table with an operational database table without running ETL." At data analytics at scale, federated query is often superseded by zero-ETL integrations.

Amazon Redshift Concurrency Scaling

Redshift Concurrency Scaling adds transient compute clusters when the main cluster runs out of concurrent query slots, billing only for the seconds those transient clusters run. Each Redshift cluster earns one free hour of concurrency scaling per 24 hours of normal operation, which covers most workloads. For data analytics at scale, enabling concurrency scaling is the cheapest way to handle a 500-analyst peak without oversizing the base cluster.

Amazon Redshift Materialized Views

Materialized views precompute and store query results. Auto-refresh re-runs the view when underlying data changes; incremental refresh updates only the affected rows. For data analytics at scale, gold-zone dashboards against Redshift should almost always sit on materialized views — queries are sub-second, Amazon QuickSight stays responsive for hundreds of concurrent users, and the underlying fact tables are read only during refresh.

Zero-ETL integrations

Zero-ETL integrations replicate data from Amazon Aurora MySQL, Amazon Aurora PostgreSQL, Amazon RDS for MySQL, and Amazon DynamoDB into Amazon Redshift in near real time, with no glue code and no ETL pipeline. The source table surfaces inside Redshift as if it were native, within seconds of writes. For data analytics at scale on SAP-C02, zero-ETL is the answer whenever the scenario says "we want operational data available for analytics within seconds, without building and operating an ETL pipeline."

RA3 for steady provisioned workloads with managed storage. Redshift Serverless for unpredictable or intermittent workloads. Redshift Spectrum to query Amazon S3 lake tables from a Redshift session. Redshift Federated Query to query operational RDS/Aurora PostgreSQL and MySQL without ETL. Zero-ETL for near-real-time replication from Aurora, RDS, and DynamoDB. Concurrency Scaling for transient peaks. Materialized Views for gold-zone dashboard speed. Redshift Data Sharing for cross-account live sharing. These eight capabilities define Redshift at data analytics at scale on SAP-C02. Source ↗

Data Sharing Across Accounts — Redshift Data Sharing, AWS Data Exchange, AWS RAM

Data analytics at scale almost always requires sharing live datasets across AWS accounts — between product teams, between subsidiaries, and with external partners. SAP-C02 tests the differences between the sharing services.

Amazon Redshift Data Sharing

Amazon Redshift Data Sharing lets a producer Redshift cluster share live tables, views, and schemas with one or more consumer Redshift clusters with no data copy. Consumers query with their own credentials, pay their own compute, and always see the latest data. Supported across RA3 provisioned and Redshift Serverless. This is the SAP-C02 answer for "finance team's warehouse must query live marketing team's warehouse across accounts without nightly extracts."

AWS Data Exchange

AWS Data Exchange is the marketplace for third-party datasets, including Amazon S3 datasets, Amazon Redshift datasets, and Amazon API Gateway APIs. A data provider publishes; a subscriber receives entitlements and can query through the native service (Amazon Athena for S3 datasets, Amazon Redshift Data Sharing for Redshift datasets). For data analytics at scale, AWS Data Exchange is the right answer for "consume third-party financial data in our data lake without custom ingest pipelines."

AWS Resource Access Manager (AWS RAM)

AWS Resource Access Manager (AWS RAM) is the generic cross-account resource-sharing service AWS Lake Formation uses under the hood. You rarely invoke AWS RAM directly for data analytics at scale — Lake Formation abstracts it — but the SAP-C02 exam may name AWS RAM when describing the S3 access grant layer.

Amazon S3 cross-account access patterns

Beyond Lake Formation, Amazon S3 supports three cross-account access patterns the SAP-C02 exam expects you to know. First, S3 bucket policies grant other accounts s3:GetObject or s3:ListBucket. Second, S3 Access Points per-account simplify large permission sets. Third, S3 Access Grants (newer) map IAM identities to specific S3 prefixes without editing bucket policies. For data analytics at scale, prefer Lake Formation for anything catalog-aware; use S3 access patterns only for non-tabular bulk data (images, video, raw logs).

Use Amazon Redshift Data Sharing when the consumer is a Redshift cluster and the data is already in Redshift. Use AWS Lake Formation cross-account sharing when the consumer queries Amazon S3 data lake tables with Amazon Athena, Amazon Redshift Spectrum, or Amazon EMR. Use AWS Data Exchange when the data comes from (or is published to) a third party. Use AWS RAM directly only when sharing non-data resources like Amazon VPC subnets or AWS Transit Gateway. Mixing these up is one of the most common SAP-C02 mistakes in data analytics at scale scenarios. Source ↗

Amazon DataZone — Self-Service Governance Portal at Scale

Amazon DataZone is the business-facing governance portal for data analytics at scale. It sits on top of AWS Lake Formation and AWS Glue Data Catalog and provides a self-service experience that SAA-C03 does not test.

What Amazon DataZone provides

Amazon DataZone provides business glossaries (map technical column names to business terms), data product catalogs (publish curated datasets with SLAs and ownership), access request workflows (analyst requests access, data owner approves, Lake Formation grant is applied), lineage (upstream-to-downstream tracing), and environment templates (each DataZone project gets a pre-configured Amazon Athena workgroup, Amazon Redshift access, and AWS Glue database). For data analytics at scale, Amazon DataZone is what turns raw Lake Formation into a product that 500 analysts actually want to use.

The DataZone domain and project model

A DataZone domain is the top-level tenant — typically one per organization. Inside a domain, projects are the unit of collaboration — each product team has a project. Projects publish data assets (tables, dashboards, machine learning models) with metadata, ownership, and SLAs. Other projects subscribe to data assets through access request workflows. For data analytics at scale, domain-project-asset is the mental model the exam uses.

When Amazon DataZone is the right answer

Pick Amazon DataZone when the scenario describes self-service data discovery, business-friendly metadata, cross-team access request workflows, lineage visualization across Glue jobs and Redshift loads, or standardized analytics environment provisioning per team. If the scenario says "analysts cannot find the right dataset" or "every access request takes two weeks because it involves a ticket, a Lake Formation grant, a Redshift grant, and an Athena workgroup change," the answer is Amazon DataZone.

Amazon DataZone does not replace AWS Lake Formation; it sits on top. Lake Formation is the enforcement layer — LF-Tags, data filters, cross-account grants. Amazon DataZone is the user experience layer — glossaries, catalogs, workflows. For data analytics at scale, pick Amazon DataZone whenever the pain point is discovery, request workflow, or business metadata, and pick Lake Formation whenever the pain point is enforcement, permission granularity, or cross-account sharing. Most SAP-C02 scenarios need both. Source ↗

OpenSearch at Scale — Observability, Search, and Serverless

Amazon OpenSearch Service completes the data analytics at scale picture for text-heavy and time-series observability workloads.

OpenSearch Service provisioned clusters

Amazon OpenSearch Service provisioned clusters expose full-text search, aggregations, and OpenSearch Dashboards on indices sized by node count and instance type. At data analytics at scale, OpenSearch Service handles log analytics (Amazon Data Firehose → OpenSearch), product catalog search (application-driven index writes), and SIEM (Amazon OpenSearch Service Security Analytics plugin).

Amazon OpenSearch Serverless

Amazon OpenSearch Serverless auto-scales independently on indexing OpenSearch Compute Units (OCUs) and search OCUs. You define a collection (time-series, search, or vector search) and OpenSearch Serverless handles capacity. For data analytics at scale, OpenSearch Serverless is the right pick when traffic is variable — for example, customer-facing product search with traffic spikes, or intermittent SIEM investigation workloads.

OpenSearch for observability at organization scale

At 20 teams and 500 analysts, centralized observability becomes a data analytics at scale problem. Pattern: AWS CloudWatch Logs Subscription Filters ship logs to Amazon Data Firehose; Firehose writes to both Amazon S3 (cold archival and long-term Athena queries) and Amazon OpenSearch Service (hot search for the last 30 days). This tiered pattern keeps OpenSearch cluster size — and cost — under control while preserving the ability to query anything, ever, through Amazon Athena.

OpenSearch vector search for retrieval-augmented generation

Amazon OpenSearch Serverless supports vector search collections, which the SAP-C02 exam is starting to test in machine learning scenarios. A retail organization building retrieval-augmented generation (RAG) over its product catalog embeds product descriptions with Amazon Bedrock, stores embeddings in an OpenSearch Serverless vector collection, and performs semantic search from the generative application. For data analytics at scale, this is where the analytics plane meets the machine learning plane.

SageMaker Feature Store and Lake House Integration

Machine learning at scale is tightly coupled to data analytics at scale because both engines need the same curated features.

Amazon SageMaker Feature Store

Amazon SageMaker Feature Store is the managed feature repository. It has two stores: an online store (low-latency key-value lookup for real-time inference, backed by a DynamoDB-style engine) and an offline store (Amazon S3 backed, in Apache Iceberg or Apache Parquet format, registered in AWS Glue Data Catalog, queryable by Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR). For data analytics at scale, the offline store is a data lake table — any feature the ML team creates is automatically available to analysts, governed by the same AWS Lake Formation policies.

The lake house pattern

A lake house unifies the data lake (Amazon S3 plus Apache Iceberg plus AWS Glue Data Catalog) with the data warehouse (Amazon Redshift RA3 plus Redshift Managed Storage). Queries execute where it makes sense — Amazon Redshift for sub-second dashboard queries on curated marts, Amazon Athena for ad-hoc exploration on lake tables, and Amazon EMR Serverless for Apache Spark workloads — but all of them honor the same AWS Lake Formation permissions and read the same Apache Iceberg tables. This is the canonical "lake house" SAP-C02 answer, and SageMaker Feature Store slots into it naturally.

Data Quality at Scale — Glue Data Quality and Amazon Deequ

Data analytics at scale without data quality enforcement is operational debt. SAP-C02 tests two specific tools.

AWS Glue Data Quality

AWS Glue Data Quality lets you declare Data Quality Definition Language (DQDL) rules on a table — for example, "column order_id must be unique," "column email must match a regex," "column revenue must be non-negative." Rules run on demand, on schedule, or inside an AWS Glue ETL job. Results are recorded as data quality scores and can block downstream jobs or publish to Amazon EventBridge. For data analytics at scale, AWS Glue Data Quality is the default because it integrates with AWS Glue Data Catalog and AWS Lake Formation out of the box.

Amazon Deequ

Amazon Deequ is an open-source library for unit-test-style data quality checks on Apache Spark DataFrames, developed by AWS. Deequ runs on Amazon EMR, Amazon EMR Serverless, and AWS Glue's Apache Spark runtime. It supports anomaly detection, column profiling, and constraint verification. For data analytics at scale, Deequ is the right pick when the rules are too complex for DQDL or when the organization already has a Spark-based data platform.

Quality gates in the pipeline

Wire AWS Glue Data Quality rules into every silver-zone write — a failed rule marks the batch bad and quarantines it. Wire a second layer of rules into every gold-zone materialization — a failed rule blocks the materialized view refresh and pages the owning team. For data analytics at scale, two-layer quality gates prevent bad data from reaching 500 analysts' dashboards.

The Retail Scenario — 20 Teams, 500 Analysts, GDPR, End-to-End Design

Pulling everything together, here is a reference design for the SAP-C02 retail scenario: a retail organization with 20 product teams, a data lake serving 500 analysts, and GDPR obligations.

Account structure

Use AWS Organizations with OUs for Producers (per-team ingestion and silver ownership), Consumers (per-team gold marts and analyst access), Central Governance (Lake Formation administrator account), and Data Platform (shared services: Amazon DataZone domain, AWS Glue Data Catalog hub, Amazon Redshift data sharing producer). Each team has one producer account and one consumer account.

Ingestion design

Operational Amazon Aurora PostgreSQL databases replicate to Amazon Redshift through zero-ETL integrations. Clickstream events flow from the storefront through Amazon Kinesis Data Streams to Amazon Managed Service for Apache Flink (for real-time session aggregates) and to Amazon Data Firehose (for raw event archival to Amazon S3 bronze zone in Apache Parquet). Partner feeds arrive via AWS Data Exchange.

Storage design

Bronze zone per producer account — raw events and raw CDC, retention by regulatory rule. Silver zone in a shared Data Platform account — cleansed Apache Iceberg tables, partitioned by event date and tenant, written by AWS Glue ETL jobs with AWS Glue Data Quality rules. Gold zone per consumer account — team-owned marts, also Apache Iceberg, surfaced through AWS Glue Data Catalog federation.

Governance design

AWS Lake Formation in the Central Governance account holds the LF-Tag ontology — zone, domain, sensitivity, geography. PII columns are tagged sensitivity=pii and geography=eu or geography=us as applicable. Data filters enforce per-country row restrictions. Cross-account grants use LF-Tag sharing at the OU level — every Consumer account in the Retail OU automatically sees silver-zone tables tagged domain=retail at their cleared sensitivity level. Amazon DataZone provides the 500-analyst self-service portal.

Query and consumption design

Amazon Redshift Serverless per Consumer account for interactive analytics and gold marts, with Redshift Data Sharing from the central finance warehouse so every team sees live corporate financials. Amazon Athena per Consumer account for ad-hoc SQL on silver and gold Apache Iceberg tables. Amazon QuickSight with SPICE per team for dashboards, embedded into internal tools for store managers. Amazon OpenSearch Serverless for store-level operational search and observability.

GDPR compliance design

Right-to-be-forgotten requests trigger an Apache Iceberg DELETE FROM across every table containing the customer ID, verified by AWS CloudTrail and AWS Lake Formation audit logs. Row-level filters on gdpr_consent = true drop non-consenting records from every analyst query automatically. Cross-border data transfer controls are enforced by geography LF-Tags — a US-based analyst simply cannot be granted a tag expression that includes geography=eu AND sensitivity=pii.

Data quality and machine learning design

AWS Glue Data Quality rules enforce completeness, uniqueness, and freshness at every bronze-to-silver transition. Amazon Deequ on Amazon EMR Serverless runs statistical profiling on the full silver zone nightly. Amazon SageMaker Feature Store offline store is an Apache Iceberg table in the silver zone; online store serves real-time recommendation inferences for the storefront.

This composite is the canonical SAP-C02 data analytics at scale answer. Every variation the exam presents is a slice of this pipeline.

Common SAP-C02 Traps in Data Analytics at Scale

Trap 1 — Single-account analytics thinking

Candidates who pattern-match from SAA-C03 will propose a single Amazon Athena workgroup and a single Lake Formation admin for a 20-team organization. SAP-C02 expects multi-account AWS Organizations design with Lake Formation LF-Tag cross-account sharing. Single-account answers are wrong at data analytics at scale.

Trap 2 — Named sharing instead of tag-based sharing

As organization size grows, named Lake Formation sharing becomes unmanageable. SAP-C02 rewards LF-Tag sharing. If the scenario hints at "dozens of tables" or "new tables added weekly," pick tag-based sharing.

Trap 3 — Using Redshift Federated Query when zero-ETL exists

Redshift Federated Query is older and requires you to write cross-database joins. Zero-ETL integrations replicate the source automatically for near-real-time analytics. When the scenario says Aurora MySQL or Aurora PostgreSQL, zero-ETL is usually the better answer.

Trap 4 — Kinesis Data Streams versus Amazon MSK

Amazon MSK is not always better because it is Apache Kafka. If the scenario does not mention Apache Kafka, Apache Kafka Connect, or consumer groups, Kinesis Data Streams is simpler and cheaper. The SAP-C02 exam plants these keywords deliberately.

Trap 5 — Raw Parquet instead of Apache Iceberg for GDPR

Right-to-be-forgotten requires row-level deletes. Raw Apache Parquet on Amazon S3 cannot express row-level deletes efficiently. Apache Iceberg can. Whenever the scenario mentions GDPR, right-to-be-forgotten, or row-level compliance deletes, Apache Iceberg is part of the answer.

Trap 6 — Amazon DataZone versus AWS Lake Formation

DataZone is the user experience layer; Lake Formation is the enforcement layer. Picking DataZone when the question asks about column-level enforcement is wrong; picking Lake Formation when the question asks about analyst self-service discovery is also wrong. Read the pain point carefully.

Trap 7 — OpenSearch for long-term analytics

Amazon OpenSearch Service is optimized for text search and recent-time-range analytics. Do not propose OpenSearch as the primary storage for five-year historical analytics — use Amazon S3 plus Amazon Athena for that, and OpenSearch only for the hot window. SAP-C02 tests the tiered pattern.

Trap 8 — QuickSight SPICE for live operational dashboards

SPICE is a cache refreshed on schedule. If the scenario requires sub-second latency on continuously updated data (for example, a live store-manager view of current-hour sales), use a direct query to Amazon Redshift with materialized views, not SPICE.

The biggest SAP-C02 trap in data analytics at scale is collapsing the whole pipeline into a single service answer. The exam rewards answers that name the right service per layer — Amazon Kinesis Data Streams for ingestion, Amazon Managed Service for Apache Flink for streaming compute, Apache Iceberg on Amazon S3 for storage, AWS Glue Data Catalog with Lake Formation LF-Tags for governance, Amazon Redshift RA3 with Data Sharing for curated marts, Amazon Athena for ad-hoc query, Amazon DataZone for discovery, and Amazon QuickSight with SPICE for dashboards. Answers that propose only Amazon Redshift, or only Amazon Athena, or only Amazon EMR, almost always lose points on SAP-C02. Source ↗

Key Numbers to Memorize for Data Analytics at Scale

  • Amazon Kinesis Data Streams retention: 24 hours to 365 days (extended retention billed separately).
  • Amazon Kinesis Data Streams on-demand mode: up to 200 MB/s or 200,000 records/s per stream, auto-scaling.
  • Amazon Data Firehose buffering: 1 MB to 128 MB and 60 to 900 seconds for Amazon S3 destinations.
  • Amazon MSK tiered storage: primary plus infinite low-cost tier; reads older than the primary retention go to the tier transparently.
  • Amazon Redshift RA3 managed storage: decoupled from compute; billed per GB-month.
  • Amazon Redshift Concurrency Scaling: one free hour per 24 hours of main-cluster usage.
  • Amazon Redshift Data Sharing: live access, zero copy, across accounts and regions (for RA3 and Serverless).
  • Amazon Redshift zero-ETL sources: Amazon Aurora MySQL, Amazon Aurora PostgreSQL, Amazon RDS for MySQL, Amazon DynamoDB (verify current list).
  • AWS Lake Formation LF-Tags: up to 50 tag keys per catalog and up to 1,000 values per key (verify current quotas).
  • Amazon DataZone domain: one per organization is typical; projects nest inside.
  • Apache Iceberg time travel: query any past snapshot by timestamp or snapshot ID.
  • Amazon OpenSearch Serverless: independent indexing OCUs and search OCUs; vector collections for RAG.
  • AWS Glue Data Quality: DQDL rules, data quality score per run, integrated with Amazon EventBridge.

Always verify quotas and limits against the current AWS documentation before the exam — data analytics at scale services evolve quickly.

FAQ — Top Questions on Data Analytics at Scale for SAP-C02

Q1. When should I pick Apache Iceberg over plain Apache Parquet for data analytics at scale?

Pick Apache Iceberg whenever the data lake needs ACID transactions, row-level updates or deletes (for example, GDPR right-to-be-forgotten), schema evolution without rewriting history, or snapshot-based time travel. Pick plain Apache Parquet only for append-only, immutable datasets such as raw bronze-zone event logs. On SAP-C02, any scenario that mentions GDPR, regulatory deletes, schema change, or point-in-time reproducibility points to Apache Iceberg. Data analytics at scale on AWS increasingly defaults to Apache Iceberg because Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, and AWS Lake Formation all support it natively.

Q2. Amazon Kinesis Data Streams, Amazon MSK, or Amazon Data Firehose — which one does SAP-C02 reward?

All three can appear in the same scenario. Amazon Data Firehose is the right pick when the destination is a supported sink (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Snowflake, HTTP) and you tolerate minute-level latency with zero consumer code. Amazon Kinesis Data Streams is the right pick when you need multiple independent consumers, millisecond latency, custom consumer logic, or replay up to 365 days. Amazon MSK is the right pick when the organization standardizes on Apache Kafka protocol and ecosystem — Apache Kafka Connect, consumer groups, Schema Registry. For stateful streaming compute (windowed aggregates, joins), layer Amazon Managed Service for Apache Flink on top of whichever of Kinesis Data Streams or MSK you chose. Data analytics at scale often uses all three in different parts of the same pipeline.

Q3. How does AWS Lake Formation scale governance to a 20-team, 500-analyst organization?

Through LF-Tags plus cross-account LF-Tag sharing plus Amazon DataZone. Design a four-dimensional LF-Tag ontology (zone, domain, sensitivity, geography) once. Apply tags automatically during AWS Glue ETL by convention. Grant tag expressions to principals and OUs — never grant explicit table names. Share tags across accounts through AWS Lake Formation cross-account LF-Tag sharing, which uses AWS Resource Access Manager under the hood. Layer Amazon DataZone on top for business-facing discovery and access request workflows. This design scales because adding a new table is a tag-assignment problem, not a grant-writing problem, and adding a new consumer is an OU-onboarding problem, not a per-table problem.

Q4. Amazon Redshift RA3, Redshift Serverless, or Redshift Spectrum — how do I pick?

They are complementary, not competing. Pick Amazon Redshift RA3 provisioned for steady-state 24x7 production warehouses that benefit from Reserved Instance pricing and predictable capacity. Pick Amazon Redshift Serverless for unpredictable or intermittent workloads — dev/test environments, ad-hoc analytics, per-team marts with bursty query patterns, workloads where the simplicity of no cluster management outweighs a small compute premium. Use Amazon Redshift Spectrum from either RA3 or Serverless to query Amazon S3 data lake tables in place — it is not an alternative to RA3 or Serverless, it is a feature of both. Data analytics at scale usually combines all three: RA3 for the central finance warehouse, Serverless per team, Spectrum to join warehouse fact tables with data lake silver-zone tables.

Q5. How do I handle GDPR compliance at data analytics at scale on AWS?

Four mechanisms together. First, tag every PII column with sensitivity=pii LF-Tags and restrict access through tag expressions — never grant PII access to general analyst roles. Second, use AWS Lake Formation row-level data filters on gdpr_consent = true so withdrawn-consent records drop out of every query automatically. Third, store tables in Apache Iceberg so right-to-be-forgotten requests translate to DELETE FROM SQL statements with row-level deletes, and schedule Iceberg compaction to physically remove deleted rows. Fourth, enable AWS CloudTrail and AWS Lake Formation audit logging across every account to prove enforcement to auditors, and use Amazon DataZone lineage to show where PII flows. Data analytics at scale under GDPR requires all four — missing any one is a compliance gap.

Q6. When does Amazon DataZone add value beyond what AWS Lake Formation already provides?

When the pain point is discovery, workflow, or business metadata rather than enforcement. Lake Formation answers "is this principal allowed to read this column?" — it does not answer "which of the 3,000 tables in our lake has quarterly revenue by store?" or "who do I ask for access to the marketing attribution dataset?" or "how does this dashboard trace back to the source systems?". Amazon DataZone answers those questions through business glossaries, data product catalogs, access request workflows, and lineage visualization. At data analytics at scale with hundreds of analysts, a pure Lake Formation deployment is technically correct but practically unusable without a DataZone-style portal on top.

Q7. Should I use zero-ETL integrations or Amazon Redshift Federated Query for operational data analytics?

Zero-ETL is preferred for near-real-time analytics on operational data from Amazon Aurora MySQL, Aurora PostgreSQL, Amazon RDS for MySQL, and Amazon DynamoDB. Replication happens automatically in seconds, the source table surfaces as a native Redshift table, and there is no ETL pipeline to operate. Amazon Redshift Federated Query is preferred for interactive SQL against smaller operational tables where query-time freshness matters more than sustained throughput, and where zero-ETL is not yet supported for your source engine. On SAP-C02, if the scenario names a supported zero-ETL source and says "near-real-time analytics without ETL," zero-ETL is the answer. If the scenario names PostgreSQL or MySQL on RDS with a cross-database join requirement on an interactive query, Federated Query fits.

Q8. How do I keep Amazon Athena cost predictable at 500-analyst data analytics at scale?

Five levers. First, enforce Amazon Athena workgroups per team with per-query and per-workgroup scan limits so a runaway query cannot scan the entire lake. Second, store all silver and gold tables as Apache Iceberg (or at minimum Apache Parquet with Snappy or ZSTD compression) and partition by the most-used WHERE predicate. Third, enable Amazon Athena query result reuse at the workgroup level so repeated dashboard queries return cached results at zero scan cost. Four, move frequent sub-second dashboards off Amazon Athena onto Amazon Redshift materialized views or Amazon QuickSight SPICE. Five, use AWS CloudTrail plus Amazon Athena workgroup CloudWatch metrics to identify the top ten most expensive queries weekly and rewrite them. At data analytics at scale, cost control is always policy plus data layout plus caching — not any single trick.

Summary — Data Analytics at Scale on SAP-C02

Data analytics at scale on AWS is a layered architecture problem, not a service selection problem. The SAP-C02 exam expects composite answers that span ingestion (Amazon Kinesis, Amazon MSK, zero-ETL), storage (Amazon S3 plus Apache Iceberg plus Amazon Redshift RA3 managed storage), catalog (AWS Glue Data Catalog federation), governance (AWS Lake Formation LF-Tags with cross-account sharing plus Amazon DataZone for discovery), compute (Amazon Athena, Amazon Redshift Serverless, Amazon EMR Serverless, Amazon Managed Service for Apache Flink), and consumption (Amazon QuickSight plus embedded analytics plus SageMaker Feature Store). A 20-team, 500-analyst, GDPR-bound retail organization exercises every one of these layers, which is exactly why the SAP-C02 exam leans on this scenario. Memorize the six layers, memorize the streaming decision tree, memorize the Redshift capability matrix, and memorize the Lake Formation plus DataZone split, and data analytics at scale becomes one of the highest-yield topics on the exam.

Official sources