Amazon Athena, AWS Lake Formation, and Amazon QuickSight together form the analytics layer of the AWS data lake. For the AWS Certified Solutions Architect Associate (SAA-C03) exam, Task Statement 3.5 expects you to design high-performing data ingestion and transformation solutions, and Athena, Lake Formation, and QuickSight are the three services the exam uses to test your grasp of serverless query, centralized governance, and business intelligence. This topic is one of the most scenario-heavy on the SAA-C03 blueprint because a single data lake question can mix ad-hoc SQL, fine-grained permissions, dashboard embedding, and the classic Athena vs Redshift vs EMR selection in one prompt. Getting Amazon Athena, AWS Lake Formation, and Amazon QuickSight right is therefore worth roughly five to seven exam questions on a typical sitting.
This study guide covers every Amazon Athena feature the SAA-C03 exam will test (including Athena Federated Query, Athena workgroups, query result reuse, and partitioning with columnar formats), every AWS Lake Formation concept you must memorize (blueprints, the permissions model, LF-Tags for ABAC, row-level and column-level security, and data filtering), the Amazon QuickSight capabilities that appear in case studies (SPICE, embedding, ML Insights, row-level security), the role of Amazon OpenSearch Service for text and log search, and the decision framework for Amazon Athena vs Amazon Redshift vs Amazon EMR. It closes with the five FAQ questions that most candidates miss.
What is a Data Lake and Analytics Architecture on AWS
A data lake is a centralized repository that stores structured, semi-structured, and unstructured data at any scale, typically on Amazon S3, and lets multiple analytics services read that data in place without moving it. In AWS terms, the data lake is S3 plus the AWS Glue Data Catalog (the metadata layer), governed by AWS Lake Formation (the permissions layer), queried by Amazon Athena (serverless SQL), Amazon Redshift Spectrum (warehouse-grade SQL), or Amazon EMR (Spark and Hadoop), and visualized by Amazon QuickSight.
Why the data lake pattern wins on SAA-C03
The SAA-C03 exam favors the data lake pattern because it matches the five architectural pillars: it is cost-optimized (S3 storage is cheap, Athena charges only per terabyte scanned), it is secure (Lake Formation provides fine-grained access on top of IAM and KMS), it is reliable (S3 is eleven-nines durable), it is performant when partitioned and columnar, and it is operationally efficient because the catalog is shared across services. Any exam scenario that mentions "store all raw data," "query on demand," "different teams need different views," or "build a dashboard for the business" is almost always solved with Amazon Athena plus AWS Lake Formation plus Amazon QuickSight.
The reference analytics architecture the exam uses
The exam keeps coming back to the same reference pipeline: ingestion (Amazon Kinesis Data Firehose or AWS DMS) lands raw data in Amazon S3, AWS Glue crawlers register the schema in the AWS Glue Data Catalog, AWS Lake Formation applies permissions, Amazon Athena runs ad-hoc SQL on the S3 data, Amazon QuickSight builds the dashboard, and Amazon OpenSearch Service indexes the log and text subset for search. Memorize this pipeline; SAA-C03 scenarios describe variations of it by name.
Plain-Language Explanation: Athena, Lake Formation, and QuickSight
Amazon Athena, AWS Lake Formation, and Amazon QuickSight may sound abstract, so here are three analogies that make them concrete for the SAA-C03 exam.
Analogy 1 — The library (圖書館)
Imagine your company data lake as a giant public library.
- The books on the shelves are the raw files in Amazon S3.
- The card catalog is the AWS Glue Data Catalog — it tells you which shelf, which row, which book.
- The head librarian is AWS Lake Formation — they decide which reader can open which section, which chapter, and even which paragraph (row-level and column-level security). They can also hand out colored bracelets (LF-Tags) so that any reader wearing the red bracelet can read the red-tagged shelves.
- The reading desk is Amazon Athena — you sit down with a SQL question and Amazon Athena reads the books in place, no checkout needed, and bills you by the page scanned.
- The summary posters on the wall are Amazon QuickSight dashboards — distilled visual insight for anyone who does not want to read books themselves.
- The express search terminal is Amazon OpenSearch Service — type a keyword, get matching passages instantly.
If the SAA-C03 question describes a central repository that multiple teams query in different ways with different permissions, the library analogy tells you the answer is a data lake governed by AWS Lake Formation with Amazon Athena and Amazon QuickSight on top.
Analogy 2 — The open-book exam (開書考試)
Amazon Athena behaves like an open-book exam.
- You do not load anything into a database beforehand; the data stays on Amazon S3 exactly as it lives in the library.
- You write a SQL query the way you would write an answer — and Amazon Athena flips through only the pages relevant to your question.
- If your "textbook" is neatly organized by chapter (partitions) and printed in a compact format (Apache Parquet or ORC), Amazon Athena finishes in seconds and charges only for the pages it read.
- If your textbook is a messy pile of CSV paragraphs, Amazon Athena still answers, but reads more pages and charges more.
This is the core SAA-C03 lesson about Amazon Athena cost and performance: partitioning plus columnar formats equals fewer bytes scanned equals lower Amazon Athena cost and faster Amazon Athena responses.
Analogy 3 — The door access card (門禁卡)
AWS Lake Formation works like an enterprise badge system.
- Your IAM identity is your name on file at reception.
- Your Lake Formation permissions are the RFID permissions on your badge — "this badge opens databases A and B, tables X and Y, columns 1 through 4, but rows where region = APAC only."
- LF-Tags are color-coded stickers on your badge — "all badges with blue stickers open all blue rooms," so admins do not have to update every badge individually.
- Data filters are the partial room access within a single room — you may enter the conference room but only sit in the first three chairs.
Any SAA-C03 prompt about "different analysts must see different columns" or "contractors can see only aggregated rows" is an AWS Lake Formation fine-grained access question, not an IAM question.
Amazon Athena — Serverless SQL on Amazon S3, Pricing, and Partitioning
Amazon Athena is an interactive query service that runs standard ANSI SQL on data stored in Amazon S3 without any servers to manage. It is built on Presto and Trino, uses the AWS Glue Data Catalog as its metastore, and is billed per terabyte of data scanned. Amazon Athena is the default SAA-C03 answer for ad-hoc SQL, log analysis, and any "query the lake without loading it" scenario.
Amazon Athena core mechanics you must know
Amazon Athena reads data directly from Amazon S3. There are no clusters to provision and no capacity to plan. You submit a query in the Amazon Athena console, through the JDBC/ODBC driver, or through the StartQueryExecution API, and Amazon Athena returns results plus a saved copy in the Amazon Athena query results S3 bucket. Amazon Athena charges USD 5.00 per terabyte scanned in most regions (verify current pricing) with a 10 MB minimum per query, and there is no charge for DDL statements.
Amazon Athena Federated Query
Amazon Athena Federated Query lets you run the same SQL across non-S3 sources such as Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon DocumentDB, Amazon ElastiCache, HBase on Amazon EMR, SAP HANA, Snowflake, and custom sources. It works through AWS Lambda-based data source connectors deployed from the AWS Serverless Application Repository. Each connector translates Amazon Athena query fragments into the native API of the target store. On the SAA-C03 exam, any scenario that says "run a single SQL join across S3 and DynamoDB" or "query an on-premises database without ETL" points to Amazon Athena Federated Query.
Amazon Athena workgroups
Amazon Athena workgroups are the isolation and governance primitive inside Amazon Athena. A workgroup lets you separate users, teams, and applications; enforce per-query and per-workgroup data-scan limits; publish CloudWatch metrics for Amazon Athena usage; require query result encryption; and control the Amazon S3 location where query results are written. You can assign a different IAM policy per workgroup, which is how large organizations bill different business units for Amazon Athena usage. Data usage control limits are a favorite SAA-C03 trap — they are set per workgroup and can cancel queries that exceed the scan threshold.
Amazon Athena query result reuse
Amazon Athena query result reuse is a cost-saving feature that caches query results up to a configurable maximum age (up to 7 days). When an identical query is re-run within the cache window, Amazon Athena returns the cached result with zero data scanned and therefore zero charge. This is enabled at the query level or at the workgroup level. For SAA-C03 scenarios about "the same dashboard query runs every 15 minutes and we want to reduce Amazon Athena cost," enabling query result reuse is the first answer.
To reduce Amazon Athena cost and latency, always do three things on the data lake. First, partition the data in Amazon S3 by high-cardinality predicate columns such as date, region, or tenant. Second, store files in a columnar format — Apache Parquet or Apache ORC. Third, compress with Snappy or ZSTD. Partitioning prunes the files Amazon Athena has to open, columnar formats let Amazon Athena read only the columns the SQL references, and compression cuts the bytes scanned. Together they routinely reduce Amazon Athena cost by 90 percent or more. Source ↗
Partitioning and partition projection
Partitioning splits a table across multiple S3 prefixes so that Amazon Athena can prune files at query time. Classic Hive-style partitioning uses directory names like s3://bucket/table/year=2026/month=04/day=20/. For tables with thousands of partitions, you can enable partition projection, which stores partition metadata in table properties rather than in the AWS Glue Data Catalog, removing the need to run MSCK REPAIR TABLE or Glue crawlers. Partition projection is the SAA-C03 answer whenever a scenario mentions "very large number of partitions" or "new partitions arrive every minute."
Apache Parquet and Apache ORC
Apache Parquet and Apache ORC are columnar file formats. Because Amazon Athena bills by data scanned, and because SQL typically references only a few columns out of many, columnar formats can reduce Amazon Athena cost by 30 to 90 percent compared to row-oriented JSON or CSV. Parquet is the default recommendation on SAA-C03. AWS Glue ETL jobs can convert CSV or JSON into Parquet, and Amazon Data Firehose can convert records to Parquet inline before writing to Amazon S3.
Athena Performance Optimization — Partitioning, Columnar Formats, and Compression
Amazon Athena performance is almost entirely a function of how much data it has to read from Amazon S3. The SAA-C03 exam tests four optimization levers.
Lever 1 — Partition pruning
Amazon Athena uses the WHERE clause to skip entire partitions. A query that filters by WHERE dt = '2026-04-20' against a table partitioned on dt reads only one day of files. A query without a partition predicate scans everything. Partition pruning is free and should be the first optimization on any SAA-C03 scenario.
Lever 2 — Columnar formats
Storing data as Apache Parquet or Apache ORC means Amazon Athena loads only the columns the SQL touches. A SELECT of three columns out of thirty scans roughly ten percent of the bytes compared to the same SELECT against CSV.
Lever 3 — Compression
Snappy and ZSTD compress typical data four-to-ten times and are splittable, which matters for parallel reads. GZIP compresses more but is not splittable, so Amazon Athena cannot distribute the work as efficiently.
Lever 4 — File size and count
Very small files cause Amazon Athena listing overhead. The recommendation is to keep file sizes between 128 MB and 1 GB. Many tiny files (the small-file problem) are a classic SAA-C03 trap; the fix is to compact files using AWS Glue or Amazon EMR.
Partition by the most-used WHERE predicate. Store as Apache Parquet with Snappy compression. Keep file sizes at 128 MB to 1 GB. Use Athena workgroups to control scan limits. Enable Athena query result reuse for repeated dashboards. These five levers are the core of every SAA-C03 Amazon Athena cost optimization question. Source ↗
AWS Lake Formation — Centralized Data Lake Governance and Fine-Grained Access
AWS Lake Formation is the service that turns a raw S3 data lake into a governed data lake. It provides a single place to register Amazon S3 locations, define databases and tables in the AWS Glue Data Catalog, and grant permissions at the database, table, column, row, and cell level. AWS Lake Formation permissions are enforced by integrated engines including Amazon Athena, Amazon Redshift Spectrum, Amazon EMR (with runtime roles), AWS Glue, and Amazon QuickSight.
How AWS Lake Formation replaces manual S3 policies
Before AWS Lake Formation, you granted data lake access by writing S3 bucket policies, IAM policies, and KMS key policies. With AWS Lake Formation, you grant permissions the way you grant rights in a relational database — GRANT SELECT ON TABLE sales.orders TO IAM_ROLE arn:aws:iam::...:role/Analyst. AWS Lake Formation, not the S3 bucket policy, is the authoritative access control layer. When a principal queries through Amazon Athena, AWS Lake Formation vends temporary S3 credentials scoped to exactly the columns and rows the principal is permitted to see.
AWS Lake Formation blueprints
AWS Lake Formation blueprints are pre-built workflows that ingest data from common sources into the data lake. Blueprints include database snapshots, database incremental loads, and log file ingestion (AWS CloudTrail, Elastic Load Balancing access logs, AWS Classic Load Balancer logs). A blueprint generates the AWS Glue crawlers, jobs, and triggers behind the scenes. On the SAA-C03 exam, any scenario that says "quickly onboard an on-premises database into the data lake with minimal custom code" points to a Lake Formation blueprint.
AWS Lake Formation permissions model
AWS Lake Formation permissions are defined on catalog objects — databases, tables, and columns — not on S3 paths. There are two main permission types: data permissions (SELECT, INSERT, DELETE, ALTER, DROP, DESCRIBE) and grantable permissions (who can grant permissions onward). Permissions are additive, and any query path that does not have the corresponding Lake Formation grant is denied even if the IAM principal has s3:GetObject.
LF-Tags for attribute-based access control (ABAC)
LF-Tags are key-value pairs you attach to databases, tables, or columns; you then grant a principal access to everything carrying a given tag combination. For example, tag all PII columns with sensitivity=pii and grant the data-scientist role read access only to sensitivity=public. Scaling ABAC through LF-Tags is far simpler than granting explicit permissions on thousands of tables. The SAA-C03 exam uses LF-Tags whenever a scenario mentions "many tables, many principals, tag-based strategy."
An LF-Tag is a key-value label in AWS Lake Formation that lets you implement attribute-based access control across the AWS Glue Data Catalog. Grants are expressed in terms of tag expressions rather than explicit resource ARNs, which scales governance to thousands of tables without enumerating each one. Source ↗
Row-level and column-level security
AWS Lake Formation column-level security (called column filtering) restricts which columns a principal sees. Column-level security is the SAA-C03 pattern for "analysts can see order totals but not customer PII." AWS Lake Formation row-level security (called row filtering) restricts which rows a principal sees based on a filter expression. Row-level security is the SAA-C03 pattern for "the EU analyst can only see orders from region = EU." A single data filter in AWS Lake Formation can combine both row and column filters into one reusable object, which is then granted to principals.
Data filtering — cell-level security
AWS Lake Formation data filtering combines row and column filters into named objects called data filters. A data filter may include a row filter expression, an included or excluded column list, and is attached to a specific table. When a principal queries through Amazon Athena or Amazon Redshift Spectrum, AWS Lake Formation applies the data filter transparently. The result is cell-level security — the intersection of allowed rows and allowed columns — without rewriting SQL.
If the exam scenario asks for column-level, row-level, or cell-level access control on data lake tables, the answer is AWS Lake Formation data filters and LF-Tags, not a bespoke IAM and S3 bucket policy. Raw IAM and S3 can only grant or deny entire objects; they cannot hide specific columns or rows. AWS Lake Formation is the only AWS service that provides fine-grained data lake governance integrated with Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR. Source ↗
Lake Formation Permissions vs S3 Bucket Policies — Layered Access Control
AWS Lake Formation does not replace IAM or S3 bucket policies — it layers on top of them. The SAA-C03 exam frequently tests the evaluation order.
The three layers
Layer one is IAM — the principal must have IAM permissions for the Amazon Athena or Amazon Redshift API actions (for example, athena:StartQueryExecution). Layer two is the AWS Glue Data Catalog resource policy and AWS Lake Formation permissions — AWS Lake Formation vends S3 credentials only if the principal has SELECT on the requested columns and rows. Layer three is the underlying S3 bucket policy and KMS key policy — AWS Lake Formation assumes an S3 data access role that needs S3 and KMS permissions.
Lake Formation administrators
AWS Lake Formation administrators are IAM principals who can register S3 locations, define LF-Tags, and grant permissions. You designate Lake Formation administrators in the Lake Formation console. On SAA-C03, the distinction between Lake Formation administrator (manages Lake Formation governance) and data lake consumer (runs Amazon Athena queries) is frequently tested.
AWS Lake Formation governs queries that go through Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, or AWS Glue. It does not govern a principal that calls s3:GetObject directly. If your security posture requires that only Lake Formation-integrated engines can read the lake, you must also restrict the S3 bucket policy to deny direct object access from non-Lake-Formation roles. Many SAA-C03 candidates forget the S3 layer and pick Lake Formation as a silver bullet. Source ↗
Amazon QuickSight — BI Dashboards, SPICE In-Memory Engine, and ML Insights
Amazon QuickSight is the AWS cloud-native business intelligence service. It provides interactive dashboards, paginated reports, embedded analytics, and machine-learning-powered insights at per-user or per-session pricing. Amazon QuickSight integrates natively with Amazon Athena, Amazon Redshift, Amazon RDS, Amazon S3 through AWS Lake Formation, Salesforce, and on-premises databases through Direct Connect or private link.
SPICE — the in-memory accelerator
SPICE (Super-fast, Parallel, In-memory Calculation Engine) is the columnar in-memory cache inside Amazon QuickSight. Datasets imported into SPICE are queried at in-memory speed regardless of the underlying source. SPICE capacity is purchased per gigabyte, and each Amazon QuickSight user typically has a default SPICE allocation. For SAA-C03 scenarios about "dashboards must stay responsive for thousands of viewers while Amazon Athena cost must stay low," SPICE is the answer — refresh SPICE once per hour, let viewers hit SPICE instead of hitting Amazon Athena per click.
Embedded analytics
Amazon QuickSight embedded analytics let you embed dashboards and authoring experiences directly into your SaaS product or internal portal, with per-session pricing that scales with end-user activity. There are two embedding modes: registered-user embedding (each end user has an Amazon QuickSight identity) and anonymous embedding (end users are unauthenticated, identified only by your app's session). Embedded analytics is the SAA-C03 answer for "a SaaS vendor wants to ship dashboards to thousands of customer users at low cost."
ML Insights
Amazon QuickSight ML Insights uses built-in machine learning to detect anomalies, forecast future values, and generate auto-narratives (natural-language summaries of dashboards) without writing any ML code. Amazon Q in QuickSight extends this with natural-language Q&A — end users type "What were last quarter's top five products by margin?" and get an answer. On SAA-C03 this is the "business users want ML-powered dashboards with no data scientists" pattern.
Row-level security in QuickSight
Amazon QuickSight row-level security (RLS) restricts which rows each user or group can see within a single shared dataset. You define RLS rules in a permissions dataset that maps users to allowed row filter values. RLS in QuickSight complements AWS Lake Formation row-level security — Lake Formation enforces at the data lake query layer, QuickSight RLS enforces on imported SPICE datasets. The SAA-C03 exam may present either pattern; the right answer depends on whether the restriction should happen before data reaches QuickSight (Lake Formation) or inside QuickSight (QuickSight RLS).
When Amazon QuickSight queries Amazon Athena, Amazon Redshift Spectrum, or the AWS Glue Data Catalog against a Lake Formation-governed data lake, AWS Lake Formation permissions are enforced for the Amazon QuickSight principal automatically. This means you can centralize fine-grained access policy once in AWS Lake Formation and have it apply to both SQL users and Amazon QuickSight dashboards. You do not need to duplicate the rule in two places. Source ↗
Athena vs Amazon Redshift — Ad-Hoc Query vs Data Warehouse Decision
This is the single most-tested distinction in SAA-C03 analytics questions.
Amazon Athena — when to pick it
Amazon Athena is the right answer when the data already lives in Amazon S3, when queries are ad hoc or infrequent, when you want pay-per-scan pricing with no always-on cost, when different teams need different views of the same files, and when you want to avoid ETL into a warehouse. Amazon Athena is inherently serverless, has no warm-up time, and charges nothing when idle.
Amazon Redshift — when to pick it
Amazon Redshift is the right answer when you need petabyte-scale data warehousing with complex joins and repeated queries across hundreds of business users, when you need sub-second dashboard latency against curated marts, when you have a predictable workload that benefits from Reserved Instance pricing, or when you need features such as materialized views, stored procedures, and Amazon Redshift ML. Amazon Redshift Serverless blurs the line by auto-scaling capacity and pausing when idle, but Amazon Redshift is still positioned as the "serious warehouse" while Amazon Athena is positioned as "query the lake in place."
Amazon Redshift Spectrum — the bridge
Amazon Redshift Spectrum lets an Amazon Redshift cluster query data directly in Amazon S3 through the AWS Glue Data Catalog, the same way Amazon Athena does, but inside an Amazon Redshift SQL session so you can join warehouse tables with lake tables. On the SAA-C03 exam, Amazon Redshift Spectrum appears whenever the scenario needs "join warehouse fact tables with external S3 files without loading the S3 files."
Amazon EMR — when to pick it
Amazon EMR is the right answer when you need full programmatic control over Apache Spark, Apache Hadoop, Apache Hive, Apache HBase, Presto, Trino, or Apache Flink; when you run long batch jobs such as terabyte-scale ETL, machine learning preprocessing, or graph processing; and when you benefit from Spot Instances on the task fleet for cost savings. Amazon EMR is not serverless by default (though Amazon EMR Serverless exists), and it requires cluster lifecycle management.
The SAA-C03 exam writers love three-way traps. Amazon Athena is for ad-hoc SQL on Amazon S3 with no infrastructure. Amazon Redshift is for high-concurrency enterprise data warehousing with dedicated capacity and materialized views. Amazon EMR is for code-first big-data processing in Apache Spark or Apache Hadoop where you need framework-level control. If the question mentions "ad hoc, infrequent, serverless SQL," pick Amazon Athena. If it mentions "thousands of BI users, sub-second dashboards, ETL into curated marts," pick Amazon Redshift. If it mentions "Spark job, Hadoop, custom code, TB-scale transformation," pick Amazon EMR. Source ↗
Amazon OpenSearch Service — Search and Log Analytics
Amazon OpenSearch Service is the managed OpenSearch (and legacy Elasticsearch) engine that indexes text and time-series data for sub-second search. It complements Amazon Athena — Amazon Athena answers "sum revenue by region" (analytical SQL), while Amazon OpenSearch Service answers "find every log line containing ERROR in the last 15 minutes" (text search and observability).
When to pick OpenSearch on SAA-C03
Use Amazon OpenSearch Service when the workload is log analytics (application logs, AWS CloudTrail, VPC Flow Logs, Elastic Load Balancing access logs), full-text search (product catalog search, wiki search), real-time observability with OpenSearch Dashboards, or security information and event management (SIEM) with the Security Analytics plugin. A common SAA-C03 pipeline is Amazon Data Firehose → Amazon OpenSearch Service for logs, in parallel with Amazon Data Firehose → Amazon S3 → Amazon Athena for long-term analytics.
OpenSearch Serverless
Amazon OpenSearch Serverless removes cluster sizing by auto-scaling OpenSearch Compute Units (OCUs) for the indexing and search workloads independently. It is the SAA-C03 answer when the scenario says "variable search traffic, no capacity planning, pay-per-use."
Building and Securing Data Lakes — Bronze/Silver/Gold Zone Architecture
The multi-zone (or medallion) data lake pattern is the architectural reference pattern for the SAA-C03 exam.
Bronze (raw) zone
The bronze zone holds raw, unmodified data as it arrives from the source — no schema enforcement, often in the original format (JSON, CSV, XML). Ingestion comes from Amazon Kinesis Data Firehose, AWS DMS, AWS DataSync, or direct producer writes. AWS Lake Formation registers the S3 location and grants restrictive permissions (typically only the ETL role).
Silver (curated) zone
The silver zone holds cleansed, deduplicated, schema-enforced data, usually in Apache Parquet with partitioning by ingestion date. AWS Glue or Amazon EMR transforms bronze to silver. AWS Lake Formation grants SELECT on specific columns to analyst roles using LF-Tags.
Gold (consumable) zone
The gold zone holds business-ready aggregates and dimensional models designed for specific BI use cases. AWS Glue or Amazon Redshift materializes gold tables. Amazon QuickSight dashboards read from gold via Amazon Athena or Amazon Redshift Spectrum.
Zone-level governance with LF-Tags
Assign an LF-Tag zone=bronze|silver|gold to every table. Grant the ETL role access to zone=bronze,zone=silver. Grant the analyst role access to zone=silver,zone=gold. Grant the business user role access to zone=gold only. One LF-Tag hierarchy scales to thousands of tables. This is the reference pattern the SAA-C03 exam rewards.
Visualization Strategies — QuickSight vs Third-Party BI Tools
SAA-C03 rarely asks about third-party BI tools such as Tableau or Looker by name, but it does ask about architectural integration.
Connection patterns
Third-party BI tools connect to Amazon Athena over JDBC/ODBC, to Amazon Redshift over JDBC/ODBC, or to Amazon S3 directly for extract. All three patterns respect AWS Lake Formation column-level and row-level permissions as long as the tool queries through Amazon Athena or Amazon Redshift Spectrum. Direct s3:GetObject access from a third-party tool bypasses AWS Lake Formation governance.
When to pick Amazon QuickSight
Pick Amazon QuickSight when you want pay-per-session pricing, tight integration with AWS Identity Center, SPICE for scale, embedded analytics for SaaS products, Amazon Q in QuickSight for natural-language Q&A, and no infrastructure to run. Pick a third-party tool when the organization already has enterprise agreements, complex pixel-perfect paginated reports, or advanced customization that Amazon QuickSight does not offer.
Key Numbers to Memorize (Athena per-TB pricing, QuickSight SPICE capacity)
SAA-C03 often tests numeric thresholds rather than prose.
- Amazon Athena pricing: USD 5.00 per terabyte scanned in most regions, 10 MB minimum per query, DDL free.
- Amazon Athena query result reuse maximum age: 7 days.
- Amazon Athena data usage control limit: set per workgroup, can cancel queries exceeding scan threshold.
- Apache Parquet recommended file size: 128 MB to 1 GB.
- AWS Lake Formation permission scope: database, table, column, row, cell.
- AWS Lake Formation data filter: one filter per (table, principal) grant.
- Amazon QuickSight Standard Edition vs Enterprise Edition: Enterprise adds SPICE at scale, row-level security, audit, hourly refresh.
- Amazon QuickSight SPICE default user allocation: typically 10 GB per Enterprise user, additional capacity purchased in GB-months.
- Amazon QuickSight embedded analytics: pricing per session (30-minute sessions) for anonymous embedding.
- Amazon QuickSight ML Insights: available only in Enterprise Edition.
- Amazon OpenSearch Service OCU: independent indexing OCUs and search OCUs scale separately in OpenSearch Serverless.
Always verify the latest numbers against the AWS documentation before the exam — pricing is the one detail that drifts.
Common Exam Traps — Athena vs Redshift for Interactive Queries, Lake Formation vs Glue Catalog
SAA-C03 uses a repeating set of traps in analytics questions.
Trap 1 — Athena vs Redshift for interactive queries
Candidates sometimes pick Amazon Redshift for any BI dashboard question. That is wrong when the data already lives in Amazon S3 and the dashboard refreshes infrequently — Amazon Athena plus Amazon QuickSight SPICE is cheaper and simpler. Amazon Redshift wins only when concurrency, sub-second repeated queries, and curated marts dominate.
Trap 2 — Lake Formation vs AWS Glue Data Catalog confusion
The AWS Glue Data Catalog is the metadata store. AWS Lake Formation is the permissions layer on top of the catalog. You cannot have AWS Lake Formation without the AWS Glue Data Catalog, but you can have the AWS Glue Data Catalog without AWS Lake Formation (IAM-based permissions only). When the question asks about column-level or row-level security, AWS Lake Formation is required.
Trap 3 — Athena Federated Query vs running ETL
If the scenario says "run a one-time SQL across Amazon DynamoDB and Amazon S3," pick Amazon Athena Federated Query, not an AWS Glue ETL job that copies DynamoDB into Amazon S3. Federated Query avoids the data movement.
Trap 4 — QuickSight SPICE vs live query
If the scenario says "hundreds of viewers, stable data refreshed hourly," load into SPICE. If the scenario says "real-time operational dashboard where every refresh must see the latest millisecond," use a direct query (live connection) to the source. Do not default to SPICE for everything.
Trap 5 — OpenSearch vs Athena for log search
For "find this error string in the last 15 minutes," Amazon OpenSearch Service is the right answer. For "aggregate error rates by hour over the last 30 days," Amazon Athena on Amazon S3 logs is cheaper. Many scenarios layer both — OpenSearch for hot search, Athena for cold analytics.
Trap 6 — EMR Serverless vs Athena
If the scenario needs Apache Spark code (PySpark, Scala, DataFrames), Amazon EMR Serverless is the answer even though it runs serverlessly. Amazon Athena is SQL-only. Do not pick Amazon Athena for code-first ETL just because it is serverless.
Trap 7 — Partition projection vs Glue crawlers
For tables with millions of partitions arriving frequently, partition projection beats AWS Glue crawlers on both cost and freshness. Crawlers suit discovery of new tables; projection suits high-cardinality partition growth on known tables.
Analytics Architecture Pattern — Ingest → Store → Catalog → Query → Visualize
The canonical SAA-C03 analytics pipeline connects every service in this topic.
- Ingest — Amazon Kinesis Data Firehose (streaming), AWS DMS (CDC from databases), AWS DataSync (file transfer), or AWS Transfer Family (SFTP). Firehose can convert to Parquet and partition on the fly.
- Store — Amazon S3 organized by bronze/silver/gold zones with partitioning by date.
- Catalog — AWS Glue crawlers or manual Data Definition Language (DDL) populate the AWS Glue Data Catalog.
- Govern — AWS Lake Formation registers S3 locations, assigns LF-Tags, and grants column/row/cell-level permissions.
- Transform — AWS Glue ETL or Amazon EMR transforms bronze to silver to gold, enforcing schema and data quality.
- Query — Amazon Athena for ad-hoc SQL, Amazon Redshift Spectrum for warehouse joins, Amazon EMR for Spark workloads, Amazon OpenSearch Service for text and logs.
- Visualize — Amazon QuickSight dashboards powered by SPICE, with embedded analytics for SaaS consumers and Amazon Q in QuickSight for natural-language Q&A.
Memorize this pipeline and the order of services. A large portion of SAA-C03 analytics questions describe a slice of this exact pipeline.
FAQ — Athena, Lake Formation, and QuickSight Top 5 Questions
Q1. When should I choose Amazon Athena over Amazon Redshift on SAA-C03?
Choose Amazon Athena when the data already lives in Amazon S3, queries are ad hoc or infrequent, and you want pay-per-scan serverless pricing with zero idle cost. Choose Amazon Redshift when you need petabyte-scale warehousing, high-concurrency sub-second dashboards, materialized views, stored procedures, or a classic extract-transform-load pipeline into curated marts. Amazon Redshift Serverless is the middle ground when you want warehouse features without cluster sizing.
Q2. How does AWS Lake Formation differ from AWS Glue Data Catalog?
The AWS Glue Data Catalog is the metadata layer — databases, tables, and column schemas. AWS Lake Formation is the governance layer on top — LF-Tags, column-level permissions, row-level permissions, and data filters. AWS Lake Formation cannot function without the AWS Glue Data Catalog, but the catalog can be used with plain IAM permissions if you do not need fine-grained access. On SAA-C03, fine-grained access always means AWS Lake Formation.
Q3. What are LF-Tags, and when do I use them instead of explicit grants?
LF-Tags are key-value labels on databases, tables, or columns that let you grant permissions by tag expression (attribute-based access control). Use LF-Tags whenever the organization has many tables and many principals — tagging once scales better than granting explicit permissions on every resource. Use explicit grants for small or static schemas where enumeration is simple.
Q4. How do I make Amazon QuickSight dashboards scale to thousands of viewers without blowing up Amazon Athena cost?
Import the dashboard dataset into SPICE and refresh SPICE at a reasonable cadence (for example hourly). Viewers then query SPICE at in-memory speed rather than hitting Amazon Athena per click, which avoids repeated scans. Combine SPICE with Amazon Athena query result reuse for the refresh step itself, and enable Amazon QuickSight row-level security if different viewers must see different row subsets.
Q5. What is the SAA-C03 quick rule for Amazon Athena vs Amazon EMR vs Amazon Redshift vs Amazon OpenSearch Service?
Amazon Athena is for serverless SQL on Amazon S3 (ad-hoc, infrequent, cost-efficient). Amazon Redshift is for high-concurrency data warehousing (enterprise BI, curated marts, sub-second dashboards). Amazon EMR is for code-first big-data processing (Spark, Hadoop, custom frameworks). Amazon OpenSearch Service is for text search and log analytics (observability, full-text, SIEM). Read the scenario, match the dominant verb — "query," "warehouse," "transform with Spark," or "search logs" — and pick the corresponding service.
Q6. Does AWS Lake Formation protect data if a user has direct s3:GetObject permissions?
No. AWS Lake Formation governs queries that route through Amazon Athena, Amazon Redshift Spectrum, Amazon EMR with runtime roles, or AWS Glue. A principal with direct s3:GetObject and s3:ListBucket permissions bypasses AWS Lake Formation. To make AWS Lake Formation the exclusive access path, tighten the S3 bucket policy to deny direct access from any role except the AWS Lake Formation data access role. This defense-in-depth step is a frequent SAA-C03 gotcha.