AWS Glue ETL and EMR Data Transformation

AWS Glue ETL and Amazon EMR are the two AWS services that transform raw data into analysis-ready data at scale. On the AWS Certified Solutions Architect Associate (SAA-C03) exam, Task Statement 3.5 asks you to "determine high-performing data ingestion and transformation solutions" — which is a direct test of whether you can pick AWS Glue ETL for small-to-medium serverless jobs, Amazon EMR for large-scale Spark and Hadoop clusters, and AWS Step Functions to orchestrate the full pipeline. Expect three to five exam questions on AWS Glue ETL, EMR, and Step Functions in every SAA-C03 sitting.

This study guide decodes every piece of the AWS Glue ETL family (Data Catalog, crawlers, Glue Studio visual ETL, DataBrew recipes, Schema Registry, job bookmarks, triggers, PySpark and Python shell jobs, DPU pricing), walks through Amazon EMR (EMR on EC2 vs EMR Serverless vs EMR on EKS, master/core/task node architecture, HDFS vs EMRFS, Spot for task nodes), and finishes with the Glue-vs-EMR decision tree you need to answer scenario questions in under thirty seconds.

What Is Data Transformation in the AWS Analytics Ecosystem?

Data transformation is the middle step in the classic pattern Ingest → Store → Transform → Catalog → Query → Visualize. Raw data lands in Amazon S3 from sources like Amazon Kinesis Data Streams, Amazon Data Firehose, AWS Database Migration Service (DMS), or AWS DataSync. Before analysts or dashboards can use it, the data usually needs cleaning (drop duplicates, mask PII), reshaping (join, aggregate, pivot), and re-encoding (CSV to Apache Parquet, JSON to ORC) so that downstream query engines such as Amazon Athena and Amazon Redshift Spectrum run faster and cheaper.

AWS offers three primary compute options for transformation, and the SAA-C03 exam tests the trade-off between them:

AWS Glue ETL — serverless Spark (or lightweight Python shell) with zero cluster management. Pay per DPU-second.
Amazon EMR — managed Hadoop ecosystem (Spark, Hive, Presto, Flink, HBase) for the largest and most customized workloads.
AWS Lambda — only for tiny transformations under 15 minutes and 10 GB memory; not the primary ETL engine, but valid for event-driven micro-transforms.

Around these three live two catalog and governance services (AWS Glue Data Catalog and AWS Lake Formation), one orchestration service (AWS Step Functions), and the storage layer (Amazon S3). The exam tests whether you can assemble the right pieces.

Why data transformation matters for SAA-C03

Task Statement 3.5 explicitly names data ingestion and transformation. Glue and EMR account for the vast majority of transformation questions. Missing a single scenario keyword (like "serverless," "Spark code control," or "Spot task nodes") is the single biggest reason architects lose points in Domain 3.

Plain-Language Explanation: AWS Glue ETL and EMR

Three analogies make AWS Glue ETL and Amazon EMR click.

Analogy 1 — The commercial kitchen (廚房)

Imagine the data lake is a restaurant.

Amazon S3 is the walk-in cold room where you dump every raw delivery (CSV, JSON, Avro, logs).
AWS Glue Data Catalog is the inventory clipboard that lists every ingredient, which shelf it is on, and what it is shaped like (schema). AWS Glue Crawlers are the sous-chefs who walk the shelves and update the clipboard automatically.
AWS Glue ETL jobs are the prep cooks — serverless, show up only when there is work, go home the minute the prep is done. No idle wages.
AWS Glue Studio is the visual recipe card (drag-and-drop) that tells the prep cook what to chop.
AWS Glue DataBrew is a quality-control chef who fixes typos, trims fat, and standardizes portions using pre-built recipes — no code needed.
Amazon EMR is the industrial 24-burner line kitchen with custom equipment. You pay for the full kitchen (the cluster), you tune every burner (Spark configs), you handle the shift schedule (node sizes). Best when you cook for thousands of covers.
AWS Step Functions is the head chef who calls the tickets: "first Glue, then EMR, then Redshift COPY, then alert." The orchestrator, not a cook itself.

If the scenario says "5 GB of logs per day, no ops team," grab a prep cook (AWS Glue ETL). If the scenario says "20 PB of clickstream, custom Spark tuning, GPU ML preprocessing," fire up the industrial kitchen (Amazon EMR).

Analogy 2 — The assembly line at a factory (工地)

Picture a conveyor belt factory.

Amazon Kinesis or S3 upload is the incoming raw-material delivery dock.
AWS Glue crawlers are barcode scanners that auto-label every pallet (table name, columns, types) into the AWS Glue Data Catalog registry.
AWS Glue Schema Registry is the parts-compatibility manual — every producer and consumer must match the declared schema, preventing a screw-head mismatch downstream.
AWS Glue ETL PySpark jobs are the robotic arms that reshape parts on the line (filter, join, aggregate, convert CSV to Parquet).
AWS Glue job bookmarks are the "last processed" sticker on each pallet — the robot remembers where it stopped, so next run only processes new pallets (incremental loads, no reprocessing).
AWS Glue triggers are the line start-button rules — cron, on-demand, or event-driven (job A finished → start job B).
Amazon EMR on EC2 is the heavy-forge station where you control every hammer (master, core, task nodes). EMR task nodes on Spot Instances are temp workers paid 70–90% less.
HDFS is the local parts bin on each forge (ephemeral). EMRFS is the central warehouse (Amazon S3) shared across the whole factory — survives cluster shutdown.
EMR Serverless is an "on-demand robot rental" — only pay per job, no cluster to plan.
EMR on EKS is the Kubernetes-managed version where the robots run inside containers scheduled by the company-wide K8s cluster.
AWS Step Functions is the PLC controller wiring the whole line together into a state machine.

Analogy 3 — The postal sorting center (郵政系統)

The data lake is a letter-sorting center.

Incoming sacks (raw Amazon S3 objects) arrive at loading bays.
A crawler (AWS Glue crawler) walks the bays, reads postcodes, and updates the central directory (AWS Glue Data Catalog).
A prep clerk (AWS Glue ETL job) reroutes letters into sorted bins (CSV → Parquet → partition by date).
A heavy-duty sorter (Amazon EMR) is used on peak days — the whole Christmas rush cluster.
Spot-instance task nodes are the seasonal part-time clerks — cheap, but you cannot guarantee they show up.
AWS Step Functions is the postmaster's daily work schedule calling "first sort, then reroute, then ship, then report."
Amazon Athena and Amazon Redshift at the end of the line are the clerks who answer customer questions.

These three pictures share one rule: Glue is serverless prep, EMR is a managed factory, Step Functions is the orchestrator. Keep this mental model and the exam questions become keyword matching.

AWS Glue — The Serverless ETL Platform

AWS Glue is AWS's serverless data-integration service. It has five major sub-components that the SAA-C03 exam tests as one bundle.

AWS Glue Data Catalog — Central metadata store

The AWS Glue Data Catalog is a centralized, Apache Hive–compatible metadata repository. It stores databases (logical groupings), tables (schema, column types, partition spec), connections (to RDS, Redshift, MongoDB, JDBC, Kafka), and classifiers (rules for crawler detection).

The Data Catalog is free for the first million objects and priced per 100,000 objects per month above that. The same catalog is shared across Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Glue ETL jobs, and AWS Lake Formation. That means one table registration powers every AWS analytics service in your account — a massive design win.

The AWS Glue Data Catalog is a persistent, managed metadata store that holds database and table definitions (schema, location in Amazon S3, partition information) for data processed by AWS Glue, Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Lake Formation. It is a region-level service, Apache Hive metastore compatible, and forms the backbone of every AWS data lake. Source ↗

AWS Glue Crawlers — Automatic schema discovery

A crawler is a managed process that scans a data store (Amazon S3 prefix, Amazon DynamoDB table, JDBC database, Delta Lake, Apache Iceberg table, Apache Hudi table) and automatically populates or updates tables in the AWS Glue Data Catalog. The crawler infers column names, types, compression, file formats (CSV, JSON, Avro, Parquet, ORC, XML), and partition layout (for example, s3://bucket/year=2026/month=04/day=20/).

Key crawler exam facts:

Crawlers can run on-demand or on a schedule (cron expression).
Crawlers support incremental crawls for Amazon S3 — only new or changed partitions are processed, cutting cost.
A crawler uses an IAM role to read the source and write to the catalog.
Classifiers can be built-in or custom (for proprietary formats).

On SAA-C03 scenarios where the question says "automatically keep the Glue Data Catalog up to date as new S3 partitions land daily," the correct pattern is a scheduled AWS Glue Crawler with incremental crawl mode. Writing and maintaining manual ALTER TABLE ADD PARTITION scripts is the wrong answer in almost every exam case. Source ↗

AWS Glue ETL Jobs — PySpark and Python shell

An AWS Glue job is the compute unit that runs your transformation code. There are three job types:

Spark (PySpark or Scala) — distributed, runs on the serverless Apache Spark engine optimized for data-lake workloads. Use for large datasets (GBs to TBs), joins, aggregations, and format conversions. Supports Glue's DynamicFrame abstraction on top of Spark DataFrame for schema-flexible handling.
Spark Streaming — consume from Amazon Kinesis Data Streams or Amazon MSK, emit to S3, DynamoDB, etc. Continuously running job type.
Python shell — a single Python worker (no Spark) for small jobs (less than 1 GB). Ideal for lightweight API calls, small-file transformations, or orchestrating other services. Cheaper because it runs as 1 DPU or 1/16 DPU.

A Glue job runs on Data Processing Units (DPUs). One DPU = 4 vCPU + 16 GB memory + 64 GB disk. Spark jobs default to 10 DPUs (configurable 2–100+). Python shell jobs default to 1/16 or 1 DPU.

AWS Glue Studio — Visual ETL

AWS Glue Studio is a web-based low-code visual interface for building, running, and monitoring AWS Glue ETL jobs. Users drag Source → Transform → Target nodes on a canvas. Under the hood, Glue Studio generates PySpark code you can edit. Built-in transforms include Join, Filter, SelectFields, ApplyMapping, Aggregate, SQL, Custom Transform, and data-quality rules.

Glue Studio is the right pick when the exam scenario mentions "data engineers without Spark experience" or "visual no-code ETL."

AWS Glue DataBrew — Visual data preparation with recipes

AWS Glue DataBrew is a purely visual, no-code data-preparation tool aimed at data analysts (not engineers). DataBrew ships with 250+ pre-built transforms (recipes) like "remove special characters," "split by delimiter," "format dates," and "mask PII columns." Recipes are reusable and version-controlled.

DataBrew profiles datasets (statistics, missing values, outliers) and can run jobs on a schedule. It is distinct from Glue Studio — DataBrew is for data cleaning by analysts; Glue Studio is for ETL pipelines by engineers.

SAA-C03 scenarios love this trap. AWS Glue DataBrew = visual data-prep recipes for business analysts (no code, focus on cleaning). AWS Glue Studio = visual ETL pipeline builder for engineers (generates PySpark under the hood). AWS Glue ETL jobs = code-first PySpark/Python shell for full control. If the question emphasizes "analysts with no SQL or coding skills clean and profile data," pick DataBrew. If it says "drag-and-drop ETL pipeline generating Spark," pick Glue Studio. If it says "write custom PySpark transformations," pick plain Glue ETL. Source ↗

AWS Glue Schema Registry — Centralized schema governance

AWS Glue Schema Registry lets producers and consumers of streaming data validate and evolve schemas centrally. It supports Apache Avro, JSON Schema, and Protobuf. Integrates with Amazon Kinesis Data Streams, Amazon MSK, AWS Lambda, and Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics).

The exam value: pick Schema Registry when the scenario emphasizes "prevent incompatible schema changes from breaking downstream consumers in a streaming pipeline." Registered schemas have compatibility modes (BACKWARD, FORWARD, FULL, NONE).

AWS Glue Job Bookmarks — Incremental processing state

Job bookmarks are AWS Glue's built-in mechanism for remembering which files (or which database rows) have already been processed. On the next run, Glue skips processed records and only transforms new data.

Job bookmark modes:

Enable — track and persist state.
Disable — reprocess everything each run.
Pause — run without updating state (useful for debugging or reprocessing a specific partition).

Job bookmarks are the exam-correct answer for "incremental ETL without duplicate processing" and "resume from last successful run."

AWS Glue Triggers — Starting jobs automatically

Triggers start Glue jobs and crawlers. Three trigger types:

Scheduled — cron expression (e.g., every hour, daily at 02:00 UTC).
On-demand — manual or API-invoked (StartJobRun).
Conditional (event) — start a downstream job when an upstream job (or crawler) finishes with status SUCCEEDED, FAILED, TIMEOUT, or STOPPED. Enables Glue-only multi-step workflows.

For complex multi-step orchestration across non-Glue services (Lambda, EMR, Athena, SNS), AWS Step Functions is the better choice.

AWS Glue Pricing — The DPU model

AWS Glue is billed per DPU-hour, billed by the second with a 1-minute minimum (ETL jobs) or 10-minute minimum (Python shell). As of current pricing:

Spark ETL job — ~$0.44 per DPU-hour, minimum 10 minutes of billing per run, minimum 2 DPUs.
Python shell job — ~$0.44 per DPU-hour with 1/16 or 1 DPU options, 1 minute minimum.
Glue Streaming ETL — ~$0.44 per DPU-hour, minimum 2 DPUs, continuous billing.
Data Catalog — first 1M objects free, then per 100K objects per month; first 1M requests free, then per million requests.
Crawler — ~$0.44 per DPU-hour, 10-minute minimum.
DataBrew — per session-minute (interactive) + per node-hour (jobs).

One DPU = 4 vCPU + 16 GB RAM + 64 GB disk. Glue Spark jobs default to 10 DPUs. Pricing is roughly $0.44 per DPU-hour billed by the second, minimum 1 minute (ETL) or 10 minutes (crawlers). Python shell jobs can run on 1/16 DPU for micro workloads — a key cost-optimization lever. If an SAA-C03 question says "smallest Glue footprint for a 100 MB API-call job," the answer is Python shell with 1/16 DPU. Source ↗

Amazon EMR — Managed Big-Data Clusters

Amazon EMR (Elastic MapReduce) is AWS's managed big-data platform that runs Apache Spark, Apache Hadoop, Apache Hive, Presto/Trino, Apache HBase, Apache Flink, Apache Livy, and Apache Zeppelin. EMR is the right choice when AWS Glue's serverless Spark is not enough: massive datasets (hundreds of TBs to PBs), custom Spark tuning, the full Hadoop ecosystem, or long-running interactive clusters.

EMR deployment models — Three flavors

Amazon EMR comes in three flavors, each suited to a different operational profile.

1. EMR on EC2 — The classic managed cluster

The original EMR model. You launch a cluster of EC2 instances; EMR installs and configures Hadoop/Spark. You own cluster sizing, auto-scaling policies, instance purchasing, and termination. Long-running or transient (job-run-then-terminate) modes supported.

Best for — predictable recurring workloads, mixed-framework pipelines (Spark + HBase + Hive), integration with custom AMIs or bootstrap actions.
Cost control — Reserved Instances or Savings Plans for core, Spot for task nodes.

2. Amazon EMR Serverless — Pay-per-use Spark and Hive

Amazon EMR Serverless runs Spark or Hive applications without provisioning any cluster. You create an application (a runtime environment), submit jobs, and AWS handles capacity. You pay per vCPU-second, memory-GB-second, and storage-GB-second actually consumed.

Best for — bursty or unpredictable ETL, teams that want Spark power without cluster ops, modernizing long-running EMR clusters to per-job billing.
Key exam point — EMR Serverless replaces the operational burden of running a 24/7 cluster; it is closer to Glue ETL in spirit but gives full Spark engine control (custom Spark configs, custom JARs).

3. Amazon EMR on EKS — Spark jobs in Kubernetes

EMR on EKS lets you run Spark applications on an existing Amazon EKS cluster. You submit Spark jobs through the EMR API; EMR schedules them as pods on your EKS nodes. You share the EKS cluster with other non-Spark workloads, maximizing utilization.

Best for — organizations standardized on Kubernetes, multi-tenant analytics platforms, cost-sharing across teams, unified Kubernetes observability.
Pricing — pay for EMR on EKS pricing uplift plus the underlying EKS/EC2 costs.

EMR on EC2 for long-running or predictable recurring clusters with full framework support. EMR Serverless for bursty Spark or Hive jobs where you want zero cluster management (closest to AWS Glue's ops model, but with full Spark engine control). EMR on EKS for Kubernetes-first organizations that already run EKS and want Spark to be another workload inside the same cluster. If the SAA-C03 scenario emphasizes "no cluster to manage, pay per job," pick EMR Serverless. If it says "already running EKS, want to consolidate," pick EMR on EKS. If it says "full Hadoop ecosystem with HBase and long-running cluster," pick EMR on EC2. Source ↗

EMR cluster anatomy — Master, core, and task nodes

An EMR on EC2 cluster has three node types, and the exam loves testing this distinction.

Master node (primary node) — exactly one per cluster. Runs the YARN ResourceManager, HDFS NameNode, and job coordination services. If the master node dies on a non-HA cluster, the whole cluster is lost. EMR supports Multi-Master (HA) with three master nodes for high availability.
Core nodes — run HDFS DataNode and YARN NodeManager. They store data in HDFS and run compute tasks. Losing a core node risks data loss (unless replication saves it). Core nodes should almost always be On-Demand or Reserved — never Spot-only for production.
Task nodes — compute-only YARN NodeManagers, no HDFS. They are stateless workers. Losing a task node only loses in-flight compute, which YARN reschedules. Task nodes are the perfect target for EC2 Spot Instances, cutting compute cost by up to 90%.

On SAA-C03 scenarios asking how to reduce EMR cost, the correct pattern is Spot Instances on task nodes only. Putting core nodes on Spot risks HDFS data loss when Spot reclaims capacity. Putting the master on Spot risks losing the entire cluster. Task nodes are stateless, so Spot interruption is safe and saves up to 90%. Any answer saying "put all nodes on Spot" is wrong. Source ↗

EMR storage — HDFS vs EMRFS

EMR supports two file systems, and the SAA-C03 exam tests the difference directly.

HDFS (Hadoop Distributed File System) — local replicated storage on the core node EBS/instance store volumes. Ephemeral — lost when the cluster terminates. Fast for intermediate shuffles and temporary MapReduce data.
EMRFS (EMR File System) — an implementation of HDFS semantics over Amazon S3. Durable — survives cluster termination. Used for input, output, and long-term data. Enables the transient-cluster pattern: spin up EMR, read from S3, process, write back to S3, terminate.

On exam questions about "persisting data after cluster termination," EMRFS (Amazon S3) is the answer. HDFS is for in-cluster temporary storage.

HDFS lives on core-node disks and dies with the cluster. EMRFS lives on Amazon S3 and survives forever. Production EMR pipelines almost always read input from EMRFS (S3), use HDFS only for intermediate shuffle data, then write output back to EMRFS (S3). This pattern allows transient EMR clusters — launch, process, terminate — which is the cheapest and most durable design. Source ↗

EMR instance fleets and auto-scaling

EMR supports two provisioning models:

Uniform instance groups — each node role (master, core, task) uses a single instance type. Simple.
Instance fleets — specify up to five instance types per role plus weighted capacity; EMR picks the cheapest combination to meet target capacity. Supports both On-Demand and Spot with Spot fallback. Strong for cost optimization.

EMR Managed Scaling automatically adjusts task (and optionally core) capacity based on YARN metrics, within minimum and maximum limits you set.

AWS Step Functions — Orchestrating the ETL Pipeline

AWS Step Functions is a serverless workflow orchestrator that coordinates AWS services into a state machine. For data pipelines, Step Functions stitches together AWS Glue jobs, AWS Glue crawlers, Amazon EMR steps, AWS Lambda functions, Amazon Athena queries, AWS Batch jobs, Amazon SageMaker jobs, and custom HTTP endpoints into a durable, retriable, observable workflow.

Why Step Functions for ETL orchestration

Durable state — each transition is checkpointed; resume after failures.
Retry and catch — declarative retry policies with exponential backoff.
Parallel branches — fan-out and fan-in Parallel and Map states.
Service integrations — optimized integrations for Glue StartJobRun, EMR RunJobFlow or AddStep, Athena StartQueryExecution, Lambda Invoke, and "wait for callback" tokens.
Two workflow types — Standard (up to 1 year, at-most-once, paid per state transition) and Express (up to 5 minutes, at-least-once, paid per request + duration; ideal for high-volume event-driven ETL).

Typical ETL pipeline orchestrated by Step Functions

Lambda triggers on S3 PutObject (file lands).
Step Functions starts the state machine.
Glue Crawler updates the Data Catalog.
Glue ETL job converts raw CSV to Apache Parquet.
Parallel branch — run an EMR Serverless heavy aggregation for analytics AND an Athena data-quality query for validation.
Choice state — if validation passes, continue; else SNS alert.
Glue job loads into Amazon Redshift.
SNS publishes "pipeline complete."

On the SAA-C03 exam, any question that describes "a multi-step ETL workflow with conditional branching, retries, and visibility" almost always points to AWS Step Functions over chaining Glue conditional triggers or raw Lambda calls.

AWS Glue vs Amazon EMR — The Decision Framework

This is the single most tested comparison in Task 3.5.

Dimension	AWS Glue ETL	Amazon EMR
Management model	Serverless (no cluster)	Managed cluster (EC2, Serverless, EKS)
Primary engine	Apache Spark (PySpark, Scala) or Python shell	Spark, Hadoop, Hive, Presto, HBase, Flink
Ideal dataset size	MB to low TB	Large TB to PB
Setup time	Seconds (job start)	5–10 minutes (on EC2 launch); seconds (EMR Serverless)
Spark version control	Limited Glue versions	Full control, any Spark version, custom JARs
Cluster customization	None (fully managed)	Full (bootstrap actions, custom AMI, kernel config)
Integrated Data Catalog	Native	Via Glue Data Catalog or Hive metastore
Cost model	Per DPU-second (minimum 1 min or 10 min)	Per node-hour (EC2) or per job (Serverless)
Cost sweet spot	Short, variable, infrequent jobs	Long-running or massive jobs where cluster utilization is high
Ops overhead	Near zero	Cluster sizing, scaling, upgrades (less on Serverless/EKS)
Best for	Serverless ETL, quick schema conversion, Data Catalog pipelines	Custom Spark tuning, Hadoop ecosystem, large-scale ML preprocessing

Rule-of-thumb decision tree

Dataset under a few hundred GB + no Spark expertise required + infrequent or unpredictable runs → AWS Glue ETL.
Dataset hundreds of GB to few TB + standard Spark patterns + want serverless → AWS Glue ETL (or EMR Serverless if you need Spark custom tuning).
Multi-TB dataset + full Spark tuning, custom JARs, or multi-framework (Spark + HBase + Flink) → Amazon EMR on EC2.
PB-scale + team already on Kubernetes → Amazon EMR on EKS.
Small one-off API-driven transform under 1 GB → Glue Python shell (cheapest) or AWS Lambda.
Continuous streaming transform from Kinesis/MSK → Glue streaming ETL or EMR Serverless Spark Streaming.

SAA-C03 scenarios often include a cost constraint. The usual trap: the question describes a "data science team that runs Spark jobs a few times a day on 50 GB datasets" and asks for the most cost-effective option. AWS Glue ETL wins — you pay only for job-run DPU-seconds, no idle cluster. Amazon EMR on EC2 with a 24/7 cluster would be more expensive because of idle node-hours. But if the scenario says "running Spark continuously 20 hours per day on 10 TB," flip the answer to Amazon EMR because continuous utilization makes a dedicated cluster cheaper than per-second Glue DPU billing. Source ↗

ETL Pattern Cookbook — Source to Athena

The canonical AWS data-lake pattern the exam expects you to know:

Ingest — data lands in Amazon S3 raw zone (s3://lake/raw/...) from Amazon Kinesis Data Firehose, AWS DMS, AWS DataSync, or direct upload.
Catalog — AWS Glue Crawler registers the raw data as tables in AWS Glue Data Catalog.
Transform — AWS Glue ETL job (or Amazon EMR Spark) converts raw CSV/JSON to Apache Parquet with partitioning (e.g., year=/month=/day=) and writes to curated zone (s3://lake/curated/...).
Re-catalog — Glue Crawler on the curated zone updates the catalog with the new Parquet tables.
Query — Amazon Athena reads the curated Parquet tables via SQL; Amazon Redshift Spectrum joins them with warehouse tables.
Visualize — Amazon QuickSight dashboards over Athena or Redshift.
Orchestrate — AWS Step Functions wires steps 1–5 together with error handling.
Govern — AWS Lake Formation enforces row/column/cell-level access on all of the above via the Data Catalog.

Why Parquet and partitioning matter

Amazon Athena and Amazon Redshift Spectrum are priced per TB scanned. Converting from CSV (row-based) to Apache Parquet (columnar + compressed) typically reduces scanned bytes by 30–90%. Partitioning further prunes scanned data by date, region, or tenant. The correct SAA-C03 answer to "how do I lower Athena costs" is almost always convert to Parquet + partition + compress — and AWS Glue ETL is the standard tool to do it.

Key Numbers to Memorize

Glue DPU — 1 DPU = 4 vCPU + 16 GB RAM + 64 GB disk. Default 10 DPUs for Spark ETL job.
Glue pricing — ~$0.44 per DPU-hour (billed per second). Spark ETL 1-minute minimum per job; crawlers and DataBrew jobs 10-minute minimum.
Glue Python shell — 1/16 DPU or 1 DPU. Cheapest Glue option for micro-jobs.
Glue Data Catalog — first 1M objects and 1M requests free per month per account.
Glue Studio — visual ETL, generates PySpark, same pricing as Glue ETL.
Glue DataBrew — 250+ pre-built recipes, per-session interactive pricing plus per node-hour for jobs.
Glue Schema Registry — free, supports Avro, JSON Schema, Protobuf.
Glue job bookmarks — Enable / Disable / Pause modes.
EMR master node — 1 per cluster (or 3 for Multi-Master HA), always On-Demand.
EMR core nodes — store HDFS, run compute, On-Demand or Reserved recommended.
EMR task nodes — compute only, Spot-eligible for up to 90% savings.
EMR file systems — HDFS (ephemeral) + EMRFS (S3, durable).
EMR Serverless — per vCPU-second + per GB-second of memory + per GB-second of storage.
Step Functions Standard — up to 1 year, at-most-once, per state transition.
Step Functions Express — up to 5 minutes, at-least-once, per request + duration.
Glue triggers — Schedule, On-demand, Conditional (event).

Common Exam Traps

AWS Glue vs Amazon EMR — serverless with minimal ops vs full Spark/Hadoop control. Questions with "no infrastructure to manage" and "infrequent jobs" mean AWS Glue. Questions with "custom Spark tuning," "HBase," "Presto," or "petabyte-scale continuous cluster" mean EMR.
DataBrew vs Glue Studio vs plain Glue ETL — DataBrew is no-code recipes for analysts; Glue Studio is low-code visual ETL for engineers; plain Glue ETL is PySpark code.
EMR node types Spot policy — Spot on task nodes only. Core nodes risk HDFS loss; master nodes risk cluster loss.
HDFS vs EMRFS — HDFS is ephemeral on core nodes; EMRFS is durable on Amazon S3. Output that must survive cluster termination goes to EMRFS.
AWS Glue Crawlers — the exam expects you to choose a scheduled crawler over manual ALTER TABLE ADD PARTITION scripts for continuously-arriving S3 data.
Glue job bookmarks vs re-processing — bookmarks enable incremental loads; Disable is only for full reprocessing or debugging.
Glue triggers vs Step Functions — Glue conditional triggers handle simple Glue-only chains. Anything involving Lambda, EMR, Athena, SNS, or retries should use Step Functions.
EMR Serverless vs EMR on EC2 — Serverless = no cluster, per-job billing, Spark/Hive only. EMR on EC2 = full Hadoop ecosystem, long-running, manual scaling.
EMR on EKS — only relevant when the organization already runs EKS. Do not pick it otherwise.
Glue Schema Registry — the right answer when the scenario emphasizes "enforce schema compatibility across streaming producers and consumers."
CSV to Parquet conversion — the canonical "reduce Athena cost" answer, done with AWS Glue ETL.
Python shell job — the right pick for small (<1 GB) Python tasks; avoid spinning up a Spark cluster for trivial work.

On SAA-C03, any multi-step ETL pipeline with conditional logic, retries, error-handling, or cross-service integration should use AWS Step Functions — not chained Glue conditional triggers, not Lambda calling Lambda, and not cron-based scripts. Step Functions provides durable state, visual DAGs in the console, and built-in service integrations with Glue StartJobRun, EMR AddStep, Athena StartQueryExecution, and Lambda Invoke. Source ↗

AWS Glue, EMR, and Step Functions vs Adjacent Topics

Glue ETL vs Kinesis Data Firehose transformations

Both can convert data formats and write to Amazon S3. The difference:

Amazon Data Firehose — near-real-time streaming delivery with built-in format conversion (JSON → Parquet/ORC) and Lambda-based inline transformation. Buffering 60 seconds typical.
AWS Glue streaming ETL — full PySpark streaming engine, arbitrary transforms, window operations, joining with reference data.

If the scenario is "deliver Kinesis data to S3 with simple format conversion," pick Firehose. If it requires complex joins or windowed aggregations, pick Glue streaming ETL or EMR Serverless Spark Streaming.

Glue Data Catalog vs Lake Formation

Glue Data Catalog is the metadata store. AWS Lake Formation sits on top and adds fine-grained access control (row, column, cell), centralized audit, and simplified setup. Lake Formation uses the Glue Data Catalog under the hood. The right pick when the scenario says "fine-grained permissions on data lake tables across analysts" is Lake Formation.

Glue vs AWS Data Pipeline

AWS Data Pipeline is a legacy orchestration service — avoid in new designs. The modern equivalents are AWS Step Functions (for general orchestration) and Amazon Managed Workflows for Apache Airflow (MWAA) (for Airflow-based teams).

FAQ — AWS Glue ETL and EMR Data Transformation Top Questions

1. When should I use AWS Glue vs Amazon EMR on the SAA-C03 exam?

Pick AWS Glue ETL when the scenario emphasizes serverless operation, small-to-medium datasets, infrequent or unpredictable runs, standard PySpark transformations, and tight Data Catalog integration. Pick Amazon EMR when the scenario requires custom Spark tuning, the full Hadoop ecosystem (HBase, Hive, Presto, Flink), very large datasets (hundreds of TBs to PBs), long-running interactive clusters, or cost optimization with Spot Instances on task nodes. The rule of thumb: Glue for serverless simplicity, EMR for scale and customization. For short bursty Spark workloads that need full Spark engine control without cluster ops, Amazon EMR Serverless is the middle ground.

2. What is the difference between AWS Glue Studio and AWS Glue DataBrew?

AWS Glue Studio is a visual low-code ETL pipeline builder for data engineers — you drag source-transform-target boxes on a canvas and Glue Studio generates editable PySpark code that runs as an AWS Glue ETL job. AWS Glue DataBrew is a pure no-code data-preparation tool for data analysts — you apply 250+ pre-built recipes (remove nulls, mask PII, reformat dates, split columns) without writing any code. Glue Studio produces full ETL pipelines; DataBrew produces cleaned datasets and profiling reports. SAA-C03 scenarios naming "data analysts who do not write code" map to DataBrew; scenarios naming "engineers who want visual Spark authoring" map to Glue Studio.

3. How do AWS Glue job bookmarks work and when should I disable them?

Job bookmarks track state about already-processed files (by path and modification time for S3) or rows (by ordered columns for JDBC). On the next run, AWS Glue skips processed data and only transforms new records — enabling incremental ETL without duplication. Set bookmarks to Enable for production incremental loads, Pause to run without updating state (useful when you need to reprocess a particular partition without losing overall bookmark progress), and Disable for a full re-run that reprocesses everything. On SAA-C03 scenarios about "efficient daily incremental ETL," the right answer is Enable job bookmarks.

4. Why should EMR task nodes use Spot Instances but not core nodes?

Task nodes are stateless — they run YARN NodeManager and compute tasks only, with no HDFS storage. If a Spot interruption reclaims a task node, YARN reschedules its tasks on surviving nodes. Spot interruption on task nodes saves up to 90% with minimal risk. Core nodes run HDFS DataNode and hold replicated HDFS data; losing multiple core nodes simultaneously (which Spot can cause) risks HDFS data loss and cluster failure. Master nodes run the cluster coordinator; losing the master on a non-HA cluster destroys the entire cluster. Production EMR pricing pattern: On-Demand or Reserved for master and core, Spot for task nodes via instance fleets with diverse Spot capacity.

5. What is the difference between EMR Serverless and EMR on EKS?

Amazon EMR Serverless removes cluster management entirely — you create a Spark or Hive application, submit jobs, and AWS auto-provisions capacity, billing per vCPU-second and memory-GB-second. You never see nodes. Best for teams that want Spark power with zero ops. Amazon EMR on EKS runs Spark jobs as pods on an existing Amazon EKS cluster that you manage. Best for organizations already standardized on Kubernetes that want to consolidate Spark workloads into the same EKS cluster as their services, sharing capacity and observability. SAA-C03 scenarios: "no cluster to manage" points to EMR Serverless; "we already run EKS and want one Kubernetes control plane" points to EMR on EKS.

6. Should I use AWS Step Functions or AWS Glue triggers to orchestrate a multi-step ETL pipeline?

Use AWS Glue conditional triggers only when the pipeline is a simple linear chain of Glue-only jobs (Glue job A finishes → Glue job B starts). The moment the pipeline involves non-Glue services — AWS Lambda, Amazon EMR, Amazon Athena queries, SNS notifications, external HTTP APIs, approval gates, or retries with exponential backoff — use AWS Step Functions. Step Functions provides durable state machines, visual workflow diagrams, built-in retries and catches, parallel branches, and native service integrations for Glue StartJobRun, EMR AddStep, Athena StartQueryExecution, and Lambda Invoke. For SAA-C03, "orchestrate a complex ETL workflow" almost always points to Step Functions.

7. What is the cheapest way to convert CSV files in S3 to Apache Parquet for Athena?

The standard pattern for SAA-C03 is an AWS Glue ETL job (Spark, PySpark) that reads CSV from an S3 raw prefix, repartitions and converts to Apache Parquet with Snappy compression, and writes to an S3 curated prefix. Use AWS Glue crawlers before and after to register schemas in the Data Catalog. Keep the job on a modest DPU count (2–10 DPUs) and schedule it hourly or daily. For very small datasets (<1 GB, simple schema), a Glue Python shell job using pandas/pyarrow at 1/16 DPU can be cheaper. The resulting Parquet files typically reduce Athena scan costs by 30–90%. Amazon EMR is overkill for this pattern unless the datasets exceed several TB.

Summary

AWS Glue ETL and Amazon EMR together cover every data-transformation scenario on SAA-C03. AWS Glue is the serverless answer: crawlers auto-discover schemas, the Glue Data Catalog centralizes metadata for Athena/Redshift Spectrum/EMR, Glue Studio provides visual ETL on top of PySpark, DataBrew provides no-code recipes for analysts, Schema Registry enforces streaming schema compatibility, job bookmarks enable incremental loads, and triggers or Step Functions orchestrate runs — all priced by DPU-second. Amazon EMR is the full-control answer with three flavors: EMR on EC2 (master + core + task nodes, Spot on task nodes only, HDFS for ephemeral and EMRFS on S3 for durable storage), EMR Serverless (per-job Spark/Hive billing, no cluster), and EMR on EKS (Spark pods on existing Kubernetes). AWS Step Functions wires the whole pipeline into durable state machines with retries and parallel branches. Master the Glue-vs-EMR decision tree, remember that Spot belongs only on task nodes, and remember that Parquet plus partitioning is the canonical cost-optimization answer for downstream Athena queries.

What Is Data Transformation in the AWS Analytics Ecosystem?

Why data transformation matters for SAA-C03

Plain-Language Explanation: AWS Glue ETL and EMR

Analogy 1 — The commercial kitchen (廚房)

Analogy 2 — The assembly line at a factory (工地)

Analogy 3 — The postal sorting center (郵政系統)

AWS Glue — The Serverless ETL Platform

AWS Glue Data Catalog — Central metadata store

AWS Glue Crawlers — Automatic schema discovery

AWS Glue ETL Jobs — PySpark and Python shell

AWS Glue Studio — Visual ETL

AWS Glue DataBrew — Visual data preparation with recipes

AWS Glue Schema Registry — Centralized schema governance

AWS Glue Job Bookmarks — Incremental processing state

AWS Glue Triggers — Starting jobs automatically

AWS Glue Pricing — The DPU model

Amazon EMR — Managed Big-Data Clusters

EMR deployment models — Three flavors

1. EMR on EC2 — The classic managed cluster

2. Amazon EMR Serverless — Pay-per-use Spark and Hive

3. Amazon EMR on EKS — Spark jobs in Kubernetes

EMR cluster anatomy — Master, core, and task nodes

EMR storage — HDFS vs EMRFS

EMR instance fleets and auto-scaling

AWS Step Functions — Orchestrating the ETL Pipeline

Why Step Functions for ETL orchestration

Typical ETL pipeline orchestrated by Step Functions

AWS Glue vs Amazon EMR — The Decision Framework

Rule-of-thumb decision tree

ETL Pattern Cookbook — Source to Athena

Why Parquet and partitioning matter

Key Numbers to Memorize

Common Exam Traps

AWS Glue, EMR, and Step Functions vs Adjacent Topics

Glue ETL vs Kinesis Data Firehose transformations

Glue Data Catalog vs Lake Formation

Glue vs AWS Data Pipeline

FAQ — AWS Glue ETL and EMR Data Transformation Top Questions

1. When should I use AWS Glue vs Amazon EMR on the SAA-C03 exam?

2. What is the difference between AWS Glue Studio and AWS Glue DataBrew?

3. How do AWS Glue job bookmarks work and when should I disable them?

4. Why should EMR task nodes use Spot Instances but not core nodes?

5. What is the difference between EMR Serverless and EMR on EKS?

6. Should I use AWS Step Functions or AWS Glue triggers to orchestrate a multi-step ETL pipeline?

7. What is the cheapest way to convert CSV files in S3 to Apache Parquet for Athena?

Further Reading

Summary

Official sources