Amazon Kinesis Data Streams and Firehose

Amazon Kinesis Data Streams is the core real-time streaming service on AWS, and Amazon Data Firehose is its managed delivery companion. On the AWS Certified Developer Associate (DVA-C02) exam, Task Statement 1.1 (Develop code for applications hosted on AWS) tests whether you can pick Amazon Kinesis Data Streams over Amazon SQS, configure shard capacity without throttling producers, write a KCL consumer that checkpoints correctly, and deliver ingested records into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, or Splunk with Amazon Data Firehose. Amazon Kinesis Data Streams appears in at least three exam scenarios on every DVA-C02 form, and the Amazon Kinesis Data Streams vs Amazon SQS vs Amazon EventBridge decision matrix is the single highest-value thing to memorize inside Domain 1.

This study guide covers every Amazon Kinesis Data Streams concept in scope for DVA-C02: shards and the 1 MB/s (or 1000 records/s) write limit, the 2 MB/s read limit, partition keys and hot-shard design, on-demand versus provisioned capacity mode, retention from 24 hours up to 365 days, enhanced fan-out for consumers, the Kinesis Producer Library (KPL) with aggregation and compression, the Kinesis Client Library (KCL) with Amazon DynamoDB checkpointing and leases, the Amazon Kinesis Agent for log files, Amazon Data Firehose destinations (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, HTTP endpoints) with buffering, Lambda transformation, dynamic partitioning, and Apache Parquet / Apache ORC conversion, Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) for streaming SQL, resharding through split-shard and merge-shards, and ProvisionedThroughputExceededException handling with exponential backoff. Expect 6 to 10 questions across the exam form on these Amazon Kinesis Data Streams topics.

What are Amazon Kinesis Data Streams and Firehose?

Amazon Kinesis Data Streams is a serverless, durable, ordered, replay-capable streaming service. Producers write records, consumers read records, and Amazon Kinesis Data Streams stores every record for a configurable retention window (24 hours by default, extendable to 365 days) so that any consumer can rewind and re-read. Amazon Kinesis Data Streams is the real-time nerve system of modern AWS applications: clickstream pipelines, IoT telemetry, log aggregation, financial tick data, and change data capture from Amazon DynamoDB Streams all flow through Amazon Kinesis Data Streams.

Amazon Data Firehose (previously Amazon Kinesis Data Firehose) is the zero-code sibling of Amazon Kinesis Data Streams. It takes a stream of records and delivers them to a destination you pick — Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Snowflake, or a generic HTTP endpoint — with optional buffering, AWS Lambda transformation, dynamic partitioning, and conversion to Apache Parquet or Apache ORC. Amazon Data Firehose is near-real-time (minimum 60-second buffer, typical 1–5 minutes), while Amazon Kinesis Data Streams is true real-time (sub-second for enhanced fan-out consumers).

Amazon Managed Service for Apache Flink (the renamed Kinesis Data Analytics) sits on top of Amazon Kinesis Data Streams to run streaming SQL and Apache Flink applications for windowed aggregations, anomaly detection, and stream-to-stream joins.

Together, these three services form the AWS streaming trio: Amazon Kinesis Data Streams for ingest, Amazon Managed Service for Apache Flink for processing, and Amazon Data Firehose for delivery.

Why Amazon Kinesis Data Streams matters for DVA-C02

DVA-C02 Domain 1 (Development with AWS Services) weighs 32 percent of the exam. Task Statement 1.1 explicitly lists "integrating applications and data with appropriate AWS services (for example, Amazon SQS, Amazon SNS, Amazon Kinesis, Amazon EventBridge)." The V2.1 exam guide (December 12, 2024) continues to emphasize ordered streaming, replay semantics, and Lambda-based consumers. Every production-grade architecture question that says "real-time," "clickstream," "sub-second," "multiple independent consumers," or "replay for 7 days" is pointing at Amazon Kinesis Data Streams.

白話文解釋 Amazon Kinesis Data Streams

Amazon Kinesis Data Streams sounds intimidating, but three plain analogies make it click.

Analogy 1 — The conveyor belt at a sushi restaurant (廚房)

Picture a sushi restaurant with a rotating conveyor belt.

Shards are the individual lanes of the belt. Each lane can carry 1 MB of sushi per second, or 1000 plates per second, whichever comes first, onto the belt. Reading off each lane is faster — up to 2 MB per second.
Partition key is the sushi label (salmon, tuna, eel). The chef (producer) writes the label on every plate; the belt machine hashes the label and drops the plate on a lane. Same label always lands on the same lane, which is how Amazon Kinesis Data Streams preserves order per customer, per IoT device, or per session.
Retention is how long the belt loops before throwing sushi out. Default is 24 hours. You can keep the belt looping for up to 365 days if customers want to re-eat history.
Consumers are diners sitting around the belt. A standard diner shares the 2 MB/s view of each lane with everyone else at the table (up to five diners total, shared bandwidth). An enhanced fan-out diner has a dedicated pipe that delivers the full 2 MB/s to that diner alone, with sub-200-millisecond latency.
Amazon Data Firehose is the auto-packaging station at the end of the belt: every plate that falls off gets boxed (buffered), optionally re-plated (Lambda transformed), and shipped to the warehouse (Amazon S3), the freezer (Amazon Redshift), the taste-index (Amazon OpenSearch Service), or the courier (Splunk or HTTP endpoint).

If the exam question says "sub-second latency to a custom consumer," reach for a dedicated Amazon Kinesis Data Streams lane with enhanced fan-out. If it says "drop ingested records into Amazon S3 with no code," reach for Amazon Data Firehose.

Analogy 2 — The postal sorting facility (郵政系統)

Amazon Kinesis Data Streams is a postal sorting facility.

Producers (Kinesis Producer Library, Kinesis Agent, AWS SDK clients) are the mail trucks dumping letters at the loading dock.
Partition key is the ZIP code on the envelope. The facility hashes the ZIP into one of several sorting lines (shards).
Sequence number is the postmark timestamp the sorting machine stamps on every letter, monotonically increasing per shard.
Checkpoint is the last read marker a mail carrier drops at the shelf when they take their break. When they come back, they pick up right where the marker sits — this is exactly what the Kinesis Client Library (KCL) writes into an Amazon DynamoDB checkpoint table.
Lease is the clipboard the facility hands to each carrier: "you own this sorting line until you drop the clipboard." If a carrier dies, another picks up the clipboard and continues.
Amazon Data Firehose is the local mailman who does the last-mile delivery with zero code.

Every Amazon Kinesis Data Streams question maps onto this picture. "Hot shard" is one ZIP code getting 10x more mail than the others. "Resharding" is splitting or merging sorting lines.

Analogy 3 — The Swiss Army knife (瑞士刀)

Amazon Kinesis is a three-blade Swiss Army knife.

The long cutting blade is Amazon Kinesis Data Streams — you decide exactly how to slice, how many consumers watch, and how long to keep history.
The corkscrew is Amazon Data Firehose — pre-shaped for one job (delivery). Pull it, data lands in Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Snowflake, or HTTP.
The scissors is Amazon Managed Service for Apache Flink — pre-cut patterns (windowed aggregations, anomaly detection) over the live stream, using SQL or Java Apache Flink code.

If the scenario only says "deliver to Amazon S3 or Amazon Redshift," don't reach for the long blade — the corkscrew is all you need.

Amazon Kinesis Data Streams Architecture — Shards, Partition Keys, Sequence Numbers

Amazon Kinesis Data Streams stores records in shards. A shard is the base unit of capacity in Amazon Kinesis Data Streams, and understanding the shard contract is the single most important thing on the DVA-C02 exam for this topic.

Shard capacity (memorize exactly)

Per shard, Amazon Kinesis Data Streams guarantees:

Write: 1 MB per second or 1000 records per second, whichever comes first.
Read (classic): 2 MB per second, shared across all standard consumers on that shard, with a cap of 5 GetRecords calls per second per shard.
Read (enhanced fan-out): 2 MB per second per consumer, not shared, sub-200 millisecond latency via SubscribeToShard push.

Exceed any of these write limits and the producer sees ProvisionedThroughputExceededException. Exceed the read limit on classic consumers and you see ProvisionedThroughputExceededException from GetRecords or ReadProvisionedThroughputExceeded in CloudWatch.

Amazon Kinesis Data Streams shard = 1 MB/s OR 1000 records/s write, 2 MB/s classic read (shared, 5 GetRecords/s), 2 MB/s per consumer enhanced fan-out. Memorize these four numbers — they appear in at least one DVA-C02 question per form. Source ↗

Partition key — the routing function

Every PutRecord call to Amazon Kinesis Data Streams carries a partition key (a UTF-8 string, up to 256 characters). Amazon Kinesis Data Streams runs an MD5 hash over the partition key and maps the resulting 128-bit integer to a hash key range owned by exactly one shard. All records with the same partition key land on the same shard, in order.

The partition key is your contract for per-key ordering. If your application needs all events for user u-123 or sensor iot-abc to be processed in order, use the user ID or sensor ID as the partition key. Never use a constant partition key (like "default") for a high-volume producer — every record will land on a single shard and you will create a permanent hot shard, limiting your whole stream to 1 MB/s regardless of how many shards you provisioned.

Sequence number — the ordering token

Amazon Kinesis Data Streams assigns every successful PutRecord a shard-scoped, monotonically increasing sequence number. The sequence number is not globally unique across shards — it is only ordered within the shard. Consumers use sequence numbers for:

Checkpointing (resume from sequence number X after a crash).
GetShardIterator with AT_SEQUENCE_NUMBER, AFTER_SEQUENCE_NUMBER, TRIM_HORIZON (oldest), or LATEST (newest).
De-duplication when the producer retried (same application data, different sequence number — producer-side idempotency is the producer's job).

Amazon Kinesis Data Streams orders records within a shard, not across shards. If global ordering matters, you must either pipe everything into a single shard (accepting the 1 MB/s ceiling) or use Amazon SQS FIFO with a single message group. The DVA-C02 exam repeatedly tries to trick you by saying "strict global ordering" — that is an Amazon SQS FIFO scenario, not an Amazon Kinesis Data Streams scenario. Source ↗

Capacity Mode — On-Demand vs Provisioned

Amazon Kinesis Data Streams offers two capacity modes. Picking the right one is a common DVA-C02 scenario question.

Provisioned mode

You specify the shard count up front. You pay per shard-hour plus PUT payload units. You manually reshard when traffic grows. Best fit when:

Traffic is predictable and steady.
Cost matters and you can estimate capacity.
You need fine-grained control over partition key distribution.

On-demand mode

Amazon Kinesis Data Streams automatically scales shard capacity to match observed traffic, up to a default ceiling of 200 MB/s write and 400 MB/s read per stream (quota raisable). You pay per GB written and per GB read. No resharding API calls. Best fit when:

Traffic is bursty or unknown.
You value zero operational overhead over per-GB cost.
You are building a new application and want to ship fast.

On-demand Amazon Kinesis Data Streams mode doubles capacity relative to the previous 30-day peak automatically, but if you throw a 10x burst from zero, you can still see throttling for a short period. For predictable, spiky traffic where the burst is known, plan a capacity-mode switch or pre-warm the stream before the burst. Source ↗

You can switch between modes twice per 24 hours on a stream.

Retention — 24 Hours Default, Up to 365 Days

Amazon Kinesis Data Streams retains every record for a configurable period. Extended retention is a DVA-C02 favorite trap.

Default retention: 24 hours.
Extended retention: up to 7 days with standard pricing, up to 365 days with long-term retention pricing.

Retention matters for three reasons:

Replay: any consumer can reset its iterator to TRIM_HORIZON and re-process every in-window record. Perfect for backfilling a new downstream consumer.
Recovery: if a Lambda consumer chokes for two days, extended retention buys you time to fix and replay.
Compliance: long-term retention lets you keep immutable, append-only event history without copying to Amazon S3.

Amazon Kinesis Data Streams is not a data lake. The maximum retention is 365 days, and you pay per GB-month for extended retention. For cold storage beyond 365 days, use Amazon Data Firehose to land records in Amazon S3 and let Amazon S3 Lifecycle move them to Amazon S3 Glacier classes. This pattern — Amazon Kinesis Data Streams for short-term replay, Amazon Data Firehose for long-term archival — appears verbatim on multiple DVA-C02 practice sets. Source ↗

Producers — SDK, Kinesis Producer Library, and Kinesis Agent

Amazon Kinesis Data Streams accepts writes through three main producer paths on DVA-C02.

AWS SDK — `PutRecord` and `PutRecords`

Direct SDK calls. PutRecord writes one record at a time; PutRecords batches up to 500 records or 5 MB per call. Use the SDK when your application already manages its own buffering, when you are inside AWS Lambda (which discourages long-lived KPL background threads), or when you need fine-grained control over partition keys and sequence numbers.

Kinesis Producer Library (KPL) — batched, compressed, asynchronous

The KPL is an asynchronous C++ wrapper (with Java and Python bindings) that sits between your application and Amazon Kinesis Data Streams. The KPL does four important things no SDK caller gets for free:

Aggregation: combines many small user records into one Amazon Kinesis Data Streams record (up to 1 MB). You pay far less PUT-payload-units fees and get massive throughput gains.
Batching: groups aggregated records into PutRecords calls automatically.
Retries with backoff: if a shard is throttled, the KPL retries transparently.
CloudWatch metrics: per-stream and per-shard built-in metrics.

Downstream consumers must use the Kinesis Client Library (KCL) to de-aggregate KPL-written records back into user records — this is the KPL-KCL pairing that the DVA-C02 exam loves to test.

Amazon Kinesis Agent — log files to Amazon Kinesis Data Streams

The Amazon Kinesis Agent is a pre-built Java daemon that tails log files on Linux servers and writes them to Amazon Kinesis Data Streams or Amazon Data Firehose. It supports CSV-to-JSON conversion, multiline patterns, file rotation, and at-least-once delivery. Use the Amazon Kinesis Agent when you want clickstream or web-server logs flowing into Amazon Kinesis Data Streams without writing custom producer code.

Aggregation is the Kinesis Producer Library (KPL) feature that packs multiple logical user records into a single Amazon Kinesis Data Streams PutRecord payload (up to 1 MB) to raise throughput and lower cost. Consumers must use the Kinesis Client Library (KCL) or the RecordAggregator utility to de-aggregate on read. Source ↗

Producer Error Handling — ProvisionedThroughputExceededException and Backoff

This is the single most tested Amazon Kinesis Data Streams error-handling scenario on DVA-C02.

When a producer exceeds 1 MB/s or 1000 records/s on a shard, the API returns ProvisionedThroughputExceededException (HTTP 400, error code ProvisionedThroughputExceededException). The SDK and the KPL both retry automatically, but the correct developer response is layered:

Exponential backoff with jitter on the producer side. Start at 100 ms, double each retry, cap at a few seconds, and add random jitter so all producers do not retry in lockstep.
Re-evaluate the partition key distribution. If one partition key is overheating a single shard, no amount of resharding helps — you need a better key.
Reshard (split the hot shard) to increase write capacity.
Switch to on-demand mode if throttling is chronic.
Use the KPL, which already implements backoff plus aggregation, removing most throttles.

The SDK will give up after the configured retry count; application code is responsible for a dead-letter path (Amazon SQS, Amazon S3, or a local disk buffer) so that records are not silently dropped.

If you see ProvisionedThroughputExceededException but CloudWatch IncomingBytes and IncomingRecords are well below the shard's 1 MB/s limit, the culprit is a hot partition key — one key is hashing to one shard and maxing it out. Increasing shard count will not help. Fix the partition key. This distinction is a classic DVA-C02 trap. Source ↗

Consumers — Classic vs Enhanced Fan-Out

Amazon Kinesis Data Streams supports two consumer styles, and the distinction drives exam questions about latency and multi-consumer isolation.

Classic (shared-throughput) consumers

Use the GetRecords pull API.
All registered classic consumers on a shard share the 2 MB/s read budget.
Hard cap of 5 GetRecords calls per shard per second.
Poll latency: typically 200 ms to 1 second.
Cheap; no per-consumer fee.
Recommended maximum: 1 to 2 classic consumers per shard. Beyond 5, every consumer suffers.

Enhanced fan-out consumers

Use the SubscribeToShard HTTP/2 push API.
Each registered consumer gets a dedicated 2 MB/s read pipe per shard.
Sub-200 ms propagation delay.
Up to 20 enhanced fan-out consumers per stream (default quota).
Extra charge per consumer-shard-hour and per GB delivered.

The right rule on DVA-C02 is: if you have one or two batch-tolerant consumers, use classic mode to save cost. If you have three or more consumers that each need the full stream at sub-second latency (for example, a fraud engine, an ML feature store, and a real-time dashboard), pay for enhanced fan-out. Classic consumers competing on one shard will starve each other and surface ReadProvisionedThroughputExceeded alarms. Source ↗

Kinesis Client Library (KCL) — Stateful Consumers with Checkpointing

The Kinesis Client Library (KCL) is the recommended way to build stateful consumers for Amazon Kinesis Data Streams. The KCL handles three hard problems the raw SDK does not.

Lease coordination via Amazon DynamoDB

The KCL creates an Amazon DynamoDB table (named after your application) where every shard is represented as a lease. Worker processes compete for leases cooperatively:

One worker owns one shard lease at a time.
Leases are renewed on a heartbeat interval.
If a worker dies, its lease expires and another worker takes it over.
As you add worker instances, the KCL rebalances leases automatically.

Checkpointing via Amazon DynamoDB

The KCL records the last successfully processed sequence number into the same Amazon DynamoDB table. After a crash or deploy, the new lease holder resumes from the last checkpoint, giving you at-least-once processing semantics.

You call checkpointer.checkpoint() explicitly inside your record processor — typically after a batch of records has been written to the downstream store. Over-checkpointing hurts Amazon DynamoDB WCU. Under-checkpointing increases re-processing on recovery.

De-aggregation and shutdown hooks

The KCL automatically de-aggregates KPL-aggregated records, calls initialize, processRecords, and shutdown (with reason TERMINATE for shard-end after resharding, or ZOMBIE for lease loss) on your record processor, and surfaces error handling primitives.

Every KCL consumer application needs an Amazon DynamoDB table for lease tracking and checkpointing. The table is created automatically on first launch, but your KCL IAM role must have dynamodb:CreateTable, dynamodb:DescribeTable, dynamodb:GetItem, dynamodb:PutItem, dynamodb:UpdateItem, dynamodb:DeleteItem, and dynamodb:Scan permissions. Missing Amazon DynamoDB permissions is a common KCL startup failure on the DVA-C02 exam. Source ↗

AWS Lambda as a Kinesis Consumer — Batch Size, Bisect on Error, Parallelization

AWS Lambda is the most common Amazon Kinesis Data Streams consumer in serverless architectures. Lambda event-source mapping polls the stream on your behalf and invokes your function with a batch of records.

Batch size and batch window

BatchSize: 1 to 10 000 records per Lambda invocation (default 100).
MaximumBatchingWindowInSeconds: 0 to 300 seconds. Lambda waits at most this long to accumulate a full batch.

Tune these two together. A large batch plus a long window maximizes throughput and minimizes invocation cost, but raises end-to-end latency and per-invocation blast radius.

Parallelization factor

ParallelizationFactor: 1 to 10. Splits one shard into up to 10 concurrent Lambda invocations, preserving per-partition-key order but removing the one-Lambda-per-shard bottleneck.

Bisect batch on error

BisectBatchOnFunctionError: true or false. On an error, Lambda halves the failing batch and retries, isolating the poison pill record in O(log n) invocations instead of reprocessing the whole batch forever.

ReportBatchItemFailures

Return a batchItemFailures list from your Lambda response to checkpoint past the successful portion of the batch and retry only the remaining records, eliminating full-batch replay on a single-record error.

Starting position and retry limits

StartingPosition: TRIM_HORIZON (oldest), LATEST (newest), or AT_TIMESTAMP.
MaximumRetryAttempts: -1 (infinite) to 10 000.
MaximumRecordAgeInSeconds: 60 to 604 800. Records older than this are dropped and skipped.
OnFailure destination: Amazon SNS or Amazon SQS for records that exhaust retries.

For any Amazon Kinesis Data Streams Lambda consumer on DVA-C02, configure five knobs: BatchSize for throughput, MaximumBatchingWindowInSeconds for latency ceiling, ParallelizationFactor for per-shard concurrency, BisectBatchOnFunctionError + ReportBatchItemFailures for poison-pill isolation, and an OnFailure Amazon SQS or Amazon SNS destination for the dead-letter path. A question that mentions "poison pill" or "one bad record blocks the shard" is asking for the bisect plus ReportBatchItemFailures combo. Source ↗

Resharding — Split and Merge Shards

As traffic grows or shrinks, you reshape an Amazon Kinesis Data Streams (in provisioned mode) by resharding.

`SplitShard`

Split one parent shard into two child shards. Use split to:

Double write capacity on a hot shard.
Redistribute a concentrated partition-key range.

The parent shard closes immediately after the split. Existing records on the parent remain readable until retention expires; new records matching either half of the parent's hash key range flow into the new children.

`MergeShards`

Combine two adjacent parent shards into one child shard. Use merge to:

Reduce cost when traffic falls.
Simplify a stream after a burst subsided.

Only shards whose hash key ranges are adjacent can be merged.

Resharding lifecycle rules

The parent shard becomes CLOSED and eventually EXPIRED (after retention).
Consumers must finish draining the parent before moving to the children — the KCL handles this automatically.
Resharding is a control-plane operation; expect 30 seconds to a few minutes before capacity stabilizes.
Batch multiple splits into a single resharding campaign to avoid thrashing.

UpdateShardCount — the shortcut

UpdateShardCount scales the whole stream to a new target shard count without manual split or merge choreography. It supports scaling up to double or down to half the current count per call, and is subject to daily limits.

Splitting or merging shards invalidates any shard ID you cached in application code. Consumers should always enumerate shards via ListShards before starting, and the KCL does this for you. A DVA-C02 scenario that says "after a reshard, some records stopped being processed" is almost always because a hand-rolled consumer hard-coded the old shard IDs. Source ↗

Amazon Data Firehose — Zero-Code Delivery to S3, Redshift, OpenSearch, Splunk, and HTTP

Amazon Data Firehose is the managed, zero-code delivery pipeline for streaming data. It sits next to Amazon Kinesis Data Streams or accepts direct PUT, and lands records in a downstream system with buffering, optional transformation, and optional format conversion.

Sources

Direct PUT: your producer calls the Amazon Data Firehose PutRecord or PutRecordBatch API.
Amazon Kinesis Data Streams: Firehose reads an existing Amazon Kinesis Data Streams as its source.
Amazon MSK: Firehose reads from Amazon Managed Streaming for Apache Kafka.
AWS IoT, CloudWatch Logs, CloudWatch Events: native integrations.

Destinations (DVA-C02 core list)

Amazon S3 — the most common destination. Firehose writes objects under a configurable prefix.
Amazon Redshift — Firehose first writes to Amazon S3, then issues a COPY to the Amazon Redshift cluster.
Amazon OpenSearch Service — Firehose delivers records to an Amazon OpenSearch Service domain index, with a parallel Amazon S3 backup.
Splunk — Firehose delivers to an HTTP Event Collector (HEC) endpoint in Splunk.
HTTP endpoint — any HTTPS endpoint (Datadog, New Relic, MongoDB, Sumo Logic, Coralogix, and more).
Snowflake — native Snowflake ingestion.

Buffering — the near-real-time tax

Amazon Data Firehose buffers records until a size threshold or a time threshold is reached, whichever comes first.

Buffer size: 1 MB to 128 MB (destination-dependent).
Buffer interval: 60 seconds to 900 seconds.

Minimum end-to-end latency is the buffer interval, which is why Amazon Data Firehose is "near-real-time" rather than real-time. If you need sub-second latency, use Amazon Kinesis Data Streams with a direct consumer.

Lambda transformation

Attach an AWS Lambda function to Amazon Data Firehose and every buffered record batch is passed through the function before delivery. Typical transformations:

JSON reshape, enrichment, PII redaction.
CSV-to-JSON conversion.
Error filtering with ProcessingFailed outcomes.

Firehose automatically sends failed transformations to a separate Amazon S3 error prefix for replay.

Dynamic partitioning

Dynamic partitioning lets Amazon Data Firehose derive Amazon S3 prefixes from the record content itself, using JQ-style expressions or a Lambda function. Example: partition by customer_id and event_date so Amazon Athena can prune Amazon S3 reads. Dynamic partitioning is enabled at stream-creation time and cannot be toggled afterwards.

Format conversion — Apache Parquet and Apache ORC

Amazon Data Firehose can convert incoming JSON records into Apache Parquet or Apache ORC, referencing an AWS Glue Data Catalog schema. This is the single most exam-relevant Amazon Data Firehose feature beyond destinations: Apache Parquet and Apache ORC cut Amazon Athena query cost and Amazon Redshift Spectrum scan cost by 5x to 10x versus raw JSON.

On DVA-C02, the moment a question says "no code," "deliver straight to Amazon S3 or Amazon Redshift," "minimal operational overhead," or "convert to Apache Parquet for Athena," the answer is Amazon Data Firehose, not Amazon Kinesis Data Streams. Conversely, any mention of "sub-second," "custom business logic per record," or "replay 3 days later" points back to Amazon Kinesis Data Streams with a custom consumer. Source ↗

Amazon Managed Service for Apache Flink — Streaming SQL on Amazon Kinesis Data Streams

Amazon Managed Service for Apache Flink (formerly Amazon Kinesis Data Analytics) runs fully managed Apache Flink applications over Amazon Kinesis Data Streams or Amazon MSK. DVA-C02 scope is recognition-level, not deep Apache Flink.

Apache Flink SQL: run windowed aggregations, joins, and pattern matching on live streams with SQL.
Apache Flink for Java / Scala / Python: full Flink DataStream API for custom streaming logic.
Studio notebooks: interactive Apache Zeppelin notebooks for exploratory streaming SQL.
Output: Amazon Kinesis Data Streams, Amazon Data Firehose, AWS Lambda, Amazon S3.

When to use Apache Flink on AWS:

Real-time aggregations (per-minute counts, percentiles).
Anomaly detection.
Stream-to-stream joins.
Enrichment with reference data.

If the scenario is pure "deliver to Amazon S3," do not use Apache Flink — use Amazon Data Firehose. If the scenario is "calculate a five-minute rolling average on live IoT data," Apache Flink is the answer.

Amazon Kinesis Data Streams vs Amazon SQS vs Amazon EventBridge — The Decision Matrix

This is the highest-value comparison on DVA-C02 for the Amazon Kinesis Data Streams topic.

Dimension	Amazon Kinesis Data Streams	Amazon SQS	Amazon EventBridge
Pattern	Ordered, replayable stream	Work queue	Event router / bus
Ordering	Per-shard (via partition key)	FIFO = strict; Standard = best-effort	Not guaranteed
Replay	Yes, up to 365 days	No (message consumed once)	No (archive + replay via archive, optional)
Consumers	Many independent, parallel	One consumer per message	Many targets per rule
Latency	Sub-second (enhanced fan-out)	~Sub-second	~Sub-second
Throughput	MB/s per shard	Unlimited (standard)	High
Use case	Clickstream, IoT, change-data capture, logs	Decoupling, async work, retries	Cross-service event routing, SaaS integration

Key DVA-C02 exam triggers:

"Multiple independent consumers of the same stream" → Amazon Kinesis Data Streams.
"Replay 7 days of history" → Amazon Kinesis Data Streams.
"Ordered clickstream from 10 million users" → Amazon Kinesis Data Streams with user ID partition key.
"Decouple order producer from order processor, one message per order, no replay" → Amazon SQS.
"Route AWS service events to many targets" → Amazon EventBridge.

Common DVA-C02 Exam Traps for Amazon Kinesis Data Streams

Seven traps that drop points on every DVA-C02 form.

1 MB/s per shard is write, not read. Read is 2 MB/s per shard (classic). Swap these and you will misjudge capacity.
Global ordering does not exist in Amazon Kinesis Data Streams. Only per-shard ordering. Global ordering is an Amazon SQS FIFO scenario with one message group.
A hot partition key cannot be fixed by adding shards. Fix the key. Then reshard.
ProvisionedThroughputExceededException can come from either producer-side write limits or consumer-side read limits. Same exception name, different root cause.
Amazon Data Firehose buffers. If the scenario insists on sub-second, Amazon Data Firehose is wrong no matter how convenient it looks.
KPL-produced records need the KCL to de-aggregate. A raw SDK consumer reads one KPL record as one record, not as the N user records inside.
Enhanced fan-out is per consumer, per shard, and per hour — it costs real money. Do not default to it; use it when you have three or more real-time consumers.

If you write user records through the KPL but read with a hand-rolled SDK GetRecords consumer, each Amazon Kinesis Data Streams record you pull contains multiple aggregated user records. Without the RecordDeaggregator utility or the KCL, your consumer will process "one record" that is actually a protobuf envelope of 50 user records. The DVA-C02 exam will hide this gotcha behind a throughput mismatch question. Source ↗

Monitoring Amazon Kinesis Data Streams with CloudWatch

A short list of CloudWatch metrics every Amazon Kinesis Data Streams developer should alarm on for DVA-C02:

IncomingBytes, IncomingRecords — write volume, compared to shard limits.
WriteProvisionedThroughputExceeded — producer throttling count.
ReadProvisionedThroughputExceeded — consumer throttling count.
GetRecords.IteratorAgeMilliseconds — how far behind consumers are (the canary metric for Lambda consumer health).
PutRecord.Latency, GetRecords.Latency — control-plane latency.

Alarm on rising GetRecords.IteratorAgeMilliseconds — if it grows without bound, the consumer is falling behind, and you need to add shards, parallelization factor, or Lambda memory.

End-to-End Reference Architecture

A typical DVA-C02 production architecture using Amazon Kinesis Data Streams:

Ingest: millions of IoT devices publish telemetry every second. A fleet of mobile clients uses the KPL (via an SDK) with deviceId as the partition key.
Stream: Amazon Kinesis Data Streams in on-demand mode with 24-hour default retention.
Real-time consumers (enhanced fan-out):
- AWS Lambda updates an Amazon DynamoDB table for device state.
- Amazon Managed Service for Apache Flink computes per-minute rolling averages.
Archival consumer: Amazon Data Firehose reads the same Amazon Kinesis Data Streams, buffers for 5 minutes, converts to Apache Parquet via an AWS Glue schema, writes to Amazon S3 with dynamic partitioning by deviceType and date.
Analytics: Amazon Athena queries the Apache Parquet files. Amazon Redshift Spectrum joins against the data warehouse.
Observability: CloudWatch alarms on GetRecords.IteratorAgeMilliseconds and WriteProvisionedThroughputExceeded.

This is the pattern behind 70 percent of Amazon Kinesis Data Streams DVA-C02 scenario questions.

FAQ: Amazon Kinesis Data Streams Top Questions

1. When should I choose Amazon Kinesis Data Streams over Amazon SQS on DVA-C02?

Choose Amazon Kinesis Data Streams when the scenario mentions any of: real-time streaming, multiple independent consumers reading the same data, replay for hours or days, ordering per key (partition key), or continuous data like clickstream, IoT, or change data capture. Choose Amazon SQS when the scenario says decouple, one worker per message, retry with visibility timeout, or strict FIFO ordering across the whole queue (Amazon SQS FIFO with a single message group). Amazon Kinesis Data Streams is a log; Amazon SQS is a queue.

2. What is the correct way to handle ProvisionedThroughputExceededException from a producer?

Implement exponential backoff with jitter starting at around 100 ms. Then diagnose: if the error comes from a subset of partition keys, fix the partition key distribution (add entropy or use a better key). If all keys are evenly distributed and the stream is maxed out, either call UpdateShardCount to scale up, split the hottest shards with SplitShard, or switch the stream to on-demand mode. Finally, use the KPL in your producer — the KPL already implements aggregation, batching, and retries, which eliminates most throttling in practice.

3. Can I have more than two Amazon Kinesis Data Streams consumers on a shard without performance loss?

Not with classic consumers. Classic consumers share the 2 MB/s read budget per shard, capped at 5 GetRecords calls per second per shard total. Beyond two classic consumers, you will observe ReadProvisionedThroughputExceeded and rising GetRecords.IteratorAgeMilliseconds. Use enhanced fan-out, which gives every registered consumer a dedicated 2 MB/s pipe per shard and sub-200 ms latency. Enhanced fan-out supports up to 20 consumers per stream (default quota, raisable).

4. What is the difference between Amazon Data Firehose and Amazon Kinesis Data Streams on DVA-C02?

Amazon Kinesis Data Streams is real-time and requires you to write consumers (AWS Lambda, KCL applications, or enhanced fan-out subscribers). It stores records for up to 365 days and is priced per shard-hour (provisioned) or per GB (on-demand). Amazon Data Firehose is near-real-time (60 to 900 second buffering), zero-code, and delivers directly to Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, HTTP, or Snowflake. Amazon Data Firehose can invoke a Lambda transformation and convert to Apache Parquet or Apache ORC. If the scenario asks for sub-second latency or replay, pick Amazon Kinesis Data Streams. If it asks for no code, deliver to S3/Redshift, pick Amazon Data Firehose. They frequently pair: Amazon Kinesis Data Streams as the live source, Amazon Data Firehose as the archival consumer writing Apache Parquet to Amazon S3.

5. How does Lambda avoid reprocessing an entire batch on a single poison-pill record?

Enable BisectBatchOnFunctionError so Lambda splits a failing batch in half and retries each half, isolating the bad record in O(log n) invocations. Also return batchItemFailures from your Lambda response, identifying the specific record sequence numbers that failed; Lambda will checkpoint past the successful records and retry only the failed ones. Pair these with MaximumRetryAttempts and an OnFailure destination (Amazon SQS or Amazon SNS) so that records that exhaust retries end up in a dead-letter location rather than blocking the shard forever.

6. What does the KCL write into Amazon DynamoDB and what permissions does it need?

The Kinesis Client Library (KCL) creates an Amazon DynamoDB table named after your application. Each shard has one row that stores the current lease holder (which worker owns the shard), the lease counter (used for renewal and fencing), and the last checkpoint sequence number. Your KCL application needs IAM permissions for dynamodb:CreateTable, dynamodb:DescribeTable, dynamodb:GetItem, dynamodb:PutItem, dynamodb:UpdateItem, dynamodb:DeleteItem, and dynamodb:Scan, plus kinesis:DescribeStream, kinesis:GetRecords, kinesis:GetShardIterator, kinesis:ListShards, kinesis:SubscribeToShard (for enhanced fan-out), and CloudWatch cloudwatch:PutMetricData.

7. When should I split a shard vs merge shards vs use UpdateShardCount?

Use SplitShard when one specific shard is hot (unbalanced hash key distribution) and you want to carve that shard into two children covering half the hash key range each. Use MergeShards when two adjacent shards both have low utilization and you want to reduce shard cost. Use UpdateShardCount when you want to scale the whole stream up or down without manually choreographing splits and merges — it doubles up or halves down per call, with daily limits. For unpredictable traffic, prefer on-demand mode and let AWS manage capacity entirely.

8. Does Amazon Data Firehose support conversion to Apache Parquet, and why does it matter?

Yes. Amazon Data Firehose can convert JSON records to Apache Parquet or Apache ORC columnar formats on the fly, using an AWS Glue Data Catalog table for the schema. Columnar formats cut Amazon Athena query cost by 5 to 10x and accelerate Amazon Redshift Spectrum scans. Combine Apache Parquet conversion with dynamic partitioning (partition by date or customer) to unlock predicate pushdown. On DVA-C02, any scenario that says "reduce Amazon Athena query cost" or "efficient long-term analytics store" is hinting at Apache Parquet via Amazon Data Firehose.

Summary — Amazon Kinesis Data Streams on DVA-C02

Amazon Kinesis Data Streams is the ordered, replayable, multi-consumer streaming backbone of AWS. Memorize five things and you will answer every Amazon Kinesis Data Streams question on DVA-C02.

Shard contract: 1 MB/s or 1000 records/s write, 2 MB/s classic read (shared, 5 GetRecords/s), 2 MB/s per consumer enhanced fan-out.
Partition key decides which shard; hot partition keys cause throttling that resharding alone cannot fix.
KPL + KCL is the production-grade producer-consumer pair; KPL aggregates, KCL de-aggregates and checkpoints into Amazon DynamoDB.
Amazon Data Firehose is zero-code near-real-time delivery to Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, HTTP, and Snowflake, with Lambda transformation, dynamic partitioning, and Apache Parquet / Apache ORC conversion.
Lambda knobs — BatchSize, MaximumBatchingWindowInSeconds, ParallelizationFactor, BisectBatchOnFunctionError, ReportBatchItemFailures, OnFailure destination — solve every Amazon Kinesis Data Streams Lambda consumer question.

Know the Amazon Kinesis Data Streams vs Amazon SQS vs Amazon EventBridge decision matrix, know the Amazon Kinesis Data Streams vs Amazon Data Firehose boundary (real-time vs near-real-time, code vs no-code), and Amazon Kinesis Data Streams becomes a reliable points bucket on the DVA-C02 exam.

What are Amazon Kinesis Data Streams and Firehose?

Why Amazon Kinesis Data Streams matters for DVA-C02

白話文解釋 Amazon Kinesis Data Streams

Analogy 1 — The conveyor belt at a sushi restaurant (廚房)

Analogy 2 — The postal sorting facility (郵政系統)

Analogy 3 — The Swiss Army knife (瑞士刀)

Amazon Kinesis Data Streams Architecture — Shards, Partition Keys, Sequence Numbers

Shard capacity (memorize exactly)

Partition key — the routing function

Sequence number — the ordering token

Capacity Mode — On-Demand vs Provisioned

Provisioned mode

On-demand mode

Retention — 24 Hours Default, Up to 365 Days

Producers — SDK, Kinesis Producer Library, and Kinesis Agent

AWS SDK — PutRecord and PutRecords

Kinesis Producer Library (KPL) — batched, compressed, asynchronous

Amazon Kinesis Agent — log files to Amazon Kinesis Data Streams

Producer Error Handling — ProvisionedThroughputExceededException and Backoff

Consumers — Classic vs Enhanced Fan-Out

Classic (shared-throughput) consumers

Enhanced fan-out consumers

Kinesis Client Library (KCL) — Stateful Consumers with Checkpointing

Lease coordination via Amazon DynamoDB

Checkpointing via Amazon DynamoDB

De-aggregation and shutdown hooks

AWS Lambda as a Kinesis Consumer — Batch Size, Bisect on Error, Parallelization

Batch size and batch window

Parallelization factor

Bisect batch on error

ReportBatchItemFailures

Starting position and retry limits

Resharding — Split and Merge Shards

SplitShard

MergeShards

Resharding lifecycle rules

UpdateShardCount — the shortcut

Amazon Data Firehose — Zero-Code Delivery to S3, Redshift, OpenSearch, Splunk, and HTTP

Sources

Destinations (DVA-C02 core list)

Buffering — the near-real-time tax

Lambda transformation

Dynamic partitioning

Format conversion — Apache Parquet and Apache ORC

Amazon Managed Service for Apache Flink — Streaming SQL on Amazon Kinesis Data Streams

Amazon Kinesis Data Streams vs Amazon SQS vs Amazon EventBridge — The Decision Matrix

Common DVA-C02 Exam Traps for Amazon Kinesis Data Streams

Monitoring Amazon Kinesis Data Streams with CloudWatch

End-to-End Reference Architecture

FAQ: Amazon Kinesis Data Streams Top Questions

1. When should I choose Amazon Kinesis Data Streams over Amazon SQS on DVA-C02?

2. What is the correct way to handle ProvisionedThroughputExceededException from a producer?

3. Can I have more than two Amazon Kinesis Data Streams consumers on a shard without performance loss?

4. What is the difference between Amazon Data Firehose and Amazon Kinesis Data Streams on DVA-C02?

5. How does Lambda avoid reprocessing an entire batch on a single poison-pill record?

6. What does the KCL write into Amazon DynamoDB and what permissions does it need?

7. When should I split a shard vs merge shards vs use UpdateShardCount?

8. Does Amazon Data Firehose support conversion to Apache Parquet, and why does it matter?

Summary — Amazon Kinesis Data Streams on DVA-C02

官方資料來源

AWS SDK — `PutRecord` and `PutRecords`

`SplitShard`

`MergeShards`