Kafka 1.0 on HDInsight lights up real time analytics scenarios

11th July 2018 Anthony Mashford 0 Comments

Data engineers love Kafka on HDInsight as a high-throughput, low-latency ingestion platform in their real time data pipeline. They already leverage Kafka features such as message compression, configurable retention policy, and log compaction. With the release of Apache Kafka 1.0 on HDInsight, customers now get key features that make it easy to implement the most demanding scenarios. Here is a quick introduction:

Idempotent producers so that you don’t have to deduplicate

Consider a cellular billing system, in which the producer writes the amount of data consumed by users to a Kafka topic called data-consumption-events. If the broker or the connection fails, the producer will not get an acknowledgment of a message write and will retry that message. This will lead to duplicate writes to the system, causing users to be overbilled.

In critical scenarios like above, data engineers had to write and maintain custom deduplication logic, such as hashing and saving message ids. However, with idempotent producers turned on, Kafka handles that logic for you. Records include unique producer ids and the sequence number of the message. Kafka brokers will only accept a message from a producer if the sequence number is exactly one more than the committed sequence number for that producer. Producer retries will not result in duplicates. This ensures idempotency guarantees within a single producer session.

Transactions to ensure accuracy

Also known as atomicity, this Kafka feature enables multi-topic and multi-partition atomic writes. A producer can send messages in a batch such that a consumer either consumes all messages in the batch or none. And since Kafka records consumer offsets, both production and consumption can be batched together in one unit.

Let’s build upon the above cellular billing system. Consider an application that consumes individual events from the data-consumption-events topic, calculates the monthly cumulative and writes that to a Kafka topic named monthly-data-usage. The customer portal showing data usage is based upon the latest entry in the monthly-data-usage topic. Before the Kafka transactions feature, if a failure happened between consuming events and writing the cumulative total, the total would be inaccurate. This is because some of the committed consumption events would be excluded from the total. With transactions enabled, consuming events and writing the total can be batched as a single unit. Now if a failure happens before writing to monthly-data-usage, consumer offsets are not committed. The application will consume the missed events to ensure an accurate total.

Exactly once processing

Until now, you could get at-least once or at-most once semantics with Kafka on HDInsight. Combining the idempotent producer with transactions and consumer offsets allows you to achieve exactly once semantics in Kafka. This means that your application calculates the correct result as input messages are neither duplicated nor lost.

Continuing our simple cellular billing example, exactly once semantics means that any data consumption event by the customer is reflected in the total exactly once. The customer can be confident in viewing accurate usage in the portal.

This can be extended to various other scenarios in a production environment, with multiple processing stages and intermediate results. Lost or duplicate data at any step can cause a ripple effect across the data pipeline. With Kafka on HDInsight, you can guarantee accurate results with no lost or duplicated messages.

Record headers

Most messaging systems support message headers. Headers are separate from the message payload, which may be encrypted or compressed. In the cellular billing scenario, headers may contain information such as cell tower id or network provider id, enabling scenarios like automated routing, message flow monitoring and metadata auditing.

Previously, if the data pipeline included message headers from other systems, the headers either had to be discarded, or converted into the payload format and included with the payload itself. Data engineers were required to create and manage custom wrapping and unwrapping code. With support for record headers in Kafka on HDInsight, all the routing and filtering scenarios can be achieved natively. This enables much lower end to end latencies, which is critical for real-time systems.

In addition to the major features above, there are numerous performance improvements and bug fixes in this version. For a complete comparison, check out the release notes of Apache Kafka 1.0 and 0.11. The previous version of Kafka on HDInsight was 0.10.

Get started with Kafka on Azure HDInsight. Follow us on @AzureHDInsight or HDInsight blog for the latest updates. For questions and feedback, reach out to AskHDInsight@Microsoft.com.

Source: Azure Blog Feed

Mashford's Musings

Kafka 1.0 on HDInsight lights up real time analytics scenarios

Idempotent producers so that you don’t have to deduplicate

Transactions to ensure accuracy

Exactly once processing

Record headers

Leave a Reply Cancel reply