Streaming Data is Undergoing a Paradigm Shift

Clay Ratliff
6 min readOct 18, 2023

Change is coming.

A large hub in space connecting a virtual network

Kafka Tiered Storage for the Masses

If you’re a Kafka fan you may have heard that Apache Kafka 3.6 RC 0 implements tiered storage for Kafka. This means a preview will be available in the opensource Kafka.

Until now tiered storage was not supported natively in Kafka and has only been available using third-party tools or modules, most of them proprietary, such as the Confluent tiered storage module, and Amazon Managed Streaming for Apache Kafka. This means that if you wanted tiered storage, you were almost certainly locked into a vendor paying for it.

With the release of 3.6 RC 0, tiered storage is now available to the masses. This may sound small but in reality, it’s a pretty big deal, and I believe that we’re on the precipice of a true paradigm shift in how we approach streaming data.

Why This is Important

Kafka has become a huge part of the streaming data ecosystem and is arguably one of, if not the, most common entry point of all data into a distributed system. Data is commonly consumed in real-time, but there’s also the capability to fetch historic data as well, as long as it’s within the retention policy parameters. Here is where Kafka has traditionally shown its weakness. To understand this weakness we need a little background.

How Kafka Works (High Level)

Kafka stores messages in append-only log segments on the brokers' local disk. The retention policy for these logs, configured by topic or system-wide, determines the oldest message that can be fetched. This mechanism is used to guarantee that consumers won’t miss messages even if the application fails or otherwise loses connectivity.

The Storage Issue

The total storage needed for the cluster is proportional to the number of topics/partitions, message rate, and the retention period. As a grossly over-simplified example without getting into topics, partitions, replication factors, index size, etc., the approximate storage size per node of a hypothetical cluster with 12 nodes, a 3 day retention policy, 10k messages per second, and a message size of 1k, can be determined using the…

--

--

Clay Ratliff

Looking for our dreams in the second half of our lives as a novice sailors as we learn to live on our floating home SV Fearless https://svfearless.substack.com/