How do you know if your service is at its limits?
I am not a Kafka expert, but I do work with it on an almost daily basis. Recently, one of my clients told me they were considering implementing a feature that would double their Kafka workload and asked if I could support them by helping right-sizing a new cluster. The ensuing conversation uncovered some questions I didn’t have answers for and I thought I’d share my search for one of those. Namely, had the client already saturated their consumer groups.
What is a Consumer Group?
In Kafka, a consumer reads messages from the broker, for example, temperature data published to a topic on the broker (by a producer), is read by an application that analyzes the data for trends (by a consumer). Simple enough.
However, if there are 5,000 sensors publishing data to that topic every minute, my single app may not be able to read the messages fast enough and a backlog starts to build up. In this case, one solution might be creating multiple consumers to process the data in parallel. Enter, consumer groups. Consumer groups allow multiple consumers to read from a single topic without stepping on each other's toes. It lets the processing of messages be parallelized across multiple consumers to provide horizontal scalability.
What is Consumer Group Saturation?
Consumer group saturation is pretty much what it sounds like. Just like a single consumer can get overwhelmed by the volume of messages it has to process, so too can a consumer group. Basically, messages are coming in faster than they can be processed and the message backlog grows faster than they can be consumed. This is referred to as saturation.
Saturation can occur for a bunch of different reasons but the most common ones can be grouped into
- Not enough consumers in the group
- Network latency
- Consumer group rebalancing
- Consumers are slow/inefficient
The not-enough-consumers issue tends to sneak up when you’re not paying attention and don’t have good monitoring in place. When you’re message volume increases over a long enough time period that the trend goes unnoticed, you’ll find that your backlog will grow sporadically, and eventually consistently until eventually you’re never able to catch up.
Network latency tends to be transient but can become a more persistent issue if you’re already on the edge of saturation as it will create a larger backlog and if you’re already on the edge of saturation you may not be able to catch up before something else happens to create a saturation event. These types of network events can exacerbate the situation outlined above.
Kafka topics are divided into partitions. Partitions are used to distribute the load across the cluster by assigning a different partition to each broker in the cluster. When a consumer group adds a consumer that consumer is assigned a subset of partitions that it will read from. When a consumer is added or removed from a consumer group, Kafka will rebalance the consumer group to evenly spread the load, i.e., it will reassign the partitions to each consumer. And while the rebalancing process does its best to cause as little disruption to consumer processing as possible, in reality, a small storm of controlled chaos results. This means that while the rebalancing is in process some consumers may be sitting idle while others are working on overload. Adding and removing consumers frequently will create havoc on topics with high message throughput.
Much like not having enough consumers in the group, slow/inefficient processing of messages tends to sneak up on you over time, only in this case, it's probably due to application/code changes in the consumers that are processing the messages, rather than in the number of messages to process. Of course, you could turn that on its head and say that instead, it’s an issue with code optimization rather than consumer group saturation, but from Kafka’s point of view (and likely your DevOps/SREs perspective), “po-ta-to”, “po-TAH-to”.
How Do I Know It’s Saturated?
Before I throw out some key indicators, I want to take a moment to point out the obvious, but you need to establish a baseline for your performance metrics because as is the case with almost everything, each use case has its own peculiar needs. So make sure you know what it looks like when behaving well.
Assuming that you know what these metrics look like under normal circumstances, here are some things you can monitor to determine if your consumer groups are saturated, or headed that way.
Consumer Group Lag:
What does your consumer group lag look like? The difference between the current consumer group message offset and the offset in the topic represents the number of unprocessed messages for a consumer. Automated monitoring implementation is beyond the scope of this article but the most convenient way to determine the consumer group lag in an ad-hoc fashion is by using the Kafka Consumer Offset Checker tool included with your Kafka installation.
If the number of unprocessed messages is growing or sits at a high threshold without shrinking, you’re probably saturated.
Be sure to monitor the lag rate, i.e., is the consumer group lag growing or shrinking over time? You can use this to determine trends and see if you’re approaching saturation before you start suffering from it.
How fast are you consumers processing messages? If the time is constant but consumer lag is increasing, you’re saturated.
High CPU and memory usage may indicate that the consumers are struggling to keep up with the load.
How Do I Fix It?
The best medicine is preventative. Avoid getting into the issue in the first place with proper monitoring. Make sure that you are monitoring trends that can alert you if, for example, you see that over the past month your message throughput has grown by more than 10 percent.
Avoid Consumer Group Rebalances:
Are consumers rebalancing frequently? Rebalances are typically caused by adding or removing consumers from a consumer group. Avoid doing this on a frequent basis as it can wreak havoc with your message processing.
Remember that partitions are how Kafka spreads the load over the cluster allowing for scalability and parallelism. This leads to some potential gotchas.
- If you have more consumers than partitions in a consumer group, some of the consumers will sit idle.
- If there are more partitions than consumers, some partitions will sit idle and some consumers will be assigned multiple partitions.
Consumer groups and partitions need to be deliberately balanced. On an efficiency note, you can never remove partitions, you can only add, so if you’re unsure about the future, starting smaller is better.
Add More Consumers:
If you’ve determined that a lack of consumers is the source of saturation, one of the easiest ways to deal with this is to simply add more consumers to the group. Just be sure that you balance the number of partitions with the number of consumers.
Slow or Inefficient Consumers:
If you’ve determined that the consumers are too inefficient, then it’s time to look at optimization or re-architecting how your messages are consumed and/or processed. Go figure.
On the one hand, there’s very little most of us can do about this one other than change our provider. The good news is that as far as businesses are concerned, it’s one you’re least likely to encounter as a persistent issue.
If you found this quick overview of saturation useful please let me know by clapping or subscribing so I know what you like. If you hated it, or have any comments, corrections, or additions, please leave me a comment and I’ll try to use that feedback in the future.