How to Evaluate, Choose, and Scale Your Message Broker Effectively?
Selecting the right message broker is a critical decision that impacts your application's scalability, resilience, and overall performance. With so many broker options, each with its unique features and trade-offs, making an informed choice can be challenging.
Let's consider some examples of what could go south, considering one of the attributes wasn't considered
Consider the well-documented case of an early-stage deployment of Apache Kafka by a prominent e-commerce platform. They didn't adequately plan for partitioning, causing an imbalance in message distribution. As a result, some partitions became hotspots, leading to message backlogs. This misconfiguration became acutely problematic during peak shopping events, where the increased message volume further strained these hotspots, causing significant processing delays. Customers experienced lag in receiving order updates, and the ripple effect was a slower order processing pipeline, impacting user experience and sales.
In another instance, a global financial services company using RabbitMQ overlooked the importance of geographic distribution and fault tolerance. When one of their primary data centers faced an outage, they couldn't efficiently failover to their secondary site, leading to substantial data loss and trading downtimes, translating to significant financial losses and reputational damage.
A popular ride-hailing app faced severe backlash when it rolled out a new promotional campaign. For every ride a user took, they were promised reward points. However, due to the lack of guaranteed delivery in their message broker setup, many point allocation messages were lost. This resulted in riders not seeing their promised points, leading to a flood of customer complaints, negative PR, and even some users shifting to competitors.
Message Ordering: An online auction platform learned the hard way about the importance of message ordering. Bids were processed out of order due to the lack of strict message sequencing in their broker. This led to earlier bids being recognized as the winning ones, even if a higher bid was placed a few moments later. The result was disgruntled users, erroneous auction outcomes, and a significant blow to the platform's credibility.
Retention Policies: A news streaming service provided users with a feature to look back at major events from the past week. However, due to incorrect retention policies in their message broker, older news items were prematurely purged. When users tried accessing them, they were met with errors or missing content, significantly affecting user experience and trust.
Schema Evolution: A healthcare analytics company used a message broker to process and store patient data. As the company grew and expanded its services, new data types and fields were introduced. However, they hadn't planned for schema evolution in their initial setup. When newer message formats were introduced, it caused data inconsistencies and even crashes in parts of their analytics pipeline that expected the older schema. This led to delays in reporting and inaccurate health analytics, a grave concern in the medical field.
Now let's consider the list of attributes that should be considered while setting up a message broker :
Latency Requirements: Speed of message delivery
Guaranteed Delivery: Assurance of every message's delivery.
Durability: Persistence of messages post-process.
Security: Measures protecting data integrity and privacy :
Authorization using Access Control Lists (ACLs)
Scalability: System's adaptability to increased load.
High Availability: System uptime and reliability.
Fault Tolerance: Functionality amidst internal failures :
Resiliency: Recovery capability from unexpected disruptions :
Monitoring and Logging: Oversight and record-keeping of operations.
Backup and Recovery: System data preservation and restoration.
Message Ordering: Sequential delivery requirement.
Retention Policies: Duration messages are retained.
Cost: Total expenditure for the system.
Partitioning: Parallelism in data processing.
Daily Traffic: Average daily message count.
Hourly Peak Traffic: Peak hourly message count.
RPM (Requests Per Minute): System's real-time handling rate.
Batch vs. Stream Processing: Mode of data processing.
Dead Letter Queues: Repository for unprocessable messages.
Replay Capability: Ability to reprocess past messages.
Multi-Tenancy: Multiple applications sharing infrastructure.
Message Filtering: Routing based on message criteria.
End-to-end Latency: Total time from sender to receiver.
Message Compression: Reducing message size for efficiency.
Integration with Other Systems: Interoperability with external platforms.
Throttling and Rate Limiting: Control over message processing rate.
Consumer Flexibility: Multiple consumers accessing similar messages :
Multiple Consumer Groups
Priority Queuing: Message delivery based on priority.
Push vs. Pull Models: Mechanism of message delivery.
Schema Evolution: Adaptability to message format changes.
Managed Cloud Solution: Utilizing cloud provider services.
Self-managed Solution Requirement: Independent system operation.
Support and Community: Availability of help and resources.
Geographic Distribution: Data distribution across regions.
Protocol Support: Recognized messaging protocols.
Deduplication: Removal of redundant messages.
Ecosystem: Available extensions and plugins.
Quota Management: Setting limits on users or applications.
Auditing: Tracking and review of activities.
Load Balancing: Equal distribution of workloads.
Clustering: Grouping systems to work as a single entity.
Licensing Model: Software usage and distribution terms.
Single Record Size: Size of an individual message.
Acceptable Consumer Lag: Allowed delay in message processing.
Operational Complexity: Complexity in daily operations
How to Calculate Consumers, Consumer Groups, Partitions, CPU, Disk Space ?
Deciding on the number of consumer groups, partitions, and resource allocations such as memory, CPU, and disk space is pivotal for the efficient operation of a message broker. Here's a guide on how to make these decisions:
1. Consumer Groups and Consumers:
Consumer Groups: Determine the consumer groups based on the different types of processing that messages require. Each consumer group should represent a unique type of processing.
Number of Consumers: The number of consumers in a group depends on the volume of messages and the processing time each message requires. It should be aligned with the number of partitions to allow parallel processing.
Throughput: More partitions allow higher throughput due to parallelism but might increase management overhead.
Consumer Parallelism: Having more partitions than consumers allows for efficient load balancing. However, having too many partitions compared to consumers can lead to inefficiencies.
Fault Tolerance: More partitions allow for better distribution of replicas, improving fault tolerance.
Horizontal Scaling: Adding more broker instances or clusters. Useful for distributing load and improving fault tolerance.
Vertical Scaling: Increasing the resources (CPU, memory) of existing brokers. Useful when the existing hardware is underutilized.
Calculations for Resource Allocation:
CPU and Memory:
Monitor the CPU and memory usage under different loads, and allocate resources based on the peak usage observed.
Consider leaving some buffer for unexpected spikes.
[Disk Space] = [Average Message Size] × [Message Retention Period] × [Message Ingestion Rate]
Ensure extra space for replicas, partition logs, and unexpected volume spikes.
Number of Partitions:
[Number of Partitions] = [Peak Throughput] / [Throughput per Partition]
Consider the number of consumers and the desired level of parallelism.
Number of Consumer Groups and Consumers:
Align with the number of unique message processing types or use cases.
Ensure there are enough consumers to handle the partition load, allowing for parallel processing and failover.
Now let's see few benchmarks on what Performance a single instance can give :
Right-sizing Apache Kafka Clusters
Consider these AWS m5 Instances for some benchmarks
Formula for Sustained Throughput Limit
r is replication factor
For a three-node cluster with replication 3 and two consumer groups, the recommended ingress throughput limits as per Equation 1 are as follows.
EBS volume : baseline throughput of 250 MB/sec.
|Recommended sustained throughput limit
EBS volume : baseline throughput of 1000 MB/sec.
|Provisioned throughput configuration
|Recommended sustained throughput limit