How We Made @platformatic/kafka 223% Faster (And What We Learned Along the Way)

Software Development

Discover how @platformatic/kafka achieved a remarkable 223% performance boost by overhauling its benchmarking methodology and addressing critical bottlenecks, ultimately outperforming native clients. Learn the key lessons from this optimization journey.

How We Made @platformatic/kafka 223% Faster (And What We Learned Along the Way)

By Paolo Insogna

A few months ago, we published an article explaining our motivation for building yet another Kafka client for Node.js. The initial benchmarks were highly promising, showing superior performance to KafkaJS and competitive results against native clients. However, a significant concern emerged: these numbers didn't align with our observations in actual production environments.

Despite continued testing and analysis, the results consistently failed to reflect our real-world production experience. We observed high variance and small sample sizes, leading to a lack of confidence in whether we were truly measuring what we intended.

This prompted a fundamental re-evaluation of our approach. Our goal was not merely to make @platformatic/kafka faster, but to first ensure that our testing methodology was sound. It turned out our initial approach was indeed flawed. Correcting this put us on a path that ultimately led to substantial performance improvements.

Performance Summary

Here’s a snapshot of the performance achieved with v1.21.0:

  • Producer (Single Message): 92,441 operations/second — 48% faster than KafkaJS
  • Producer (Batch): 4,465 operations/second — 53% faster than KafkaJS
  • Consumer: 159,828 operations/second — 9% faster than our previous version

Notably, the single-message producer performance represents a remarkable 223% improvement over v1.16.0.

Benchmark Methodology Issues

Our initial benchmarking for the first blog post employed what appeared to be a standard method: send messages, measure elapsed time, and calculate operations per second.

The core problem was that timing measurements were only captured every 100 messages. Furthermore, for rdkafka-based libraries, we weren't properly awaiting delivery reports. Essentially, messages were sent without tracking their actual acknowledgment. This made our timing measurements inconsistent and inherently unreliable.

Our initial results clearly reflected these methodological flaws:

Notice the significant variance for node-rdkafka at ±67.58%. Such high variance indicates unreliable measurements. Additionally, a sample size of only 100 operations was statistically insufficient.

We undertook a complete rewrite of our benchmark suite, incorporating the following critical improvements:

  • Per-operation timing: Instead of sampling every 100 messages, we now measure the timing for each individual operation. This provides significantly more granular data and drastically reduces variance.
  • Proper delivery tracking: For rdkafka-based libraries, we now send a message and explicitly wait for its specific delivery report before timing the next operation. This guarantees accurate per-message timing.
  • Substantially larger sample sizes: We increased our sample size from 100 to 100,000 for most tests. While this extends execution time, it yields statistically meaningful and reliable results.

When these corrected methodologies were applied, the benchmark numbers improved dramatically across all libraries, especially for those based on rdkafka:

LibraryProducer SingleProducer BatchConsumer
@platformatic/kafka v1.21.092,441 op/s4,465 op/s159,828 op/s
@platformatic/kafka v1.16.028,596 op/s3,779 op/s146,862 op/s
KafkaJS62,450 op/s2,923 op/s120,279 op/s
node-rdkafka16,488 op/s701 op/s133,526 op/s
Confluent KafkaJS19,721 op/s2,311 op/s139,881 op/s
Confluent rdkafka21,587 op/s2,648 op/s127,146 op/s

It's crucial to understand that the underlying libraries themselves hadn't changed; we had simply begun measuring their performance accurately. However, these improved benchmarks also highlighted specific performance bottlenecks within our own @platformatic/kafka implementation that required immediate attention.

Identifying and Addressing Performance Bottlenecks

With precise measurements now in place, we could accurately pinpoint where @platformatic/kafka was spending its time and where optimization opportunities existed.

While our v1.16.0 performance numbers were respectable (28,596 operations/second for single messages), the ±34.18% variance was a significant concern. In production environments, such high variance translates directly to unpredictable latency spikes, which fundamentally contradicts our design objectives.

We initiated systematic profiling, and the first major bottleneck identified was CRC32C computation. We were calculating checksums for every message (a requirement of the Kafka protocol) using a pure JavaScript implementation. While functional, this approach exhibited both low throughput and high variance.

To address this, we integrated @node-rs/crc32, a native Rust implementation (see #126). The improvement was immediate and substantial, not only in throughput but also in consistency, leading to significantly more predictable timing.

@baac0 contributed a pull request that refactored error handling in request serialization (see #154). Initially, this was viewed primarily as code cleanup, but this assessment proved incorrect. By handling errors asynchronously rather than blocking the serialization path, we eliminated an entire category of event loop blockages, resulting in a substantial increase in throughput.

@jmdev12 identified a subtle bug in our metadata request handling (see #144). We were improperly mixing callbacks within kPerformDeduplicated, occasionally causing requests to hang or retry unnecessarily. Resolving this issue significantly improved connection handling reliability.

We also introduced a handleBackPressure option (see #127) to provide users with finer control over flow control behavior. While the Kafka protocol includes back-pressure mechanisms where brokers can signal clients to slow down, our previous implementation wasn't handling this consistently. The new option allows for fine-tuning how the client responds to these signals.

After implementing these changes, we re-ran the benchmarks:

The result: an improvement from 28,596 to 92,441 operations/second—a 223% gain. More significantly, observe the dramatic variance reduction to just ±1.05%.

Batch Processing Performance

While single-message performance is crucial for real-time event streaming, many Kafka workloads involve bulk data pipelines that send hundreds or thousands of messages in batches.

Our batch performance was already competitive in v1.16.0 (3,779 operations/second for batches of 100 messages). With the same optimizations applied, we observed further improvements:

This represents an 18% improvement, bringing batch performance to 4,465 operations/second. Crucially, we now outperform KafkaJS by 53% in batch scenarios—a performance difference that becomes substantial when processing millions of messages daily.

Consumer Performance Improvements

Our consumer implementation was performing well in initial tests, but we identified and addressed several bugs. Issues included flawed partition assignment logic (see #138) and edge cases in lag computation that could produce incorrect results (see #153).

Addressing these bugs led to a performance improvement from 146,862 to 159,828 operations/second:

While the 9% throughput increase is valuable, the reduction in variance to ±1.75% is arguably more significant. This compares very favorably to node-rdkafka's ±19.16% and the Confluent clients' ±18-24% variance. In production environments, consistent performance often outweighs peak throughput.

Performance Architecture: How Pure JavaScript Outperforms Native Bindings

We frequently encounter questions about how a pure JavaScript implementation can outperform native bindings to librdkafka. The answer isn't a single "silver bullet" optimization but rather the cumulative effect of multiple deliberate architectural decisions:

  • Minimal buffer copying: Every buffer allocation and copy adds overhead and garbage collection pressure. We designed the entire protocol handling layer to work with buffer slices and views wherever possible. When processing over 90,000 messages per second, avoiding unnecessary allocations significantly impacts both throughput and latency consistency.
  • Direct protocol implementation: There's no abstraction layer between our application code and the Kafka wire protocol. Less indirection means fewer function calls, reduced stack manipulation, and more predictable performance characteristics. This also grants us the flexibility to optimize hot paths without architectural constraints.
  • Non-blocking event loop usage: Node.js performs optimally when leveraged according to its design principles—specifically, with asynchronous operations that do not block the event loop. The error handling refactor was particularly impactful here. We had been blocking on error serialization in several code paths, and eliminating these blocks substantially reduced latency spikes.
  • Proper stream implementation: Node.js streams provide built-in back-pressure management when used correctly. When network sockets become full, the stream automatically pauses writes. Similarly, when consumers cannot keep pace, the fetch loop pauses. This ensures predictable memory usage and prevents unbounded memory growth.
  • Hot path optimization: Operations like CRC32C checksums, Murmur2 partition hashing, and varint encoding execute for every single message. We profiled these operations extensively, optimized them, and profiled again. The migration to native CRC32C via Rust was the largest single improvement, but countless smaller optimizations compound significantly at scale.

It's important to acknowledge that librdkafka itself implements similar optimizations and is exceptionally well-optimized C code. However, it must cross the Node.js/C++ boundary for nearly every operation, and that boundary crossing carries measurable overhead. By remaining entirely within JavaScript, we bypass that overhead altogether.

The Journey Continues

What began as a nagging doubt about our benchmark methodology evolved into something far more valuable: a comprehensive understanding of our library's performance characteristics and a remarkable 223% improvement in single-message throughput.

The lessons gleaned from this experience are worth emphasizing:

  1. Measurement matters: Flawed benchmarks don't just waste time; they obscure real performance issues. By correcting our methodology, we exposed bottlenecks we hadn't even known existed.
  2. Community contributions are invaluable: The pull requests from our contributors didn't merely fix bugs—they fundamentally improved our throughput and reliability.
  3. Consistency matters as much as peak performance: Reducing variance from ±34% to ±1% means your p99 latencies become predictable, which is precisely what production systems truly require.

The results speak for themselves: @platformatic/kafka v1.21.0 now delivers 92,441 operations/second for single messages and 159,828 operations/second for consumption, with variance consistently under ±2% across all scenarios. It is a 99% pure JavaScript library, yet it confidently outperforms libraries built on highly optimized C code.

If you're developing Node.js applications where Kafka performance is critical, we strongly encourage you to evaluate @platformatic/kafka:

npm install @platformatic/kafka

We invite you to run the benchmarks on your own infrastructure; we've published the complete test suite in BENCHMARKS.md. Test it against your specific workload patterns. And if you discover issues or have optimization ideas, we warmly welcome contributions at github.com/platformatic/kafka. After all, that's precisely how we achieved these improvements.

All benchmarks were executed on an M2 Max MacBook Pro with Node.js 22.19.0 against a three-broker Kafka cluster. Results may vary based on hardware and network configurations, though relative performance characteristics should remain comparable.