Recent Enhancements to Rust Compiler Performance

Software Development

Discover the latest advancements in Rust compiler optimization, including improvements to VecCache and libc compilation, rustdoc-json enhancements, and strategies for accelerating large API crate builds and macOS performance.

The Rust compiler has seen significant performance improvements over the past six months, with ongoing efforts to enhance build speeds and memory efficiency.

Compiler Improvements

Several key optimizations have been integrated:

  • #142095 (VecCache Optimization): An optimization targeting VecCache, a key-value store for densely-numbered IDs, was implemented. By optimizing the common case where keys reside in the first segment (holding 4096 entries), this change yielded instruction count (icount) reductions across numerous benchmarks, exceeding 4% in some scenarios.
  • #148040 (Trivial Consts Fast Path): A fast path was added for lowering trivial constants, dramatically reducing compile times for the libc crate by 5-15%. This is a notable improvement given libc's prominence (ranked #12 by recent downloads, #7 by all-time downloads on crates.io). This optimization also reduced icounts for other benchmarks by up to 10%.
  • #147293 (Query System Debug Call Optimization): An unnecessary computation within a debug! call, present on a hot path in the query system, was conditionally executed only when needed. This resulted in icount reductions across many benchmarks, with gains over 3% in the best cases. This serves as a classic micro-optimization example in "The Rust Performance Book".
  • #148706 (Temporary Scope Handling): Handling of temporary scopes was optimized, leading to icount reductions on several benchmarks (up to 3%) and a 5% reduction in peak memory usage for secondary benchmarks involving very large literals.
  • #143684 (LLVM 21 Upgrade): The Rust compiler's underlying LLVM version was upgraded to LLVM 21. Historically, each LLVM update has boosted Rust compiler speed. This upgrade achieved a mean icount reduction of 1.70% and a mean cycle count reduction of 0.90% across benchmarks. While wall-time, the true user-perceived metric, showed a slight increase of 0.26%, the icount and cycle count improvements often correlate well with wall-time for significant changes, though this case presents an anomaly, raising questions about test machine representativeness.
  • #148789 (format_args!() and fmt::Arguments Reimplementation): The format_args!() and fmt::Arguments functionalities were reimplemented for improved space efficiency. This led to numerous small icount gains and substantial reductions (30-38%) for the large-workspace stress test. This work provides deep insights into optimization details, including memory layout diagrams, for those interested in intricate technical improvements.

Procedural Macro Optimizations in Bevy

A new compiler flag, -Zmacro-stats, introduced in June, helps measure code generated by macros. This flag was instrumental in optimizing #[derive(Arbitrary)] from the arbitrary crate (used for fuzzing) and streamlining code generated by #[derive(Reflect)] in Bevy.

The #[derive(Reflect)] macro, used for reflection on many types, previously generated a significant amount of code. For instance, in the bevy_ui crate, the macro expansion almost quadrupled the code size (from 16,000 lines/563,000 bytes of source to an additional 27,000 lines/1,544,000 bytes generated).

Subsequent pull requests addressed redundancies within the generated code, removing unnecessary calls, duplicate type bounds, const _ blocks, closures, arguments, trait bounds, attributes, impls, and factoring out repetitions.

As a result, for the bevy_window crate, the code generated by #[derive(Reflect)] was reduced by 39%. This led to a 16% reduction in cargo check wall-time and a 5% decrease in peak memory usage for that crate. Similar improvements are anticipated across other Bevy crates and projects utilizing #[derive(Reflect)].

While procedural macros can be challenging to write and historically lacked easy ways to measure generated code size, optimizing at the generation point is more efficient. This approach leverages context-specific information, reducing the reliance on sophisticated, later-stage compiler optimizations that must reconstruct data, and ultimately results in less code to parse and store in memory.

rustdoc-json Enhancements

Discussions at RustWeek 2025 concerning rustdoc-json (invoked with --output-format=json) and its impact on cargo-semver-checks performance led to a key optimization:

  • #142335 (Allocation Reduction): This PR reduced the number of allocations performed by rustdoc-json, resulting in wall-time reductions of up to 10% and peak memory usage reductions of up to 8%.

Further attempts to improve rustdoc-json's speed were less successful. While JSON's simplicity and readability benefit newcomers, its space inefficiency can limit performance for heavy-duty tools like cargo-semver-checks when processing large codebases. Obvious space optimizations, such as shortening field names, omitting default values, or interning strings, would compromise readability and flexibility. The preferred solution for performance-oriented users is likely a new, specialized output format, with a draft attempt under #142642 for future progress.

Faster Compilation of Large API Crates

An experimental flag, -Zhint-mostly-unused, was introduced to significantly improve compile times for users who consume only small fractions of very large crates. This is particularly beneficial for substantial API crates such as windows, rustix, and aws-sdk-ec2.

Faster Rust Builds on macOS

macOS offers a hidden setting that can accelerate Rust build times.

General Progress

Performance measurement periods were segmented due to a change in the test machine in July. The first period (May 20 to June 30, 2025, old machine) showed a mean wall-time improvement of -3.19%. The second period (July 1 to December 3, 2025, new machine) showed a mean wall-time improvement of -2.65%. Mean peak memory usage changes were mixed (+1.18% and -1.50%), and mean binary size saw small increases (0.45% and 2.56%).

Despite mixed metrics, the overall reduction in wall-times is positive. Compilers naturally tend to slow down with a continuous stream of bug fixes and new features if active performance work is absent. Therefore, even minor improvements are valuable. Notably, the new test machine itself contributed to an approximate 20% reduction in wall-times, highlighting the potential benefits of hardware upgrades.