Revolutionizing Rust Documentation: Introducing Arborium for Advanced Syntax Highlighting

Software Development

Explore Arborium, a Rust-based solution leveraging tree-sitter for robust syntax highlighting in docs.rs and beyond. This article details integration challenges and evaluates client-side, rustdoc, and backend approaches for enhanced code readability.

Approximately two weeks ago, a discussion began regarding the visual limitations of docs.rs, specifically the absence of syntax highlighting: The goal was to achieve an enhanced presentation, similar to this: Naturally, the current state of affairs is attributed to various underlying reasons. An investigation into these reasons, initiated via a GitHub issue, led to a brief yet productive discussion. Despite this, the path forward remained unclear, prompting an exploration of three distinct approaches to address the problem.

Before delving into the solutions, it's essential to understand the foundational context.

Background

Rust provides a tool for generating HTML and JSON documentation for crates directly from doc comments (using /// or //! for modules). This capability is highly beneficial, allowing users to access offline documentation and preview its appearance prior to publication. Once documentation for a crate is refined and completed, it is published to crates.io.

This submission places the crate into one of two build queues at docs.rs: Upon successful compilation, the documentation is stored in a 7.75TiB data bucket, providing a dedicated space on the internet with a navigation bar integrated into the docs.rs ecosystem: This bucket contains immutable HTML, CSS, and JavaScript, which remains unchanged unless a fresh rustdoc build is performed. The docs.rs team undertakes this for the latest versions of all crates but not for historical versions. This immutability explains a primary challenge in implementing features like syntax highlighting: rebuilding every version of every crate ever published with a new highlighting feature is simply not feasible.

Problems

The challenge extends beyond the immutability of historical documentation. Several other issues arise when considering syntax highlighting:

Solution Selection: Many code highlighting solutions exist; choosing the optimal one is complex.
Language Support: Which programming languages should be supported?
Reliability & Quality: Can the chosen solution be trusted to run consistently and produce high-quality output?
Dynamic Linking: Does it necessitate dynamic linking, which can introduce deployment complexities?
Platform Compatibility: Does it build successfully across all target platforms supported by rustdoc?
HTML Markup Size: Syntax-highlighted code generally results in larger HTML output. How much larger? Is this size increase acceptable given storage and bandwidth considerations?
Implementation Effort: Who will undertake the significant effort required to implement and maintain this?

The answers, broadly, point to tree-sitter for 96 languages (chosen by popular demand), with affirmative responses to quality and platform compatibility, minimal size increase, and the commitment to implementation.

Solutions

tree-sitter has been utilized for extensive website development for six years, establishing itself as a gold standard for syntax highlighting. While a Language Server Protocol (LSP) could offer superior semantic highlighting, its resource requirements (loading source code, dependencies, and the entire sysroot) make it impractical for generating documentation offline.

Note: LSP (Language Server Protocol) is the communication standard used by tools like Rust Analyzer and code editors for semantic highlighting. However, its significant demands on time and memory (due to loading all source code and dependencies) render it unsuitable for offline syntax highlighting.

While crates for the tree-sitter core and tree-sitter-highlight exist, assembling the complete solution typically requires manual effort. This involves finding appropriate grammars for each language. Languages like Rust or C++ benefit from readily available, high-quality, up-to-date grammars from the tree-sitter-grammars GitHub organization. However, for less common languages, finding or creating a suitable grammar can be time-consuming, sometimes necessitating cleanup and regeneration from older tree-sitter versions, potentially removing rules that cause compilation issues.

Note: "Regenerate" refers to processing a grammar's grammar.js and potentially scanner.cc files through the tree-sitter CLI to generate the C code required for the parser.

This process must be repeated for every language intended for highlighting: After collecting 18 different grammars and facing repeated highlighting needs across various projects, the idea emerged to create a comprehensive solution for the community. These grammars, along with their automatically generated crates, export a single symbol: a pointer to a struct containing parsing tables and scanner function pointers. Beyond the basic parsing functionality, grammars often export highlight and injection queries. Highlight queries are crucial for semantic understanding, enabling the differentiation of keywords, functions, numbers, and strings, which is essential for applying meaningful colors. Without them, a parse tree exists, but its nodes lack semantic context for highlighting. Injection queries, on the other hand, define how other grammars are nested within the primary one. For instance, Svelte components often embed JavaScript and CSS within HTML, sometimes TypeScript, requiring these languages to be injected. While tree-sitter-highlight offers a callback system for injections, managing dependencies and implementing this callback falls to the developer.

This complex setup, honed over six years of personal experience, led to the creation of arborium.

Arborium

Introducing arborium. Arborium Homepage

For the 96 languages most frequently requested, arborium provides curated grammars. Each grammar has been sourced, integrated, refined, and validated to ensure highlight queries function correctly. Licenses and attributions are meticulously preserved, and each language is integrated into the main arborium crate via cargo feature flags. arborium also intelligently manages dependencies. For example, if a project depends on Svelte, it automatically includes the necessary crates for highlighting HTML, CSS, and JavaScript within Svelte components. Similar to core tree-sitter crates, arborium's individual components offer limited standalone functionality. Their primary utility is realized through the main Arborium crate, which provides straightforward APIs for code highlighting:

use arborium::Highlighter;

let mut highlighter = Highlighter::new();
let html = highlighter.highlight_to_html("rust", "fn main() {}")?;

While this example simplifies the advanced incremental parsing and highlighting capabilities of tree-sitter, more complex APIs are available for intricate needs.

The highlighting output is highly configurable, from the chosen theme (with several built-in options) to the style of HTML. The default output uses a modern, compact, and widely supported format:

<a-k>keyword</a-k>

For those preferring a more traditional and verbose style, perhaps with Brotli compression mitigating size concerns, an alternative is available:

<span class="code-keyword">keyword</span>

For terminal users, arborium can generate output with ANSI escape codes, supporting optional background colors, margins, padding, and borders to enhance visibility. Crucially, the Rust crates within arborium are configured to compile for the wasm32-unknown-unknown target via Cargo. This was a significant challenge, requiring the provision of just enough libc symbols to satisfy grammar requirements. The arborium-sysroot/wasm-sysroot provides these, including headers like assert.h, ctype.h, endian.h, and inttypes.h.

A point of clarification: earlier demonstrations of a "WASM playground" generated with tree-sitter build --wasm and tree-sitter playground target wasm32-wasi. arborium, however, targets wasm32-unknown-unknown, which requires a custom provision of system functions. Most functions are simple (e.g., isupper, islower), with memory management functions like malloc and free provided by dlmalloc. Since all these crates compile with a Rust (and underlying C) toolchain to wasm32-unknown-unknown, they can run in a browser with a minimal amount of glue code.

Three approaches were considered for integrating arborium into the documentation ecosystem.

Angle 1: Client-Side Script Inclusion

Currently, to enable syntax highlighting for non-Rust languages in published crate documentation, users can follow instructions on arborium.bearcove.eu. This involves creating an HTML file within the repository and adding metadata to Cargo.toml for docs.rs build process integration. This approach is demonstrated on the arborium_docsrs_demo page with its sources available in the arborium repository. The solution dynamically detects if it's running on docs.rs and responsively matches the active theme (light, dark, or Ayu) for consistency.

While the themes may not align with all aesthetic preferences, consistency was prioritized. This solution is advantageous because it is immediately functional and requires no additional effort from the Rust docs team, avoiding modifications to Rustdoc, their build pipeline, or infrastructure. It serves as an effective escape hatch, similar to how others have integrated KaTeX for LaTeX equations or rendered diagrams. rustdoc-katex-demo

However, this client-side approach has significant drawbacks. It relies on both JavaScript and WebAssembly, forcing users to download potentially large grammar bundles (hundreds of kilobytes) even for small code blocks. More critically, it poses a security risk. Allowing third-party JavaScript into the main page context is generally ill-advised. While docs.rs currently offers limited exploitable data, this might change. This is considered poor practice, and the docs.rs team is likely aware of the vulnerability.

The security implications are substantial: if Arborium were widely adopted client-side on docs.rs pages, a malicious update to the arborium package on NPM could instantly compromise millions of users. (Illustrative stock photo) While users could pin to specific package versions, this would prevent them from receiving crucial updates. Ideally, all JavaScript deployed on docs.rs pages should originate from the docs.rs team, limiting the attack surface. Consequently, for a long-term, resourced solution, alternative approaches are necessary.

Angle 2: Integration within Rustdoc

Arborium comprises Rust crates containing highly portable C code, without dynamic linking, plugin folders, or asynchronous loading. It bundles grammars and the necessary code for highlighting. Leveraging this, a pull request was submitted to Rustdoc to incorporate support for highlighting other languages: Rust PR #149944 Despite its seemingly modest size (+537, -11 lines), this PR effectively integrates millions of lines of C code (Tree-sitter generated parsers). This raises the critical question of which grammars to bundle, a decision that requires careful consideration.

Integrating all 96 languages significantly increases the rustdoc binary size. For example, a custom rustdoc build with all languages compiled in measures 171MB, compared to the main branch rustdoc at 22MB. (Top: Custom rustdoc with 96 languages compiled. Bottom: Main branch rustdoc.) This substantial increase in binary size is a potential point of contention for maintainers. Therefore, a third approach is proposed.

Angle 3: Backend Processing Only

If integrating hundreds of programming, markup, and configuration languages into the client-side rustdoc is deemed unfeasible due to binary size, the alternative is to perform highlighting in the docs.rs backend.

This is where arborium-rustdoc comes in. It functions as a post-processor specifically designed for rustdoc. It detects code blocks within HTML files and applies highlighting. Additionally, it patches the main CSS file to append its own styles. Testing arborium-rustdoc on all dependencies of the facet monorepo revealed a minimal impact on documentation size, with the ~900MB doc folder increasing by only 24KB. This low overhead suggests that this backend solution is highly feasible.

Post-Mortem

The most challenging aspect of this project was the CI setup. While GitHub Actions is manageable for small packages, orchestrating 2x96 builds plus supporting packages, along with publishing with provenance to two platforms, proved exceptionally complex. Appreciation is extended to Depot.dev for their generous donation of powerful CI runners, which were instrumental in the project's completion.

To manage the complexity, plugin jobs were distributed into ten tree-themed groups in the CI workflow. Given the severe impact of CI failures, as much logic as possible was moved out of YAML configuration and into a a cargo-xtask script, which proved to be quite effective. This included not just visual progress indicators, but also rigorous validation: ensuring every generated artifact could be loaded in a browser by parsing the WebAssembly bundle and checking its imports via walrus (a more robust approach than simple wasm-objdump piping). The build engineering involved extensive techniques, such as using Blake3 hashes for input caching.

Conclusion

Arborium has been developed for long-term sustainability, aiming for a 20-year lifespan. It is proudly donated to the commons under an Apache2+MIT license, with the hope of fostering accurate syntax highlighting across the web, mirroring the advancements seen in code editors.

tree-sitter has the potential to revolutionize language tooling once more, this time by simplifying the integration process for developers who lack the time or expertise to assemble its components themselves. Further details are available on the arborium website.

For docs.rs specifically, arborium-rustdoc as a post-processing step is the recommended solution. It offers speed, supports a comprehensive range of languages, and avoids the security and bundle size concerns associated with client-side or integrated rustdoc modifications. Furthermore, it can be easily sandboxed for enhanced security.