Benchmarking Pitfalls: Understanding and Preventing Compiler Optimizations

Programming

Discover common pitfalls in code benchmarking caused by aggressive compiler optimizations. Learn why compilers like LLVM can elide or constant-fold computations, leading to inaccurate performance metrics. Explore practical strategies, including dynamic input parameters and result validation, to ensure your benchmarks accurately reflect real-world performance without relying on platform-specific 'black box' functions.

The article discusses how compilers can aggressively optimize code, which is usually beneficial for performance but detrimental for accurate benchmarking. It illustrates how compilers, such as LLVM, can replace iterative summations with closed-form mathematical expressions (like Gauss's formula) and even optimize more complex polynomial sums. This behavior means that instead of measuring the intended computation, one might end up benchmarking a highly optimized no-op.

The author highlights two primary pitfalls in benchmarking:

Entire Computation Elision: If a computation() function's result is not used, a smart compiler might completely remove the function call, leading to a benchmark that measures virtually nothing. const start = now(); _ = computation(); // Compiler might optimize this away if result is unused const elapsed = now() - start;
Constant Folding: Even if the computation isn't entirely removed, compilers can perform constant folding when parameters are known at compile time. This can make parts of the computation much faster than they would be in a real-world scenario where inputs might vary at runtime. const parameter_a = 1_000_000; const parameter_b = 1_000; const start = now(); _ = computation(parameter_a, parameter_b); // Parts might be constant-folded const elapsed = now() - start;

For a practical demonstration, refer to this example: https://godbolt.org/z/T9EcTb8zq

While many languages offer explicit functions like Rust's hint::black_box or Zig's mem.doNotOptimizeAway to prevent optimizations, the author finds these "dragon oil" solutions problematic. Their semantics are often tricky, as they contradict the compiler's core principle of semantic-preserving transformations by introducing a "transparent" but semantically unexplainable element into the pipeline.

A simpler and more direct approach is recommended, which can be summarized as "opening the box and checking if the cat is there."

Consider benchmarking a binary search function, fn insertion_point(xs: []const u32, x: u32) usize { ... }, using the following scaffold:

fn benchmark(arena: Allocator) !void {
    const element_count = try parameter("element_count", 1_000_000);
    const search_count = try parameter("search_count", 10_000);

    const elements: []const u32 = make_elements(arena, element_count);
    const searches: []const u32 = make_searches(arena, search_count);

    const start = now();
    var hash: u32 = 0;
    for (searches) |key| {
        hash +%= insertion_point(elements, key);
    }
    const elapsed = now().duration_since(start);

    print("hash={}
", .{hash});
    print("elapsed={}
", .{elapsed});
}

fn parameter(comptime name: []const u8, default: u64) !u64 {
    const value = if (process.hasEnvVarConstant(name))
        try process.parseEnvVarInt(name, u64, 10)
    else
        default;
    print(name ++ "={}
", .{value});
    return value;
}

This approach tackles compiler optimizations in two ways:

Runtime Overridable Parameters: The parameter function retrieves values from environment variables, falling back to a default. Since the values can be specified at runtime, the compiler cannot assume them to be constants at compile time, effectively preventing constant folding. This also provides the added benefit of easily re-running benchmarks with different parameters without recompilation.
Utilizing Computation Results: A simple "hash" (in this case, a sum of all binary search indexes) is computed and then printed. By using and printing the result of the computation, the compiler is forced to execute the entire benchmark logic, preventing it from being optimized away.

Beyond ensuring accurate timing, this method offers a valuable bonus: a built-in correctness check. If an optimization introduces a bug or leads to an incorrect answer, the computed hash will immediately reflect an unexpected value, signaling an issue that might otherwise go unnoticed.

The author advises against relying on "black box" functions for future benchmarks. Instead, one should employ natural anti-optimizing-compiler remedies:

Make input parameters runtime overridable (with compile-time defaults).
Print the result (or a hash thereof).