In Defense of Lock Poisoning in Rust: Ensuring System Correctness

programming

This article explores the crucial role of lock poisoning in Rust for maintaining system correctness amidst unexpected panics in critical sections, advocating for its retention as a default behavior.

In Defense of Lock Poisoning in Rust

Recent discussions have centered on the benefits and drawbacks of lock (mutex) poisoning in Rust, prompted by a proposal to make the default mutex non-poisoned, meaning it would silently unlock on panic. As a staunch advocate for lock poisoning, I aim to consolidate and present my views on this matter.

My core beliefs are:

Unexpected cancellations within critical sections can severely compromise system correctness.
Lock poisoning is vital for upholding the correctness of critical sections in Rust.
Poisoning extends beyond mutexes, and providing a straightforward method to track such states (e.g., via a Poison<T> wrapper) offers significant value.
While there's an inherent elegance in separating locking mechanisms from panic-induced poisoning, the critical importance of lock poisoning outweighs these conceptual concerns.

What is Poisoning?

Rust, like many multithreaded languages, employs mutexes: a mechanism to ensure that data is accessed by only one thread at a time. Rust's approach to mutexes is particularly robust:

Rust's single-ownership model utilizes shared (&) and exclusive (&mut) references. Most data structures are designed so that mutations necessitate a &mut reference.
In Rust, the data protected by a mutex is owned by the mutex itself. This differs from many other languages where managing the mutex and its guarded data separately can lead to errors.
Acquiring a lock begins with a shared reference (&Mutex<T>). Upon successful acquisition, a MutexGuard<T> is returned, signaling exclusive access to the guarded data.
The MutexGuard provides a &mut T, granting exclusive access to the data.
When the MutexGuard is dropped, the lock is released. The duration during which the lock is held is known as the critical section.

This design is generally sound. Consider an example involving the processing of incoming messages for a set of operations, where multiple threads might be involved, requiring a mutex to guard the internal state.

A simple implementation:

use std::{collections::HashMap, sync::Mutex};

struct OperationId(/* ... */);

enum OperationState {
    InProgress { /* ... */ },
    Completed { /* ... */ },
}

impl OperationState {
    // Here, `process_message` consumes self and returns self. In practice this
    // is often because the state has some internal data that requires
    // processing by ownership.
    fn process_message(self, message: Message) -> Self {
        match self {
            /* ... */
        }
    }
}

struct Operations {
    ops: Mutex<HashMap<OperationId, OperationState>>,
}

impl Operations {
    /// Process a message, updating the internal operation state appropriately.
    pub fn process(&self, id: &OperationId, message: Message) {
        // Obtain a lock on the HashMap.
        let mut lock = self.ops.lock().unwrap();

        // Once the lock has been acquired, it's guaranteed that no other
        // threads have any kind of access to the data. So a `&mut` reference
        // can safely be handed to us.
        //
        // This step is shown for pedagogical reasons. Generally, `ops` is not
        // obtained explicitly. Instead, lock.remove and lock.insert are used
        // directly as `lock` dereferences to the underlying HashMap.
        let ops: &mut HashMap<_, _> = &mut *lock;

        // Retrieve the element from the map to process it.
        let Some(state) = ops.remove(id) else {
            // (return a not-found error here)
        };
        let next_state = state.process_message(message);
        ops.insert(id.clone(), next_state);

        // At this point, lock is dropped, and the mutex is available to other
        // threads.
    }
}

This represents a typical use of mutexes: safeguarding one or more invariants or properties. These invariants are maintained when the mutex is unlocked. In this case, the Operations::ops field is expected to provide complete and up-to-date tracking of all operations.

Crucially, while the mutex is held, the invariant is temporarily violated. To process a message, we remove an operation's state from the map, create a new state, and then reinsert it. During this interim, Operations::ops temporarily lacks that specific operation, thus not tracking all operations. This temporary violation is acceptable because no other threads observe this intermediate state. The code is responsible for restoring the operation to the map before the mutex is released.

However, this guarantee breaks down with unexpected errors. Practitioners often distinguish between two types of errors:

An expected error is one that can occur during normal operation, such as a user specifying an unwritable directory.
An unexpected error is one that should not occur during normal operation, like a fixed string literal failing to parse as a regex.

In Rust, expected errors are typically handled with the Result type, while unexpected errors lead to panicking. While this isn't a strict rule (some high-availability systems might model unexpected errors via Result, or scripts might handle all errors as panics), in production-grade Rust, panics are reserved for unexpected errors, and lock poisoning is designed around this assumption.

What if a Panic Occurs?

Consider a panic within OperationState::process_message. Rust offers two panic configurations:

Unwind (Default): The stack is unwound, and cleanup code is executed. Panics can be caught using catch_unwind within the same thread or via JoinHandle::join from another thread.
Abort: The entire process crashes immediately without cleanup.

Many real-world applications (including those shipped by Oxide) configure panics to abort, which renders much of this discussion moot. For the remainder of this article, we will focus on the default unwind behavior.

Program Behavior on Unwind:

If a panic occurs within a catch_unwind block, an Err(E) is returned, where E is the panic payload.
If no catch_unwind is present and the panic occurs on the main thread, a message is printed, and the program exits with a non-success code.
```
use std::{thread, time::Duration};

fn main() {
    panic!("This is a panic message");
}
```
This program will print the panic message and exit with an error.
If no catch_unwind is present and the panic occurs on a different thread, a message is printed, and the panic message is returned as the result of JoinHandle::join.
```
use std::thread;

fn main() {
    let join_handle = thread::spawn(|| {
        panic!("This is a panic message");
    });
    join_handle.join().expect("child thread succeeded");
}
```
This will result in two panics: one in the child thread and one in the main thread when expect is called, leading to a non-success exit code.
If a non-main thread panics and is not joined:
```
use std::{thread, time::Duration};

fn main() {
    thread::spawn(|| {
        panic!("This is a panic message");
    });
    thread::sleep(Duration::from_secs(5));
}
```
This program will print the panic message from the spawned thread but then exit with a successful exit code. The panic, if not explicitly processed by catch_unwind or JoinHandle::join, can easily be overlooked.

The consequence for our Operations example is significant: If the Rust binary is set to unwind on panic, a non-main thread panics within a critical section, there's no catch_unwind, and the child thread is not joined (or its error is ignored), then the mutex invariant is permanently violated. The data protected by the mutex becomes logically corrupted, and the in-progress operation is lost.

While this scenario involves multiple conditions, they represent either default Rust behavior or common coding patterns.

Rust's designers anticipated this problem and introduced lock poisoning as a detection mechanism. The process is as follows:

When a lock is released, Rust checks if the thread is currently panicking. If so, the mutex is marked as poisoned.
The next attempt to acquire the lock will return a PoisonError instead of a MutexGuard.

Most code responds to PoisonError by immediately panicking via .lock().unwrap(), which is known as propagating panics. However, PoisonError can be handled more explicitly. It's important to note that PoisonError and poisoning, in general, are advisory; the underlying data can still be retrieved, and the poison bit can even be cleared in Rust 1.77 and later.

Despite the ability to handle PoisonError explicitly, .lock().unwrap() remains the prevalent practice. This fact strengthens the argument for retaining poisoning, rather than removing it, while simultaneously improving its ergonomics. The primary goal is detection, not necessarily recovery.

In summary: if a child thread panics within a critical section, the guarded data may be left in an inconsistent or logically corrupt state. Poisoning marks the mutex to indicate this. If the parent thread doesn't wait on the child, this might be the sole indicator that a panic occurred within a critical section. This combination of factors makes lock poisoning a crucial feature.

Unexpected Cancellations

Is the issue of inconsistent mutex-guarded state confined solely to panic unwinding? I contend that it is a broader characteristic of unexpected cancellations: situations where a critical section is initiated with the expectation of completion, but the process is interrupted.

In Rust, two primary sources of unexpected cancellations exist, exhibiting strong parallels:

Panics, as discussed.
Future cancellations in async Rust at an await point.

As observed in real-world scenarios and documented internally at Oxide, unexpected future cancellations have led to numerous mutex invariant violations, prompting the decision to entirely avoid Tokio mutexes.

This problem is particularly acute with Tokio mutexes due to their Send nature. While the standard library's MutexGuard is not Send (preventing await points within a critical section guarded by std::sync::Mutex from migrating across OS threads), Tokio's Mutex type is Send. This allows await points within a critical section, thus reintroducing the risk of unexpected cancellations and data corruption. Furthermore, Tokio mutexes do not poison if a critical section panics. While this is less critical for systems configured to abort on panic, it adds another layer of concern. My perspective here is shaped by extensive experience with these issues in async Rust, and a strong desire to prevent this "footgun" from affecting synchronous Rust.

Do Panics in Critical Sections Always Cause Invariant Violations?

One might ask if poisoning is often too conservative. My answer is that while panics don't always cause invariant violations, their frequency and the potentially unbounded downsides of corrupted state make lock poisoning a valuable and strong heuristic.

Firstly, if a critical section merely reads mutex-guarded data (perhaps written by another function), a panic won't cause invariant violations. In such cases, an RwLock might be more appropriate.

Secondly, some simple write operations can also avoid invariant violations. For example, updating simple counters: While more complex data might require a mutex, for simple counters, an atomic type is often more typical.

#[derive(Default)]
struct Counters {
    read_count: u64,
    write_count: u64,
}

let mutex = Mutex::new(Counters::default());

// On read:
*mutex.lock().unwrap().read_count += 1;

// On write:
*mutex.lock().unwrap().write_count += 1;

Finally, it's sometimes possible to architect code carefully to be unwind safe. This means that if a panic occurs, either internal invariants are not violated, or the violation can be easily detected (effectively tracking a "poison bit" internally). For instance, Rust's standard library HashMap and BTreeMap are designed this way. In our Operations example, instead of removing an operation entirely, we could replace it with an Invalid sentinel state.

In these specific cases, a panic in a critical section may not be harmful, and the default .lock().unwrap() approach could reduce system availability. However, it's crucial to remember that code changes over time. Rust's type system generally provides strong resilience against future changes (e.g., mutable access). However, unwind safety, much like async cancel safety, is not natively encoded in Rust's type system. While rudimentary support for unwind safety exists in the type system, it is often ignored by users, and there is a proposal to remove it. This means code that is safe today could become unsafe tomorrow with seemingly innocuous modifications.

The primary downside of a .lock().unwrap() that prematurely panics is reduced availability or denial of service, which is a bounded risk. Conversely, the downsides of an undetected panic are unbounded, ranging from denial of service to severe data leakage, such as sensitive HTTP requests being misdirected. Faced with a bounded, potentially serious downside versus a flaw that could cripple an organization, the choice for a default becomes clear. (Drop implementations may need to be written robustly to guard against temporary invariant violations during unwinding.)

What About Writing Panic-Free Code?

One could meticulously ensure critical sections are panic-free. Yet, maintaining this property as code evolves is exceptionally difficult. Even a simple println! can panic. Moreover, if a critical section genuinely cannot panic, then the mutex's poisoning behavior becomes irrelevant.

Where Else Can Panics Cause Invariant Violations?

Historically, in Rust 1.0, panics could only be detected at thread boundaries via JoinHandle::join. This meant invariant violations caused by panics required:

Shared data guarded by a mutex.
A thread panicking mid-critical section.

Since then, two features were added to Rust:

catch_unwind in Rust 1.9.
Scoped threads in Rust 1.63.

These additions allow panics to leave arbitrary data (not just mutex-guarded data) in an inconsistent state. Consider rewriting our Operations example without a mutex, requiring exclusive access (&mut self) for modifications:

#[derive(Default)]
struct Operations {
    ops: HashMap<OperationId, OperationState>,
}

impl Operations {
    /// Process a message, updating the internal operation state appropriately.
    ///
    /// Note: this now requires &mut self, not just &self.
    pub fn process(&mut self, id: &OperationId, message: Message) {
        // Retrieve the element from the map to process it.
        let Some(state) = self.ops.remove(id) else {
            // (return a not-found error here)
        };
        let next_state = state.process_message(message);
        self.ops.insert(id.clone(), next_state);
    }
}

Without mutexes, this is not a classical critical section. However, the invariant that ops tracks all operations still applies, and it is temporarily violated, with the expectation of restoration before the function returns. Since &mut implies no other access to this data, the intermediate state is not visible elsewhere.

But like with mutexes, this breaks down with unwinding. With catch_unwind:

use std::panic;

let mut operations = Operations::default();
// ...
let result = panic::catch_unwind(|| {
    operations.process(id, message);
});

And with scoped threads:

use std::thread;

let mut operations = Operations::default();
// ...
thread::scope(|s| {
    let join_handle = s.spawn(|| {
        operations.process(id, message);
    });
});

If process_message panics, Operations is logically corrupted. This failure mode has prompted a proposal for a Poison<T> wrapper that poisons on panicking, which is an excellent and sensible idea.

Separating Mutexes from Poisoning?

Alongside the Poison<T> wrapper, some suggestions propose that the current std::sync::Mutex type in the next Rust edition should silently unlock on panic instead of poisoning. The follow-up would be that the current Mutex<T> becomes Mutex<Poison<T>>.

(It's worth noting another non-poisoning option: the mutex remains locked indefinitely, as C programmers might expect. This offers default safety in a sense, but once a thread is stuck waiting, recovery is challenging. This seems strictly worse than a poisoning mutex, so I'll assume the proposal implies silent unlocking.)

I acknowledge the conceptual elegance of this proposal:

Composability: The Poison wrapper could be used with various mutexes, including those like parking_lot that currently silently unlock on panic. Single-threaded mutex equivalents like RefCell could also benefit.
Zero-Cost Abstractions: Only users requiring poisoning would incur its cost.
Independence: As observed, not all mutexes need poisoning, and poisoning is useful outside mutexes, suggesting they are independent.

While these points hold true, I consistently return to the unbounded downside of an undetected panic and the ease with which a system can become corrupted. Mutexes and poisoning, despite their individual merits, are not as independent as they initially appear. My experience writing Rust code over many years suggests that most mutex usages benefit from poisoning, and most critical instances of poisoning involve mutex-guarded data.

Why are Mutexes Special?

As established, poisoning can be valuable in non-mutex contexts. What makes mutexes a uniquely important place for it? This relates to what I term the cancellation blast radius. Corrupted data is only significant if it represents shared mutable state that has been altered in a way that violates an invariant and is visible externally. If a typical &mut reference panics mid-operation, the data it refers to is likely torn down, rendering invariant violations irrelevant. (The exception here is Drop implementations, which may need to be written robustly to guard against temporary invariant violations during unwinding.) However, the entire purpose of mutexes is to facilitate shared access across threads, meaning data corruption is almost certainly externally visible.

While some use cases, such as metrics or best-effort logging, might benefit from non-poisoning mutexes, these should not dictate the default behavior.

Specifically, I worry that a common complaint about lock poisoning — that it introduces too much friction — will be exacerbated. Requiring Mutex<Poison<T>> instead of Mutex<T> adds even more friction, potentially pushing developers toward non-poisoning mutexes more often. This could lead to serious production issues.

This situation highlights a tension between zero-cost abstractions and "safety by default." While I'd appreciate performance numbers to quantify the impact, I suspect the incremental cost of checking the poison flag (a single atomic load with relaxed ordering) is minimal compared to the cost of acquiring the lock itself.

What About `parking_lot` Mutexes?

I previously mentioned that parking_lot's mutexes silently unlock on panic. A significant portion of the Rust ecosystem uses parking_lot, often for performance reasons. Does this imply that code relying on parking_lot faces these unbounded downsides?

The answer varies, but generally (especially in library code), I believe it does. For example, a critical section in parity-db is quite large, and reasoning about its unwind-safety seems very challenging—precisely the type of code mutex poisoning is designed to guard. In parity-db's case, the binary is configured to abort on panic, mitigating the issue. However, reusable Rust libraries cannot mandate panic = 'abort', making this a genuine concern if such code were in a public library.

Just Ship with `panic = 'abort'`?

A common response to this class of issues is to forego unwinding entirely and always abort on panic. This ties back to the cancellation blast radius: aborting the process guarantees that in-memory state is torn down.

I have considerable sympathy for this approach; it's what we implement at Oxide. (Why write this post if it doesn't affect my workplace? Firstly, I care deeply about the health of Rust globally. Secondly, libraries must function correctly with unwinds. Most importantly, we have firsthand experience with the pain of unexpected async cancellations at Oxide, so we understand the severity.)

Crucially, aborting on panic works perfectly with the current approach: .lock().unwrap() always succeeds. Whether mutexes poison or not only matters when panic = 'unwind'.

This leads to what I believe is the core driver of much discussion:

Typing in `.lock().unwrap()` is Annoying

I understand this complaint completely. Writing .lock().unwrap() everywhere is cumbersome. It adds extra characters in a language already dense with syntax, and rustfmt can break lines of code due to it.

These are valid points. However, there's a superior solution that doesn't sacrifice the crucial benefits of poisoning: in the next Rust edition, make lock() automatically panic if the mutex is poisoned! (And introduce a lock_or_poison method for the current behavior, as try_lock is already taken.)

Let's compare the different options:

Aspect	`lock().unwrap()`	Auto-panic (`lock()`)	Removing Poison
Syntax noise	Medium: `.unwrap()` everywhere	Low: just `lock()`	Low by default, high with `Poison<T>`
Safety by default	✅ Panics propagate	✅ Panics propagate	❌ Silent corruption possible
Opt-out available	✅ `lock().unwrap_or_else()`	✅ `lock_or_poison()`	❌ Must opt in via `Poison<T>`
Works with `panic = 'abort'`	✅	✅	✅
Ergonomics	Poor	Good	Good without poison, poor with `Poison<T>`
Backwards compatibility	Current behavior	Requires new edition	Requires new edition

Based on this comparison, I believe the answer is clear: if a breaking change is warranted, it is far better to make lock() automatically panic on poisoning than to have panics silently unlock.

Conclusion

Concurrent programming is inherently challenging. Rust significantly eases this complexity, and lock poisoning is an indispensable part of its solution. We should strive to avoid any regressions in this area.

Providing a Poison<T> wrapper is a logical and beneficial step. However, making the default std::sync::Mutex silently unlock on panic would be a mistake.

Should Rust's standard library offer non-poisoning mutexes at all? That's a more complex question. I'm concerned that their mere presence could lower the barrier for incorrect usage, particularly in libraries where panic = 'abort' cannot be assumed. Nevertheless, non-poisoning mutexes do have some legitimate applications, so I wouldn't strongly object if their trade-offs are thoroughly documented.

Articulating these thoughts has been immensely helpful for me in clarifying my position, and I hope it proves insightful for you too.