Corrosion: Fly.io's Unconventional Global Service Discovery System

Distributed Systems

Dive into Corrosion, Fly.io's unique distributed service discovery system. Discover its gossip-based design, challenges, and key lessons from real-world outages.

Fly.io transforms Docker containers into Fly Machines: micro-VMs running on our custom hardware across the globe. The most challenging aspect of operating this platform isn't managing servers or the network; it's seamlessly integrating these two components.

Several times a second, as customer CI/CD pipelines provision or deprovision Fly Machines, our state synchronization system broadcasts updates across our internal mesh. This ensures that edge proxies from Tokyo to Amsterdam maintain accurate routing tables, enabling them to route requests for applications to the nearest customer instances.

On September 1, 2024, at 3:30 PM EST, a new Fly Machine came online with a "virtual service" configuration option that a developer had just deployed. Within a few seconds, every proxy in our fleet experienced a hard lockup. It was the worst outage we've ever faced: a period during which no end-user requests could reach our customer applications at all.

Distributed systems act as blast amplifiers. By propagating data across a network, they also propagate bugs in the systems that rely on that data. In the case of Corrosion, our state distribution system, those bugs propagate quickly. The proxy code handling that Corrosion update succumbed to a notorious Rust concurrency "footgun": an if let expression over an RWLock incorrectly assumed in its else branch that the lock had been released. This led to an instant and virulently contagious deadlock.

A lesson we've learned the hard way: never trust a distributed system without an interesting failure story. If a distributed system hasn't ruined a weekend or kept you up overnight, you don't truly understand it yet. This is precisely why we're introducing Corrosion, an unconventional service discovery system we built for our platform and open-sourced.

Our Face-Seeking Rake

State synchronization is the most difficult problem in running a platform like ours. So why embark on building a risky new distributed system for it? Because no matter what we try, that particular challenge is always waiting for us. The core reason lies in our orchestration model.

Virtually every mainstream orchestration system (including Kubernetes) relies on a centralized database to make decisions about where to place new workloads. Individual servers track what they're running, but the central database remains the source of truth. At Fly.io, to achieve global scale across dozens of regions, we invert this concept: individual servers are the authoritative source of truth for their own workloads.

In our platform, our central API "bids out" work to what effectively functions as a global market of competing physical "worker" servers. By shifting the authoritative source of information from a central scheduler to individual servers, we scale out without bottlenecking on a database that demands both responsiveness and consistency across locations like São Paulo, Virginia, and Sydney.

While the bidding model is elegant, it's insufficient for routing network requests. To enable an HTTP request in Tokyo to efficiently find the nearest instance in Sydney, we genuinely require a global map of every application we host.

For longer than we should have, we relied on HashiCorp Consul to route traffic. Consul is fantastic software. Don't build a global routing system on it. Subsequently, we built SQLite caches of Consul. SQLite: also fantastic. But don't do this either.

Like an unattended turkey deep-frying on the patio, truly global distributed consensus promises deliciousness while often yielding only immolation. Consensus protocols like Raft break down over long distances. Moreover, they work against the architecture of our platform: our Consul cluster, running on the most powerful hardware we could acquire, wasted time guaranteeing consensus for updates that couldn't conflict in the first place.

Corrosion

To build a global routing database, we moved away from distributed consensus and drew inspiration from actual routing protocols.

A protocol like OSPF shares the same operating model and many of our constraints. OSPF is a "link-state routing protocol," which, conveniently for us, means that routers are sources of truth for their own links and are responsible for quickly communicating changes to every other router, enabling the network to make forwarding decisions.

We have things easier than OSPF. Its flooding algorithm can't assume connectivity between arbitrary routers (solving that problem is OSPF's core purpose). However, we operate a global, fully connected WireGuard mesh between our servers. All we need to do is gossip efficiently.

Corrosion is a Rust program that propagates a SQLite database using a gossip protocol.

Similar to Consul, our gossip protocol is built on SWIM. Start with the simplest, most basic group membership protocol you can imagine: every node spams every node it learns about with heartbeats. Now, just two tweaks: first, in each step of the protocol, spam a random subset of nodes, not the entire set. Second, instead of panicking when a heartbeat fails, mark it "suspect" and ask another random subset of neighbors to ping it for you. SWIM converges on global membership very quickly.

Once membership is established, we use QUIC between nodes in the cluster to broadcast changes and reconcile state for new nodes.

Corrosion functions like a globally synchronized database. You can open it with SQLite and simply read from its tables. What makes it particularly interesting is what it doesn't do: no locking, no central servers, and no distributed consensus. Instead, we leverage our orchestration model: workers own their own state, so updates from different workers almost never conflict.

We do impose some order. Every node in a Corrosion cluster will eventually receive the same set of updates, in some order. To ensure every instance arrives at the same "working set" picture, we use cr-sqlite, the CRDT SQLite extension.

cr-sqlite works by marking specified SQLite tables as CRDT-managed. For these tables, changes to any column of a row are logged in a special crsql_changes table. Updates to tables are applied using last-write-wins with logical timestamps (i.e., causal ordering rather than wall-clock ordering).

You can read much more about how that works here.

As rows are updated in Corrosion's ordinary SQL tables, the resulting changes are collected from crsql_changes. They're bundled into batched update packets and gossiped.

When things are running smoothly, Corrosion is easy to reason about. Many consumers of Corrosion's data don't even need to know it exists, just where the database is. We don't fret over "leader elections" or anxiously monitor metrics for update backlogs. And it's incredibly fast.

Shit Happens

This is a story about how we made one good set of engineering decisions and never experienced any problems. Please clap.

We've already told you about the worst problem Corrosion was involved with: efficiently gossiping a deadlock bug to every proxy in our fleet, shutting down our entire network. Really, Corrosion was just a bystander during that outage. But it perpetrated others.

Consider a classic operations problem: the unexpectedly expensive DDL change. You wrote a simple migration, tested it, merged it to main, and went to bed, wrongly assuming the migration wouldn't cause an outage when it ran in production. Happens to the best of us.

Now, add a twist. You made a trivial-seeming schema change to a CRDT table hooked up to a global gossip system. When the deploy runs, thousands of high-powered servers around the world join a chorus of database reconciliation messages that melts down the entire cluster.

That happened to us last year when a team member added a nullable column to a Corrosion table. New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table. It played out as if every Fly Machine on our platform had suddenly changed state simultaneously, just to mess with us.

A gnarlier war story: for a long time, we ran both Corrosion and Consul, operating under the assumption that two distributed systems provide twice the resiliency. One morning, a Consul mTLS certificate expired. Every worker in our fleet severed its connection to Consul.

We should have been fine. We had Corrosion running. Except: under the hood, every worker in the fleet was executing a backoff loop trying to reestablish connectivity to Consul. Each of those attempts re-invoked a code path to update Fly Machine state. That code path incurred a Corrosion write.

By the time we figured out what the hell was happening, we were literally saturating our uplinks almost everywhere in our fleet. We apologize to our uplink providers.

It's been a long time since anything like this has happened at Fly.io, but preventing the next one is basically all we think about anymore.

Iteration

In retrospect, our Corrosion rollout repeated a mistake we made with Consul: we built a single global state domain. Nothing about Corrosion's design required us to do this, and we're currently unwinding that decision. Hold that thought. We did get some significant payoffs from some smaller improvements.

First, and most importantly, we watchdogged everything. We showed you a contagious deadlock bug, lethal because our risk model was missing "these Tokio programs might deadlock." Not anymore. Our Tokio programs all have built-in watchdogs; an event-loop stall will bounce the service and trigger a king-hell alerting racket. Watchdogs have prevented multiple outages. Minimal code, easy win. Do this in your own systems.

Then, we extensively tested Corrosion itself. We've written about a bug we found in the Rust parking_lot library. We spent months looking for similar bugs with Antithesis. Again: highly recommend. It easily retraced our steps on the parking_lot bug; the bug wouldn't have been worth the blog post if we'd been using Antithesis at the time.

Multiverse debugging is killer for distributed systems.

No amount of testing will make us implicitly trust a distributed system. So we've made it simpler to rebuild Corrosion's database from our workers. We maintain checkpoint backups of the Corrosion database on object storage. That was a smart move. When things truly went haywire last year, we had the option to reboot the cluster, which is ultimately what we did. That consumes some time (the database is large and propagation is expensive), but diagnosing and repairing distributed systems mishaps takes even longer.

We've also improved the way our workers feed Corrosion. Until recently, any time a worker updated its local database, we published the same incremental update to Corrosion.

But now we've eliminated partial updates. Instead, when a Fly Machine changes, we republish the entire data set for that Machine. Because of how Corrosion resolves changes to its own rows, the node receiving the republished Fly Machine automatically filters out the no-op changes before gossiping them. Eliminating partial updates prevents a bunch of bugs (and, we believe, kills off a couple of sneaky ones we've been chasing). We should have done it this way from the beginning.

Finally, let's revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve beyond a single cluster. So we undertook a project we call "regionalization," which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in that region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.

Regionalization significantly reduces the blast radius of state bugs. Most information we track doesn't need to be globally relevant outside its region (importantly, most code changes related to what we track are also region-local). We can roll out changes to this kind of code in ways that, worst case, threaten only a single region.

The New System Works

Most distributed systems face state synchronization challenges. Corrosion has a different "shape" than most of those systems:

  • It doesn't rely on distributed consensus, like Consul, Zookeeper, Etcd, Raft, or rqlite (which we came very close to using).
  • It doesn't rely on a large-scale centralized data store, like FoundationDB or databases backed by S3-style object storage.
  • It's nevertheless highly distributed (each of thousands of workers run nodes), converges quickly (in seconds), and presents as a simple SQLite database. Neat!

It wasn't easy getting here. Corrosion is a significant part of what every engineer at Fly.io who writes Rust works on.

Part of what's making Corrosion work is that we're careful about what we put into it. Not every piece of state we manage needs gossip propagation.

tkdb, the backend for our Macaroon tokens, is a much simpler SQLite service backed by Litestream. So is Pet Sematary, the secret store we built to replace HashiCorp Vault.

Still, there are probably lots of distributed state problems that want something more like a link-state routing protocol and less like a distributed database. If you think you might have one of those, feel free to take Corrosion for a spin.

Corrosion is Jérôme Gravel-Niquet's brainchild. For the last couple of years, much of its iteration was led by Somtochi Onyekwere and Peter Cai. The work was alternately cortisol- and endorphin-inducing. We're glad to finally get to talk about it in detail.

Last updated • Oct 22, 2025

Share this post on Twitter Share this post on Hacker News Share this post on Reddit

Author Thomas Ptacek @tqbf

Author Peter Cai @PeterCxy

Next post ↑ You Should Write An Agent

Previous post ↓ Kurt Got Got