Disaggregation: Revolutionizing Cloud Database Architectures for Elasticity and Efficiency

Cloud Database Architecture

Explore disaggregation in cloud databases, its benefits for elastic scalability and cost efficiency, examining modern architectures, key tradeoffs, and future research shaping the next generation of data systems.

September 08, 2025

This summary is based on a concise VLDB'25 paper that explores disaggregation in cloud databases. The paper offers several profound insights, making it a valuable read.

The primary benefit of cloud infrastructure, compared to on-premise solutions, is its elastic scalability. Users can dynamically adjust resources and only pay for what they consume. Traditional database architectures, such as shared-nothing, often fail to fully leverage this advantage, leading to a growing trend of cloud-native databases adopting disaggregated designs.

Disaggregation is primarily driven by the inherent asymmetry between compute and storage resources:

  • Compute resources are significantly more expensive than storage in a cloud environment.
  • Compute demand experiences rapid fluctuations, whereas storage requirements typically grow at a slower pace.
  • Compute components can often be stateless and thus easier to scale, while storage is inherently stateful.

Decoupling these components allows compute to scale elastically, while storage can remain relatively stable and cost-effective.

Review of Disaggregation in the Clouds

Early cloud-native systems, such as Snowflake and Amazon Aurora, pioneered the separation of compute and storage into independent clusters. However, modern systems are extending the concept of disaggregation even further.

Socrates exemplifies this by splitting storage into three distinct services: a Logging service (characterized by a small footprint and strict latency requirements), a Page cache, and a Durable page store. This granular approach allows each service to be independently tuned for specific performance-to-cost tradeoffs. For instance, the logging service can leverage faster storage hardware to meet its stringent latency demands.

Further examples of disaggregation include computation pushdown (seen in Redshift Spectrum and S3 Select), intermediate caching (Snowflake), dedicated metadata services (Lakehouse architectures), and memory disaggregation (PolarDB). Many other critical database functions, such as indexing, concurrency control, and query optimization, remain largely underexplored in this context. This suggests a significant opportunity for a unified middleware layer between compute and storage to consolidate and manage these diverse functions.

Although not explicitly mentioned in the paper, this discussion closely parallels the microservices trend in general systems design. Decomposing monolithic applications into smaller, independently scalable services enhances modularity, improves resource efficiency, and facilitates better sharing and pooling of resources across various workloads. It's plausible that disaggregated databases will evolve similarly to microservices: starting with a straightforward split, progressing to numerous fine-grained services, and eventually requiring sophisticated orchestration layers, observability tools, and service meshes. While the focus today is on compute and storage, the future could involve dozens of database microservices (potentially even encompassing concurrency control), seamlessly integrated by a middleware layer that bears a striking resemblance to Kubernetes.

Tradeoffs in Disaggregated Design

The primary tradeoff in a disaggregated design is performance. Given that disaggregated components are physically separated, the communication overhead between them can be substantial.

A 2019 study indicated a tenfold throughput reduction compared to an optimized shared-nothing system. While optimizations can help mitigate this gap, disaggregation should only be implemented when its benefits demonstrably outweigh the associated network costs. This fundamental tradeoff actively drives ongoing research into techniques aimed at reducing inter-component communication overhead within distributed systems.

Rethinking Core Protocols

Many traditional distributed database protocols are designed under the assumption of a shared-nothing architecture. With disaggregation, some of these core assumptions become invalid, presenting not only new challenges but also significant opportunities for innovation.

For instance, the Two-Phase Commit (2PC) protocol typically encounters a blocking problem when a failed node's log becomes inaccessible. However, with disaggregated storage, logs reside in a shared, highly reliable service. The Cornus 2PC protocol (2022) capitalizes on this by allowing active nodes to cast a 'NO' vote on behalf of failed nodes by directly writing to their respective logs. A compare-and-swap API is used to ensure that only a single decision is definitively recorded.

Disaggregating the Query Engine

Pushdown techniques aim to reduce data movement by executing query operators closer to the storage layer. While this concept has been explored in specialized database machines, Smart SSDs, and Processing-in-Memory (PIM) architectures, it is particularly well-suited for cloud environments. Capitalizing on this, PushdownDB utilizes S3 Select to offload both basic and advanced operators, resulting in a 6.7x reduction in query execution time and a 30% cut in cost.

FlexPushdownDB further refines this by combining pushdown with caching. This allows operators such as filters or hash probes to execute locally on cached data while also leveraging remote pushdown capabilities, with the results seamlessly merged. This hybrid approach significantly outperforms either technique when used in isolation, demonstrating a 2.2x speedup.

Enabling New Capabilities and Embracing New Hardware

Contemporary applications demand queries that reflect the most recent transactions, rather than data hours old. While HTAP (Hybrid Transactional/Analytical Processing) systems address this, they often necessitate migration to entirely new database engines. Disaggregated architectures present a compelling opportunity in this area. Hermes (VLDB'25) leverages this by positioning itself strategically between the compute and storage layers. Hermes intercepts both transactional logs and analytical reads, dynamically merging recent updates into queries in real-time. These updates are then batched for eventual persistence to stable storage.

Furthermore, disaggregation both encourages and simplifies the adoption of novel hardware. Different components within the architecture can utilize specialized hardware like GPUs, RDMA, or CXL to achieve optimal cost-performance tradeoffs. The paper highlights a GPU-based DuckDB engine (VLDB'25) that demonstrates substantial speedups through aggressive parallelism.

Discussion

The paper outlines several promising new research directions. For academic systems researchers, an impactful project would involve taking a monolithic database (e.g., Postgres, RocksDB, or MySQL) and systematically transforming it into a disaggregated database. This endeavor should go beyond a mere proof-of-concept, requiring a thorough investigation into the efficiency tradeoffs of various alternative designs. It's equally important to examine the software engineering implications, cost-to-production considerations, resilience tradeoffs, and potential metastability risks. By comparing and contrasting different transformation pathways, such research could provide a comprehensive roadmap—and identify potential pitfalls—for numerous databases contemplating similar architectural redesigns or implementations.

I previously reviewed a paper last year that made initial inroads into this area, but I believe substantial further research and development are still required.

For distributed protocol designers, the section on "Rethinking Core Protocols" serves as an excellent blueprint. The suggestion is to select other fundamental protocols—such as consensus, leader election, replication, and caching—and re-examine them within a disaggregated architectural context, evaluating both the novel opportunities that arise and the new challenges that are introduced.

For those interested in further reading, I have previously covered disaggregated architectures multiple times on this blog.