Transparent Analytics from PostgreSQL to ClickHouse with pg_clickhouse

database

Discover pg_clickhouse, an Apache 2-licensed PostgreSQL extension designed to simplify analytics migrations. It enables transparent execution of ClickHouse queries directly from PostgreSQL, eliminating the need to rewrite complex SQL and enhancing performance for data-intensive workloads.

The migration of analytics workloads to ClickHouse Cloud reveals a significant trend: after self-hosted ClickHouse, PostgreSQL is the most common source for data migrations. While ClickPipes simplifies data replication, users still face challenges when migrating analytical queries and application code from PostgreSQL to ClickHouse. To address this, efforts began months ago to streamline the migration of analytical queries.

Today, we are pleased to announce the release of pg_clickhouse v0.1.0, an Apache 2-licensed PostgreSQL extension designed to transparently execute analytics queries on ClickHouse directly from PostgreSQL.

pg_clickhouse is available for download from PGXN and GitHub. You can also quickly test it by spinning up a Docker instance:

docker run --name pg_clickhouse -e POSTGRES_PASSWORD=my_pass \
-d ghcr.io/clickhouse/pg_clickhouse:18

For a guided introduction, consider starting with the tutorial.

Goals

Organizations often build applications backed by PostgreSQL, handling both transactional business data and analytical workloads like logging and metrics. As products scale, increased user traffic and data volume lead to slower analytical queries for real-time features and observability systems.

Developers frequently use PostgreSQL read replicas as a provisional solution. Eventually, the need arises to shift these workloads to specialized analytics databases like ClickHouse. While ClickPipes ensures rapid data migration, the challenge remains: how to manage existing PostgreSQL queries, often embedded in SQL libraries or ORMs, that were written for that data?

The most time-consuming part isn't moving the data, which ClickPipes excels at, but rather rewriting months or years of analytics SQL deeply integrated into dashboards, ORMs, and cron jobs.

We envisioned a PostgreSQL extension that would minimize the need to migrate these queries. The goal was to enable workload migration simply by pointing existing queries at a new Postgres database or schema after data migration.

This led to the development of pg_clickhouse with several key objectives:

  • Provide ClickHouse query execution from PostgreSQL.
  • Allow existing PostgreSQL queries to run unmodified.
  • Push down query execution to ClickHouse for efficiency.
  • Create a foundation for continuous query and pushdown evolution.

Imagine ClickHouse tables appearing as regular PostgreSQL tables, perhaps within a separate schema but with an identical structure. This setup would allow existing queries to function as before, requiring only a change to the search_path.

History

The SQL/MED standard addresses this exact use case through database extensions called foreign data wrappers (FDWs), which enable external data management via SQL. PostgreSQL has supported FDWs since version 9.3 (2011), and a robust array of FDW extensions has grown since then.

Initially, clickhouse_fdw was identified as an existing solution. This project, originating in 2019 from a fork of postgres_fdw and the ClickHouse C++ library, supported raw data access and query pushdown, including some JOINs and aggregate functions. However, it received only basic maintenance after late 2020 and did not benefit from recent pushdown improvements in the PostgreSQL FDW API, which include support for advanced aggregations, SEMI-JOINs, and subqueries. It also lacked comprehensive testing beyond Linux and lagged behind new PostgreSQL releases. Consequently, much of its functionality was imported into the new pg_clickhouse project, retaining the Apache 2 license for consistent distribution.

Improvements

Building upon the foundation of clickhouse_fdw and postgres_fdw, pg_clickhouse was developed to modernize the code and build process, address bugs and shortcomings, and evolve into a complete product offering near-universal pushdown for analytics queries and aggregations.

Key advancements include:

  • Adopting the standard PGXS build pipeline for PostgreSQL extensions.
  • Adding prepared INSERT support and adopting the latest supported release of the ClickHouse C++ library.
  • Creating test cases and CI workflows to ensure compatibility with PostgreSQL versions 13-18 and ClickHouse versions 22-25.
  • Support for TLS-based connections for both the binary protocol and the HTTP API, a requirement for ClickHouse Cloud.
  • Bool, Decimal, and JSON data type support.
  • Transparent aggregate function pushdown, including for ordered-set aggregates like percentile_cont().
  • SEMI JOIN pushdown.

These latter two features significantly enhance an FDW's suitability for an analytics database. The core benefit of an analytics database is its execution speed; an FDW that merely returns millions of rows for PostgreSQL to aggregate would negate much of this advantage.

Aggregate Pushdown

Ordered-set aggregates are particularly challenging to map between different database engines due to differing syntax. Ideally, an aggregate function fully pushes down to ClickHouse for efficient execution. Consider this query, adapted from the HouseClick project:

SELECT
    type,
    round(min(price)) + 100 AS min,
    round(max(price)) AS max,
    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price)) AS median,
    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price)) AS "25th",
    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price)) AS "75th"
FROM
    uk.uk_price_paid
GROUP BY
    type

This query uses three aggregate functions. min() and max() push down automatically as they have identical names in both ClickHouse and PostgreSQL. However, percentile_cont(), which calculates the highest value within a percentage of all values, does not have a direct equivalent in ClickHouse, nor does ClickHouse support the WITHIN GROUP (ORDER BY x) ordered set aggregate syntax.

ClickHouse does, however, provide parametric aggregate functions, such as quantile, which implement a subset of the ordered set aggregate syntax. Thus, pg_clickhouse transparently rewrites this query for ClickHouse as:

SELECT
    type,
    (round(min(price)) + 100),
    round(max(price)),
    round(quantile(0.5)(price)),
    round(quantile(0.25)(price)),
    round(quantile(0.75)(price))
FROM
    uk.uk_price_paid
GROUP BY
    type

Notice the conversion of percentile_cont()'s direct arguments (0.5, 0.25, 0.75) to quantile()'s parametric constants, and the ORDER BY arguments to function arguments: percentile_cont(0.5) WITHIN GROUP (ORDER BY price) => quantile(0.5)(price).

Furthermore, pg_clickhouse, like clickhouse_fdw before it, translates PostgreSQL aggregate FILTER (WHERE) expressions to ClickHouse -If combinators. Here's the full PostgreSQL query from HouseClick:

SELECT
    type,
    round(min(price)) + 100 AS min,
    round(min(price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19' )) AS min_filtered,
    round(max(price)) AS max,
    round(max(price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19' )) AS max_filtered,
    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price)) AS median,
    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19' )) AS median_filtered,
    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price)) AS "25th",
    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19' )) AS "25th_filtered",
    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price)) AS "75th",
    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19' )) AS "75th_filtered"
FROM
    uk.uk_price_paid
GROUP BY
    type

Running this with EXPLAIN reveals a fully pushed-down query plan:

QUERY PLAN --------------------------------------------------- Foreign Scan (cost=1.00..-0.90 rows=1 width=112) Relations: Aggregate on (uk_price_paid)

This ensures the heavy lifting remains within ClickHouse, avoiding the transfer of millions of rows back to PostgreSQL. Using EXPLAIN (VERBOSE) outputs the query sent to ClickHouse (reformatted):

SELECT
    type,
    (round(min(price)) + 100),
    round(minIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
    round(max(price)),
    round(maxIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
    round(quantile(0.5)(price)),
    round(quantileIf(0.5)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
    round(quantile(0.25)(price)),
    round(quantileIf(0.25)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
    round(quantile(0.75)(price)),
    round(quantileIf(0.75)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0)))
FROM
    uk.uk_price_paid
GROUP BY
    type;

Each FILTER (WHERE) expression is converted into an -If suffixed ClickHouse function, effectively computing the equivalent filtering. For example, min(price) FILTER (WHERE town = 'ILMINSTER' AND district = 'SOUTH SOMERSET' AND postcode1 = 'TA19') becomes minIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0)).

SEMI JOIN Pushdown

During the development of pg_clickhouse, pushdown capabilities were tested against TPC-H — a benchmark for decision support workloads — loaded into ClickHouse. Initially, with Decimal type support, 10 of 22 queries ran quickly, but only 3 fully pushed down from pg_clickhouse foreign tables to ClickHouse sources. One example was Query 3's joins:

EXPLAIN (ANALYZE, COSTS)
-- using default substitutions
select
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
from
    customer,
    orders,
    lineitem
where
    c_mktsegment = 'BUILDING'
    and c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and o_orderdate < date '1995-03-15'
    and l_shipdate > date '1995-03-15'
group by
    l_orderkey,
    o_orderdate,
    o_shippriority
order by
    revenue desc,
    o_orderdate
LIMIT 10;
QUERY PLAN --------------------------------------------------------------------------------------------------- Foreign Scan (cost=0.00..-10.00 rows=1 width=44) (actual time=60.146..60.162 rows=10.00 loops=1) Relations: Aggregate on (((customer) INNER JOIN (orders)) INNER JOIN (lineitem)) FDW Time: 0.106 ms Planning: Buffers: shared hit=230 Planning Time: 6.973 ms Execution Time: 61.567 ms (7 rows)

Many other queries failed to push down efficiently, particularly those using JOINs to subqueries or EXISTS subqueries in WHERE clauses. Query 4 illustrates the latter case (with ANALYZE disabled due to long execution times):

EXPLAIN (COSTS, VERBOSE, BUFFERS)
-- using default substitutions
select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1993-07-01'and o_orderdate < date(date '1993-07-01' + interval '3month')
    and exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate)
group by
    o_orderpriority
order by
    o_orderpriority;
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=-80.86..-80.36 rows=200 width=40) Output: orders.o_orderpriority, (count(*)) Sort Key: orders.o_orderpriority -> HashAggregate (cost=-90.50..-88.50 rows=200 width=40) Output: orders.o_orderpriority, count(*) Group Key: orders.o_orderpriority -> Nested Loop (cost=3.50..-93.00 rows=500 width=32) Output: orders.o_orderpriority Join Filter: (orders.o_orderkey = lineitem.l_orderkey) -> HashAggregate (cost=2.50..4.50 rows=200 width=4) Output: lineitem.l_orderkey Group Key: lineitem.l_orderkey -> Foreign Scan on tpch.lineitem (cost=0.00..0.00 rows=0 width=4) Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus, lineitem.l_shipdate, lineitem.l_commitdate, lineitem.l_receiptdate, lineitem.l_shipinstruct, lineitem.l_shipmode, lineitem.l_comment Remote SQL: SELECT l_orderkey FROM tpch.lineitem WHERE ((l_commitdate < l_receiptdate)) -> Foreign Scan on tpch.orders (cost=1.00..-0.50 rows=1 width=36) Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus, orders.o_totalprice, orders.o_orderdate, orders.o_orderpriority, orders.o_clerk, orders.o_shippriority, orders.o_comment Remote SQL: SELECT o_orderkey, o_orderpriority FROM tpch.orders WHERE ((o_orderdate >= '1993-07-01')) AND ((o_orderdate < '1993-10-01')) ORDER BY o_orderpriority ASC NULLS LAST Planning: Buffers: shared hit=236 (20 rows)

Two foreign scans deep in the plan are highly inefficient for such a straightforward query.

These issues were addressed through two main approaches:

  • Adjusting costs to encourage the PostgreSQL planner to push down queries, aligning with analytics use cases.
  • Crucially, adding support for SEMI JOIN pushdown.

These enhancements dramatically improved execution, bringing 21 of the 22 queries to under 1 second, with full pushdown achieved for 12, including Query 4:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
-- using default substitutions
select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1993-07-01'and o_orderdate < date(date '1993-07-01' + interval '3month')
    and exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate)
group by
    o_orderpriority
order by
    o_orderpriority;
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Foreign Scan (cost=1.00..5.10 rows=1000 width=40) (actual time=51.835..51.847 rows=5.00 loops=1) Output: orders.o_orderpriority, (count(*)) Relations: Aggregate on ((orders) LEFT SEMI JOIN (lineitem)) Remote SQL: SELECT r1.o_orderpriority, count(*) FROM tpch.orders r1 LEFT SEMI JOIN tpch.lineitem r3 ON (((r3.l_commitdate < r3.l_receiptdate)) AND ((r1.o_orderkey = r3.l_orderkey))) WHERE ((r1.o_orderdate >= '1993-07-01')) AND ((r1.o_orderdate < '1993-10-01')) GROUP BY r1.o_orderpriority ORDER BY r1.o_orderpriority ASC FDW Time: 0.056 ms Planning: Buffers: shared hit=242 Planning Time: 6.583 ms Execution Time: 54.937 ms (9 rows)

The table below compares query performance across regular PostgreSQL tables, pg_clickhouse before SEMI-JOIN optimization, and pg_clickhouse with SEMI-JOIN support (as released). Tests were conducted using TPC-H data at a scaling factor of 1 on PostgreSQL and ClickHouse tables. A '✅' signifies full pushdown, while a dash indicates query cancellation after 1 minute.

QueryPostgres RuntimeOriginal RuntimeSEMI JOIN Runtime
14478ms✅ 82ms✅ 73ms
2560ms--
31454ms✅ 74ms✅ 74ms
4650ms-✅ 67ms
5452ms-✅ 104ms
6740ms✅ 33ms✅ 42ms
7633ms-✅ 83ms
8320ms-✅ 114ms
93028ms-✅136ms
106ms10ms✅ 10ms
11213ms-✅ 78ms
121101ms99ms✅ 37ms
13967ms1028ms1242ms
14193ms168ms✅ 51ms
151095ms101ms522 ms
16492ms1387ms1639ms
171802ms-9ms
186185ms-10ms
1964ms75m65ms
20473ms-4595ms
211334ms-1702ms
22257ms-268ms

This table demonstrates significant performance improvements for most queries when using pg_clickhouse foreign tables with SEMI-JOIN support. While the query optimizer selected slower plans for a few cases (queries 13, 15, and 16), and Query 2's performance requires further investigation, the overall performance gain is undeniable.

The Future

These improvements represent a significant milestone, and this first release is a strong step forward. However, development continues. The primary focus is completing pushdown coverage for analytical workloads before introducing Data Manipulation Language (DML) features. The roadmap includes:

  • Optimizing planning for the remaining 10 un-pushed-down TPC-H queries.
  • Testing and fixing pushdown for ClickBench queries.
  • Supporting transparent pushdown of all PostgreSQL aggregate functions.
  • Supporting transparent pushdown of all PostgreSQL functions.
  • Implementing comprehensive subquery pushdown.
  • Allowing server-level and user-level ClickHouse settings via CREATE SERVER and CREATE USER.
  • Supporting all ClickHouse data types.
  • Supporting lightweight DELETEs and UPDATEs.
  • Supporting batch insertion via COPY.
  • Adding a function to execute an arbitrary ClickHouse query and return its results as tables.
  • Adding support for pushdown of UNION queries when they all query the remote database.

And more. We encourage you to install pg_clickhouse from GitHub and PGXN releases and test it with your real-world workloads. Your feedback on any pushdown issues via project issues will be invaluable for further improvements.