The COUNT(DISTINCT) Problem in Postgres (and How HLL Fixes It)



This content originally appeared on DEV Community and was authored by Ismael Vacco

Exact COUNT(DISTINCT) queries can crush Postgres at scale. In this post, I’ll show how HyperLogLog (HLL) gives you fast, tiny, and composable unique counts with ~1% error—saving time, CPU, and storage.

In my work building analytics pipelines, I ran into the pain of counting distinct users on billions of rows. Even with indexes and aggregations, queries were slow and expensive. That’s when I turned to HyperLogLog (HLL). It let me trade a tiny bit of accuracy for massive performance and storage gains. In this post, I’ll walk through the COUNT(DISTINCT) problem and how HLL solves it in Postgres.

The problem: exact distinct counts on massive data

Say you have an events table with billions of rows:

CREATE TABLE events (
  user_id      uuid NOT NULL,
  occurred_at  timestamptz NOT NULL,
  event_type   text,
  payload      jsonb
);

A common metric is daily active users (DAU) or other unique counts:

SELECT date_trunc('day', occurred_at) AS day,
       COUNT(DISTINCT user_id)        AS unique_users
FROM events
WHERE occurred_at >= now() - interval '60 days'
GROUP BY 1
ORDER BY 1;

This looks simple—and melts CPUs. Why?

  • COUNT(DISTINCT) needs either a sort or a hash set of all seen keys per group.
  • With high cardinality and wide time ranges, memory spills and disk sorts happen.
  • Even with indexes, exact distinct remains expensive for huge datasets.

If you only need fast trending metrics and are OK with a tiny error (e.g., 0.8–2%), approximate distinct is the right trade‑off.

Enter HyperLogLog (HLL)

HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. It’s fixed‑size, mergeable, and extremely fast. In Postgres, you get it via the hll extension.

  • Storage is tiny (hundreds to a few KB per group).
  • You can union HLLs (e.g., sum days into months) with no raw data.
  • Error is predictable and configurable. A common setting yields ~0.8% relative error.

Accuracy rule of thumb: error ≈ 1.04 / sqrt(m), where m = 2^log2m. With log2m = 14 (m = 16384), error ≈ 1.04 / 128 ≈ 0.81%.

Install the extension

CREATE EXTENSION IF NOT EXISTS hll;

Ensure the extension is installed on your Postgres instance (on all nodes if you’re sharded).

HLL basics in Postgres

Key functions you’ll use:

  • hll_hash_* — Hash a value to a 64‑bit digest (hll_hash_text, hll_hash_integer, etc.).
  • hll_add_agg(hashval) — Aggregate many hashed values into an HLL sketch.
  • hll_union(hll, hll) / hll_union_agg(hll) — Merge sketches.
  • hll_cardinality(hll) — Get the estimated unique count.
  • hll_empty(log2m, regwidth, expthresh, sparseon) — Create an empty sketch with chosen parameters.

One‑off approx count

SELECT hll_cardinality(
  hll_add_agg(hll_hash_text(user_id::text))
)
FROM events
WHERE occurred_at >= now() - interval '60 days';

Daily uniques materialized (recommended)

Create a compact daily sketch table you can roll up at query time:

CREATE TABLE daily_uniques (
  day   date PRIMARY KEY,
  sketch hll NOT NULL
);

Backfill the last 60 days:

INSERT INTO daily_uniques (day, sketch)
SELECT day,
       hll_add_agg(hll_hash_text(user_id::text)) AS sketch
FROM (
  SELECT occurred_at::date AS day, user_id
  FROM events
  WHERE occurred_at >= now() - interval '60 days'
) src
GROUP BY day
ON CONFLICT (day) DO UPDATE
  SET sketch = hll_union(daily_uniques.sketch, EXCLUDED.sketch);

Query daily uniques instantly:

SELECT day, hll_cardinality(sketch)::bigint AS unique_users
FROM daily_uniques
ORDER BY day;

Monthly rollups (no raw scans!)

SELECT date_trunc('month', day)::date AS month,
       hll_cardinality(hll_union_agg(sketch))::bigint AS uniques
FROM daily_uniques
GROUP BY 1
ORDER BY 1;

Choosing parameters (precision vs size)

HLL precision is controlled primarily by log2m (number of registers = 2^log2m).

  • `m = 16384` registers → ~0.81% error. Sketch size ~1.5 KB.
  • `m = 4096` registers → ~1.6% error. Sketch ~384–512 B.

Create empty sketches explicitly if needed:

-- log2m=14, regwidth=5, expthresh=0 (HLL++ range), sparseon=1 (better for small sets)
SELECT hll_empty(14, 5, 0, 1);

Tip: All sketches you union must share the same parameters (log2m, regwidth, etc.). Keep them consistent at table boundaries.

Incremental updates

For streaming/event‑driven systems, upsert into the daily sketch as events arrive:

-- Example for a single day batch upsert
WITH d AS (
  SELECT CURRENT_DATE AS day,
         hll_add_agg(hll_hash_text(user_id::text)) AS delta
  FROM new_events_batch
  GROUP BY 1
)
INSERT INTO daily_uniques (day, sketch)
SELECT day, delta FROM d
ON CONFLICT (day) DO UPDATE
  SET sketch = hll_union(daily_uniques.sketch, EXCLUDED.sketch);

Or maintain a materialized view and schedule REFRESH MATERIALIZED VIEW CONCURRENTLY.

Accuracy you can trust

HLL has well‑studied error bounds. In practice:

  • With log2m = 14, relative error ≈ 0.8%.
  • For very small sets (<~ 2.5m), HLL++ uses a sparse/linear counting mode to keep accuracy high.
  • You can empirically validate by sampling a few days and comparing to exact counts.
WITH exact AS (
  SELECT day, COUNT(DISTINCT user_id)::bigint AS exact
  FROM (
    SELECT occurred_at::date AS day, user_id
    FROM events
    WHERE occurred_at BETWEEN DATE '2025-07-01' AND DATE '2025-07-07'
  ) t
  GROUP BY 1
), approx AS (
  SELECT day, hll_cardinality(sketch)::bigint AS approx
  FROM daily_uniques
  WHERE day BETWEEN DATE '2025-07-01' AND DATE '2025-07-07'
)
SELECT e.day,
       e.exact,
       a.approx,
       ROUND(100.0 * (a.approx - e.exact) / NULLIF(e.exact,0), 2) AS pct_err
FROM exact e JOIN approx a USING (day)
ORDER BY 1;

Performance and storage wins

  • Sketch tables are tiny (often >90% smaller than raw aggregations).
  • Queries over sketches are CPU‑light and avoid massive sorts/hashes.
  • Rollups (day→week→month) require only hll_union_agg, no raw re‑aggregation.

In data‑warehouse practice, I’ve seen 10x+ speedups and order‑of‑magnitude storage savings when moving unique‑count metrics to HLL‑based tables. Your mileage will vary by schema and hardware—measure!

Common gotchas

  • Consistency: Don’t change HLL parameters once you’ve started; union requires matching configs.
  • Hashing: Always hash the stable identity (e.g., user_id), not a mutable attribute.
  • Small groups: For very small sets, exact COUNT(DISTINCT) may be fine—and exact.
  • Explain it: Stakeholders must accept approximate counts. Document expected error.

Optional: using Go to pre‑aggregate

If you’re building Go services, you can either push raw events and let Postgres aggregate, or compute sketches in Go and store them.

  • Server‑side HLL keeps your pipeline simpler.
  • If you pre‑aggregate in Go (e.g., per shard/partition), store the sketch as bytea and convert to hll on ingest, or directly as hll via a custom function.

Wrap‑up

COUNT(DISTINCT) is perfect… until scale arrives. HLL lets you trade a tiny bit of accuracy for huge wins in performance and cost, while keeping your queries simple and your metrics composable.

In my case, introducing HLL completely changed how we approached unique metrics at scale. What used to be painful, long‑running queries became lightweight, predictable operations we could build real dashboards on. If you’re hitting the same wall, HLL might save your database too.

Have you used HLL in Postgres (or Redis/BigQuery/Redshift)? What configs worked best for your cardinality? Drop your experience in the comments!

Appendix: reference snippets

Create extension

CREATE EXTENSION hll;

Core functions

SELECT hll_hash_text('abc');
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id::text))) FROM events;
SELECT hll_cardinality(hll_union_agg(sketch)) FROM daily_uniques;

Empty sketch with chosen params

SELECT hll_empty(14, 5, 0, 1);


This content originally appeared on DEV Community and was authored by Ismael Vacco