This content originally appeared on DEV Community and was authored by Ismael Vacco
Exact COUNT(DISTINCT)
queries can crush Postgres at scale. In this post, I’ll show how HyperLogLog (HLL) gives you fast, tiny, and composable unique counts with ~1% error—saving time, CPU, and storage.
In my work building analytics pipelines, I ran into the pain of counting distinct users on billions of rows. Even with indexes and aggregations, queries were slow and expensive. That’s when I turned to HyperLogLog (HLL). It let me trade a tiny bit of accuracy for massive performance and storage gains. In this post, I’ll walk through the COUNT(DISTINCT) problem and how HLL solves it in Postgres.
The problem: exact distinct counts on massive data
Say you have an events
table with billions of rows:
CREATE TABLE events (
user_id uuid NOT NULL,
occurred_at timestamptz NOT NULL,
event_type text,
payload jsonb
);
A common metric is daily active users (DAU) or other unique counts:
SELECT date_trunc('day', occurred_at) AS day,
COUNT(DISTINCT user_id) AS unique_users
FROM events
WHERE occurred_at >= now() - interval '60 days'
GROUP BY 1
ORDER BY 1;
This looks simple—and melts CPUs. Why?
-
COUNT(DISTINCT)
needs either a sort or a hash set of all seen keys per group. - With high cardinality and wide time ranges, memory spills and disk sorts happen.
- Even with indexes, exact distinct remains expensive for huge datasets.
If you only need fast trending metrics and are OK with a tiny error (e.g., 0.8–2%), approximate distinct is the right trade‑off.
Enter HyperLogLog (HLL)
HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. It’s fixed‑size, mergeable, and extremely fast. In Postgres, you get it via the hll
extension.
- Storage is tiny (hundreds to a few KB per group).
- You can union HLLs (e.g., sum days into months) with no raw data.
- Error is predictable and configurable. A common setting yields ~0.8% relative error.
Accuracy rule of thumb: error ≈
1.04 / sqrt(m)
, wherem = 2^log2m
. Withlog2m = 14
(m = 16384
), error ≈1.04 / 128 ≈ 0.81%
.
Install the extension
CREATE EXTENSION IF NOT EXISTS hll;
Ensure the extension is installed on your Postgres instance (on all nodes if you’re sharded).
HLL basics in Postgres
Key functions you’ll use:
-
hll_hash_*
— Hash a value to a 64‑bit digest (hll_hash_text
,hll_hash_integer
, etc.). -
hll_add_agg(hashval)
— Aggregate many hashed values into an HLL sketch. -
hll_union(hll, hll)
/hll_union_agg(hll)
— Merge sketches. -
hll_cardinality(hll)
— Get the estimated unique count. -
hll_empty(log2m, regwidth, expthresh, sparseon)
— Create an empty sketch with chosen parameters.
One‑off approx count
SELECT hll_cardinality(
hll_add_agg(hll_hash_text(user_id::text))
)
FROM events
WHERE occurred_at >= now() - interval '60 days';
Daily uniques materialized (recommended)
Create a compact daily sketch table you can roll up at query time:
CREATE TABLE daily_uniques (
day date PRIMARY KEY,
sketch hll NOT NULL
);
Backfill the last 60 days:
INSERT INTO daily_uniques (day, sketch)
SELECT day,
hll_add_agg(hll_hash_text(user_id::text)) AS sketch
FROM (
SELECT occurred_at::date AS day, user_id
FROM events
WHERE occurred_at >= now() - interval '60 days'
) src
GROUP BY day
ON CONFLICT (day) DO UPDATE
SET sketch = hll_union(daily_uniques.sketch, EXCLUDED.sketch);
Query daily uniques instantly:
SELECT day, hll_cardinality(sketch)::bigint AS unique_users
FROM daily_uniques
ORDER BY day;
Monthly rollups (no raw scans!)
SELECT date_trunc('month', day)::date AS month,
hll_cardinality(hll_union_agg(sketch))::bigint AS uniques
FROM daily_uniques
GROUP BY 1
ORDER BY 1;
Choosing parameters (precision vs size)
HLL precision is controlled primarily by log2m
(number of registers = 2^log2m
).
- `
→
m = 16384` registers → ~0.81% error. Sketch size ~1.5 KB. - `
→
m = 4096` registers → ~1.6% error. Sketch ~384–512 B.
Create empty sketches explicitly if needed:
-- log2m=14, regwidth=5, expthresh=0 (HLL++ range), sparseon=1 (better for small sets)
SELECT hll_empty(14, 5, 0, 1);
Tip: All sketches you union must share the same parameters (log2m, regwidth, etc.). Keep them consistent at table boundaries.
Incremental updates
For streaming/event‑driven systems, upsert into the daily sketch as events arrive:
-- Example for a single day batch upsert
WITH d AS (
SELECT CURRENT_DATE AS day,
hll_add_agg(hll_hash_text(user_id::text)) AS delta
FROM new_events_batch
GROUP BY 1
)
INSERT INTO daily_uniques (day, sketch)
SELECT day, delta FROM d
ON CONFLICT (day) DO UPDATE
SET sketch = hll_union(daily_uniques.sketch, EXCLUDED.sketch);
Or maintain a materialized view and schedule REFRESH MATERIALIZED VIEW CONCURRENTLY
.
Accuracy you can trust
HLL has well‑studied error bounds. In practice:
- With
log2m = 14
, relative error ≈ 0.8%. - For very small sets (<~
2.5m
), HLL++ uses a sparse/linear counting mode to keep accuracy high. - You can empirically validate by sampling a few days and comparing to exact counts.
WITH exact AS (
SELECT day, COUNT(DISTINCT user_id)::bigint AS exact
FROM (
SELECT occurred_at::date AS day, user_id
FROM events
WHERE occurred_at BETWEEN DATE '2025-07-01' AND DATE '2025-07-07'
) t
GROUP BY 1
), approx AS (
SELECT day, hll_cardinality(sketch)::bigint AS approx
FROM daily_uniques
WHERE day BETWEEN DATE '2025-07-01' AND DATE '2025-07-07'
)
SELECT e.day,
e.exact,
a.approx,
ROUND(100.0 * (a.approx - e.exact) / NULLIF(e.exact,0), 2) AS pct_err
FROM exact e JOIN approx a USING (day)
ORDER BY 1;
Performance and storage wins
- Sketch tables are tiny (often >90% smaller than raw aggregations).
- Queries over sketches are CPU‑light and avoid massive sorts/hashes.
- Rollups (day→week→month) require only
hll_union_agg
, no raw re‑aggregation.
In data‑warehouse practice, I’ve seen 10x+ speedups and order‑of‑magnitude storage savings when moving unique‑count metrics to HLL‑based tables. Your mileage will vary by schema and hardware—measure!
Common gotchas
- Consistency: Don’t change HLL parameters once you’ve started; union requires matching configs.
-
Hashing: Always hash the stable identity (e.g.,
user_id
), not a mutable attribute. -
Small groups: For very small sets, exact
COUNT(DISTINCT)
may be fine—and exact. - Explain it: Stakeholders must accept approximate counts. Document expected error.
Optional: using Go to pre‑aggregate
If you’re building Go services, you can either push raw events and let Postgres aggregate, or compute sketches in Go and store them.
- Server‑side HLL keeps your pipeline simpler.
- If you pre‑aggregate in Go (e.g., per shard/partition), store the sketch as
bytea
and convert tohll
on ingest, or directly ashll
via a custom function.
Wrap‑up
COUNT(DISTINCT)
is perfect… until scale arrives. HLL lets you trade a tiny bit of accuracy for huge wins in performance and cost, while keeping your queries simple and your metrics composable.
In my case, introducing HLL completely changed how we approached unique metrics at scale. What used to be painful, long‑running queries became lightweight, predictable operations we could build real dashboards on. If you’re hitting the same wall, HLL might save your database too.
Have you used HLL in Postgres (or Redis/BigQuery/Redshift)? What configs worked best for your cardinality? Drop your experience in the comments!
Appendix: reference snippets
Create extension
CREATE EXTENSION hll;
Core functions
SELECT hll_hash_text('abc');
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id::text))) FROM events;
SELECT hll_cardinality(hll_union_agg(sketch)) FROM daily_uniques;
Empty sketch with chosen params
SELECT hll_empty(14, 5, 0, 1);
This content originally appeared on DEV Community and was authored by Ismael Vacco