⚡ Scaling User Search with Bloom Filters in Node.js



This content originally appeared on DEV Community and was authored by Abhinav

When your system grows to millions of users, even the simplest operations—like checking if a phone number or email already exists—can become costly.

Yes, you can add database indexes, but every lookup still eats up I/O and CPU cycles. Under heavy signup traffic, this can quickly overwhelm your database.

This is where Bloom Filters come to the rescue. 🌸

🌱 What is a Bloom Filter?

A Bloom Filter is a probabilistic data structure used for set membership checks. It allows us to ask:

👉 “Does this value possibly exist?”

It can say:

  • ❌ Definitely Not → Safe to skip DB.
  • ✅ Might Exist → Confirm with DB.

This tiny compromise (allowing false positives, but never false negatives) gives us O(1) lookups with very little memory usage.

🔬 Anatomy of a Bloom Filter (Abscissa)

At its core, a Bloom filter is just:

  1. A Bit Array (size m) → starts with all 0’s.
  2. k Hash Functions → each maps an input to one of the m positions.

👉 When we add an element:

  • Run it through all k hash functions.
  • Flip those positions in the bit array to 1.

👉 When we check an element:

  • Run it through the same k functions.
  • If all those positions are 1 → the element might exist.
  • If any position is 0 → it definitely does not exist.

📈 Visual Abscissa

Think of it as a number line (abscissa = x-axis):

Bit Array (size m)
0 1 2 3 4 5 6 7 8 9 ... m-1
[0][0][0][0][0][0][0][0][0][0]
  • Each hash function picks some positions along this line.
  • Adding "alice@example.com" might flip positions 3, 7, and 9.
  • Checking "bob@example.com"? If one of its hash positions is still 0, we know Bob isn’t in the set.

⚖ Balancing Act

  • More bits (m) → fewer collisions, lower false positives.
  • More hash functions (k) → more accuracy, but also more computation.
  • The sweet spot depends on expected number of elements n.

Formula for optimal k:

k=n/m(​ln2)

This balance is why Bloom filters are tiny in memory yet mighty in scale.

🏗 Our Architecture

We built a Bloom Filter Service in Node.js that acts as a fast gatekeeper before the database.

It consists of:

  1. Routes Layer → API endpoints for clients.
  2. Handler Layer → Processes requests, interacts with the service.
  3. Service Layer → Manages Bloom filters, population, refresh, and lookups.

📜 Routes Layer

We expose three endpoints under /bloom_filter:

import express from 'express';
import { getBloomFilterStatus, refreshBloomFilter, checkPhoneExists } from './handler.js';

const router = express.Router();

router.get('/status', getBloomFilterStatus);
router.post('/refresh', refreshBloomFilter);
router.get('/check', checkPhoneExists);

export default router;
  • GET /status → Monitor filters.
  • POST /refresh → Force a refresh.
  • GET /check?phoneNumber=... → Check existence.

⚙ Handler Layer

The handlers sit between API requests and the service. They manage errors and responses:

import userSearchBloomFilter from '../../services/userSearchBloomFilter.js';
import { generateInternalServerErrorRepsonse } from '../../utils/errorHandler.js';

export async function getBloomFilterStatus(req, res) {
    try {
        const status = userSearchBloomFilter.getStatus();
        return res.status(200).json({ success: true, data: status });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'getBloomFilterStatus');
        return res.status(500).json(errorResponse);
    }
}

export async function refreshBloomFilter(req, res) {
    try {
        await userSearchBloomFilter.refresh();
        return res.status(200).json({
            success: true,
            message: 'Bloom filter refreshed successfully'
        });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'refreshBloomFilter');
        return res.status(500).json(errorResponse);
    }
}

export async function checkPhoneExists(req, res) {
    try {
        const { phoneNumber } = req.query;

        if (!phoneNumber) {
            return res.status(400).json({
                success: false,
                error: 'Phone number is required'
            });
        }

        const mightExist = userSearchBloomFilter.mightExist(phoneNumber);

        return res.status(200).json({
            success: true,
            data: {
                phoneNumber,
                mightExist,
                note: mightExist
                    ? 'Might exist - check database'
                    : 'Definitely does not exist'
            }
        });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'checkPhoneExists');
        return res.status(500).json(errorResponse);
    }
}

👉 Notice how checkPhoneExists does not immediately hit the DB. It asks the Bloom filter first.

🧠 Service Layer: Core Bloom Filter Logic

This is where the real magic happens.

We maintain four Bloom filters:

  • emailFilter
  • phoneFilter
  • alternateEmailFilter
  • alternatePhoneFilter

Each filter is initialized with a target capacity and error rate (0.01 = 1%).

import BloomFilter from '../utils/BloomFilter.js';
import User from '../models/user.js';
import logger from '../setup/logger.js';

class UserSearchBloomFilterService {
    constructor() {
        this.emailFilter = new BloomFilter(100000, 0.01);
        this.phoneFilter = new BloomFilter(100000, 0.01);
        this.alternateEmailFilter = new BloomFilter(50000, 0.01);
        this.alternatePhoneFilter = new BloomFilter(50000, 0.01);
        this.isInitialized = false;
        this.lastUpdated = null;
        this.updateInterval = 24 * 60 * 60 * 1000; // 24 hours
    }

🔄 Populating the Filters

On startup, we fetch users in batches and add their identifiers into the filters:

async populateFilters() {
    const batchSize = 1000;
    let offset = 0;
    let hasMoreUsers = true;

    while (hasMoreUsers) {
        const users = await User.query(qb => {
            qb.select('email', 'phone_number', 'alternate_email', 'alternate_phone');
            qb.whereNotNull('email').orWhereNotNull('phone_number');
            qb.limit(batchSize);
            qb.offset(offset);
        }).fetchAll();

        const userList = users.toJSON();
        if (userList.length === 0) {
            hasMoreUsers = false;
            break;
        }

        userList.forEach(user => {
            if (user.email) this.emailFilter.add(user.email);
            if (user.phone_number) this.phoneFilter.add(user.phone_number);
            if (user.alternate_email) this.alternateEmailFilter.add(user.alternate_email);
            if (user.alternate_phone) this.alternatePhoneFilter.add(user.alternate_phone);
        });

        offset += batchSize;
        logger.info(`Processed ${offset} users for bloom filter`);
    }

    logger.info(`Bloom filter population completed. Total users processed: ${offset}`);
}

This ensures all existing users are represented in the filter.

⚡ Lookup Logic

When a phone/email check request arrives:

mightExist(searchKey) {
    if (!this.isInitialized) {
        return true; // fail-safe until initialized
    }

    const normalizedKey = searchKey.toLowerCase().trim();

    return (
        this.emailFilter.mightContain(normalizedKey) ||
        this.phoneFilter.mightContain(normalizedKey) ||
        this.alternateEmailFilter.mightContain(normalizedKey) ||
        this.alternatePhoneFilter.mightContain(normalizedKey)
    );
}

👉 If it returns false, we know for sure the user doesn’t exist.
👉 If it returns true, we query the DB to confirm.

🕒 Auto-Refreshing

To stay fresh with new data, we schedule a 24-hour refresh:

schedulePeriodicUpdate() {
    setInterval(async () => {
        try {
            logger.info('Starting scheduled bloom filter update');
            await this.refresh();
        } catch (error) {
            logger.error('Scheduled bloom filter update failed:', error);
        }
    }, this.updateInterval);
}

This clears and repopulates the filters.

📊 Status Reporting

Finally, we can inspect filter health:

getStatus() {
    return {
        isInitialized: this.isInitialized,
        lastUpdated: this.lastUpdated,
        emailFilterStats: this.emailFilter.getStats(),
        phoneFilterStats: this.phoneFilter.getStats(),
        alternateEmailFilterStats: this.alternateEmailFilter.getStats(),
        alternatePhoneFilterStats: this.alternatePhoneFilter.getStats()
    };
}

🚀 Example Flow

A signup form checks if 1231881971 exists:

  1. Client calls → GET /bloom_filter/check?phoneNumber=<phoneNumber>
  2. Bloom filter says:
  • ❌ Not in set → Return immediately (skip DB).
  • ✅ Might exist → Query DB to confirm.

This cuts DB load massively.

bloom-filter-search-postman

✅ Benefits

  • O(1) Lookups → Super fast.
  • Reduced DB Load → Fewer queries.
  • Scalable → Handles millions of entries.
  • Fault-Tolerant → Always errs on the safe side (false positives only).

⚠ Limitations

  • False Positives → Might say “exists” when it doesn’t.
  • No Deletion → Standard Bloom filters don’t support removing entries.
  • Cold Start → Until initialized, returns “might exist” to avoid false negatives.

🌟 Final Thoughts

Bloom Filters are not a database replacement—they’re an optimization layer.
Think of them as a bouncer at the club entrance:

  • If you’re definitely not on the list, you’re turned away instantly.
  • If you might be on the list, they’ll double-check inside.

For high-traffic systems, this simple gatekeeping can make a world of difference.


This content originally appeared on DEV Community and was authored by Abhinav