How to Find Top-K Trending Hashtags from a Stream of Tweets Using Count-Min Sketch



This content originally appeared on DEV Community and was authored by VipraTech Solutions

When working with massive streams of data like hashtags from tweets, storing and processing every individual hashtag becomes impractical due to high memory usage and real-time constraints.

👉 The goal:

Find the top-K trending hashtags in real time, from a continuous stream of tweets.

✅ Problem Statement

A large-scale tweet stream sends thousands of hashtags per second.

Your task: Continuously identify the most popular hashtags trending right now.

✅ Challenges:

  • Huge volume of hashtags makes exact counting infeasible.
  • Limited memory and need for fast processing.
  • Real-time approximate results with good accuracy.

🧐 Two Practical Approaches

1⃣ Multiple CMS Time Window Approach

✔ What It Solves:

Enables answering questions like:

  • “What are the top trending hashtags in the last 15 minutes?”
  • “What are the top trending hashtags in the last 30 minutes?”
  • “What are the top trending hashtags in the last 1 hour?”

✔ How It Works:

  • Maintain multiple CMS instances, one per fixed time window (e.g., one CMS per minute).
  • Keep only the latest N CMS instances corresponding to the desired time window (sliding window).
  • At query time, sum counts across the relevant CMS instances.

✅ Java Implementation Example:

public class SlidingWindowCMS {
    private final int windowSize;
    private final LinkedList<CountMinSketch> cmsList;

    public SlidingWindowCMS(int windowSize, int rows, int cols) {
        this.windowSize = windowSize;
        this.cmsList = new LinkedList<>();
    }

    public void addNewMinuteCMS(CountMinSketch newCms) {
        if (cmsList.size() >= windowSize) {
            cmsList.removeFirst();
        }
        cmsList.addLast(newCms);
    }

    public int query(String key) {
        int totalCount = 0;
        for (CountMinSketch cms : cmsList) {
            totalCount += cms.count(key);
        }
        return totalCount;
    }
}

2⃣ Decaying Count Approach

✔ What It Solves:

Answers questions such as:

  • “What are the currently trending hashtags, giving more weight to recent data?”

✔ How It Works:

  • Use a single CMS instance.
  • Periodically apply a decay factor to all counters (e.g., multiply by 0.99 every minute).
  • Recent hashtags remain significant while older counts fade automatically.
  • Eliminates the need for storing multiple CMS instances.

✅ Java Implementation Example:

public class DecayingCMS {
    private final CountMinSketch cms;
    private final double decayFactor;

    public DecayingCMS(int rows, int cols, double decayFactor) {
        this.cms = new CountMinSketch(rows, cols);
        this.decayFactor = decayFactor;
    }

    public void add(String key) {
        cms.add(key);
    }

    public int query(String key) {
        return cms.count(key);
    }

    public void applyDecay() {
        cms.applyDecay(decayFactor);
    }
}

✅ Count-Min Sketch Implementation Example:

public class CountMinSketch {
    private final int[][] table;
    private final int[] seeds;
    private final int rows;
    private final int cols;
    private final Random rand = new Random();

    public CountMinSketch(int rows, int cols) {
        this.rows = rows;
        this.cols = cols;
        this.table = new int[rows][cols];
        this.seeds = new int[rows];
        for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt();
    }

    public void add(String key) {
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            table[i][index]++;
        }
    }

    public int count(String key) {
        int min = Integer.MAX_VALUE;
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            min = Math.min(min, table[i][index]);
        }
        return min;
    }

    public void applyDecay(double factor) {
        for (int i = 0; i < rows; i++)
            for (int j = 0; j < cols; j++)
                table[i][j] *= factor;
    }

    private int hash(String key, int seed) {
        int hash = 0;
        for (char c : key.toCharArray()) {
            hash = hash * seed + c;
        }
        return hash;
    }
}

✅ Conclusion

When processing streams of hashtags in real time:

  • ✅ Use Multiple CMS Time Windows if you need exact top-K in fixed time windows (e.g., last 15 min, 1 hour).
  • ✅ Use Decaying Counts CMS for smooth, approximate real-time trending insights without separate windows.

Both approaches solve real-world problems depending on your requirements.

👉 For a deeper overview of how Count-Min Sketch works in general, check out our detailed CMS overview 👉 Count-Min Sketch Overview.


This content originally appeared on DEV Community and was authored by VipraTech Solutions