This content originally appeared on DEV Community and was authored by ZeeshanAli-0704
Introduction
The world generates massive amounts of data every day. It is practically impossible to store all this data on a single server. This necessitates horizontal scaling, where data is distributed across multiple servers.
While vertical scaling allows for storing all data in one place, horizontal scaling requires careful organization to ensure rapid access to data across different servers.
We will first discuss a basic hashing approach and its implementation, followed by designing a more resilient system to address this problem using the principle of consistent hashing.
Problem
Imagine we have n data objects that need to be stored across k different servers. Now, consider that the configuration of these servers can change over time:
- Any server can be shut down
- A new server can be added to the system
Given these potential changes, we need to design a system that can rapidly retrieve required data blocks and efficiently transfer data between servers when configurations change.
Hashing Implementation
The hashing implementation involves distributing data across different servers based on a hash function.
When adding a new data block to our system, we use its key in the hash function, which outputs the server number to which this block will belong.
For data retrieval, we calculate the hash value of the given key to find the server where the information is stored.
It is crucial that the hash function uniformly distributes data so that each server has approximately the same amount of data. If not, one server may become overloaded.
This system works well until changes are made.
For example, if server S3 is shut down, its data is no longer accessible, and new data that hashes to its bucket cannot be added.
Whenever a server is shut down, its data becomes inaccessible, necessitating the redistribution of all data blocks across the remaining servers.
Similarly, adding a new server requires modifying the hash function to account for the new server.
This problem is known as the Rehashing Problem.
In the case of system configuration changes, redistributing all data is resource-intensive and inefficient for large data volumes and frequent changes.
Consistent Hashing
Consistent hashing is a more resilient alternative to the basic hashing system.
Consistent hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table. It assigns servers and data objects positions on an abstract circle or hash ring, allowing them to scale without affecting the overall system.
Details with Example
Consistent hashing hashes both data and servers to the same set of values [0, n]. To visualize, imagine all hash values are located on a ring (or clock). Each server has its own hash range.
A server’s hash range is the interval of hash values on the hash ring before the server’s hash value and after the hash value of the nearest server counter-clockwise.
To determine which server a key belongs to, move clockwise from the key’s hash value until you reach a server’s hash value. That server will store the data for the key.
Hashed values for servers should be stored in ascending order for rapid access. Using binary search, you can find a server storing a given key in O(log S) time, where S is the number of servers.
Shutting Down a Server
If a server is shut down, delete its hash value and transfer its data to the next server clockwise. This advantageously avoids redistributing all data, unlike simple hashing.
Adding a New Server
When adding a new server, transfer the data associated with hash values between the new server’s hash value and the nearest server’s hash value counter-clockwise.
Uneven Distributions
Consistent hashing can result in uneven data distribution. This may be due to the hash function not uniformly generating keys, leading to disproportional hash ranges among servers.
Even with even data distribution initially, configuration changes can skew the distribution over time, increasing average response time.
To mitigate this, periodically redistribute data using another hash function when distribution becomes skewed. However, this is suboptimal for millions or billions of data objects.
Virtual Nodes
Virtual nodes extend consistent hashing, making the system more resilient to uneven data distributions. Each server is hashed multiple times with different hash functions, creating multiple virtual nodes.
Shutting down a server deletes all its virtual nodes, transferring data to multiple servers. Adding a new server involves calculating hash values for its virtual nodes with the same hash functions used for other servers.
Increasing the number of virtual nodes aligns hash ranges better, but also increases the time for configuration changes and requires additional metadata storage.
It is best to choose the number of virtual nodes based on the problem, available servers, and data quantity. Tuning this parameter helps find the optimal trade-off.
Including diagrams can help visualize these concepts better.
This content originally appeared on DEV Community and was authored by ZeeshanAli-0704
