Data Sharding

Sharding is data partitioning strategy for distributing the data to multiple servers, where the data cannot fit in one server. It is also called as horizontal data partition

When can you Shard?
  • Data is too large fit in one server
  • Too many writes and IO’s can’t scale with single server
Types Of Sharding

There are plenty of approaches are available for sharding, but I want discuss few of the most commonly used ones.

Hash Based Sharding

In Hash based sharding, we use primary key of the data to identify the partition/shard it belongs to read and write. It also called as application sharding, as the application will be able identify the shard using a hash function.

PartitionId = hashFunction(Key) % numberOfShards
Dynamic Sharding

One of the main drawback with hash based sharding is data growth in particular shard/partition. Let’s assume partition1 has more data than it can fit it to one box, then we need to repartition the data in server and distribute to multiple server and also we need to change the partition algorithm to address the repartition data. In Dynamic sharding we will try to address this problem.

Dynamic sharding is extension of hash based sharding, however we will use an external lookup service for mapping details of partitionId and nodeId.

Dynamic sharding with lookup service

Let’s assume initially we have decided 256 partition is enough for 4 nodes/servers, but the data growth in server1 is more. We can address this in two ways

  • Add a node and break the partition range to 0-32 in node1 and 32-64 in new node.
  • We can redistribute the data to available nodes

It gives flexibility to redistribute the data among nodes without affecting the clients.

Dynamic sharding with lookup service after repartition

It has been widely used by lot of data stores use this sharding strategy. In HDFS filesystem metadata has been stored in Name node and clients uses this information to route the request to relevant data nodes. But one main draw back with this approach is NameNode is single point of failure and also any change in partition, it needs to update the details in single unit of work

HBase also uses similar lookup service based sharding for the regions present in region server, but it uses zookeeper to store the information. So it overcomes the single point of failure by using zookeeper, as zookeeper is strongly consistent and highly available

Entity Group Sharding

One main drawbacks with above mentioned approach in case of RBDMS. It has queries which may require join, in that case if the data is distributed, then it will need to fork and join across multiple nodes.

Entity Group sharding overcomes this drawback by storing the data relevant to the key in same shard for multiple tables/entities. So that single shard queries can be done. But the cross column joins can still lead to multiple shards query, this also can be addressed by choosing the relevant key for sharding based on query pattern.

Pitfalls to look for while Sharding
  • Identify the right partition key, otherwise we may end up with repartition
  • Shard is a single atomic unit, so each partition should fit in one server.
  • Hot spotting – Ex. Let’s assume you are designing twitter/instagram like application, you have used userId as shard key for the posts, if the user is celebrity, then the shard which has their posts will suffer with hot spotting
  • Sharding is operationally heavy and complex, do if really have to do