Database Sharding
Database sharding is a horizontal scaling technique that partitions a large database across multiple machines (shards), each holding a subset of data, enabling the database to handle more data and traffic than a single machine supports.
Explanation
When a single database can no longer handle read/write volume, you distribute data across multiple machines. Each shard is an independent database containing a horizontal partition. The sharding logic determines which shard holds each row. Sharding strategies: Range-based (shard by value ranges — users A-M on shard 1, N-Z on shard 2; simple but creates hotspots if one range is more active), Hash-based (hash function on shard key determines the shard; distributes evenly but makes range queries hard), Directory-based (lookup table mapping keys to shards; most flexible but the lookup becomes a bottleneck). The shard key is the most critical decision. A good key distributes writes evenly and allows most queries to touch only one shard (avoiding cross-shard joins). user_id is a common shard key for user data — user-specific queries go to exactly one shard. A bad key creates hotspots (all new writes go to one shard). Cross-shard complications: cross-shard joins require querying all shards and merging in the application. Distributed transactions spanning shards require 2-phase commit (complex, slow). Re-sharding (splitting a full shard) is operationally painful. Exhaust all other options first: read replicas, caching, query optimization, vertical scaling.
Code Example
javascript// Hash-based sharding: route queries to the right shard
const NUM_SHARDS = 4;
const shards = [
createDbConnection(process.env.SHARD_0_URL),
createDbConnection(process.env.SHARD_1_URL),
createDbConnection(process.env.SHARD_2_URL),
createDbConnection(process.env.SHARD_3_URL),
];
function getShardIndex(userId) {
// Consistent hashing preferred in production (minimizes re-sharding)
return parseInt(userId.slice(-4), 16) % NUM_SHARDS;
}
async function getUserById(userId) {
const db = shards[getShardIndex(userId)];
return db.query('SELECT * FROM users WHERE id = $1', [userId]);
}
// Cross-shard query: scatter-gather (expensive!)
async function searchAllShards(query) {
const results = await Promise.all(
shards.map(db => db.query(
'SELECT * FROM users WHERE name ILIKE $1', [`%${query}%`]
))
);
return results.flatMap(r => r.rows);
}
// This is O(shards) — design queries to avoid cross-shard when possible
Why It Matters for Engineers
Database sharding is a core systems design topic for staff/principal engineer roles. 'Design Instagram,' 'design a key-value store,' and 'scale your database to 10 billion rows' all involve sharding decisions. Understanding shard key selection, hotspots, cross-shard query costs, and re-sharding complexity demonstrates senior engineering judgment. In production, sharding decisions are expensive to reverse. A poor shard key can require months of painful data migration to fix.
Related Terms
SQL · NoSQL · Cache · CAP Theorem
Learn This In Practice
Go deeper with the full module on Beyond Vibe Code.
Systems Design Fundamentals → →