Beyond the Basics: A Data-Driven Guide to Database Partitioning

Author: Prasanna Padmanabhan | Published: January 15, 2025


For architects managing massive datasets, query performance isn't just a goal; it's a constant battle. When your tables swell to 40 million records and beyond, even well-indexed queries can slow to a crawl. The solution is often partitioning, but choosing the right strategy—None, HASH, RANGE, or Subpartitioning—can be paralyzing.

This analysis, based on benchmarks from a 40-million-record table, moves beyond theory to provide a data-driven framework for making that choice. The key takeaway? The best partitioning strategy is dictated entirely by your application's most critical query patterns.


Level 1: When Simple Partitioning Shines

Simple partitioning involves splitting a table based on a single rule. It's highly effective when your queries consistently filter on one specific key.

RANGE Partitioning: The Time-Series Champion 🗓️

If your application frequently queries data within date or time ranges, RANGE partitioning is your best option. By partitioning the table by start_time, the database can ignore irrelevant time periods entirely.

Query: select * from <table> where start_time > '...' and start_time < '...';
Result:
  - Non-Partitioned: 5,476 ms
  - RANGE Partitioned: 0.9 ms

This is the classic use case for partitioning time-series data, like logs, IoT events, or financial transactions.

HASH Partitioning: For High-Cardinality Lookups 🔀

If your workload involves looking up data for a specific entity, like a user or a company, HASH partitioning on that entity's ID is optimal. It evenly distributes data, and for queries on that ID, the database knows exactly which partition to check.

Query: select count(*) from <table> where company_id='...';
Result:
  - Non-Partitioned: ~900 ms
  - HASH Partitioned: ~67 ms

Crucial Pitfall: Using the wrong strategy is worse than using none. Running the count(*) query on the RANGE partitioned table took 45 seconds in its initial run because it had to scan every single time-based partition to find all records for that one company.


Level 2: The Power of Subpartitioning (Range + Hash) 🚀

What if your most critical queries filter on both time and a specific ID? This is where subpartitioning delivers peak performance. By partitioning first by RANGE (start_time) and then subpartitioning by HASH (company_id), you enable a powerful two-level pruning system.

Query: select * from <table> where start_time='...' and company_id='...';
Result:
  - Non-Partitioned: ~19,000 ms
  - HASH Partitioned: ~3,500 ms
  - RANGE Partitioned: ~3,600 ms
  - Range + Hash Subpartitioned: ~28 ms 🚀

When a query provides both keys, the database first isolates the correct time range and then, within that small slice, instantly locates the data for the specific company. This is the "sweet spot" for subpartitioning and offers an order-of-magnitude improvement over any other strategy.


The Universal Rule: When Partitioning Can't Help 🛑

Partitioning is not a magic bullet. If a query does not filter on a partition key, the database must scan all partitions, making the query just as slow as it would be on a non-partitioned table. This demonstrates that partitioning is a complement to, not a replacement for, a proper indexing strategy.

Query: select * from <table> where channel_id='...'; (a non-indexed, non-partitioned column)
Result: All strategies, including subpartitioning, took ~57-60 seconds.

Architect's Decision Framework

Don't choose your partitioning strategy in a vacuum. Let your queries guide your architecture.

If Your Most Critical Query Is... Recommended Strategy Why?
WHERE start_time = '...' AND company_id = '...' Range + Hash Subpartitioning Unlocks two-level partition pruning for maximum efficiency.
WHERE start_time BETWEEN '...' AND '...' RANGE Partitioning Most direct and efficient way to query time-series data.
WHERE company_id = '...' HASH Partitioning Fastest for direct lookups on a specific ID.
WHERE other_column = '...' None (Focus on Indexing) Partitioning provides no benefit. The problem is a missing index.

Ultimately, you must optimize for the queries that are most frequent and most critical to your application's performance. By understanding these trade-offs, you can design a partitioning scheme that provides true, data-driven results.


Questions about database partitioning strategies for your application? Let's discuss how to optimize your data architecture for the queries that matter most.

💬 Share Your Thoughts

Have insights to share or questions about this post? I'd love to hear from you!