- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hello everyone,
I’m planning to create a Delta Lake table on Databricks with an estimated size of ~50 TiB. The table includes three date columns — year, month, and day — and most of my queries will filter on these fields.
I’m trying to decide whether to use partitioning or clustering (Z‑Ordering) for this table. Which approach would you recommend for optimal query performance, file management, and maintenance at this scale? Are there any best‑practice guidelines around:
Choosing partition keys and granularity for date data
When to prefer clustering over partitioning (or vice versa)
Expected impact on data skipping, file sizes, and maintenance overhead
Any advice — including example configurations or links to relevant documentation — would be greatly appreciated. Thank you!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hi Yutaro,
How are you doing today?, As per my understanding, For a 50 TiB Delta table, a mix of partitioning and clustering (Z-Ordering) will give you the best performance and manageability. Since your queries filter by year, month, and day, I’d recommend partitioning by year and month—this keeps partitions at a reasonable size without creating too many small files. Partitioning by day could lead to too many tiny files, which would slow things down. To further optimize query performance, Z-Order on the day column (or another frequently queried field) within each partition. This helps Databricks group similar data together, improving data skipping and query speed. A good setup would be to partition by (year, month) and then periodically run OPTIMIZE ... ZORDER BY day; to keep things efficient. This approach balances query speed, storage efficiency, and maintenance effort. Let me know if you need help fine-tuning it!
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
an hour ago
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hi Yutaro,
How are you doing today?, As per my understanding, For a 50 TiB Delta table, a mix of partitioning and clustering (Z-Ordering) will give you the best performance and manageability. Since your queries filter by year, month, and day, I’d recommend partitioning by year and month—this keeps partitions at a reasonable size without creating too many small files. Partitioning by day could lead to too many tiny files, which would slow things down. To further optimize query performance, Z-Order on the day column (or another frequently queried field) within each partition. This helps Databricks group similar data together, improving data skipping and query speed. A good setup would be to partition by (year, month) and then periodically run OPTIMIZE ... ZORDER BY day; to keep things efficient. This approach balances query speed, storage efficiency, and maintenance effort. Let me know if you need help fine-tuning it!
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 hours ago
Hi Brahma,
Thank you so much for your detailed explanation and clear recommendations! Partitioning by year and month, combined with regularly running OPTIMIZE … ZORDER BY day, makes perfect sense for balancing performance and manageability at this scale. I’ll implement this approach and monitor file sizes and query speed as you suggested.
If you have any tips on how often I should schedule the OPTIMIZE job for a 50 TiB table, I’d greatly appreciate it.
Thanks again for your help!
Best regards,
Yutaro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
an hour ago
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
23m ago
Hi Brahma,
I want to extend my deepest gratitude for your detailed and insightful response. Your explanation was incredibly helpful, and your advice on scheduling, targeting recent partitions, and monitoring performance has given me a clear path forward. I truly appreciate the time and effort you invested in sharing your expertise. Your support means a lot, and I feel much more confident moving ahead thanks to your guidance.
Thank you once again for all your help!
Best regards,
Yutaro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3m ago
Hey Yutaro,
Thank you so much for the kind words—it honestly means a lot! I'm really glad the guidance helped and that you're feeling more confident moving forward. You're doing all the right things by asking the right questions and planning ahead. If you ever run into anything else or just want to bounce around ideas, feel free to reach out anytime. Always happy to help!
Wishing you all the best with your project!
– Brahma

