Databricks Community

Yutaro · yesterday

Hello everyone,

I’m planning to create a Delta Lake table on Databricks with an estimated size of ~50 TiB. The table includes three date columns — year, month, and day — and most of my queries will filter on these fields.

I’m trying to decide whether to use partitioning or clustering (Z‑Ordering) for this table. Which approach would you recommend for optimal query performance, file management, and maintenance at this scale? Are there any best‑practice guidelines around:

Choosing partition keys and granularity for date data
When to prefer clustering over partitioning (or vice versa)
Expected impact on data skipping, file sizes, and maintenance overhead
Any advice — including example configurations or links to relevant documentation — would be greatly appreciated. Thank you!

Brahmareddy · yesterday

Hi Yutaro,

How are you doing today?, As per my understanding, For a 50 TiB Delta table, a mix of partitioning and clustering (Z-Ordering) will give you the best performance and manageability. Since your queries filter by year, month, and day, I’d recommend partitioning by year and month—this keeps partitions at a reasonable size without creating too many small files. Partitioning by day could lead to too many tiny files, which would slow things down. To further optimize query performance, Z-Order on the day column (or another frequently queried field) within each partition. This helps Databricks group similar data together, improving data skipping and query speed. A good setup would be to partition by (year, month) and then periodically run OPTIMIZE ... ZORDER BY day; to keep things efficient. This approach balances query speed, storage efficiency, and maintenance effort. Let me know if you need help fine-tuning it!

Regards,

Brahma

View solution in original post

Brahmareddy · an hour ago

Hey Yutaro, glad to hear the plan made sense! For a 50 TiB table, a good starting point is to run OPTIMIZE daily or every couple of days if you're ingesting data frequently—this helps keep query performance sharp without too much overhead. If your updates are less frequent, running it weekly could be enough. To make it more efficient, you can target only recent partitions (like the latest month) using a WHERE clause in your OPTIMIZE statement, which avoids reprocessing older data. Also, try to schedule the job during off-peak hours to reduce impact on other workloads. Keep an eye on file sizes and query performance, and adjust the frequency if needed. Let me know how it goes!

Regards,

Brahma

View solution in original post

Brahmareddy · yesterday

Hi Yutaro,

How are you doing today?, As per my understanding, For a 50 TiB Delta table, a mix of partitioning and clustering (Z-Ordering) will give you the best performance and manageability. Since your queries filter by year, month, and day, I’d recommend partitioning by year and month—this keeps partitions at a reasonable size without creating too many small files. Partitioning by day could lead to too many tiny files, which would slow things down. To further optimize query performance, Z-Order on the day column (or another frequently queried field) within each partition. This helps Databricks group similar data together, improving data skipping and query speed. A good setup would be to partition by (year, month) and then periodically run OPTIMIZE ... ZORDER BY day; to keep things efficient. This approach balances query speed, storage efficiency, and maintenance effort. Let me know if you need help fine-tuning it!

Regards,

Brahma

Yutaro · 3 hours ago

Hi Brahma,

Thank you so much for your detailed explanation and clear recommendations! Partitioning by year and month, combined with regularly running OPTIMIZE … ZORDER BY day, makes perfect sense for balancing performance and manageability at this scale. I’ll implement this approach and monitor file sizes and query speed as you suggested.

If you have any tips on how often I should schedule the OPTIMIZE job for a 50 TiB table, I’d greatly appreciate it.

Thanks again for your help!

Best regards,
Yutaro

Brahmareddy · an hour ago

Hey Yutaro, glad to hear the plan made sense! For a 50 TiB table, a good starting point is to run OPTIMIZE daily or every couple of days if you're ingesting data frequently—this helps keep query performance sharp without too much overhead. If your updates are less frequent, running it weekly could be enough. To make it more efficient, you can target only recent partitions (like the latest month) using a WHERE clause in your OPTIMIZE statement, which avoids reprocessing older data. Also, try to schedule the job during off-peak hours to reduce impact on other workloads. Keep an eye on file sizes and query performance, and adjust the frequency if needed. Let me know how it goes!

Regards,

Brahma

Yutaro · 23m ago

Hi Brahma,

I want to extend my deepest gratitude for your detailed and insightful response. Your explanation was incredibly helpful, and your advice on scheduling, targeting recent partitions, and monitoring performance has given me a clear path forward. I truly appreciate the time and effort you invested in sharing your expertise. Your support means a lot, and I feel much more confident moving ahead thanks to your guidance.

Thank you once again for all your help!

Best regards,
Yutaro

Brahmareddy · 3m ago

Hey Yutaro,

Thank you so much for the kind words—it honestly means a lot! I'm really glad the guidance helped and that you're feeling more confident moving forward. You're doing all the right things by asking the right questions and planning ahead. If you ever run into anything else or just want to bounce around ideas, feel free to reach out anytime. Always happy to help!

Wishing you all the best with your project!
– Brahma